Streaming datasets with pipe mode
The default setting of estimators is to copy the dataset to training instances, which is known as File Mode. Instead, pipe mode streams it directly from S3. The name of the feature comes from its use of Unix named pipes (also known as FIFOs): at the beginning of each epoch, one pipe is created per input channel.
Pipe mode removes the need to copy any data to training instances. Obviously, training jobs start quicker. They generally run faster too, as pipe mode is highly optimized. Another benefit is that you won't have to provision any storage for the dataset on training instances.
Cutting down on training time and storage means that you'll save money. The larger the dataset, the more you'll save. You can find benchmarks at https://aws.amazon.com/blogs/machine-learning/accelerate-model-training-using-faster-pipe-mode-on-amazon-sagemaker/.
In practice, you can start experimenting with pipe mode for datasets in the hundreds of...