Processing data at scale on AWS
In the previous section, Analyzing large amounts of unstructured data, the data was stored in an S3 bucket, which was used for training. There will be scenarios where you will need to load data faster for training instead of waiting for the training job to copy the data from S3 locally into your training instance. In these scenarios, you can store the data on a file system, such as Amazon Elastic File System (EFS) or Amazon FSx, and mount it to the training instance, which will be faster than storing the data in S3 location. The code for this is in the 3_unstructured_data.ipynb
notebook. Refer to the Optimize it with data on EFS and Optimize it with data on FSX sections in the notebook.
Note
Before you run the Optimize it with data on EFS and Optimize it with data on FSX sections, please launch the CloudFormation template_filesystems.yaml
template, in a similar fashion as we did in the Setting up EMR and SageMaker Studio section.