Summary
One of the difficulties with DL projects arises from the amount of data. Since a large amount of data is necessary to train a DL model, data processing steps can take up a lot of resources. Therefore, in this chapter, we learned how to utilize the most popular cloud service, AWS, to process terabytes and petabytes of data efficiently. The system includes a scheduler, data storage, databases, visualization, as well as a data processing tool for running the ETL logic.
We have spent extra time looking at ETL since it plays a major role in data processing. We introduced Spark, which is the most popular tool for ETL, and described four different ways of setting up ETL jobs using AWS. The four settings include using a single-node EC2 instance, an EMR cluster, Glue, and SageMaker. Each setup has distinct advantages, and the right one may differ based on the situation. This is because you need to consider both technical and non-technical aspects of the project.
Similar to how...