Summary
In this chapter, we discussed data collection practices that are used by organizations and the issue of dark data. We also discussed different storage and processing techniques, such as OLTP and OLAP, and how organizations are using a combination of these two techniques to extract value from the data gathered. We briefly discussed the evolution of data management strategies such as data warehousing, data lakes, the data lakehouse, and data meshes and the role played by ETL and ELT processes in ingesting data into OLAP systems for analysis.
Then, we introduced the Apache Spark framework and talked about how Spark executes workloads by dividing them into different Spark Jobs, stages, and tasks. After this, we discussed different services in the AWS cloud that can be used to execute Spark workloads. We introduced AWS Glue and the different features available in Glue that make it a full-fledged data integration platform and not just a managed ETL service.
In the next chapter, we will discuss the different microservices that are available in AWS Glue and how they work. We will also focus on some Glue-specific features/enhancements that make AWS Glue an ideal service for your data integration workloads.