Introducing Apache Airflow
Data for ML model training can come from various sources, such as databases, data warehouses, or even data lakes. These data repositories store data in a wide variety of different data formats. For example, data may be stored as unstructured objects, as in the case of image, video, or sound files. Objects may be stored as semi-structured data, such as JSON data that doesn't conform to a standardized tabular schema. In the case of relational databases, or data warehouses, the data is stored in an organized and structured format, but it may have multiple different types of schemas.
To make matters worse, some datasets can be very large, often terabytes, or even petabytes in size, where joining, merging, and transforming the data, often referred to as Extract, Transform and Load (ETL) processes, requires large compute clusters, such as Hadoop and Apache Spark clusters. AWS provides infrastructure resources and dedicated services to scale these big data...