Using Airflow to process the abalone dataset
To set the scene, you will recall from Chapter 1, Getting Started with Automated Machine Learning on AWS, that the ACME Fishing Logistics company uses an outdated dataset, found in the UCI Machine Learning Repository, to train the ML model. The ML practitioners have found that since an ML model is only as good as the data it's trained on, they can tweak and tune the model as much as they want, but without newer data, the production model can't be improved upon.
To resolve this problem, ACME has hired an external company to survey abalone catches and supply daily updates of the surveyed dataset. This means that the already tuned ML model can be retrained on fresh data, and thus be further optimized. This also means that the data engineering teams need to orchestrate a process, or data pipeline, to merge the original dataset with the new survey data and supply the new training, validation, and testing dataset to a new model training...