Creating synthetic Abalone survey data
In the previous section, we created the two primary artifacts – from the perspective of the data engineering team – that are required to successfully implement the data-centric workflow, with the first being the ETL artifacts that merge the raw Abalone data with new data to create the training, validation, and test datasets. We also integrated these ETL artifacts into a data-centric workflow, in the form of an Airflow DAG artifact, to automate the ML process whereby we can train, evaluate, and deploy a production-grade Age Calculator model.
As you may recall from the Using Airflow to process the Abalone dataset section of Chapter 8, Automating the Machine Learning Process Using Apache Airflow, we established the context for the data-centric workflow by expanding the ACME Fishing Logistics use case to address the need to add updated Abalone survey data.
So, before we can execute the data-centric workflow, we must address the...