Getting started with data processing and analysis
In the previous chapter, we utilized a data warehouse and a data lake to store, manage, and query our data. Data stored in these data sources generally must undergo a series of data processing and data transformation steps similar to those shown in Figure 5.1 before it can be used as a training dataset for ML experiments:
Figure 5.1 – Data processing and analysis
In Figure 5.1, we can see that these data processing steps may involve merging different datasets, along with cleaning, converting, analyzing, and transforming the data using a variety of options and techniques. In practice, data scientists and ML engineers generally spend a lot of hours cleaning the data and getting it ready for use in ML experiments. Some professionals may be used to writing and running custom Python or R scripts to perform this work. However, it may be more practical to use no-code or low-code solutions such as AWS Glue DataBrew...