Automatic data preparation
The first stage of a typical machine learning pipeline deals with data preparation (recall the pipeline in Figure 13.1). There are two main aspects that should be taken into account: data cleansing and data synthesis:
Data cleansing is about improving the quality of data by checking for wrong data types, missing values, and errors, and by applying data normalization, bucketization, scaling, and encoding. A robust AutoML pipeline should automate all of these mundane but extremely important steps as much as possible.
Data synthesis is about generating synthetic data via augmentation for training, evaluation, and validation. Normally, this step is domain-specific. For instance, we have seen how to generate synthetic CIFAR10-like images (Chapter 4) by using cropping, rotation, resizing, and flipping operations. One can also think about generating additional images or video via GANs (see Chapter 9) and using the augmented synthetic dataset for training...