Automatic data preparation
The first stage of a typical machine learning pipeline deals with data preparation (recall the pipeline of Figure 1). There are two main aspects that should be taken into account: data cleansing, and data synthesis.
Data cleansing is about improving the quality of data by checking for wrong data types, missing values, errors, and by applying data normalization, bucketization, scaling, and encoding. A robust AutoML pipeline should automate all of these mundane but extremely important steps as much as possible.
Data synthesis is about generating synthetic data via augmentation for training, evaluation, and validation. Normally, this step is domain-specific. For instance, we have seen how to generate synthetic CIFAR10-like images (Chapter 4, Convolutional Neural Networks) by using cropping, rotation, resizing, and flipping operations. One can also think about generate additional images or video via GANs (see Chapter 6, Generative Adversarial Networks)...