Data engineering
The objectives of data engineering are to make sure that the datasets represent the real ML problem and have the right format for ML model training. Often, we use statistical techniques to sample, balance, and scale datasets, and handle missing values and outliers in the datasets. This section covers the following:
- Sampling data with sub-datasets
- Balancing dataset classes
- Transforming data
Let us start with data sampling and balancing.
Data sampling and balancing
Data sampling is a statistical analysis technique used to select, manipulate, and analyze a representative subset in a larger dataset. Sampling data plays an important role in data construction. When sampling data, you need to be very careful not to introduce biased factors. For more details, please refer to https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/sampling.
A classification dataset has more than two dataset classes. We call the classes...