Cleaning and preparing data
Data preparation is a crucial step in machine learning because the quality, relevance, and suitability of the data used for model training directly impact the accuracy, reliability, and effectiveness of the resulting machine learning models.
General data preparation steps include the following:
- Removing null values
- Removing columns that are not needed
- Encoding (for example, the one-hot encoding that we used in some of the examples in Chapter 2)
- Feature scaling
- Splitting into test and training datasets
- Setting correct data types
- Removing duplicates
- Correcting data errors
- Removing outliers
Those steps that are automatically taken care of by Qlik AutoML are shown in bold in the preceding list. The rest of the steps can be done in Qlik Sense.
Let’s take a closer look at some of these steps using examples.
Example 1 – one-hot encoding
Let’s assume that we have the following dataset...