Preprocessing data
Preprocessing data is a technique that transforms raw data into a useable and efficient format. It is, in fact, the most important step in the data mining and machine learning process.
When we are preprocessing data, we are really cleaning it, transforming it, or doing a data reduction. In this section, we will take a look at what these all mean.
Data cleaning
Data cleaning refers to the process of making our dataset more efficient. If we go through data cleaning in really large datasets, we can expedite the algorithm, avoid errors, and get better results. There are two things we deal with when data cleaning:
- Missing data: This can be fixed by ignoring the data or manually entering a value for the missing data.
- Noisy data: This can be fixed/improved by using binning, regression, or clustering, among other processes.
We're going to look at each of these things in more detail.
Working with missing data
Let's take a look at...