Data Preprocessing
Before building a classifier, we need to format our data so that we can keep relevant data in the most suitable format for classification and remove all the data that we are not interested in.
The following points are the best ways to achieve this:
- Replacing or dropping values:
For instance, if there are
N/A
(orNA
) values in the dataset, we may be better off substituting these values with a numeric value we can handle. Recall from the previous chapter thatNA
stands for Not Available and that it represents a missing value. We may choose to ignore rows withNA
values or replace them with an outlier value.Note
An outlier value is a value such as -1,000,000 that clearly stands out from regular values in the dataset.
The
fillna()
method of a DataFrame does this type of replacement. The replacement ofNA
values with an outlier looks as follows:df.fillna(-1000000, inplace=True)
The
fillna()
method changes allNA
values into numeric values.This numeric value...