Introduction
This chapter addresses the clean subtask of the data preparation phase. CRISP-DM describes this subtask in the following way:
Raise the data quality to the level required by the selected analysis techniques. This may involve selection of clean subsets of the data, the insertion of suitable defaults, or more ambitious techniques such as the estimation of missing data by modeling.
While this chapter can't tackle the entire subject of cleaning data, it addresses three themes, and all three themes involve working with data that is incomplete in some way:
- Avoiding the missing data
- Imputing the missing data
- Fuzzy matching
The first two recipes address the first theme, that is, how to deal with missing data. Sometimes a null value indicates that a value is unknown, but very frequently a null value is the only appropriate value because for the particular case (customer) the value is non-applicable. In these instances imputation is usually not the best choice.
However, when the missing...