Real data is dirty and its integrity must be ensured before useful insights can be harvested. Missing or corrupt values can contribute to spurious conclusions or completely uncovered insights. In addition to data integrity, feature scaling, and variable types (that is, continuous or discrete) contribute heavily to the effectiveness of downstream methods. I will explain the reasons for these contributions in the dedicated sections for each topic.
Cleaning input data
Missing values
Missing values can ruin a data mining job. Sometimes, an entire record or row is empty, and at other times a single cell or value inside a record is missing. The latter situation is much harder to spot and, indeed, these missing cells can be quiet...