Section 1 – Data Cleaning and Machine Learning Algorithms
I try to avoid thinking about different parts of the model building process sequentially, to see myself as cleaning data, then preprocessing, and so on until I have done model validation. I do not want to think about that process as involving phases that ever end. We start with data cleaning in this section, but I hope the chapters in this section convey that we are always looking ahead, anticipating modeling challenges as we clean data; and that we also typically reflect back on the data cleaning we have done when we evaluate our models.
To some extent, the clean and dirty metaphor hides the nuance in preparing data for subsequent analysis. The real concern is how representative our instances and attributes (observations and variables) are of phenomena of interest. This can always be improved, and easily made worse without care. One thing is for certain though. There is nothing we can do in any other part of the model...