Introduction
Now that we have a thorough understanding of how RDDs and DataFrames work and what they can do, we can start preparing ourselves and our data for modeling.
Someone famous (Albert Einstein) once said (paraphrasing):
"The universe and the problems with any dataset are infinite, and I am not sure about the former."
The preceding is of course a joke. However, any dataset you work with, be it acquired at work, found online, collected yourself, or obtained through any other means, is dirty until proven otherwise; you should not trust it, you should not play with it, you should not even look at it until such time that you have proven to yourself that it is sufficiently clean (there is no such thing as totally clean).
What problems can your dataset have? Well, to name a few:
- Duplicated observations: These arise through systemic and operator's faults
- Missing observations: These can emerge due to sensor problems, respondents' unwillingness to provide an answer to a question, or simply some...