Data cleaning
In this section, we will review some methods for data cleaning on Spark with a focus on data incompleteness. Then, we will discuss some of Spark's special features for data cleaning and also some data cleaning solutions made easy with Spark.
After this section, we will be able to clean data and make datasets ready for machine learning.
Dealing with data incompleteness
For machine learning, the more the data the better. However, as is often the case, the more the data, the dirtier it could be—that is, the more the work to clean the data.
There are many issues to deal with data quality control, which can be as simple as data entry errors or data duplications. In principal, the methods of treating them are similar—for example, utilizing data logic for discovery and subject matter knowledge and analytical logic to correct them. For this reason, in this section, we will focus on missing value treatment so as to illustrate our usage of Spark for this topic. Data cleaning covers data...