Data scrubbing
Scrubbing data, also called data cleansing, is the process of correcting or removing data in a dataset that is incorrect, inaccurate, incomplete, improperly formatted, or duplicated.
The result of the data analysis process not only depends on the algorithms, it depends on the quality of the data. That's why the next step after obtaining the data, is data scrubbing. In order to avoid dirty data, our dataset should possess the following characteristics:
Correct
Completeness
Accuracy
Consistency
Uniformity
Dirty data can be detected by applying some simple statistical data validation and also by parsing the texts or deleting duplicate values. Missing or sparse data can lead you to highly misleading results.
Statistical methods
In this method, we need some context about the problem (knowledge domain) to find values that are unexpected and thus erroneous, even if the data type matches but the values are out of the range. This can be resolved by setting the values to an average or mean value...