Data scrubbing
Data scrubbing, also called data cleansing, is the process of correcting or removing data in a dataset that is incorrect, inaccurate, incomplete, improperly formatted, or duplicated.
The result of the data analysis process not only depends on the algorithms, it also depends on the quality of the data. That's why the next step after obtaining the data, is data scrubbing. In order to avoid dirty data our dataset should possess the following characteristics:
Correct
Completeness
Accuracy
Consistency
Uniformity
The dirty data can be detected by applying some simple statistical data validation also by parsing the texts or deleting duplicate values. Missing or sparse data can lead you to highly misleading results.
Statistical methods
In this method we need some context about the problem (knowledge domain) to find values that are unexpected and thus erroneous, even if the data type match but the values are out of the range, it can be resolved by setting the values to an average or mean value...