Techniques for Data Cleaning
In this chapter, we will cover six key dimensions of data quality and their corresponding techniques to improve data quality, commonly known as techniques for cleaning data in machine learning. Simply put, data cleaning is the process of implementing techniques to improve data quality by fixing errors in data or removing erroneous data. As covered in Chapters 1 and 2, reducing errors in data is a highly efficient and effective way to improve model quality over using model-centric techniques such as adding more data and/or implementing complex algorithms.
At a high level, data cleaning techniques involve fixing or removing incorrect, incomplete, invalid, biased, inconsistent, stale, or corrupted data. As data is captured at multiple sources, due to different annotators following their judgment or due to poor system designs, combining these sources can often result in data being mislabeled, inconsistent, duplicated, or incomplete. As discovered in earlier...