Summary
In this chapter, we gained a good understanding of the six key dimensions of data quality and why it’s important to improve data quality for superior model performance. We further dived into the data-centric approach of improving model performance by iterating over the data, rather than iterating over various algorithms (model-centric approach), by improving the overall health of the data.
Next, we learned how to ensure data is consistent, unique, accurate, valid, fresh, and complete. We dived into various techniques of imputing missing values and when to apply which approach. We concluded that imputing missing values with machine learning can be better than using simple imputation methods, especially when data is MAR or MNAR. We also showed how to conduct error analysis and how to use the results to further improve model performance by either performing feature engineering, which involves building new features, or increasing the data size by creating synthetic data...