Detecting missing data
Missing data is a common and inevitable issue in real-world datasets. It occurs when one or more values are absent in a particular observation or record. This data gap can greatly impact the validity and reliability of any analysis or model built with those data. As we say in the data world: garbage in, garbage out, meaning that if your data is not correct, then the models or analysis created with that data will not be correct either.
In the following parts, we will use a scenario to demonstrate how to detect missing data and how the different imputation methods work. The scenario is the following:
Imagine you are analyzing a dataset containing information about students, including their ages and test scores. However, due to various reasons, some ages and test scores are missing.
The code for this section can be found at https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/blob/main/chapter08/1.detect_missing_data.py...