Performing data quality checks
Missing data are values not captured or observed in the dataset. Values can be missing for a particular feature (column), or an entire observation (row). When ingesting the data using pandas, missing values will show up as either NaN
, NaT
, or NA
.
Sometimes, missing observations are replaced with other values in the source system; for example, this can be a numeric filler such as 99999
or 0
, or a string such as missing
or N/A
. When missing values are represented by 0
, you need to be cautious and investigate further to determine whether those zero values are legitimate or they are indicative of missing data.
In this recipe, you will explore how to identify the presence of missing data.
Getting ready
You can download the Jupyter notebooks and requisite datasets from the GitHub repository. Please refer to the Technical requirements section of this chapter.
You will be using two datasets from the Ch7
folder: clicks_missing_multiple.csv
and...