Performing data quality checks
Missing data are values not captured or not observed in the dataset. Values can be missing for a particular feature (column), or an entire observation (row). When ingesting the data using pandas, missing values will show up as either NaN
, NaT
, or NA
.
Sometimes, in a given data set, missing observations are replaced with other values from the source system; for example, this can be a numeric filler such as 99999
or 0
, or a string such as missing
or N/A
. When missing values are represented by 0
, you need to be cautious and investigate further to determine whether those zero values are legitimate or if they are indicative of missing data.
In this recipe, you will explore how to identify the presence of missing data.
Getting ready
You can download the Jupyter notebooks and requisite datasets from the GitHub repository. Please refer to the Technical requirements section of this chapter.
You will be using two datasets from the Ch7
folder: clicks_missing_multiple...