Functions for checking overall data quality
We can tighten up our data quality checks by being more explicit and upfront about what we are evaluating. We likely have some expectations about the distribution of variable values, about the range of allowable values, and about the number of missing values very early in a data analysis project. This may come from documentation, our knowledge of the underlying real-world processes represented by the data, or our understanding of statistics. It is a good idea to have a routine for delineating those initial assumptions, testing them, and then revising assumptions throughout a project. This recipe will demonstrate what that process might look like.
We set up data quality targets for each variable of interest. This includes allowable values and thresholds for missing values for categorical variables. It also includes ranges of values; missing value, skewness, and kurtosis thresholds; and checking for outliers for numeric values. We will...