Checking data quality
Checking data quality as part of your machine learning system is extremely critical to ensure the integrity and correctness of your model training and inference. Principles of software testing and quality should be borrowed and used on the data layer of machine learning platforms.
From a data quality perspective, in a dataset there are a couple of critical dimensions with which to assess and profile our data, namely:
- Schema compliance: Ensuring the data is from the expected types; making sure that numeric values don't contain any other types of data
- Valid data: Assessing from a data perspective whether the data is valid from a business perspective
- Missing data: Assessing whether all the data needed to run analytics and algorithms is available
For data validation, we will use the Great Expectations Python package (available at https://github.com/great-expectations/great_expectations). It allows making assertions on data with many...