Managing data integrity with Great Expectations
Great Expectations is a third-party tool that allows you to capture and define the properties of a dataset. You can save these properties and then use them to validate future data to ensure data integrity. This can be very useful when building machine learning models, as new categorical data values and numeric outliers tend to cause a model to perform poorly or error out.
In this section, we will look at the Kaggle dataset and make an expectation suite to test and validate the data.
How to do it…
- Read the data using the
tweak_kag
function previously defined:>>> kag = tweak_kag(df)
- Use the Great Expectations
from_pandas
function to read in a Great Expectations DataFrame (a subclass of DataFrame with some extra methods):>>> import great_expectations as ge >>> kag_ge = ge.from_pandas(kag)
- Examine the extra methods on the DataFrame: ...