Data validation
The “garbage in, garbage out” principle in computing says that no matter how great your code may be, if you start with poor-quality data, your analysis will yield poor-quality results. All too often, data practitioners struggle with issues like unexpected missing data, duplicate values, and broken relationships between modeling entities.
Fortunately, there are tools to help you automate both the data that is input to and output from your models, which ensures trust in the work that you are performing. In this recipe, we are going to look at Great Expectations.
Great Expectations
This book was written using Great Expectations version 1.0.2. To get started, let’s once again look at our vehicles dataset:
df = pd.read_csv(
"data/vehicles.csv.zip",
dtype_backend="numpy_nullable",
dtype={
"rangeA": pd.StringDtype(),
"mfrCode": pd.StringDtype(),
"c240Dscr...