Defining constraints
In the previous section, we looked at examples of how Deequ can automatically suggest constraints as well as how we can gather various metrics around data. We will now define the actual constraints that we expect the dataframe to pass. In the following code, we define the following constraints that we expect the flights
data to pass:
- The
airline
column should not contain anyNULL
values - The
flight_number
column should not contain anyNULL
values - The
cancelled
column should contain only0
or1
- The
distance
column should not contain any negative value - The
cancellation_reason
column should contain onlyA
,B
,C
, orD
If all of the checks pass, then we print data looks good
on the console; else, we print the constraint along with the result status.
Here is the code for it.
As a first step, we will create a dataframe using the flights
table we loaded in MySQL:
val session = Spark.initSparkSession("de-with-scala...