Performing data analysis
Deequ offers capabilities to generate statistics called metrics on data. For example, we can use Deequ to provide us with the number of records in a dataset, tell us whether a particular column is unique, give us the degree of correlation between columns, and so on. Deequ offers this functionality with case classes such as ApproxCountDistinct
, Completeness
, Correlation
, and so on, defined in the com.amazon.deequ.analyzers
package. For a complete list of metrics along with their definitions, please refer to https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/.
In the following example, we will be using the flight data that we loaded into a MySQL table named flights
. We analyze the flights
data to check the count of records, whether the airline
column contains any NULL
value, an approximate distinct count of origin_airport
, and so on. The result set is then converted into a dataframe and finally printed on the screen:
package com...