Once we have a decent understanding of the data and some of its main properties, the next step is to find a concrete relationship between data elements. We can use some of the well-established statistical techniques to understand the distribution of data.
Let's continue with our Spark example from the previous section by comparing Total Population to Total Households. We can expect the two numbers to be strongly correlated:
println("Covariance: " + df.stat.cov("Total Population", "Total Households"))
println("Correlation: " + df.stat.corr("Total Population", "Total Households"))
The output from this would be something like this:
Covariance: 1.2338126298368526E8
Correlation: 0.9090567549637986
As expected, we see the correlation coefficient value closer to 1, indicating a...