Computing correlations
Features correlated with the outcome are desirable, but those that are also correlated among themselves can make the model unstable. In this recipe, we will show you how to calculate correlations between features.
Getting ready
To execute this recipe, you need to have a working Spark environment. Also, we will be working off of the no_outliers
DataFrame we created in the Handling outliers recipe, so we assume you have followed the steps to handle duplicates, missing observations, and outliers.
No other prerequisites are required.
How to do it...
To calculate the correlations between two features, all you have to do is to provide their names:
( no_outliers .corr('Cylinders', 'Displacement') )
That's it!
How it works...
The .corr(...)
method takes two parameters, the names of the two features you want to calculate the correlation coefficient between.
Note
Currently, only the Pearson correlation coefficient is available.
The preceding command will produce a correlation coefficient...