Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Training Systems Using Python Statistical Modeling

You're reading from   Training Systems Using Python Statistical Modeling Explore popular techniques for modeling your data in Python

Arrow left icon
Product type Paperback
Published in May 2019
Publisher Packt
ISBN-13 9781838823733
Length 290 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Curtis Miller Curtis Miller
Author Profile Icon Curtis Miller
Curtis Miller
Arrow right icon
View More author details
Toc

Finding correlations

In the final section of this chapter, we will learn about computing correlations using pandas and SciPy. We will look at how to use pandas and SciPy to compute correlations in datasets, and also explore some statistical tests to detect correlation.

In this section, I have used Pearson's correlation coefficient, which quantifies how strongly two variables are linearly correlated. This is a unitless number that takes values between -1 and 1. The sign of the correlation coefficient indicates the direction of the relationship. A positive r indicates that as one variable increases, the other tends to increase; while a negative r indicates that as one variable increases, the other tends to decrease. The magnitude indicates the strength of the relationship. If r is close to 1 or -1, then the relationship is almost a perfect linear relationship, while an r that is close to 0 indicates no linear relationship. The NumPy corrcoef() function computes the number for two NumPy arrays.

So, here's the correlation coefficient's definition. We're going to work with what's known as the Boston housing price dataset. This is known as a toy dataset, and is often used for evaluating statistical learning methods. I'm only interested in looking at correlations between the variables in this dataset.

Note that we will be using a slightly modified version of the dataset, which can be found in the GitHub repository for this chapter.

We will use the following steps:

  1. We will first load up all the required libraries, as follows:
  1. We will then load in the dataset using the following code:
  1. Then, we will print the dataset information, so that we can have a look at it, as follows:

So, these attributes tell us what this dataset contains. It has a few different variables such as, for example, the crime rate by town, the proportion of residential land zones, non-retail business acres, and others. Therefore, we have a number of different attributes for you to play with.

  1. We will now load the actual dataset, as follows:

This is going to very useful to us. Now, let's take a look at the correlation between the two variables.

  1. We're going to import the corrcoef() function, and we're going to look at one of the columns from this dataset as if it were a NumPy array, as follows:

It started out as a NumPy array, but it's perfectly fine to just grab a pandas series and turn it into an array.

  1. Now we're going to take corrcoef(), and compute the correlation between the crime rate and the price:

We can see that the numbers in the matrix are the same.

The number in the off-diagonal entries corresponds to the correlation between the two variables. Therefore, there is a negative relationship between crime rate and price, which is not surprising. As the crime rate increases, you would expect the price to decrease. But this is not a very strong correlation. We often want correlations not just for two variables, but for many combinations of two variables. We can use the pandas DataFrame corr() method to compute what's known as the correlation matrix. This matrix contains correlations for every combination of variables in our dataset.

  1. So, let's go ahead and compute a correlation matrix for the Boston dataset, using the corr() function:

These are all correlations. Every entry off the diagonal will have a corresponding entry on the other side of the diagonal—this is a symmetric matrix. Now, while this matrix contains all the data that we might want, it is very difficult to read.

  1. So, let's use a package called seaborn, as this is going to allow us to plot a heatmap. A heatmap has colors with differing intensities for how large or small a number tends to be. So, we compute a heatmap and plot it as follows:

The preceding screenshot shows the resulting map. Here, black indicates correlations close to -1, and very light colors indicate correlations that are close to 1. From this heatmap, we can start to see some patterns.

Testing for correlation

SciPy contains the pearsonr() function in its statistics module, which not only computes the correlation between two variables but also allows for hypothesis testing to detect correlation. Rejecting the null hypothesis will signify that the correlation between the two variables, in general, is not 0. However, be aware that rejecting the null hypothesis does not automatically signify a meaningful relationship between the two variables. It's possible that the correlation is not 0, but still too small to be meaningful. This is especially true when performing the test on large samples.

We're going to do a statistical test using the following hypotheses:

We import the pearsonr() function, and we run it on crime and price to see whether these things are correlated with statistical significance:

Here, we have a very small p value, which suggests that yes, indeed, the crime rate and the price of a home appear to be correlated. If we had to take a guess, then we would say that it would be negatively correlated.

You have been reading a chapter from
Training Systems Using Python Statistical Modeling
Published in: May 2019
Publisher: Packt
ISBN-13: 9781838823733
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image