Chapter 5: Predicting Customer Revenue Using Linear Regression
Activity 8: Examining Relationships between Storefront Locations and Features about their Area
Load the data from location_rev.csv and then take a look at it:
import pandas as pd df = pd.read_csv('location_rev.csv') df.head()
Use seaborn's pairplot function to visualize the data and its relationships:
import seaborn as sns %matplotlib inline sns.pairplot(df)
Use correlations to investigate the relationship between the different variables and the location revenue:
df.corr()
The resulting correlations should make sense. The more competitors in the area, the lower the revenue of a location, while the median income, loyalty members, and population density are all positively related. A location's age is also positively correlated with revenue, indicating that the longer a location is open, the better known it is and the more customers it attracts (or perhaps, only locations that do well last a long time).
Activity 9: Building a Regression Model to Predict Storefront Location Revenue
Import the data from location_rev.csv and view the first few rows:
import pandas as pd df = pd.read_csv('location_rev.csv') df.head()
Create a variable, X, with the predictors in it, and store the outcome (revenue) in a separate variable, y:
X = df[['num_competitors', 'median_income', 'num_loyalty_members', 'population_density', 'location_age' ]] y = df['revenue']
Split the data into training and test sets. Use random_state = 100:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)
Create a linear regression model and fit it on the training data:
from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train,y_train)
Print out the model coefficients:
model.coef_
Print out the model intercept:
model.intercept_
Produce a prediction for a location that has 3 competitors; a median income of 30,000; 1,200 loyalty members; a population density of 2,000; and a location age of 10:
single_location = pd.DataFrame({ 'num_competitors': [3], 'median_income': [30000], 'num_loyalty_members': [1200], 'population_density': [2000], 'location_age': [10] }) model.predict(single_location)
Plot the model predictions versus the true values on the test data:
import matplotlib.pyplot as plt %matplotlib inline plt.scatter(model.predict(X_test),y_test) plt.xlabel('Model Predictions') plt.ylabel('True Value') plt.plot([0, 100000], [0, 100000], 'k-', color = 'r') plt.show()
Calculate the correlation between the model predictions and the true values of the test data:
from scipy.stats.stats import pearsonr pearsonr(model.predict(X_test),y_test)
The first number shows an extremely high correlation value (just over 0.9, where 1.0 would be a perfect correlation). The second number shows an extremely small p-value, indicating that it's very unlikely that this correlation is due to chance. Taken together, this indicates that our model is working very well on the test data.