Chapter 6: Other Regression Techniques and Tools for Evaluation
Activity 10: Testing Which Variables are Important for Predicting Responses to a Marketing Offer
Import pandas, read in the data from offer_responses.csv, and use the head function to view the first five rows of the data:
import pandas as pd df = pd.read_csv('offer_responses.csv') df.head()
Import train_test_split from sklearn and use it to split the data into a training and test set, using responses as the y variable and all others as the predictor (X) variables. Use random_state=10 for train_test_split:
from sklearn.model_selection import train_test_split X = df[['offer_quality', 'offer_discount', 'offer_reach' ]] y = df['responses'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10)
Import LinearRegression and mean_squared_error from sklearn. Fit a model to the training data (using all of the predictors), get predictions from the model on the test data, and print out the calculated RMSE on the test data:
from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error model = LinearRegression() model.fit(X_train,y_train) predictions = model.predict(X_test) print('RMSE with all variables: ' + str(mean_squared_error(predictions, y_test)**0.5))
Create X_train2 and X_test2 by dropping offer_quality from X_train and X_test. Train and evaluate the RMSE of the model using X_train2 and X_test2:
X_train2 = X_train.drop('offer_quality',axis=1) X_test2 = X_test.drop('offer_quality',axis=1) model = LinearRegression() model.fit(X_train2,y_train) predictions = model.predict(X_test2) print('RMSE without offer quality: ' + str(mean_squared_error(predictions, y_test)**0.5))
Perform the same sequence of steps from step 4, but this time dropping offer_discount instead of offer_quality:
X_train3 = X_train.drop('offer_discount',axis=1) X_test3 = X_test.drop('offer_discount',axis=1) model = LinearRegression() model.fit(X_train3,y_train) predictions = model.predict(X_test3) print('RMSE without offer discount: ' + str(mean_squared_error(predictions, y_test)**0.5))
Perform the same sequence of steps, but this time dropping offer_reach:
X_train4 = X_train.drop('offer_reach',axis=1) X_test4 = X_test.drop('offer_reach',axis=1) model = LinearRegression() model.fit(X_train4,y_train) predictions = model.predict(X_test4) print('RMSE without offer reach: ' + str(mean_squared_error(predictions, y_test)**0.5))
You should notice that the RMSE went up when offer_reach or offer_discount was removed from the model, but remained about the same when offer_quality was removed. This suggests that offer_quality isn't contributing to the accuracy of the model and could be safely removed to simplify the model.
Activity 11: Using Lasso Regression to Choose Features for Predicting Customer Spend
Import pandas, use it to read the data in customer_spend.csv, and use the head function to view the first five rows of data:
import pandas as pd df = pd.read_csv('customer_spend.csv') df.head()
Use train_test_split from sklearn to split the data into training and test sets, with random_state=100 and cur_year_spend as the y variable:
from sklearn.model_selection import train_test_split cols = df.columns[1:] X = df[cols] y = df['cur_year_spend'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)
Import Lasso from sklearn and fit a lasso model (with normalize=True and random_state=10) to the training data:
from sklearn.linear_model import Lasso lasso_model = Lasso(normalize=True, random_state=10) lasso_model.fit(X_train,y_train)
Get the coefficients from the lasso model, and store the names of the features that have non-zero coefficients along with their coefficient values in the selected_features and selected_coefs variables, respectively:
coefs = lasso_model.coef_ selected_features = cols[coefs > 0] selected_coefs = coefs[coefs > 0]
Print out the names of the features with non-zero coefficients and their associated coefficient values using the following code:
for coef, feature in zip(selected_coefs, selected_features): print(feature + ' coefficient: ' + str(coef))
From the output, we can see not only which variables are important, but also the effect that they have. For example, for each dollar a customer spent in the previous year, we can expect a customer to spend approximately $0.80 this year, everything else being equal.
Activity 12: Building the Best Regression Model for Customer Spend Based on Demographic Data
Import pandas, read the data in spend_age_income_ed.csv into a DataFrame, and use the head function to view the first five rows of the data:
import pandas as pd df = pd.read_csv('spend_age_income_ed.csv') df.head()
Perform a train-test split, with random_state=10:
from sklearn.model_selection import train_test_split X = df[['age','income','years_of_education']] y = df['spend'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10)
Fit a linear regression model to the training data:
from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train,y_train)
Fit two regression tree models to the data, one with max_depth=2 and one with max_depth=5:
from sklearn.tree import DecisionTreeRegressor max2_tree_model = DecisionTreeRegressor(max_depth=2) max2_tree_model.fit(X_train,y_train) max5_tree_model = DecisionTreeRegressor(max_depth=5) max5_tree_model.fit(X_train,y_train)
Fit two random forest models to the data, one with max_depth=2, one with max_depth=5, and random_state=10 for both:
from sklearn.ensemble import RandomForestRegressor max2_forest_model = RandomForestRegressor(max_depth=2, random_state=10) max2_forest_model.fit(X_train,y_train) max5_forest_model = RandomForestRegressor(max_depth=5, random_state=10) max5_forest_model.fit(X_train,y_train)
Calculate and print out the RMSE on the test data for all five models:
from sklearn.metrics import mean_squared_error linear_predictions = model.predict(X_test) print('Linear model RMSE: ' + str(mean_squared_error(linear_predictions, y_test)**0.5)) max2_tree_predictions = max2_tree_model.predict(X_test) print('Tree with max depth of 2 RMSE: ' + str(mean_squared_error(max2_tree_predictions, y_test)**0.5)) max5_tree_predictions = max5_tree_model.predict(X_test) print('Tree with max depth of 5 RMSE: ' + str(mean_squared_error(max5_tree_predictions, y_test)**0.5)) max2_forest_predictions = max2_forest_model.predict(X_test) print('Random Forest with max depth of 2 RMSE: ' + str(mean_squared_error(max2_forest_predictions, y_test)**0.5)) max5_forest_predictions = max5_forest_model.predict(X_test) print('Random Forest with max depth of 5 RMSE: ' + str(mean_squared_error(max5_forest_predictions, y_test)**0.5))
We can see that, with this particular problem, a random forest with a max depth of 5 does best out of the models we tried. In general, it's good to try a few different types of models and values for hyperparameters to make sure you get the model that captures the relationships in the data well.