Chapter 4: The Bias-Variance Trade-off
Activity 4: Cross-Validation and Feature Engineering with the Case Study Data
Select out the features from the DataFrame of the case study data.
You can use the list of feature names that we've already created in this chapter. But be sure not to include the response variable, which would be a very good (but entirely inappropriate) feature:
features = features_response[:-1] X = df[features].values
Make a train/test split using a random seed of 24:
X_train, X_test, y_train, y_test = train_test_split(X, df['default payment next month'].values, test_size=0.2, random_state=24)
We'll use this going forward and reserve this testing data as the unseen test set. This way, we can easily create separate notebooks with other modeling approaches, using the same training data.
Instantiate the MinMaxScaler to scale the data, as shown in the following code:
from sklearn.preprocessing import MinMaxScaler min_max_sc = MinMaxScaler()
Instantiate a logistic regression model with the saga solver, L1 penalty, and set max_iter to 1,000 as we'd like to allow the solver enough iterations to find a good solution:
lr = LogisticRegression(solver='saga', penalty='l1', max_iter=1000)
Import the Pipeline class and create a Pipeline with the scaler and the logistic regression model, using the names 'scaler' and 'model' for the steps, respectively:
from sklearn.pipeline import Pipeline scale_lr_pipeline = Pipeline(steps=[('scaler', min_max_sc), ('model', lr)])
Use the get_params and set_params methods to see how to view the parameters from each stage of the pipeline and change them:
scale_lr_pipeline.get_params() scale_lr_pipeline.get_params()['model__C'] scale_lr_pipeline.set_params(model__C = 2)
Create a smaller range of C values to test with cross-validation, as these models will take longer to train and test with more data than our previous exercises; we recommend C = [102, 10, 1, 10-1, 10-2, 10-3]:
C_val_exponents = np.linspace(2,-3,6) C_vals = np.float(10)**C_val_exponents
Make a new version of the cross_val_C_search function, called cross_val_C_search_pipe. Instead of the model argument, this function will take a pipeline argument. The changes inside the function will be to set the C value using set_params(model__C = <value you want to test>) on the pipeline, replacing model with the pipeline for the fit and predict_proba methods, and accessing the C value using pipeline.get_params()['model__C'] for the printed status update.
The changes are as follows:
def cross_val_C_search_pipe(k_folds, C_vals, pipeline, X, Y): ##[…] pipeline.set_params(model__C = C_vals[c_val_counter]) ##[…] pipeline.fit(X_cv_train, y_cv_train) ##[…] y_cv_train_predict_proba = pipeline.predict_proba(X_cv_train) ##[…] y_cv_test_predict_proba = pipeline.predict_proba(X_cv_test) ##[…] print('Done with C = {}'.format(pipeline.get_params()['model__C']))
Note
For the complete code, refer to http://bit.ly/2ZAy2Pr.
Run this function as in the previous exercise, but using the new range of C values, the pipeline you created, and the features and response variable from the training split of the case study data. You may see warnings here, or in later steps, about the non-convergence of the solver; you could experiment with the tol or max_iter options to try and achieve convergence, although the results you obtain with max_iter = 1000 are likely to be sufficient. Here is the code to do this:
cv_train_roc_auc, cv_test_roc_auc, cv_test_roc = \ cross_val_C_search_pipe(k_folds, C_vals, scale_lr_pipeline, X_train, y_train)
You will obtain the following output:
Done with C = 100.0 Done with C = 10.0 Done with C = 1.0 Done with C = 0.1 Done with C = 0.01 Done with C = 0.001
Plot the average training and testing ROC AUC across folds, for each C value, using the following code:
plt.plot(C_val_exponents, np.mean(cv_train_roc_auc, axis=0), '-o', label='Average training score') plt.plot(C_val_exponents, np.mean(cv_test_roc_auc, axis=0), '-x', label='Average testing score') plt.ylabel('ROC AUC') plt.xlabel('log$_{10}$(C)') plt.legend() plt.title('Cross validation on Case Study problem') np.mean(cv_test_roc_auc, axis=0)
You will obtain the following output:
You should notice that regularization does not impart much benefit here, as may be expected. While we are able to increase model performance over our previous efforts by using all the features available, it appears there is no overfitting going on. Instead, the training and testing scores are about the same. Instead of overfitting, it's possible that we may be underfitting. Let's try engineering some interaction features to see if they can improve performance.
Create interaction features for the case study data and confirm that the number of new features makes sense using the following code:
from sklearn.preprocessing import PolynomialFeatures make_interactions = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False) X_interact = make_interactions.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split( X_interact, df['default payment next month'].values, test_size=0.2, random_state=24) print(X_train.shape) print(X_test.shape)
You will obtain the following output:
(21331, 153) (5333, 153)
From this you should see the new number of features is 153, which is 17 + "17 choose 2" = 17 + 136 = 153. The "17 choose 2" part comes from choosing all possible combinations of 2 features to interact from the possible 17.
Repeat the cross-validation procedure and observe the model performance now; that is, repeat Steps 9 and 10. Note that this will take substantially more time, due to the larger number of features, but it will probably take only a few minutes.
You will obtain the following output:
So, does the average cross-validation testing performance improve with the interaction features? Is regularization useful?
Engineering the interaction features increases the best model testing score to about ROC AUC = 0.74 on average across the folds, from about 0.72 without including interactions. These scores happen at C = 100, that is, with negligible regularization. On the plot of training versus testing scores for the model with interactions, you can see that the training score is a bit higher than the testing score, so it could be said that some amount of overfitting is going on. However, we cannot increase the testing score through regularization here, so this may not be a problematic instance of overfitting. In most cases, whatever strategy yields the highest testing score is the best strategy.
We will reserve the step of fitting on all the training data for later, when we've tried other models in cross-validation to find the best model.