Feature selection attempts to reduce the size of the original dataset by subsetting the original features and shortlisting the best ones with the highest predictive power. We may intelligently choose which feature selection method might work best for us, but in reality, a very valid way of working in this domain is to work through examples of each method and measure the performance of the resulting pipeline.
To begin, let's take a look at the subclass of feature selection modules that are reliant on statistical tests to select viable features from a dataset.
Statistics provides us with relatively quick and easy methods of interpreting both quantitative and qualitative data. We have used some statistical measures in previous chapters to obtain new knowledge and perspective around our data, specifically in that we recognized mean and standard deviation as metrics that enabled us to calculate z-scores and scale our data. In this tutorial, we will rely on two new concepts to help us with our feature selection:
Both of these methods are known as univariate methods of feature selection, meaning that they are quick and handy when the problem is to select out single features at a time in order to create a better dataset for our machine learning pipeline.
We have actually looked at correlations in this book already, but not in the context of feature selection. We already know that we can invoke a correlation calculation in pandas by calling the following method:
credit_card_default.corr()
The output of the preceding code produces is the following:
As a continuation of the preceding table we have:
The Pearson correlation coefficient (which is the default for pandas) measures the linear relationship between columns. The value of the coefficient varies between -1 and +1, where 0 implies no correlation between them. Correlations closer to -1 or +1 imply an extremely strong linear relationship.
It is worth noting that Pearson’s correlation generally requires that each column be normally distributed (which we are not assuming). We can also largely ignore this requirement because our dataset is large (over 500 is the threshold).
The pandas .corr() method calculates a Pearson correlation coefficient for every column versus every other column. This 24 column by 24 row matrix is very unruly, and in the past, we used heatmaps to try and make the information more digestible:
# using seaborn to generate heatmaps
import seaborn as sns
import matplotlib.style as style
# Use a clean stylizatino for our charts and graphs
style.use('fivethirtyeight')
sns.heatmap(credit_card_default.corr())
The heatmap generated will be as follows:
Note that the heatmap function automatically chose the most correlated features to show us. That being said, we are, for the moment, concerned with the features correlations to the response variable. We will assume that the more correlated a feature is to the response, the more useful it will be. Any feature that is not as strongly correlated will not be as useful to us.
Correlation coefficients are also used to determine feature interactions and redundancies. A key method of reducing overfitting in machine learning is spotting and removing these redundancies. We will be tackling this problem in our model-based selection methods.
Let's isolate the correlations between the features and the response variable, using the following code:
# just correlations between every feature and the response
credit_card_default.corr()['default payment next month']
LIMIT_BAL -0.153520
SEX -0.039961
EDUCATION 0.028006
MARRIAGE -0.024339
AGE 0.013890
PAY_0 0.324794
PAY_2 0.263551
PAY_3 0.235253
PAY_4 0.216614
PAY_5 0.204149
PAY_6 0.186866
BILL_AMT1 -0.019644
BILL_AMT2 -0.014193
BILL_AMT3 -0.014076
BILL_AMT4 -0.010156
BILL_AMT5 -0.006760
BILL_AMT6 -0.005372
PAY_AMT1 -0.072929
PAY_AMT2 -0.058579
PAY_AMT3 -0.056250
PAY_AMT4 -0.056827
PAY_AMT5 -0.055124
PAY_AMT6 -0.053183
default payment next month 1.000000
We can ignore the final row, as is it is the response variable correlated perfectly to itself. We are looking for features that have correlation coefficient values close to -1 or +1. These are the features that we might assume are going to be useful. Let's use pandas filtering to isolate features that have at least .2 correlation (positive or negative).
Let's do this by first defining a pandas mask, which will act as our filter, using the following code:
# filter only correlations stronger than .2 in either direction (positive
or negative)
credit_card_default.corr()['default payment next month'].abs() > .2
LIMIT_BAL False
SEX False
EDUCATION False
MARRIAGE False
AGE False
PAY_0 True
PAY_2 True
PAY_3 True
PAY_4 True
PAY_5 True
PAY_6 False
BILL_AMT1 False
BILL_AMT2 False
BILL_AMT3 False
BILL_AMT4 False
BILL_AMT5 False
BILL_AMT6 False
PAY_AMT1 False
PAY_AMT2 False
PAY_AMT3 False
PAY_AMT4 False
PAY_AMT5 False
PAY_AMT6 False
default payment next month True
Every False in the preceding pandas Series represents a feature that has a correlation value between -.2 and .2 inclusive, while True values correspond to features with preceding correlation values .2 or less than -0.2. Let's plug this mask into our pandas filtering, using the following code:
# store the features
highly_correlated_features =
credit_card_default.columns[credit_card_default.corr()['default payment
next month'].abs() > .2]
highly_correlated_features
Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5',
u'default payment next month'],
dtype='object')
The variable highly_correlated_features is supposed to hold the features of the dataframe that are highly correlated to the response; however, we do have to get rid of the name of the response column, as including that in our machine learning pipeline would be cheating:
# drop the response variable
highly_correlated_features = highly_correlated_features.drop('default
payment next month')
highly_correlated_features
Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5'], dtype='object')
So, now we have five features from our original dataset that are meant to be predictive of the response variable, so let's try it out with the help of the following code:
# only include the five highly correlated features
X_subsetted = X[highly_correlated_features]
get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y)
# barely worse, but about 20x faster to fit the model
Best Accuracy: 0.819666666667
Best Parameters: {'max_depth': 3}
Average Time to Fit (s): 0.01
Average Time to Score (s): 0.002
Our accuracy is definitely worse than the accuracy to beat, .8203, but also note that the fitting time saw about a 20-fold increase. Our model is able to learn almost as well as with the entire dataset with only five features. Moreover, it is able to learn as much in a much shorter timeframe.
Let's bring back our scikit-learn pipelines and include our correlation choosing methodology as a part of our preprocessing phase. To do this, we will have to create a custom transformer that invokes the logic we just went through, as a pipeline-ready class.
We will call our class the CustomCorrelationChooser and it will have to implement both a fit and a transform logic, which are:
from sklearn.base import TransformerMixin, BaseEstimator
class CustomCorrelationChooser(TransformerMixin, BaseEstimator):
def __init__(self, response, cols_to_keep=[], threshold=None):
# store the response series
self.response = response
# store the threshold that we wish to keep
self.threshold = threshold
# initialize a variable that will eventually
# hold the names of the features that we wish to keep
self.cols_to_keep = cols_to_keep
def transform(self, X):
# the transform method simply selects the appropiate
# columns from the original dataset
return X[self.cols_to_keep]
def fit(self, X, *_):
# create a new dataframe that holds both features and response
df = pd.concat([X, self.response], axis=1)
# store names of columns that meet correlation threshold
self.cols_to_keep = df.columns[df.corr()[df.columns[-1]].abs() >
Self.threshold]
# only keep columns in X, for example, will remove response
Variable
self.cols_to_keep = [c for c in self.cols_to_keep if c in
X.columns]
return self
Let's take our new correlation feature selector for a spin, with the help of the following code:
# instantiate our new feature selector
ccc = CustomCorrelationChooser(threshold=.2, response=y)
ccc.fit(X)
ccc.cols_to_keep
['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5']
Our class has selected the same five columns as we found earlier. Let's test out the transform functionality by calling it on our X matrix, using the following code:
ccc.transform(X).head()
The preceding code produces the following table as the output:
We see that the transform method has eliminated the other columns and kept only the features that met our .2 correlation threshold. Now, let's put it all together in our pipeline, with the help of the following code:
# instantiate our feature selector with the response variable set
ccc = CustomCorrelationChooser(response=y)
# make our new pipeline, including the selector
ccc_pipe = Pipeline([('correlation_select', ccc),
('classifier', d_tree)])
# make a copy of the decisino tree pipeline parameters
ccc_pipe_params = deepcopy(tree_pipe_params)
# update that dictionary with feature selector specific parameters
ccc_pipe_params.update({
'correlation_select__threshold':[0, .1, .2, .3]})
print ccc_pipe_params #{'correlation_select__threshold': [0, 0.1, 0.2,
0.3], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19,
21]}
# better than original (by a little, and a bit faster on
# average overall
get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y)
Best Accuracy: 0.8206
Best Parameters: {'correlation_select__threshold': 0.1,
'classifier__max_depth': 5}
Average Time to Fit (s): 0.105
Average Time to Score (s): 0.003
Wow! Our first attempt at feature selection and we have already beaten our goal (albeit by a little bit). Our pipeline is showing us that if we threshold at 0.1, we have eliminated noise enough to improve accuracy and also cut down on the fitting time (from .158 seconds without the selector). Let's take a look at which columns our selector decided to keep:
# check the threshold of .1
ccc = CustomCorrelationChooser(threshold=0.1, response=y)
ccc.fit(X)
# check which columns were kept
ccc.cols_to_keep
['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
It appears that our selector has decided to keep the five columns that we found, as well as two more, the LIMIT_BAL and the PAY_6 columns. Great! This is the beauty of automated pipeline gridsearching in scikit-learn. It allows our models to do what they do best and intuit things that we could not have on our own.
Hypothesis testing is a methodology in statistics that allows for a bit more complex statistical testing for individual features. Feature selection via hypothesis testing will attempt to select only the best features from a dataset, just as we were doing with our custom correlation chooser, but these tests rely more on formalized statistical methods and are interpreted through what are known as p-values.
A hypothesis test is a statistical test that is used to figure out whether we can apply a certain condition for an entire population, given a data sample. The result of a hypothesis test tells us whether we should believe the hypothesis or reject it for an alternative one. Based on sample data from a population, a hypothesis test determines whether or not to reject the null hypothesis. We usually use a p-value (a non-negative decimal with an upper bound of 1, which is based on our significance level) to make this conclusion.
In the case of feature selection, the hypothesis we wish to test is along the lines of: True or False: This feature has no relevance to the response variable. We want to test this hypothesis for every feature and decide whether the features hold some significance in the prediction of the response. In a way, this is how we dealt with the correlation logic. We basically said that, if a column's correlation with the response is too weak, then we say that the hypothesis that the feature has no relevance is true. If the correlation coefficient was strong enough, then we can reject the hypothesis that the feature has no relevance in favor of an alternative hypothesis, that the feature does have some relevance.
To begin to use this for our data, we will have to bring in two new modules: SelectKBest and f_classif, using the following code:
# SelectKBest selects features according to the k highest scores of a given
scoring function
from sklearn.feature_selection import SelectKBest
# This models a statistical test known as ANOVA
from sklearn.feature_selection import f_classif
# f_classif allows for negative values, not all do
# chi2 is a very common classification criteria but only allows for
positive values
# regression has its own statistical tests
SelectKBest is basically just a wrapper that keeps a set amount of features that are the highest ranked according to some criterion. In this case, we will use the p-values of completed hypothesis testings as a ranking.
The p-values are a decimals between 0 and 1 that represent the probability that the data given to us occurred by chance under the hypothesis test. Simply put, the lower the p-value, the better the chance that we can reject the null hypothesis. For our purposes, the smaller the p-value, the better the chances that the feature has some relevance to our response variable and we should keep it.
The big take away from this is that the f_classif function will perform an ANOVA test (a type of hypothesis test) on each feature on its own (hence the name univariate testing) and assign that feature a p-value. The SelectKBest will rank the features by that p-value (the lower the better) and keep only the best k (a human input) features. Let's try this out in Python.
Let's begin by instantiating a SelectKBest module. We will manually enter a k value, 5, meaning we wish to keep only the five best features according to the resulting p-values:
# keep only the best five features according to p-values of ANOVA test
k_best = SelectKBest(f_classif, k=5)
We can then fit and transform our X matrix to select the features we want, as we did before with our custom selector:
# matrix after selecting the top 5 features
k_best.fit_transform(X, y)
# 30,000 rows x 5 columns
array([[ 2, 2, -1, -1, -2],
[-1, 2, 0, 0, 0],
[ 0, 0, 0, 0, 0],
...,
[ 4, 3, 2, -1, 0],
[ 1, -1, 0, 0, 0],
[ 0, 0, 0, 0, 0]])
If we want to inspect the p-values directly and see which columns were chosen, we can dive deeper into the select k_best variables:
# get the p values of columns
k_best.pvalues_
# make a dataframe of features and p-values
# sort that dataframe by p-value
p_values = pd.DataFrame({'column': X.columns, 'p_value':
k_best.pvalues_}).sort_values('p_value')
# show the top 5 features
p_values.head()
The preceding code produces the following table as the output:
We can see that, once again, our selector is choosing the PAY_X columns as the most important. If we take a look at our p-value column, we will notice that our values are extremely small and close to zero. A common threshold for p-values is 0.05, meaning that anything less than 0.05 may be considered significant, and these columns are extremely significant according to our tests. We can also directly see which columns meet a threshold of 0.05 using the pandas filtering methodology:
# features with a low p value
p_values[p_values['p_value'] < .05]
The preceding code produces the following table as the output:
The majority of the columns have a low p-value, but not all. Let's see the columns with a higher p_value, using the following code:
# features with a high p value
p_values[p_values['p_value'] >= .05]
The preceding code produces the following table as the output:
These three columns have quite a high p-value. Let's use our SelectKBest in a pipeline to see if we can grid search our way into a better machine learning pipeline, using the following code:
k_best = SelectKBest(f_classif)
# Make a new pipeline with SelectKBest
select_k_pipe = Pipeline([('k_best', k_best),
('classifier', d_tree)])
select_k_best_pipe_params = deepcopy(tree_pipe_params)
# the 'all' literally does nothing to subset
select_k_best_pipe_params.update({'k_best__k':range(1,23) + ['all']})
print select_k_best_pipe_params # {'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all'],
'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}
# comparable to our results with correlationchooser
get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y)
Best Accuracy: 0.8206
Best Parameters: {'k_best__k': 7, 'classifier__max_depth': 5}
Average Time to Fit (s): 0.102
Average Time to Score (s): 0.002
It seems that our SelectKBest module is getting about the same accuracy as our custom transformer, but it's getting there a bit quicker! Let's see which columns our tests are selecting for us, with the help of the following code:
k_best = SelectKBest(f_classif, k=7)
# lowest 7 p values match what our custom correlationchooser chose before
# ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
p_values.head(7)
The preceding code produces the following table as the output:
They appear to be the same columns that were chosen by our other statistical method. It's possible that our statistical method is limited to continually picking these seven columns for us.
There are other tests available besides ANOVA, such as Chi2 and others, for regression tasks. They are all included in scikit-learn's documentation. For more info on feature selection through univariate testing, check out the scikit-learn documentation here: http://scikit-learn.org/stable/modules/feature_selection. html#univariate-feature-selection
Before we move on to model-based feature selection, it's helpful to do a quick sanity check to ensure that we are on the right track. So far, we have seen two statistical methods for feature selection that gave us the same seven columns for optimal accuracy. But what if we were to take every column except those seven? We should expect a much lower accuracy and worse pipeline overall, right? Let's make sure. The following code helps us to implement sanity checks:
# sanity check
# If we only the worst columns
the_worst_of_X = X[X.columns.drop(['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3',
'PAY_4', 'PAY_5', 'PAY_6'])]
# goes to show, that selecting the wrong features will
# hurt us in predictive performance
get_best_model_and_accuracy(d_tree, tree_params, the_worst_of_X, y)
Best Accuracy: 0.783966666667
Best Parameters: {'max_depth': 5}
Average Time to Fit (s): 0.21
Average Time to Score (s): 0.002
Hence, by selecting the columns except those seven, we see not only worse accuracy (almost as bad as the null accuracy), but also slower fitting times on average.
We statistically selected features from the dataset for our machine learning pipeline.
[box type="note" align="" class="" width=""]This article is an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla. Do check out the book to get access to alternative techniques such as the model-based method to achieve optimum results from the machine learning application.[/box]