Selecting Statistical-based Features in Machine Learning application

In today’s tutorial, we will work on one of the methods of executing feature selection, the statistical-based method for interpreting both quantitative and qualitative datasets.

Feature selection attempts to reduce the size of the original dataset by subsetting the original features and shortlisting the best ones with the highest predictive power. We may intelligently choose which feature selection method might work best for us, but in reality, a very valid way of working in this domain is to work through examples of each method and measure the performance of the resulting pipeline.

To begin, let's take a look at the subclass of feature selection modules that are reliant on statistical tests to select viable features from a dataset.

Statistical-based feature selections

Statistics provides us with relatively quick and easy methods of interpreting both quantitative and qualitative data. We have used some statistical measures in previous chapters to obtain new knowledge and perspective around our data, specifically in that we recognized mean and standard deviation as metrics that enabled us to calculate z-scores and scale our data. In this tutorial, we will rely on two new concepts to help us with our feature selection:

Pearson correlations
hypothesis testing

Both of these methods are known as univariate methods of feature selection, meaning that they are quick and handy when the problem is to select out single features at a time in order to create a better dataset for our machine learning pipeline.

Using Pearson correlation to select features

We have actually looked at correlations in this book already, but not in the context of feature selection. We already know that we can invoke a correlation calculation in pandas by calling the following method:

credit_card_default.corr()

The output of the preceding code produces is the following:

selecting-statistical-based-features-in-machine-learning-application-img-0

As a continuation of the preceding table we have:

selecting-statistical-based-features-in-machine-learning-application-img-1

The Pearson correlation coefficient (which is the default for pandas) measures the linear relationship between columns. The value of the coefficient varies between -1 and +1, where 0 implies no correlation between them. Correlations closer to -1 or +1 imply an extremely strong linear relationship.

It is worth noting that Pearson’s correlation generally requires that each column be normally distributed (which we are not assuming). We can also largely ignore this requirement because our dataset is large (over 500 is the threshold).

The pandas .corr() method calculates a Pearson correlation coefficient for every column versus every other column. This 24 column by 24 row matrix is very unruly, and in the past, we used heatmaps to try and make the information more digestible:

# using seaborn to generate heatmaps

import seaborn as sns

import matplotlib.style as style

# Use a clean stylizatino for our charts and graphs

style.use('fivethirtyeight')

sns.heatmap(credit_card_default.corr())

The heatmap generated will be as follows:

selecting-statistical-based-features-in-machine-learning-application-img-2

Note that the heatmap function automatically chose the most correlated features to show us. That being said, we are, for the moment, concerned with the features correlations to the response variable. We will assume that the more correlated a feature is to the response, the more useful it will be. Any feature that is not as strongly correlated will not be as useful to us.

Correlation coefficients are also used to determine feature interactions and redundancies. A key method of reducing overfitting in machine learning is spotting and removing these redundancies. We will be tackling this problem in our model-based selection methods.

Let's isolate the correlations between the features and the response variable, using the following code:

# just correlations between every feature and the response

credit_card_default.corr()['default payment next month']

LIMIT_BAL -0.153520

SEX -0.039961

EDUCATION 0.028006

MARRIAGE -0.024339

AGE 0.013890

PAY_0 0.324794

PAY_2 0.263551

PAY_3 0.235253

PAY_4 0.216614

PAY_5 0.204149

PAY_6 0.186866

BILL_AMT1 -0.019644

BILL_AMT2 -0.014193

BILL_AMT3 -0.014076

BILL_AMT4 -0.010156

BILL_AMT5 -0.006760

BILL_AMT6 -0.005372

PAY_AMT1 -0.072929

PAY_AMT2 -0.058579

PAY_AMT3 -0.056250

PAY_AMT4 -0.056827

PAY_AMT5 -0.055124

PAY_AMT6 -0.053183

default payment next month 1.000000

We can ignore the final row, as is it is the response variable correlated perfectly to itself. We are looking for features that have correlation coefficient values close to -1 or +1. These are the features that we might assume are going to be useful. Let's use pandas filtering to isolate features that have at least .2 correlation (positive or negative).

Let's do this by first defining a pandas mask, which will act as our filter, using the following code:

# filter only correlations stronger than .2 in either direction (positive

or negative)

credit_card_default.corr()['default payment next month'].abs() > .2

LIMIT_BAL False

SEX False

EDUCATION False

MARRIAGE False

AGE False

PAY_0 True

PAY_2 True

PAY_3 True

PAY_4 True

PAY_5 True

PAY_6 False

BILL_AMT1 False

BILL_AMT2 False

BILL_AMT3 False

BILL_AMT4 False

BILL_AMT5 False

BILL_AMT6 False

PAY_AMT1 False

PAY_AMT2 False

PAY_AMT3 False

PAY_AMT4 False

PAY_AMT5 False

PAY_AMT6 False

default payment next month True

Every False in the preceding pandas Series represents a feature that has a correlation value between -.2 and .2 inclusive, while True values correspond to features with preceding correlation values .2 or less than -0.2. Let's plug this mask into our pandas filtering, using the following code:

# store the features

highly_correlated_features =

credit_card_default.columns[credit_card_default.corr()['default payment

next month'].abs() > .2]

highly_correlated_features

Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5',

u'default payment next month'],

dtype='object')

The variable highly_correlated_features is supposed to hold the features of the dataframe that are highly correlated to the response; however, we do have to get rid of the name of the response column, as including that in our machine learning pipeline would be cheating:

# drop the response variable

highly_correlated_features = highly_correlated_features.drop('default

payment next month')

highly_correlated_features

Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5'], dtype='object')

So, now we have five features from our original dataset that are meant to be predictive of the response variable, so let's try it out with the help of the following code:

# only include the five highly correlated features

X_subsetted = X[highly_correlated_features]

get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y)

# barely worse, but about 20x faster to fit the model

Best Accuracy: 0.819666666667

Best Parameters: {'max_depth': 3}

Average Time to Fit (s): 0.01

Average Time to Score (s): 0.002

Our accuracy is definitely worse than the accuracy to beat, .8203, but also note that the fitting time saw about a 20-fold increase. Our model is able to learn almost as well as with the entire dataset with only five features. Moreover, it is able to learn as much in a much shorter timeframe.

Let's bring back our scikit-learn pipelines and include our correlation choosing methodology as a part of our preprocessing phase. To do this, we will have to create a custom transformer that invokes the logic we just went through, as a pipeline-ready class.

We will call our class the CustomCorrelationChooser and it will have to implement both a fit and a transform logic, which are:

The fit logic will select columns from the features matrix that are higher than a specified threshold
The transform logic will subset any future datasets to only include those columns that were deemed important

from sklearn.base import TransformerMixin, BaseEstimator

class CustomCorrelationChooser(TransformerMixin, BaseEstimator):

def __init__(self, response, cols_to_keep=[], threshold=None):

# store the response series

self.response = response

# store the threshold that we wish to keep

self.threshold = threshold

# initialize a variable that will eventually

# hold the names of the features that we wish to keep

self.cols_to_keep = cols_to_keep

def transform(self, X):

# the transform method simply selects the appropiate

# columns from the original dataset

return X[self.cols_to_keep]

def fit(self, X, *_):

# create a new dataframe that holds both features and response

df = pd.concat([X, self.response], axis=1)

# store names of columns that meet correlation threshold

self.cols_to_keep = df.columns[df.corr()[df.columns[-1]].abs() >

Self.threshold]

# only keep columns in X, for example, will remove response

Variable

self.cols_to_keep = [c for c in self.cols_to_keep if c in

X.columns]

return self

Let's take our new correlation feature selector for a spin, with the help of the following code:

# instantiate our new feature selector

ccc = CustomCorrelationChooser(threshold=.2, response=y)

ccc.fit(X)

ccc.cols_to_keep

['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5']

Our class has selected the same five columns as we found earlier. Let's test out the transform functionality by calling it on our X matrix, using the following code:

ccc.transform(X).head()

The preceding code produces the following table as the output:

selecting-statistical-based-features-in-machine-learning-application-img-3

We see that the transform method has eliminated the other columns and kept only the features that met our .2 correlation threshold. Now, let's put it all together in our pipeline, with the help of the following code:

# instantiate our feature selector with the response variable set

ccc = CustomCorrelationChooser(response=y)

# make our new pipeline, including the selector

ccc_pipe = Pipeline([('correlation_select', ccc),

('classifier', d_tree)])

# make a copy of the decisino tree pipeline parameters

ccc_pipe_params = deepcopy(tree_pipe_params)

# update that dictionary with feature selector specific parameters

ccc_pipe_params.update({

'correlation_select__threshold':[0, .1, .2, .3]})

print ccc_pipe_params #{'correlation_select__threshold': [0, 0.1, 0.2,

0.3], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19,

21]}

# better than original (by a little, and a bit faster on

# average overall

get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y)

Best Accuracy: 0.8206

Best Parameters: {'correlation_select__threshold': 0.1,

'classifier__max_depth': 5}

Average Time to Fit (s): 0.105

Average Time to Score (s): 0.003

Wow! Our first attempt at feature selection and we have already beaten our goal (albeit by a little bit). Our pipeline is showing us that if we threshold at 0.1, we have eliminated noise enough to improve accuracy and also cut down on the fitting time (from .158 seconds without the selector). Let's take a look at which columns our selector decided to keep:

# check the threshold of .1

ccc = CustomCorrelationChooser(threshold=0.1, response=y)

ccc.fit(X)



# check which columns were kept

ccc.cols_to_keep

['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

It appears that our selector has decided to keep the five columns that we found, as well as two more, the LIMIT_BAL and the PAY_6 columns. Great! This is the beauty of automated pipeline gridsearching in scikit-learn. It allows our models to do what they do best and intuit things that we could not have on our own.

Feature selection using hypothesis testing

Hypothesis testing is a methodology in statistics that allows for a bit more complex statistical testing for individual features. Feature selection via hypothesis testing will attempt to select only the best features from a dataset, just as we were doing with our custom correlation chooser, but these tests rely more on formalized statistical methods and are interpreted through what are known as p-values.

A hypothesis test is a statistical test that is used to figure out whether we can apply a certain condition for an entire population, given a data sample. The result of a hypothesis test tells us whether we should believe the hypothesis or reject it for an alternative one. Based on sample data from a population, a hypothesis test determines whether or not to reject the null hypothesis. We usually use a p-value (a non-negative decimal with an upper bound of 1, which is based on our significance level) to make this conclusion.

In the case of feature selection, the hypothesis we wish to test is along the lines of: True or False: This feature has no relevance to the response variable. We want to test this hypothesis for every feature and decide whether the features hold some significance in the prediction of the response. In a way, this is how we dealt with the correlation logic. We basically said that, if a column's correlation with the response is too weak, then we say that the hypothesis that the feature has no relevance is true. If the correlation coefficient was strong enough, then we can reject the hypothesis that the feature has no relevance in favor of an alternative hypothesis, that the feature does have some relevance.

To begin to use this for our data, we will have to bring in two new modules: SelectKBest and f_classif, using the following code:

# SelectKBest selects features according to the k highest scores of a given

scoring function

from sklearn.feature_selection import SelectKBest

# This models a statistical test known as ANOVA

from sklearn.feature_selection import f_classif

# f_classif allows for negative values, not all do

# chi2 is a very common classification criteria but only allows for

positive values

# regression has its own statistical tests

SelectKBest is basically just a wrapper that keeps a set amount of features that are the highest ranked according to some criterion. In this case, we will use the p-values of completed hypothesis testings as a ranking.

Interpreting the p-value

The p-values are a decimals between 0 and 1 that represent the probability that the data given to us occurred by chance under the hypothesis test. Simply put, the lower the p-value, the better the chance that we can reject the null hypothesis. For our purposes, the smaller the p-value, the better the chances that the feature has some relevance to our response variable and we should keep it.

The big take away from this is that the f_classif function will perform an ANOVA test (a type of hypothesis test) on each feature on its own (hence the name univariate testing) and assign that feature a p-value. The SelectKBest will rank the features by that p-value (the lower the better) and keep only the best k (a human input) features. Let's try this out in Python.

Ranking the p-value

Let's begin by instantiating a SelectKBest module. We will manually enter a k value, 5, meaning we wish to keep only the five best features according to the resulting p-values:

# keep only the best five features according to p-values of ANOVA test

k_best = SelectKBest(f_classif, k=5)

We can then fit and transform our X matrix to select the features we want, as we did before with our custom selector:

# matrix after selecting the top 5 features

k_best.fit_transform(X, y)

# 30,000 rows x 5 columns

array([[ 2, 2, -1, -1, -2],

[-1, 2, 0, 0, 0],

[ 0, 0, 0, 0, 0],

...,

[ 4, 3, 2, -1, 0],

[ 1, -1, 0, 0, 0],

[ 0, 0, 0, 0, 0]])

If we want to inspect the p-values directly and see which columns were chosen, we can dive deeper into the select k_best variables:

# get the p values of columns

k_best.pvalues_

# make a dataframe of features and p-values

# sort that dataframe by p-value

p_values = pd.DataFrame({'column': X.columns, 'p_value':

k_best.pvalues_}).sort_values('p_value')

# show the top 5 features

p_values.head()

The preceding code produces the following table as the output:

selecting-statistical-based-features-in-machine-learning-application-img-4

We can see that, once again, our selector is choosing the PAY_X columns as the most important. If we take a look at our p-value column, we will notice that our values are extremely small and close to zero. A common threshold for p-values is 0.05, meaning that anything less than 0.05 may be considered significant, and these columns are extremely significant according to our tests. We can also directly see which columns meet a threshold of 0.05 using the pandas filtering methodology:

# features with a low p value

p_values[p_values['p_value'] < .05]

The preceding code produces the following table as the output:

selecting-statistical-based-features-in-machine-learning-application-img-5

The majority of the columns have a low p-value, but not all. Let's see the columns with a higher p_value, using the following code:

# features with a high p value

p_values[p_values['p_value'] >= .05]

The preceding code produces the following table as the output:

selecting-statistical-based-features-in-machine-learning-application-img-6

These three columns have quite a high p-value. Let's use our SelectKBest in a pipeline to see if we can grid search our way into a better machine learning pipeline, using the following code:

k_best = SelectKBest(f_classif)

# Make a new pipeline with SelectKBest

select_k_pipe = Pipeline([('k_best', k_best),

('classifier', d_tree)])

select_k_best_pipe_params = deepcopy(tree_pipe_params)

# the 'all' literally does nothing to subset

select_k_best_pipe_params.update({'k_best__k':range(1,23) + ['all']})

print select_k_best_pipe_params # {'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9,

10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all'],

'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}

# comparable to our results with correlationchooser

get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y)

Best Accuracy: 0.8206

Best Parameters: {'k_best__k': 7, 'classifier__max_depth': 5}

Average Time to Fit (s): 0.102

Average Time to Score (s): 0.002

It seems that our SelectKBest module is getting about the same accuracy as our custom transformer, but it's getting there a bit quicker! Let's see which columns our tests are selecting for us, with the help of the following code:

k_best = SelectKBest(f_classif, k=7)

# lowest 7 p values match what our custom correlationchooser chose before

# ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

p_values.head(7)

The preceding code produces the following table as the output:

selecting-statistical-based-features-in-machine-learning-application-img-7

They appear to be the same columns that were chosen by our other statistical method. It's possible that our statistical method is limited to continually picking these seven columns for us.

There are other tests available besides ANOVA, such as Chi2 and others, for regression tasks. They are all included in scikit-learn's documentation. For more info on feature selection through univariate testing, check out the scikit-learn documentation here: http://scikit-learn.org/stable/modules/feature_selection. html#univariate-feature-selection

Before we move on to model-based feature selection, it's helpful to do a quick sanity check to ensure that we are on the right track. So far, we have seen two statistical methods for feature selection that gave us the same seven columns for optimal accuracy. But what if we were to take every column except those seven? We should expect a much lower accuracy and worse pipeline overall, right? Let's make sure. The following code helps us to implement sanity checks:

# sanity check

# If we only the worst columns

the_worst_of_X = X[X.columns.drop(['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3',

'PAY_4', 'PAY_5', 'PAY_6'])]

# goes to show, that selecting the wrong features will

# hurt us in predictive performance

get_best_model_and_accuracy(d_tree, tree_params, the_worst_of_X, y)

Best Accuracy: 0.783966666667

Best Parameters: {'max_depth': 5}

Average Time to Fit (s): 0.21

Average Time to Score (s): 0.002

Hence, by selecting the columns except those seven, we see not only worse accuracy (almost as bad as the null accuracy), but also slower fitting times on average.

We statistically selected features from the dataset for our machine learning pipeline.

[box type="note" align="" class="" width=""]This article is an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla. Do check out the book to get access to alternative techniques such as the model-based method to achieve optimum results from the machine learning application.[/box]