[box type="note" align="" class="" width=""]This article is an excerpt from Ensemble Machine Learning. This book serves as a beginner's guide to combining powerful machine learning algorithms to build optimized models.[/box]
In this article, we will look at different methods to select features from the dataset; and discuss types of feature selection algorithms with their implementation in Python using the Scikit-learn (sklearn) library:
We have explained first three algorithms and their implementation in short. Further we will discuss Choosing important features (feature importance) part in detail as it is widely used technique in the data science community.
Statistical tests can be used to select those features that have the strongest relationships with the output variable.
The scikit-learn library provides the SelectKBest class, which can be used with a suite of different statistical tests to select a specific number of features.
The following example uses the chi squared (chi^2) statistical test for non-negative features to select four of the best features from the Pima Indians onset of diabetes dataset:
#Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
#Import the required packages
#Import pandas to read csv import pandas
#Import numpy for array related operations import numpy
#Import sklearn's feature selection algorithm
from sklearn.feature_selection import SelectKBest
#Import chi2 for performing chi square test from sklearn.feature_selection import chi2
#URL for loading the dataset
url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"
#Define the attribute names
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
#Create pandas data frame by loading the data from URL
dataframe = pandas.read_csv(url, names=names)
#Create array from data values
array = dataframe.values
#Split the data into input and target
X = array[:,0:8]
Y = array[:,8]
#We will select the features using chi square
test = SelectKBest(score_func=chi2, k=4)
#Fit the function for ranking the features by score
fit = test.fit(X, Y)
#Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_)
#Apply the transformation on to dataset
features = fit.transform(X)
#Summarize selected features print(features[0:5,:])
You can see the scores for each attribute and the four attributes chosen (those with the highest scores): plas, test, mass, and age.
Scores for each feature:
[111.52 1411.887 17.605 53.108 2175.565 127.669 5.393
181.304]
Selected Features:
[[148. 0. 33.6 50. ]
[85. 0. 26.6 31. ]
[183. 0. 23.3 32. ]
[89. 94. 28.1 21. ]
[137. 168. 43.1 33. ]]
RFE works by recursively removing attributes and building a model on attributes that remain. It uses model accuracy to identify which attributes (and combinations of attributes) contribute the most to predicting the target attribute. You can learn more about the RFE class in the scikit-learn documentation.
The following example uses RFE with the logistic regression algorithm to select the top three features. The choice of algorithm does not matter too much as long as it is skillful and consistent:
#Import the required packages
#Import pandas to read csv import pandas
#Import numpy for array related operations import numpy
#Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE
#Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression
#URL for loading the dataset
url =
"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data"
#Define the attribute names
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
#Create pandas data frame by loading the data from URL
dataframe = pandas.read_csv(url, names=names)
#Create array from data values
array = dataframe.values
#Split the data into input and target
X = array[:,0:8]
Y = array[:,8]
#Feature extraction
model = LogisticRegression() rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_)
After execution, we will get:
Num Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 3 5 6 1 1 4]
You can see that RFE chose the the top three features as preg, mass, and pedi. These are marked True in the support_ array and marked with a choice 1 in the ranking_ array.
PCA uses linear algebra to transform the dataset into a compressed form. Generally, it is considered a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result.
In the following example, we use PCA and select three principal components:
#Import the required packages
#Import pandas to read csv import pandas
#Import numpy for array related operations import numpy
#Import sklearn's PCA algorithm
from sklearn.decomposition import PCA
#URL for loading the dataset
url =
"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"
#Define the attribute names
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
#Create array from data values
array = dataframe.values
#Split the data into input and target
X = array[:,0:8]
Y = array[:,8]
#Feature extraction
pca = PCA(n_components=3) fit = pca.fit(X)
#Summarize components
print("Explained Variance: %s") % fit.explained_variance_ratio_
print(fit.components_)
You can see that the transformed dataset (three principal components) bears little resemblance to the source data:
Explained Variance: [ 0.88854663 0.06159078 0.02579012]
[[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02
9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03]
[ -2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-02 9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01
[ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01 2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]
Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let's understand it in detail.
Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy.
A random forest consists of a number of decision trees. Every node in a decision tree is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is known as impurity. For classification, it is typically either the Gini
impurity or information gain/entropy, and for regression trees, it is the variance. Thus when training a tree, it can be computed by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.
Let's see how to do feature selection using a random forest classifier and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset. This dataset is available for free from kaggle (you will need to sign up to kaggle to be able to download this dataset). You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory.
This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind.
The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy).
We will start with importing all of the libraries:
#Import the supporting libraries
#Import pandas to load the dataset from csv file
from pandas import read_csv
#Import numpy for array based operations and calculations
import numpy as np
#Import Random Forest classifier class from sklearn
from sklearn.ensemble import RandomForestClassifier
#Import feature selector class select model of sklearn
from sklearn.feature_selection
import SelectFromModel
np.random.seed(1)
Let's define a method to split our dataset into training and testing data; we will train our dataset on the training part and the testing part will be used for evaluation of the trained model:
#Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split):
np.random.seed(0) training = [] testing = []
np.random.shuffle(dataset) shape = np.shape(dataset)
trainlength = np.uint16(np.floor(split*shape[0]))
for i in range(trainlength): training.append(dataset[i])
for i in range(trainlength,shape[0]): testing.append(dataset[i])
training = np.array(training) testing = np.array(testing)
return training,testing
We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percentage accuracy:
#Function to evaluate model performance
def getAccuracy(pre,ytest): count = 0
for i in range(len(ytest)):
if ytest[i]==pre[i]: count+=1
acc = float(count)/len(ytest)
return acc
This is the time to load the dataset. We will load the train.csv file; this file contains more than 61,000 training instances. We will use 50000 instances for our example, in which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier:
#Load dataset as pandas data frame
data = read_csv('train.csv')
#Extract attribute names from the data frame
feat = data.keys()
feat_labels = feat.get_values()
#Extract data values from the data frame
dataset = data.values
#Shuffle the dataset
np.random.shuffle(dataset)
#We will select 50000 instances to train the classifier
inst = 50000
#Extract 50000 instances from the dataset
dataset = dataset[0:inst,:]
#Create Training and Testing data for performance evaluation
train,test = getTrainTestData(dataset, 0.7)
#Split data into input and output variable with selected features
Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain)
print("Shape of the dataset ",shape)
#Print the size of Data in MBs
print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6))
Let's take note of the data size here; as our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is quite large. Let's see:
Shape of the dataset (35000, 94)
Size of Data set before feature selection: 26.32 MB
As you can see, we are having 35000 rows and 94 columns in our dataset, which is more than 26 MB data.
In the next code block, we will configure our random forest classifier; we will use 250 trees with a maximum depth of 30 and the number of random features will be 7. Other hyperparameters will be the default of sklearn:
#Lets select the test data for model evaluation purpose
Xtest = test[:,0:94] ytest = test[:,94]
#Create a random forest classifier with the following Parameters
trees = 250
max_feat = 7
max_depth = 30
min_sample = 2
clf = RandomForestClassifier(n_estimators=trees,
max_features=max_feat,
max_depth=max_depth,
min_samples_split= min_sample, random_state=0,
n_jobs=-1)
#Train the classifier and calculate the training time
import time
start = time.time() clf.fit(Xtrain, ytrain) end = time.time()
#Lets Note down the model training time
print("Execution time for building the Tree is: %f"%(float(end)- float(start)))
pre = clf.predict(Xtest)
Let's see how much time is required to train the model on the training dataset:
Execution time for building the Tree is: 2.913641
#Evaluate the model performance for the test data
acc = getAccuracy(pre, ytest)
print("Accuracy of model before feature selection is %.2f"%(100*acc))
The accuracy of our model is:
Accuracy of model before feature selection is 98.82
As you can see, we are getting very good accuracy as we are classifying almost 99% of the test data into the correct categories. This means we are classifying about 14,823 instances out of 15,000 in correct classes.
So, now my question is: should we go for further improvement? Well, why not? We should definitely go for more improvements if we can; here, we will use feature importance to select features. As you know, in the tree building process, we use impurity measurement for node selection. The attribute value that has the lowest impurity is chosen as the node in the tree. We can use similar criteria for feature selection. We can give more importance to features that have less impurity, and this can be done using the feature_importances_ function of the sklearn library. Let's find out the importance of each feature:
#Once we have trained the model we will rank all the features for feature in zip(feat_labels, clf.feature_importances_):
print(feature)
('id', 0.33346650420175183)
('feat_1', 0.0036186958628801214)
('feat_2', 0.0037243050888530957)
('feat_3', 0.011579217472062748)
('feat_4', 0.010297382675187445)
('feat_5', 0.0010359139416194116)
('feat_6', 0.00038171336038056165)
('feat_7', 0.0024867672489765021)
('feat_8', 0.0096689721610546085)
('feat_9', 0.007906150362995093)
('feat_10', 0.0022342480802130366)
As you can see here, each feature has a different importance based on its contribution to the final prediction.
We will use these importance scores to rank our features; in the following part, we will select those features that have feature importance more than 0.01 for model training:
#Select features which have higher contribution in the final prediction
sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain)
Here, we will transform the input dataset according to the selected feature attributes. In the next code block, we will transform the dataset. Then, we will check the size and shape of the new dataset:
#Transform input dataset
Xtrain_1 = sfm.transform(Xtrain) Xtest_1 = sfm.transform(Xtest)
#Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6))
shape = np.shape(Xtrain_1)
print("Shape of the dataset ",shape)
Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20)
Do you see the shape of the dataset? We are left with only 20 features after the feature selection process, which reduces the size of the database from 26 MB to 5.60 MB. That's about 80% reduction from the original dataset.
In the next code block, we will train a new random forest classifier with the same hyperparameters as earlier and test it on the testing dataset. Let's see what accuracy we get after modifying the training set:
#Model training time
start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time()
print("Execution time for building the Tree is: %f"%(float(end)- float(start)))
#Let's evaluate the model on test data
pre = clf.predict(Xtest_1) count = 0
acc2 = getAccuracy(pre, ytest)
print("Accuracy after feature selection %.2f"%(100*acc2))
Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97
Can you see that!! We have got 99.97 percent accuracy with the modified dataset, which means we are classifying 14,996 instances in correct classes, while previously we were classifying only 14,823 instances correctly.
This is a huge improvement we have got with the feature selection process; we can summarize all the results in the following table:
Evaluation criteria | Before feature selection | After feature selection |
Number of features | 94 | 20 |
Size of dataset | 26.32 MB | 5.60 MB |
Training time | 2.91 seconds | 1.71 seconds |
Accuracy | 98.82 percent | 99.97 percent |
The preceding table shows the practical advantages of feature selection. You can see that we have reduced the number of features significantly, which reduces the model complexity and dimensions of the dataset. We are getting less training time after the reduction in dimensions, and at the end, we have overcome the overfitting issue, getting higher accuracy than before.
To summarize the article, we explored 4 ways of feature selection in machine learning.
If you found this post is useful, do check out the book Ensemble Machine Learning to know more about stacking generalization among other techniques.