Selecting features for regression models
Regression models have a continuous target. The statistical techniques we used in the previous section are not appropriate for such targets. Fortunately, scikit-learn
's selection module provides several options for selecting features when building regression models. (By regression models here, I do not mean linear regression models. I am only referring to models with continuous targets.) Two good options are selection based on F-tests and selection based on mutual information for regression. Let's start with F-tests.
F-tests for feature selection with a continuous target
The F-statistic is a measure of the strength of the linear correlation between a target and a single regressor. Scikit-learn
has an f_regression
scoring function, which returns F-statistics. We can use it with SelectKBest
to select features based on that statistic.
Let's use F-statistics to select features for a model of wages. We use mutual information...