Selecting meaningful features
If we notice that a model performs much better on a training dataset than on the test dataset, this observation is a strong indicator of overfitting. As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn, overfitting means the model fits the parameters too closely with regard to the particular observations in the training dataset, but does not generalize well to new data; we say that the model has a high variance. The reason for the overfitting is that our model is too complex for the given training data. Common solutions to reduce the generalization error are as follows:
- Collect more training data
- Introduce a penalty for complexity via regularization
- Choose a simpler model with fewer parameters
- Reduce the dimensionality of the data
Collecting more training data is often not applicable. In Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, we will learn about a...