Using learning curves
Another part of model optimization is determining the right amount of data to use. We want to use enough data so that our performance is maximized, but don't want to use too much extra data if it's not going to improve performance, since that would take more resources and longer to train. Using the yellowbrick
package, we can easily see how our model's performance changes as we increase the amount of data we use:
from yellowbrick.model_selection import LearningCurve
lc = LearningCurve(knn, scoring='neg_mean_absolute_error')
lc.fit(features, targets)
lc.show()
We simply give the LearningCurve
class our model, a scoring metric, and possibly other options. By default, it uses 3-fold CV. When we fit and then show the results with lc.show()
, we get the following:
Figure 14.2: The learning curves from our KNN model and house price data
The training score is the average score on the training sets from CV, while the CV score...