Scaling numeric columns
Recall in this chapter that we saw initial successes with the FastTree
decision tree and FastForest
random forest model trainers. There are a few reasons for that.
First, decision trees and random forests are flexible and can generate a good set of rules for different datasets. Secondly, these algorithms are not adversely affected by unscaled data.
Our dataset is not currently scaled – that is, goals per player might range from 0 to 20 over the course of a season, while minutes may range from 0 to thousands of minutes in a single season.
Some machine learning algorithms have trouble distinguishing between columns when different columns use different scales. As a result, these algorithms tend to focus on very large positive or negative values in one column, focusing less on other columns that don’t see as many large values.
Let’s see the problem in action by creating a pipeline from our basePipeline
(Featurizer
), mixed with a...