Encoding, Transforming, and Scaling Features
Our data cleaning efforts are often intended to prepare that data for use with a machine learning algorithm. Machine learning algorithms typically require some form of encoding of variables. Our models also often perform better with some form of scaling so that features with higher variability do not overwhelm the optimization. We show examples of that in this chapter and of how standardizing addresses the issue.
Machine learning algorithms typically require some form of encoding of variables. We almost always need to encode our features for algorithms to understand them correctly. For example, most algorithms cannot make sense of the values female or male, or know not to treat zip codes as ordinal. Although not typically necessary, scaling is often a very good idea when we have features with vastly different ranges. When we are using algorithms that assume a Gaussian distribution of our features, some form of transformation may be...