Feature Engineering
People come to me as a data scientist with their data. Then my job becomes part data-hazmat officer, part grief counselor.
–Anonymous
Chapter 6, Value Imputation looked at filling in missing values. In Chapter 5, Data Quality, we touched on normalization and scaling, which adjust values to artificially fit certain numeric or categorical patterns. Both of those earlier topics come close to the subject of this chapter, but here we focus more directly on the creation of synthetic features based on raw datasets. Whereas imputation is a matter of making reasonable guesses about what missing values might be, feature engineering is about changing the representational form of data, but in ways that are deterministic and often information-preserving (e.g. reversible). A simple example of a synthetic feature is the construction of BMI (body mass index) in the prior chapter.
There are many ways we might transform data. In a simple case, we...