Feature engineering
In Chapter 5, Advanced Model Building – Part I, we introduced some feature engineering concepts and discussed target encoding at length. In this section, we will delve into feature engineering in a bit more depth. We can organize feature engineering as follows:
- Algebraic transformations
- Features engineered from dates
- Simplifying categorical variables by combining categories
- Missing value indicator functions
- Target encoding categorical columns
The ordering of these transformations is not important except for the last one. Target encoding is the only transformation that requires data to be split into train and test sets. By saving it for the end, we can apply the other transformations to the entire dataset at once rather than separately to the training and test splits. Also, we introduce stratified sampling for splitting data in H2O-3. This has very little impact on our current use case but is important when data is highly imbalanced...