We have our data and installed the imbalanced-learn library. Now, we are ready to build our classifier. As we mentioned earlier, the one-hot encoding techniques we are familiar with will not scale well with the high cardinality of our categorical features. In Chapter 8, Ensembles – When One Model is Not Enough, we briefly mentionedrandom trees embedding as a technique for transforming our features. It is an ensemble of totally random trees, where each sample of our data will be represented according to the leaves of each tree it ends upon. Here, we are going to build a pipeline where the data will be transformed into a random trees embedding and scaled. Finally, a logistic regression classifier will be used to predict whether a click has occurred or not:
from sklearn.preprocessing import MaxAbsScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomTreesEmbedding
from sklearn.pipeline import Pipeline
from sklearn...