Anticipating preferences with classification
In this tutorial, you will step into the role of a marketing analyst working for a mid-sized national consumer bank, offering services such as accounts, personal loans, and mortgages to around 300,000 customers in the country. The bank is currently trying to launch a new type of low-cost savings account, providing essential services and a pre-paid card that can be fully managed online. The product manager of this new account is not very pleased with how things are going and invites you to join a review meeting. You can see he is tense as he presents the outcome of a pilot telemarketing campaign run to support the launch. As part of this pilot, 10,000 people were randomly selected among the full bank customer base and were phoned by an outbound call center. The outcome was apparently not so bad: 1,870 of the contacted customers (19% of the total) signed up for a new account. However, the calculation of the Return On Investment (ROI) pulled the entire audience back to the unsettling reality. The average cost of attempting to contact a customer through a call center is $15 per person while the incremental revenue resulting from a confirmed sale is estimated to be, on average, $60. The math is simple: the pilot telemarketing campaign cost $150,000 and generated revenues amounting only to $112,200, implying a net loss of $37,800. Now it is clear why the product manager looked disappointed: repeating the same campaign on more customers would be financially devastating.
You timidly raise your hand and ask whether the outcomes of the pilot calls could be used to rethink the campaign target and improve the ROI of the marketing efforts. You explain that some machine learning algorithms might be able to predict whether a customer is willing or not to buy a product by learning from previous examples. As it normally happens in these cases, you instantly earn the opportunity to try what you suggested, and your manager asks you to put together a proposal on an ML way to support the launch of the new savings account.
You have mixed feelings about what just happened: on one hand, you are wondering whether you were a bit too quick in sharing the idea. On the other hand, you are very excited as you get to try leveraging algorithms to impact the business on such an important case. You are impatient to start and ask for all the available information related to the customers that were involved in the pilot. The file you receive (BankTelemarketing.csv
) contains the following columns:
- Age: the age of the customer.
- Job: a string describing the job family of the customer, like
blue-collar
,management
,student
,unemployed
, andretired
. - Marital: the marital status, which could be
married
,single
,divorced
, orunknown
. - Education: the highest education level reached to date by the customer, ranging from
illiterate
andbasic.4y
(4 years of basic education in total) touniversity.degree
. - Default: this tells us whether we know that the customer has defaulted due to extended payment delinquency or not. Only a few customers end up being marked as defaulted (
yes
): most of them either show a good rating history (no
) or do not have enough history to be assigned in a category (unknown
). - Mortgage and Loan: tells us whether the user has ever requested a housing mortgage or a personal loan, respectively.
- Contact: indicates if the telephone number provided as a preferred contact method is a
landline
or amobile
phone
. - Outcome: a string recording the result of the call center contact during the pilot campaign. It can be
yes
orno
, depending on whether the customer opened the new savings account or decided to decline the offer.
Before you get cracking, you have a chat with the product manager to get clear on what would be the most valuable outputs for the business given the situation:
- First of all, it would be very useful to understand and document what characteristics make a customer most likely to buy the new banking product. Given its novelty, it is not clear yet who will find its proposition particularly appealing. Having some more clues on this aspect can help to build more tailored campaigns, personalize their content, and—by doing so—transfer the learnings from the call center pilot to other types of media touchpoints.
- Given that the pilot covered only a relatively small subset of customers—around 3% of the total—it would be useful to identify "who else" to call within the other 97% to maximize the ROI of the marketing initiative. In fact, we can assume that the same features we found in our pilot dataset—such as age, job, marital status, and so on—are available for the entire customer database. If we were able to score the remaining customers in terms of their propensity to buy the product, we would be focusing our efforts on the most inclined ones and greatly improving the campaign's effectiveness. In other words, we should create a propensity model that will score current (and future) customers to enable a better marketing targeting. We will use the propensity scores to "limit" the next marketing efforts to a selected subset of the total customer base where the percentage of people in the new product is higher than 19% (as it was in our pilot): by doing so, we would increase the ROI of our marketing efforts.
From a machine learning standpoint, you need to create a machine able to predict whether a consumer will buy or will not open a savings account before you make the call. This is still a clear case of supervised learning, since you aim at predicting something based on previous examples (the pilot calls). In contrast with the Rome real estate case, where we had to predict a number (the rental price) using regression algorithms, here we need to predict the value of the categorical column Outcome. We will then need to implement classification algorithms, such as decision trees and random forest, which we are going to meet shortly. We are clear on the business need, the available data, and the type of machine learning route we want to take: we have all we need to start getting serious about this challenge. After creating a new workflow in KNIME, we load the data into it:
- Drag and drop the file
BankTelemarketing.csv
onto the blank workflow. After the CSV Reader node dialog appears, we can quickly check that all is in order and close the window by clicking on OK. Once executed, the output of the node (Figure 5.16) confirms that our dataset is ready to go:Figure 5.16: The pilot campaign data: 10,000 customers through 8 features and for which we know the outcome of their call center contact
- As usual, we implement the node Statistics, to explore the characteristics of our dataset. After confirming its default configuration, we check the Top/bottom tab of its main view (press F10 or right-click and select View: Statistics View to open it). It seems that there are no missing values and that all seems to be in line with what we knew about the pilot campaign: the Outcome column shows 1,870 rows with
yes
, which is what the product manager managed in his presentation. We also notice that the Default column has only one row referring to a defaulted customer. This column might still be useful as it differentiates between customers who never defaulted and ones we don't have any certainty about, so we decide to keep it and move on:Figure 5.17: The Top/bottom output of the Statistics node: only one person in this sample defaulted—good for everyone!
- Since we are in the supervised learning scenario, we need to implement the usual partitioning/learn/predict/score structure in order to validate against the risk of overfitting. We start by adding the Partitioning node and connecting it downstream to the CSV Reader node. In its configuration dialog, we leave the Relative 70% size for the training partition and we decide to protect the distribution of the target variable Outcome in both partitions, selecting the Stratified sampling option. Additionally, we put a static number in the random seed box (you can put
12345
as you see in Figure 5.18) and tick the adjacent checkbox:As a general rule, always perform a stratified sampling on the target variable of a classification. This will reduce the impact of imbalanced classes when learning and validating your model. There are other ways to restore a balance in the distribution of classes, such as under-sampling the majority class or over-sampling the minority one. One interesting approach is the creation of synthetic (and realistic) additional samples using algorithms like the Synthetic Minority Over-sampling Technique: check out the SMOTE node to learn more.
Figure 5.18: Performing a stratified sampling using the Partitioning node: this way, we ensure a fair presence of yes and no customers in each partition
Now that we have a training and test dataset readily available, we can proceed with implementing our first classification algorithm: decision trees. Let's get a hint of how it works.
Decision tree algorithm
Decision trees are simple models that describe a decision-making process. Have a look at the tree shown in Figure 5.19 to get an idea of how they work. Their hierarchical structure resembles an upside-down tree. The root on top corresponds to the first question: according to the possible answers, there is a split between two or more subsequent branches. Every branch can either lead to additional questions (and respective splits into more branches) or terminate in leaves, indicating the outcome of the decision:
Figure 5.19: How will you go to work tomorrow? A decision tree can help you make up your mind
Decision trees can be used to describe the process that assigns an entity to a class: in this case, we call it a classification tree. Think about a table where each entity (corresponding to a row) is described by multiple features (columns) and is assigned to one specific class, among different alternatives. For example, a classification tree that assigns consumers to multiple classes will answer the question to which class does the consumer belong?: every branching will correspond to different outcomes of a test on the features (like is the age of the consumer higher than 35? or is the person married?) while each terminal leaf will be one of the possible classes. Once you have defined the decision tree, you can apply it to all consumers (current and future). For every consumer in the table, you follow the decision tree: the features of the consumer will dictate which specific path to follow and result in a single leaf to be assigned as the class of the consumer.
There are many tree-based learning algorithms available for classification. They are able to "draw" trees by learning from labeled examples. These algorithms can find out the right splits and paths that end up with a decision model able to predict classes of new, unlabeled entities. The simplest version of a decision tree learning algorithm will proceed by iteration, starting from the root of the tree and checking what the "best possible" next split to make is so as to differentiate classes in the least ambiguous way. This concept will become clear by means of a practical example. Let's imagine that we want to build a decision tree in order to predict which drink fast-food customers are going to order (among soda, wine, or beer), based on the food menu they had (the delicious alternatives are pizza, burger, or salad) and the composition of the table (whether it is among kids, couples, or groups of adults). The dataset to learn from will look like the one shown in Figure 5.20: we have 36 rows, each referring to a previous customer, and three columns, one for each feature (Menu and Type) and the target class (the Favorite drink).
Figure 5.20: Drink preferences for 36 fast-food customers: can you predict their preferences based on their food menu and type?
Since we have only two features, the resulting decision tree can only have two levels, resulting in two alternative shapes: either the first split is by Menu and the second, at the level below, by Type, or the other way around. The learning algorithm will pick the split that makes the most sense by looking at the count of the items falling into each branch and checking which splits make the "clearest cut" among classes.
In this specific case, the alternative choices for the first split are the ones drawn in Figure 5.21: you can find the number of customers falling into each branch, separated by alternative class (beer, soda, or wine). Have a look at the number and ask yourself: between the Menu split on the left and the Type split on the right, which one is differentiating in the "purest" way among the three classes?
Figure 5.21: Which of these two alternative splits gives you the most help in anticipating the choice of drinks?
In this case, it seems that the Type split on the right is a no-brainer: kids are consistently going for sodas (with the exception of 2 customers who—hopefully—got served with alcohol-free beer), groups prefer beers, while couples go mainly with wine. The other alternative (split by Menu) is messier: for those having salad and, to some extent, burger, there is no such clear cut drinks choice. Our preference for the option on the right is guided by human intuition: for an algorithm, we need to have a more deterministic way to make a decision. Tree learning algorithms use, in fact, metrics to decide which splits are best to pick. One of these metrics is called the Gini index, or Impurity index. Its formula is quite simple:
where fi is the relative frequency of i-nth class (it's in the % column in Figure 5.21), among the M possible classes. The algorithm will calculate the IG for each possible branching of a split and average the results out. The option with the lowest Gini index (meaning, with the least "impure" cut) will win among the others. In our fast-food case, the overall Gini index for the option on the left will be the average of:
By averaging them out, we find that the Gini index for the left option is 0.60 while the one for the right option is 0.38. These metrics are confirming our intuition: the option on the right (the split by Type) is "purer" as demonstrated by the lower Gini index. Now you have all the elements to see how the decision tree learning algorithm works: it will iteratively calculate the average IG for all possible splits (at least one for each available feature), pick the one with the lowest index, and repeat the same at the levels below, for all possible branches, until it is not possible to split further. In the end, the leaves are assigned by just looking at where the majority of the known examples fall. For instance, take the branching on the right in Figure 5.21: if this was the last level of a tree, kids will be classified with soda, couples with wine, and groups with beer. You can see in Figure 5.22 the resulting full decision tree you would obtain by using the fast-food data we presented above:
Figure 5.22: Decision tree for classifying fast-food customers according to their favorite drink.In which path would you normally be?
By looking at the obtained decision tree, you will notice that not all branches at the top level incur further splits at the level below. Take the example of the Type=Kids
branch on the top left: the vast majority of kids (10 out of 12) go for Soda
. There are not enough remaining examples to make a meaningful further split by Menu, so the tree just stops there. On top of this basic stopping criterion, you can implement additional (and more stringent) conditions that limit the growth of the tree by removing less meaningful branches: these are called—quite appropriately, I must say—pruning mechanisms. By pruning a decision tree, you end up with a less complex model: this is very handy to use when you want to avoid model overfitting. Think about this: if you have many features and examples, your tree can grow massively.
Every combination of values might, in theory, produce a very specific path. Chances are that these small branches cover an insignificant case that just happened to be in the training set but has no general value: this is a typical case of overfitting that we want to avoid as much as possible. That is why, as you will soon see in KNIME, you might need to activate some of the pruning mechanisms to avoid overfitting when growing trees.
Let's make another consideration related to numeric features in decision trees. In the fast-food example, we only had nominal features, which make every split quite simple to imagine: every underlying branch covered a possible value of the categorical column. If you have a numeric column to be considered, the algorithm will check what the Gini index would be if you split your samples using a numeric threshold. The algorithm will try multiple thresholds and pick the best split that minimizes impurity. Let's imagine that in our example we had an additional feature, called Size, that counts the number of people sitting at each table. The algorithm will test multiple thresholds and will check what the Gini index would be if you divided your samples according to these conditions, which are questions like "is Size > 3?", "is Size > 5?", and "is Size > 7?". If one of these conditions is meaningful, the split will be made according to the numeric variable: all samples having Size lower than the threshold will go to the left branch, and all others to the right branch. The Gini indices resulting from all the thresholds on the numeric features will be compared across all other indices coming from the categorical variables as we saw earlier: at each step, the purest split will win, irrespectively of its type. This is how decision trees can cleverly mix all types of features when classifying samples.
Decision tree models can be extended to predict numbers and, so, become regression trees. In these trees, each leaf is labeled with a different value of the target variable. Normally, the value of the leaf is just the average of all the samples that ended up in such a leaf node, after going through a construction mechanism similar to the ones for classification trees (using Gini indices and all that). You can build regression trees in KNIME as well: have a look at the simple regression tree nodes in the repository.
Now that we know what decision trees are, let's grow one to classify our bank customers according to the outcome of the telemarketing campaign. We'll use a new node for that: the Decision Tree Learner.
Decision Tree Learner
This node (Analytics > Mining > Decision tree) trains a decision tree model for predicting nominal variables (classification). The most important fields to be set in its configuration dialog (see Figure 5.23) are:
- Class column: you need to specify your nominal target variable to be predicted.
- Quality measure: this is the metric used to decide how to make the splits. The default value is the Gini index we have encountered above. You can also select the information for Gain ratio, which would tend to create more numerous and smaller branches. There is not a good and bad choice, and in most cases both measures generate very similar trees: you can try them both and see which one produces the best results.
- Pruning method: you can use this selector to activate a robust pruning technique called MDL (Minimum Description Length) that removes the less meaningful branches and generates a balanced tree.
- Min number records per node: you can control the tree growth-stopping criterion by setting a minimum number of samples for allowing a further split. By default, this hyperparameter is set to
2
: this means that no branch will be generated with less than 2 samples. As you increase this number, you will prune more branches and obtain smaller and smaller trees: this is an effective way for tuning the complexity of the trees and obtaining an optimal, well-fitted model. By activating the MDL technique in the earlier selector, you go the "easy way" as it will automatically guess the right level of pruning.
Figure 5.23: Configuration window of the Decision Tree Learner node: are you up for some pruning today?
The output of the node is the definition of the tree model, which can be explored by opening its main view (right-click on the node and select View: Decision Tree View). In Figure 5.24, you will find the KNIME output of the fast-food classification tree we obtained earlier (see, for comparison, Figures 5.22 and 5.21): at each node of the tree, you find the number of training samples falling into each value of the class. You can expand and collapse the branches by clicking on the circled + and – signs appearing at each split:
Figure 5.24: The fast-food classification tree, as outputted by the Decision Tree Learner node in KNIME.The gray rows correspond to the majority class
- Drag and drop the Decision Tree Learner node from the repository and connect the upper output of the Partitioning node (the training set) with it. Let's leave all the default values for now in its configuration (we will have the opportunity for some pruning later): the only selector to double-check is the one setting the Class column that in our case is Outcome. If you run the node and open its decision tree view (select the node and press F10), you will meet the tree you have just grown:
Figure 5.25: A first tree classifying bank customers by Outcome: this is just a partial view of the many levels and branches available
As you expand some of the branches, you realize that the tree is very wide and deep: Figure 5.25 shows an excerpt of what the tree might look like (depending on your random partitioning, you might end up with a different tree, which is fine). In this case, we noticed that the top split divided customers into mobile and landline users. This is what happened: the Gini index was calculated across all features and scored the lowest for Contact, making this the single most important variable to differentiate customers according to their Outcome. Let's see whether this tree is good enough and predict the outcomes in the test set.
Decision Tree Predictor
This (Analytics > Mining > Decision tree) applies a decision tree model (provided as an input in the first port) to a dataset (second port) and returns the prediction for each input row. This node will not require any configuration and will produce a similar table to the one provided in the input with an additional column that includes the result of the classification.
- Let's implement the Decision Tree Predictor node and wire it in such a way it gets as inputs the tree model outputted by the Decision Tree Learner node and the second outport of the Partitioning node, which is our test set. As you execute the node, you will find an output that the precious additional column called Prediction (Outcome).
At this point, we can finally assess the performance of the model by calculating the metrics used for classification. Do you remember the accuracy, precision, sensitivity measures, and confusion matrix we obtained in the cute dog versus muffin example? It's time to calculate these metrics by using the right node: Scorer.
Scorer
This node (Analytics > Mining > Scoring) calculates the summary performance scores of classification by comparing two nominal columns. The only step required for its configuration (Figure 5.26) is the selection of the columns to be compared: you should select the column carrying the observed (actual) values in the First Column dropdown, while predictions go in the Second Column selector. The node outputs the most important metrics for assessing a classification performance, namely: the Confusion Matrix, provided as a table in the first output (columns will refer to the predictions, while actual values will go as rows) and summary metrics such as Accuracy, Precision, and Sensitivity, which you can find in the second output of the node.
Some of the performance metrics for a classification will depend on which class you decide to be considered as Positive
: have a look at Figure 4.8 in the previous chapter to get a refresher. In the second output of the Scorer node, you will find one row for every possible class: each row contains the metrics calculated under the assumption that one specific class is labeled as Positive
and all the other classes are Negative
.
Figure 5.26: The configuration window of the Scorer node: just select the columns to compare across
- We can now add the Scorer node (make sure you don't get confused and pick the Numeric Scorer node, which can only be used for regressions) to the workflow and connect it downstream to the Decision Tree Predictor. In the configuration window, we can leave everything as it is, just checking that we have Outcome as First Column and Prediction (Outcome) as Second Column. Execute the node and open its main view (F10 or right-click and select View: Confusion Matrix).
The output of the Scorer node (Figure 5.27) tells us that we get an accuracy level of 78.3%: out of 100 predictions, 78 of them turn out to be correct. The confusion matrix helps us understand whether the model can bring value to our business case:
Figure 5.27: The output of the node Scorer after our first classification: 78% accuracy is not bad as a starting point
In the case shown in Figure 5.27, we have 450 customers (180 + 270) in the test set that were predicted as interested in the account (Prediction (Outcome) = yes
). Out of this, only 180 (40%, which corresponds to the precision of our model) were predicted correctly, meaning that these customers ended up buying the product. The number seems to be low, but it is already encouraging: the algorithm can help to find a subset of customers that are more likely to buy the product. If we indiscriminately called every customer—as we know from the pilot—we would have achieved a success rate of 19% while, by focusing on the (fewer) customers that the algorithm identified as potential (Prediction (Outcome) = yes
), the success rate would double and reach 40%.
Let's now think about what we can do to improve the results of the modeling. We remember that our decision tree was deep and wide: some of the branches were leading to very "specific" cases, which interested only a handful of examples in the training set. This doesn't look right: a decision tree that adapted so closely to the training set might produce high errors in future cases as it is not able to comprehend the essential patterns of general validity. We might be overfitting! Let's equip ourselves with a good pair of pruning shears: we can try to fix the overfitting by reducing the complexity of the tree, making some smart cuts here and there:
Sometimes, the Decision Tree Predictor node generates null predictions (red ?
in KNIME tables, which caused the warning message you see at the top of Figure 5.27). This is a sign that the tree might be overfitted: its paths are too "specific" and do not encompass the set of values that require a prediction (this "pathology" is called No True Child). Besides taking care of the overfitting, one trick you can apply to solve the missing values is to open the PMMLSettings panel (second tab in the Decision Tree Learner configuration) and set No true child strategy to returnLastPrediction.
- Open the configuration dialog of the Decision Tree Learner and select MDL as the Pruning method. This is the simplest and quickest way to prune our tree: we could have also iterated through higher values of Min number records per node (give it a try to check how it works), but MDL is a safe approach to get quick improvements.
- Let's see if it worked. We don't need to change anything else, so let's just execute the Scorer node and open its main view to see what happened.
When you look at the results (Figure 5.28) you feel a thrill of excitement: things got better. The accuracy raised to 83% and, most importantly, the precision of the model greatly increased. Out of the 175 customers in the test set who are now predicted as Outcome=
yes
, 117 would have ended up actually buying the product. If we followed the recommendation of the model (which we can assume will keep a similar predictive performance on customers we didn't call yet—so the remaining 97% of our customer base), the success rate of our marketing campaign will move to 67%, which is more than 3 times better than our initial baseline of 19%!Figure 5.28: The output of the node Scorer after our tree pruning: the precision
The model was previously overfitting and some pruning clearly helped. If you now open the tree view of the Decision Tree Learner node, you will find a much simpler model that can be explored and, finally, interpreted. You can expand all branches at once by selecting the root node (just left-click on it) and then clicking on Tree | Expand Selected Branch from the top menu. By looking at the tree, which might be similar to the one shown in Figure 5.29, we can finally attempt some interpretation of the model. Look at the different percentages of the yes
category within each node: we found some buckets of customers that are disproportionally interested in our product:
Figure 5.29: An excerpt of the decision tree classifying bank customers by Outcome: students, retired, and 60+ customers using landlines are showing the most interest in our new savings account
For example, we find out that customers falling into these three segments:
- Mobile users who are students
- Mobile users who are retired
- Landline users who are 60+ years old
responded much more to our pilot campaign than all others, having more than 50% of the samples ending up with opening a new savings account. We have a quick chat with the product manager and show these results to him. He is very excited about the findings and, after some thinking, he confirms that what the algorithm spotted makes perfect sense from a business standpoint. The new type of account has less fixed costs than the others, so this explains while its proposition proves more compelling to lower-income customers, such as students and the retired. Additionally, this account includes a free prepaid card, which is a great tool for students, who can get their balance topped up progressively, but also for older customers, who do not fully trust yet the usage of traditional credit cards and prefer keeping the risk of fraud under control. The account manager is very pleased with what you shared with him and does not stop thanking you: by having data-based evidence of the characteristics that make a customer more likely to buy his new product, he can now finetune the marketing concept, highlighting benefits and reinforcing the message to share with prospective customers.
The positive feedback you just received was invigorating and you want to quickly move to the second part of the challenge: building a propensity model able to "score" the 97% of the customers that have not been contacted yet. To do so, we will first need to introduce another classification algorithm particularly well suited for anticipating propensities: random forest.
Random forest algorithm
One approach used in machine learning to obtain better performance is ensemble learning. The idea behind it is very simple: instead of building a single model, you combine multiple base models together and obtain an ensemble model that, collectively, produces stronger results than any of the underlying models. If we apply this concept to decision trees, we will grow multiple models in parallel and obtain… a forest. However, if we run the decision tree algorithm we've seen in the previous pages to the same data set multiple times, we will just obtain "copies" of identical trees. Think about it: the procedure we described earlier (with the calculation of the Gini index and the building of subsequent branches) is completely deterministic and will always produce the same outputs when using the same inputs. To encourage "diversity" across the base models, we need to force some variance in the inputs: one way to do so is to randomly sample subsets of rows and columns of our input dataset, and offer them as different training sets to independently growing base models. Then, we will just need to aggregate the results of the several base models into a single ensemble model. This is called Bagging, short for Bootstrap Aggregation, which is the secret ingredient that we are going to use to move from decision trees to random forests.
To understand how it works, let's visualize it in a practical example: Figure 5.30 shows both a simple decision tree and a random forest (made of four trees) built on our bank telemarketing example:
Figure 5.30: A decision tree and random forest compared: with the forest you get a propensity score and higher accuracy
Thanks to a random sampling of rows and columns, we managed to grow four different trees, starting from the same initial dataset. Look at the tree on the bottom left (marked as #1 in the figure): it only had the Mortgage and the Contact columns available to learn from, as they were the ones randomly sampled in its case. Given the subset of rows that were offered to it (that were also randomly drawn as part of the bootstrap process), the model applies the decision tree algorithm and produces a tree that differs from all other base models (you can check the four trees at the bottom—they are all different). Given the four trees that make our forest, let's imagine that we want to predict the outcome for a 63-year-old retired customer, who has a mortgage and gets contacted by landline. The same customer will follow four different paths (one for each tree), which will lead to different outcomes. In this case, 3 trees out of 4 agree that the prediction should be yes
. The resulting ensemble prediction will be made in a very democratic manner, by voting. Since the majority believes that this customer is a yes
, the final outcome will be yes
with a Propensity score of 0.75 (3 divided by 4).
The assumption we make is that the more trees that are in agreement with a customer being classified as yes
, the "closer" the customer is to buying our product. Of course, we normally build many more trees than just four: the diversity of the different branching each tree displays will make our ensemble model more "sensitive" to the smaller nuances of feature combinations that can tell us something useful about the propensity of a customer. Every tree offers a slightly "different" point of view on how to classify a customer: by bringing all these contributions together—in a sort of decisions crowdsourcing—we obtain more robust collective predictions: this is yet another proof of the universal value of diversity in life!
Although the propensity score is related to the probability that a classification is correct, they are not the same thing. We are still in the uncertain world of probabilistic models: even if 100% of the trees agree on a specific classification, you cannot be 100% sure that the classification is right.
Let's get acquainted with the KNIME node that can grow forests: meet the Random Forest Learner node.
Random Forest Learner
This node (Analytics > Mining > Decision Tree Ensemble > Random Forest > Classification) trains a random forest model for classification. At the top of its configuration window (Figure 5.31) you can select the nominal column to use as the target of the classification (Target Column). Then, in the column selector in the middle, you can choose which columns to use as features (the ones appearing on the Include box on the right): all others will be ignored by the learning algorithm. The option Save target distribution… will record the number of samples that fell into each leaf of the underlying tree models: although it is memory expensive, it can help to generate more accurate propensity scores, by means of the soft voting technique, which we will talk about later.
Toward the bottom of the window, you will find also a box that lets you choose how many trees you want to grow (Number of models). Lastly, you can decide to check a tick box (labeled as Use static random seed) that, similarly to what you found in the Partitioning node, lets you "fix" the initialization seed of the pseudo-random number generator used for the random sampling of rows and columns: in this case, you will obtain, at parity of input and configuration parameters, always the same forest generated:
Figure 5.31: Configuration window of the Random Forest Learner node: how many trees you want to see in the forest?
- Let's implement the Random Forest Learner node and connect the training set (the first output port of the Partitioning node) with its input: there is no harm in reusing the same training and test sets used for the decision tree learner. If we execute the node and open its main view (F10 or right-click and then select View: Tree Views), we will find a tree-like output, as in the case of the decision trees: however, this time, we have a little selector at the top that lets us scroll across all 100 trees of the forest.
Random forests are black box models as they are hard to interpret: going through 100 different trees would not offer us a hint for explaining how the predictions are made. However, there is a simple way to check which features proved to be most meaningful. Open the second outport of the Random Forest Learner node (right-click and click on Attribute statistics). The first column—called #splits (level 0)—tells you how many times that feature was selected as the top split of a tree. The higher that number, the more useful that feature has been in the learning process of the model.
Random Forest Predictor
This node (Analytics > Mining > Decision Tree Ensemble > Random Forest > Classification) applies a random forest model (which needs to be provided in the first gray input port) to a dataset (second port) and returns the ensemble prediction for each input row. As part of its configuration, you can decide whether you want to output the propensity scores for each individual class (Append individual class probabilities). If you tick the Use soft voting box, you enable a more accurate estimation of propensity: in this case, the vote of each tree will be weighted by a factor that depends on how many samples fell in each leaf during the learning process. The more samples a leaf has "seen," the more confident we can be about its estimation. To use this feature, you will have to select the option Save target distribution… in the Random Forest Learning node, which is upstream.
Figure 5.32: The configuration dialog of Random Forest Learner node. You can decide whether you want to see propensity scores or not.
- Drag and drop the Random Forest Predictor node onto the workflow and connect its inputs with the forest model, outputted by the Random Forest Learner and the training set, meaning the bottom outport of the Partitioning node. Configure the node by unticking the Append overall prediction confidence box, and ticking both the Append individual class probabilities (we need the propensity score) and the Use soft voting boxes. After you execute it, you will find at its output the test set enriched with the prediction, Prediction (Outcome), and the propensity scores by class. Specifically, the propensity of a customer being interested in our product is P (Outcome=Yes).
- Implement a new Scorer node (for simplicity, you can copy/paste the one you used for the decision tree) and connect it downstream to the Random Forest Predictor. For its configuration, just make sure you select Outcome and Prediction (Outcome) in the first two drop-down menus. Execute it and open its main output view (F10).
The results of Scorer (Figure 5.33) confirm that, at least in this case, the ensemble model comes with better performance metrics. Accuracy has increased by a few decimal points and, most importantly (as it directly affects the ROI of our marketing campaigns), precision has reached 72% (open the Accuracy statistics outport to check it or compute it easily from the confusion matrix):
Figure 5.33: The Scorer node output for our random forest. Both accuracy and precision increased versus the decision tree: diversity helps
Now that we have confirmation that we have built a robust model at hand, let's concentrate on the propensity score we calculated and see what we can do with it.
Open the output of the Random Forest Predictor node and sort the rows by decreasing level of propensity (click on the header of column P (Outcome=yes) and then on Sort Descending): you will obtain a view similar to the one shown in Figure 5.34:
Figure 5.34: The predictions generated by the random forest in descending order of propensity, P (Outcome=yes): the more we go down the list, the less interested customers (column Outcome) we find
At the top of the list, we have the customers in the test set that most decision trees identified as interested. In fact, if you look at the column Outcome, we find that most rows show a yes
, proving that, indeed, these customers were very interested in the product (when called, they agreed to open the savings account). If you scroll down the list, the propensity will go down and you will start finding increasingly more no
values in column Outcome. Now, let's think about the business case once again: now that we have a model able to predict the level of propensity, we could run it on the other 97% of customers that were not contacted as part of the pilot. If we then sorted our customer list by decreasing level of propensity (as we just did on the test set), we will obtain a prioritized list of the next people to call about our product. We will expect that the first calls (the ones directed to the most inclined people) will end up with a very high success rate (like we noticed in the test set).
Then, little by little, the success rate will decay: more and more people will start saying no
and, at some point, it will start to become counterproductive to make a call. So, the key question becomes: at what point should we "stop" to get the maximum possible ROI from the initiative? How many calls should we make? What is the minimum level of propensity, below which we should avoid attempting to make a sale? The exciting part of propensity modeling is that you can find an answer to these questions before making any call!
In fact, if we assume that the customers that were part of the pilot were a fair sample of the total population, then we can use our test set (which has not been "seen" by the training algorithm, so there is no risk of overfitting) as a base for simulating the ROI of a marketing campaign where we call customers by following a decreasing level of propensity. This is exactly what we are going to do right now: we will need to first sort the test set by decreasing level of propensity (the temporary sorting we did earlier did not impact the permanent order of the rows in the underlying table); then, we calculate the cumulative profit we would make by "going through the list," using the cost and revenue estimates shared by the product manager. We check at which level of propensity we maximized our profit, so that we have a good estimate of the number of calls that we will need to make in total to optimize the ROI. Let's get cracking!
- Implement a Sorter node and connect it at the output of the Random Forest Predictor node. We want to sort the customers in the test set by decreasing level of propensity, so select column P (Outcome=yes) and go for the Descending option.
- Implement a Rule Engine node to calculate the marginal profit we make on each individual customer. We know that every call we make costs us $15, irrespective of its outcome. We also know that every account opening brings an incremental revenue of $60. Hence, every customer that ends up buying the product (Outcome=
Yes
) brings $45 of profit while all others hit us by $–15. Let's create a column (we can call it Profit) that implements this simple logic, as shown in Figure 5.35:Figure 5.35: The Rule Engine node for calculating the marginal profit for each individual customer
To calculate the cumulative profit we will need to use a new node, called Moving Aggregation.
Moving Aggregation
As the name suggests, this node (Other Data Types > Time Series > Smoothing) aggregates values on moving windows and calculates cumulative summarizations. To use a moving window, you will have to declare the Window length in terms of the number of rows to be considered and the Window type (meaning the direction of movement of the window in the table). For example, if you select 3 as the length and Backward as the type, the previous 3 rows will be aggregated together. If you want to aggregate by cumulating values from the first row to the last, you need to check the Cumulative computation box. Similarly to a Group By node, the Aggregation settings tab will let you select which columns should be aggregated and using which method:
Figure 5.36: Configuration dialog of the Moving Aggregation node: you can aggregate through moving windows or by progressively cumulating
- Implement the Moving Aggregation node and connect it downstream from the Rule Engine. Check the Cumulative Computation box, double-click on the Profit column on the left, and select Sum as the aggregation method. Execute the node and open its outport view.
The Moving Aggregation node has cumulated the marginal profit generated by each customer. If we scroll the list (similar to the one displayed in Figure 5.37) and keep an eye on the last column, Sum(Profit), we noticed that the profit peaks when we are slightly below the first third of the full list. When the P (Outcome=yes) propensity is near 0.23, we obtain a profit of around $8,200. This means that by calling only people above this level of propensity (called the Cutoff point), we maximize the ROI of our campaign.
Figure 5.37: The output of the Moving Aggregation node: it seems that we reach maximum profit when we call people having a propensity of around 0.23.
To make this concept clearer, let's visualize the changing profit by employing a line chart.
Line Plot (local)
This node (View > Local (Swing)) generates a line plot. The only configuration that might be needed is the box labeled No. of rows to display, which you can use to extend the limit of rows considered for creating the plot.
- Implement a Line Plot (local) node, extend the number of rows to display to at least 3,000 (the size of the test set), execute it, and open its view at once (Shift + F10). In the Column Selection tab, keep only Sum(Profit) on the right and remove all other columns.
The output of the chart (shown in Figure 5.38) confirms what we noticed in the table and makes it more evident: if we use the propensity score to decide the calling order of customers, our profit will follow the shape of the curve in the figure. We will start with a steep increase of profit (see the first segment on the left), as most of the first people we call (which are top prospects, given their high propensity score) will actually buy the product. Then, at around one-third of the list (when we know that the propensity score is near 0.23), we reach the maximum possible profit. After that, it will drop fast as we will encounter fewer and fewer interested customers. If we called all the people on the list, we will end up with a significant loss, as we have painfully learned as part of the pilot campaign:
Figure 5.38: The cumulative profit curve for our machine learning-assisted telemarketing campaign: we maximize the ROI at around one-third of the list sorted by propensity
Thanks to this simulation, we have discovered that if we limit our campaign to customers with a propensity score higher than 0.23 (which will be around one-third of the total population), we will maximize our profit. By doing the required proportions (our simulation covered only the test set, so 3,000 customers in total), we can estimate how much profit we would make if we applied our propensity model to the entire bank database. In this case, we would use the scores to decide who to call within the remaining 97% of the customer base. The overall "size of the prize" of conducting a mass telemarketing campaign will bring around $800,000 of profit, if we were to call one-third of the bank's customers. Considering that it might not be viable to make so many calls, we might stop earlier in the list: in any case, we will make some considerable profit by following the list that our random forest can now generate. The simulation that we just did can be used as a tool for planning the marketing spend and sizing the right level of investment. The product manager and your boss are pleased with the great work you pulled together. You definitely proved that spotting (and following) the ML way can bring sizeable value to the business: in this case, you completely reversed the potential outcome of a marketing campaign. The heavy losses in the pilot can now be transformed into a meaningful value, thanks to data, algorithms, and—most importantly—your expertise in leveraging them. It was a terrific result, and it took only 12 KNIME nodes (Figure 5.39) to put all of this together!
Figure 5.39: Full workflow for the bank telemarketing optimization