Machine Learning Automation with TPOT

Chapter 1: Machine Learning and the Idea of Automation

In this chapter, we'll make a quick revision of the essential machine learning topics. Topics such as supervised machine learning are covered, alongside the basic concepts of regression and classification.

We will understand why machine learning is essential for success in the 21st century from various perspectives: those of students, professionals, and business users, and we will discuss the different types of problems machine learning can solve.

Further, we will introduce the concept of automation and understand how it applies to machine learning tasks. We will go over automation options in the Python ecosystem and compare their pros and cons. We will briefly introduce the TPOT library, and discuss its role in the modern-day automation of machine learning.

This chapter will cover the following topics:

Reviewing the history of machine learning
Reviewing automation
Applying automation to machine learning
Automation options for Python

Reviewing the history of machine learning

Just over 25 years ago (1994), a question was asked in an episode of The Today Show – "What is the internet, anyway?" It's hard to imagine that a couple of decades ago, the general population had difficulty defining what the internet is and how it works. Little did they know that we would have intelligent systems managing themselves only a quarter of a century later, available to the masses.

The concept of machine learning was introduced much earlier in 1949 by Donald Hebb. He presented theories on neuron excitement and communication between neurons (A Brief History of Machine Learning – DATAVERSITY, Foote, K., March 26, 2019). He was the first to introduce the concept of artificial neurons, their activation, and their relationships through weights.

In the 1950s, Arthur Samuel developed a computer program for playing checkers. The memory was quite limited at that time, so he designed a scoring function that attempted to measure every player's probability of winning based on the positions on the board. The program chose its next move using a MinMax strategy, which eventually evolved into the MinMax algorithm (A Brief History of Machine Learning – DATAVERSITY, Foote, K., March 26; 2019). Samuel was also the first one to come up with the term machine learning.

Frank Rosenblatt decided to combine Hebb's artificial brain cell model with the work of Arthur Samuel to create a perceptron. In 1957, a perceptron was planned as a machine, which led to building a Mark 1 perceptron machine, designed for image classification.

The idea seemed promising, to say at least, but the machine couldn't recognize useful visual patterns, which caused a stall in further research – this period is known as the first AI winter. There wasn't much going on with the perceptron and neural network models until the 1990s.

The preceding couple of paragraphs tell us more than enough about the state of machine learning and deep learning at the end of the 20th century. Groups of individuals were making tremendous progress with neural networks, while the general population had difficulty understanding even what the internet is.

To make machine learning useful in the real world, scientists and researchers required two things:

Data
Computing power

The first was rapidly becoming more available due to the rise of the internet. The second was slowly moving into a phase of exponential growth – both in CPU performance and storage capacity.

Still, the state of machine learning in the late 1990s and early 2000s was nowhere near where it is today. Today's hardware has led to a significant increase in the use of machine-learning-powered systems in production applications. It is difficult to imagine a world where Netflix doesn't recommend movies, or Google doesn't automatically filter spam from regular email.

But, what is machine learning, anyway?

What is machine learning?

There are a lot of definitions of machine learning out there, some more and some less formal. Here are a couple worth mentioning:

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed (What is Machine Learning? A Definition – Expert System, Expert System Team; May 6, 2020).
Machine learning is the concept that a computer program can learn and adapt to new data without human intervention (Machine Learning – Investopedia, Frankenfield, J.; August 31, 2020).
Machine learning is a field of computer science that aims to teach computers how to learn and act without being explicitly programmed (Machine Learning – DeepAI, Deep AI Team; May 17, 2020).

Even though these definitions are expressed differently, they convey the same information. Machine learning aims to develop a system or an algorithm capable of learning from data without human intervention.

The goal of a data scientist isn't to instruct the algorithm on how to learn, but rather to provide an adequately sized and prepared dataset to the algorithm and briefly specify the relationships between the dataset variables. For example, suppose the goal is to produce a model capable of predicting housing prices. In that case, the dataset should provide observations on a lot of historical prices, measured through variables such as location, size, number of rooms, age, whether it has a balcony or a garage, and so on.

It's up to the machine learning algorithm to decide which features are important and which aren't, ergo, which features have significant predictive power. The example in the previous paragraph explained the idea of a regression problem solved with supervised machine learning methods. We'll soon dive into both concepts, so don't worry if you don't quite understand it.

Further, we might want to build a model that can predict, with a decent amount of confidence, whether a customer is likely to churn (break the contract). Useful features would be the list of services the client is using, how long they have been using the service, whether the previous payments were made on time, and so on. This is another example of a supervised machine learning problem, but the target variable (churn) is categorical (yes or no) and not continuous, as was the case in the previous example. We call these types of problems classification machine learning problems.

Machine learning isn't limited to regression and classification. It is applied to many other areas, such as clustering and dimensionality reduction. These fall into the category of unsupervised machine learning techniques. These topics won't be discussed in this chapter.

But first, let's answer a question on the usability of machine learning models, and discuss who uses these models and in which circumstances.

In which sectors are the companies using machine learning?

In a single word – everywhere. But you'll have to continue reading to get a complete picture. Machine learning has been adopted in almost every industry in the last decade or two. The main reason is the advancements in hardware. Also, machine learning has become easier for the broader masses to use and understand.

It would be impossible to list every industry in which machine learning is used and to discuss further the specific problems it solves. The easier task would be to list the industries that can't benefit from machine learning, as there are far fewer of those.

We'll focus only on the better-known industries in this section.

Here's a list and explanation of the ten most common use cases of machine learning, both from the industry standpoint and as a general overview:

The finance industry: Machine learning is gaining more and more popularity in the financial sector. Banks and financial institutions can use it to make smarter decisions. With machine learning, banks can detect clients who most likely won't repay their loans. Further, banks can use machine learning methods to track and understand the spending patterns of their customers. This can lead to the creation of more personalized services to the satisfaction of both parties. Machine learning can also be used to detect anomalies and fraud through unexpected behaviors on some client accounts.
Medical industry: The recent advancements in medicine are at least partly due to advancements in machine learning. Various predictive methods can be used to detect diseases in the early stages, based on which medical experts can construct personalized therapy and recovery plans. Computer vision techniques such as image classification and object detection can be used, for example, to perform classification on lung images. These can also be used to detect the presence of a tumor based on a single image or a sequence of images.
Image recognition: This is probably the most widely used application of machine learning because it can be applied in any industry. You can go from a simple cat-versus-dog image classification to classifying the skin conditions of endangered animals in Africa. Image recognition can also be used to detect whether an object of interest is present in the image. For example, the automatic detection of Waldo in the Where's Waldo? game has roughly the same logic as an algorithm in autonomous vehicles that detects pedestrians.
Speech recognition: Yet another exciting and promising field. The general idea is that an algorithm can automatically recognize the spoken words in an audio clip and then convert them to a text file. Some of the better-known applications are appliance control (controlling the air conditioner with voice commands), voice dialing (automated recognition of a contact to call just from your voice), and internet search (browsing the web with your voice). These are only a couple of examples that immediately pop into mind. Automatic speech recognition software is challenging to develop. Not all languages are supported, and many non-native speakers have accents when speaking in a foreign language, which the ML algorithm may struggle to recognize.
Natural Language Processing (NLP): Companies in the private sector can benefit tremendously from NLP. For example, a company can use NLP to analyze the sentiments of online reviews left by their customers if there are too many to classify manually. Further, companies can create chatbots on web pages that immediately start conversations with users, which then leads to more potential sales. For a more advanced example, NLP can be used to write summaries of long documents and even segment and analyze protein sequences.
Recommender systems: As of late 2020, it's difficult to imagine a world where Google doesn't tailor the search results based on your past behaviors, Amazon doesn't automatically recommend similar products, Netflix doesn't recommend movies and TV shows based on the past watches, and Spotify doesn't recommend music that somehow flew under your radar. These are only a couple of examples, but it's not difficult to recognize the importance of recommender systems.
Spam detection: Just like it's hard to imagine a world where the search results aren't tailored to your liking, it's also hard to imagine an email service that doesn't automatically filter out messages about that now-or-never discount on a vacuum cleaner. We are bombarded with information every day, and automatic spam detection algorithms can help us focus on what's important.
Automated trading: Even the stock market is moving too fast to fully capture what's happening without automated means. Developing trading bots isn't easy, but machine learning can help you pick the best times to buy or sell, based on tons of historical data. If fully automated, you can watch how your money creates money while sipping margaritas on the beach. It might sound like a stretch to some of you, but with robust models and a ton of domain knowledge, I can't see why not.
Anomaly detection: Let's dial back to our banking industry example. Banks can use anomaly detection algorithms for various use cases, such as flagging suspicious transactions and activities. Lately, I've been using anomaly detection algorithms to detect suspicious behavior in network traffic with the goal of automatic detection of cyberattacks and malware. It is another technique applicable to any industry if the data is formatted in the right way.
Social networks: How many times has Facebook recommended you people you may know? Or YouTube recommended the video on the topic you were just thinking about? No, they are not reading your mind, but they are aware of your past behaviors and decisions and can predict your next move with a decent amount of confidence.

These are just a couple of examples of what machine learning can do – not an exhaustive list by any means. You are now familiar with a brief history of machine learning and know how machine learning can be applied to a wide array of tasks.

The next section will provide a brief refresher on supervised machine learning techniques, such as regression and classification.

Supervised learning

The majority of practical machine learning problems are solved through supervised learning algorithms. Supervised learning refers to a situation where you have an input variable (a predictor), typically denoted with X, and an output variable (what you are trying to predict), typically denoted with y.

There's a reason why features (X) are denoted with a capital letter and the target variable (y) isn't. In math terms, X denotes a matrix of features, and matrices are typically denoted with capital letters. On the other hand, y is a vector, and lowercase letters are typically used to denote vectors.

The goal of a supervised machine learning algorithm is to learn the function that can transform any input into the output. The most general math representation of a supervised learning algorithm is represented with the following formula:

Figure 1.1 – General supervised learning formula

We must apply one of two corrections to make this formula acceptable. The first one is to replace y with y-hat, as y generally denotes the true value, and y-hat denotes the prediction. The second correction we could make is to add the error term, as only then can we have the correct value of y on the other side. The error term represents the irreducible error – the type of error that can't be reduced by further training.

Here's how the first corrected formula looks:

Figure 1.2 – Corrected supervised learning formula (v1)

And here's the second one:

Figure 1.3 – Corrected supervised learning formula (v2)

It's more common to see the second one, but don't be confused by any of the formats – these formulas generally represent the same thing.

Supervised machine learning is called "supervised" because we have the labeled data at our disposal. You might have already picked this because of the feature and target discussion. This means that we have the correct answers already, ergo, we know which combinations of X yield the corresponding values of y.

The end goal is to make the best generalization from the data available. We want to produce the most unbiased model capable of generalizing on new, unseen data. The concepts of overfitting, underfitting, and the bias-variance trade-off are important to produce such a model, but they are not in the scope of this book.

As we've already mentioned, supervised learning problems are grouped into two main categories:

Regression: The target variable is continuous in nature, such as the price of a house in USD, the temperature in degrees Fahrenheit, weight in pounds, height in inches, and so on.
Classification: The target variable is a category – either binary (true/false, positive/negative, disease/no disease), or multi-class (no symptoms/mild symptoms/severe symptoms, school grades, and so on).

Both regression and classification are explored in the following sections.

Regression

As briefly discussed in the previous sections, regression refers to a phenomenon where the target variable is continuous. The target variable could represent a price, a weight, or a height, to name a few.

The most common type of regression is linear regression, a model where a linear relationship between variables is assumed. Linear regression further divides into a simple linear regression (only one feature), and multiple linear regression (multiple features).

Important note

Keep in mind that linear regression isn't the only type of regression. You can perform regression tasks with algorithms such as decision trees, random forests, support vector machines, gradient boosting, and artificial neural networks, but the same concepts still apply.

To make a quick recap of the regression concept, we'll declare a simple pandas.DataFrame object consisting of two columns – Living area and Price. The goal is to predict the price based only on the living space. We are using a simple linear regression model here just because it makes the data visualization process simpler, which, as the end result, makes the regression concept easy to understand:

The following is the dataset – both columns contain arbitrary and made-up values:

import pandas as pd 
df = pd.DataFrame({
    'LivingArea': [300, 356, 501, 407, 950, 782, 
                   664, 456, 673, 821, 1024, 900, 
                   512, 551, 510, 625, 718, 850],
    'Price': [100, 120, 180, 152, 320, 260, 
              210, 150, 245, 300, 390, 305, 
              175, 185, 160, 224, 280, 299]
})

To visualize these data points, we will use the matplotlib library. By default, the library doesn't look very appealing, so a couple of tweaks are made with the matplotlib.rcParams package:
```
import matplotlib.pyplot as plt 
from matplotlib import rcParams
rcParams['figure.figsize'] = 14, 8
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False
```
The following options make the charts larger by default, and remove the borders (spines) on the top and right. The following code snippet visualizes our dataset as a two-dimensional scatter plot:
```
plt.scatter(df['LivingArea'], df['Price'], color='#7e7e7e', s=200)
plt.title('Living area vs. Price (000 USD)', size=20)
plt.xlabel('Living area', size=14)
plt.ylabel('Price (000 USD)', size=14)
plt.show()
```
The preceding code produces the following graph:
Figure 1.4 – Regression – Scatter plot of living area and price (000 USD)
Training a linear regression model is most easily achieved with the scikit-learn library. The library contains tons of different algorithms and techniques we can apply on our data. The sklearn-learn.linear_model module contains the LinearRegression class. We'll use it to train the model on the entire dataset, and then to make predictions on the entire dataset. That's not something you would usually do in production environment, but is essential here to get a further understanding of how the model works:
```
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(df[['LivingArea']], df[['Price']])
preds = model.predict(df[['LivingArea']])
df['Predicted'] = preds
```
We've assigned the prediction as yet another dataset column, just to make data visualization simpler. Once again, we can create a chart containing the entire dataset as a scatter plot. This time, we will add a line that represents the line of best fit – the line where the error is smallest:
```
plt.scatter(df['LivingArea'], df['Price'], color='#7e7e7e', s=200, label='Data points')
plt.plot(df['LivingArea'], df['Predicted'], color='#040404', label='Best fit line')
plt.title('Living area vs. Price (000 USD)', size=20)
plt.xlabel('Living area', size=14)
plt.ylabel('Price (000 USD)', size=14)
plt.legend()
plt.show()
```
The preceding code produces the following graph:
Figure 1.5 – Regression – Scatter plot of living area and price (000 USD) with the line of best fit
As we can see, the simple linear regression model almost perfectly captures our dataset. This is not a surprise, as the dataset was created for this purpose. New predictions would be made along the line of best fit. For example, if we were interested in predicting the price of house that has a living space of 1,000 square meters, the model would make a prediction just a bit north of $350K. The implementation of this in the code is simple:
```
model.predict([[1000]])
>>> array([[356.18038708]])
```
Further, if you were interested in evaluating this simple linear regression model, metrics like R2 and RMSE are a good choice. R2 measures the goodness of fit, ergo it tells us how much variance our model captures (ranging from 0 to 1). It is more formally referred to as the coefficient of determination. RMSE measures how wrong the model is on average, in the unit of interest. For example, an RMSE value of 10 would mean that on average our model is wrong by $10K, in either the positive or negative direction.
Both the R2 score and the RMSE are calculated as follows:
```
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error
rmse = lambda y, ypred: np.sqrt(mean_squared_error(y, ypred))
model_r2 = r2_score(df['Price'], df['Predicted'])
model_rmse = rmse(df['Price'], df['Predicted'])
print(f'R2 score: {model_r2:.2f}')
print(f'RMSE: {model_rmse:.2f}')
>>> R2 score: 0.97
>>> RMSE: 13.88
```

To conclude, we've built a simple but accurate model. Don't expect data in the real world to behave this nicely, and also don't expect to build such accurate models most of the time. The process of model selection and tuning is tedious and prone to human error, and that's where automation libraries such as TPOT come into play.

We'll cover a classification refresher in the next section, again on the fairly simple example.

Classification

Classification in machine learning refers to a type of problem where the target variable is categorical. We could turn the example from the Regression section in the classification problem by converting the target variable into categories, such as Sold/Did not sell.

In a nutshell, classification algorithms help us in various scenarios, such as predicting customer attrition, whether a tumor is malignant or not, whether someone has a given disease or not, and so on. You get the point.

Classification tasks can be further divided into binary classification tasks and multi-class classification tasks. We'll explore binary classification tasks briefly in this section. The most basic classification algorithm is logistic regression, and we'll use it in this section to build a simple classifier.

Note

Keep in mind that you are not limited only to logistic regression for performing classification tasks. On the contrary – it's good practice to use a logistic regression model as a baseline, and to use more sophisticated algorithms in production. More sophisticated algorithms include decision trees, random forests, gradient boosting, and artificial neural networks.

The data is completely made up and arbitrary in this example:

We have two columns – the first indicates a measurement of some sort (called Radius), and the second column denotes the classification (either 0 or 1). The dataset is constructed with the following Python code:

import numpy as np
import pandas as pd
df = pd.DataFrame({
    'Radius': [0.3, 0.1, 1.7, 0.4, 1.9, 2.1, 0.25, 
               0.4, 2.0, 1.5, 0.6, 0.5, 1.8, 0.25],
    'Class': [0, 0, 1, 0, 1, 1, 0, 
              0, 1, 1, 0, 0, 1, 0]
})

We'll use the matplotlib library once again for visualization purposes. Here's how to import it and make it a bit more visually appealing:

import matplotlib.pyplot as plt 
from matplotlib import rcParams
rcParams['figure.figsize'] = 14, 8
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False

We can reuse the same logic from the previous regression example to make a visualization. This time, however, we won't see data that closely resembles a line. Instead, we'll see data points separated into two groups. On the lower left are the data points where the Class attribute is 0, and on the right where it's 1:
```
plt.scatter(df['Radius'], df['Class'], color='#7e7e7e', s=200)
plt.title('Radius classification', size=20)
plt.xlabel('Radius (cm)', size=14)
plt.ylabel('Class', size=14)
plt.show()
```
The following graph is the output of the preceding code:
Figure 1.6 – Classification – Scatter plot between measurements and classes
The goal of a classification model isn't to produce a line of best fit, but instead to draw out the best possible separation between the classes.
The logistic regression model is available in the sklearn.linear_model package. We'll use it to train the model on the entire dataset, and then to make predictions on the entire dataset. Again, that's not something we will keep doing later on in the book, but is essential to get insights into the inner workings of the model at this point:
```
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(df[['Radius']], df['Class'])
preds = model.predict(df[['Radius']])
df['Predicted'] = preds
```
We can now use this model to make predictions on an arbitrary number of X values, ranging from the smallest to the largest in the entire dataset. The range of evenly spaced numbers is obtained through the np.linspace method. It takes three arguments – start, stop, and the number of elements. We'll set the number of elements to 1000.
Then, we can make a line that indicates the probabilities for every value of X generated. By doing so, we can visualize the decision boundary of the model:
```
xs = np.linspace(0, df['Radius'].max() + 0.1, 1000)
ys = [model.predict([[x]]) for x in xs]
plt.scatter(df['Radius'], df['Class'], color='#7e7e7e', s=200, label='Data points')
plt.plot(xs, ys, color='#040404', label='Decision boundary')
plt.title('Radius classification', size=20)
plt.xlabel('Radius (cm)', size=14)
plt.ylabel('Class', size=14)
plt.legend()
plt.show()
```
The preceding code produces the following visualization:
Figure 1.7 – Classification – Scatter plot between measurements and classes and the decision boundary
Our classification model is basically a step function, which is understandable for this simple problem. Nothing more complex is needed to correctly classify every instance in our dataset. This won't always be the case, but more on that later.
A confusion matrix is one of the best methods for evaluating classification models. Our negative class is 0, and the positive class is 1. The confusion matrix is just a square matrix that shows the following:
- True negatives: The upper left number. These are instances that had the class of 0 and were predicted as 0 by the model.
- False negatives: The bottom left number. These are instances that had the class of 0, but were predicted as 1 by the model.
- False positives: The top right number. These are instances that had the class of 1, but were predicted as 0 by the model.
- True positives: The bottom right number. These are instances that had the class of 1 and were predicted as 1 by the model.
  Read the previous list as many times as necessary to completely understand the idea. The confusion matrix is an essential concept in classifier evaluation, and the later chapters in this book assume you know how to interpret it.
The confusion matrix is available in the sklearn.metrics package. Here's how to import it and obtain the results:
```
from sklearn.metrics import confusion_matrix
confusion_matrix(df['Class'], df['Predicted'])
```
Here are the results:

Figure 1.8 – Classification – Evaluation with a confusion matrix

The previous figure shows that our model was able to classify every instance correctly. As a rule of thumb, if the diagonal elements stretching from the bottom left to the top right are zeros, it means the model is 100% accurate.

The confusion matrix interpretation concludes our brief refresher on supervised machine learning methods. Next, we will dive into the idea of automation, and discuss why we need it in machine learning.

Reviewing automation

This section briefly discusses the idea of automation, why we need it, and how it applies to machine learning. We will also answer the age-old question of machine learning replacing humans in their jobs, and the role of automation in that regard.

Automation plays a huge role in the modern world, and in the past centuries it has allowed us to completely remove the human factor from dangerous and repetitive jobs. This has opened a new array of possibilities on the job market, where jobs are generally based on something that cannot be automated, at least at this point in time.

But first, we have to understand what automation is.

What is automation?

There are many syntactically different definitions out there, but they all share the same basic idea. The following one presents the idea in the simplest terms:

Automation is a broad term that can cover many areas of technology where human input is minimized (What is Automation? – IBM, IBM Team; February 28, 2020).

The essential part of the definition is the minimization of the human input. An automated process is entirely or almost entirely managed by a machine. Up to a couple of years back, machines were a great way to automate boring, routine tasks, and leave creative things to people. As you might guess, machines are not that great with creative tasks. That is, they weren't until recently.

Machine learning provides us with a mechanism to not only automate calculations, spreadsheet management, and expenses tracking, but also more cognitive tasks, such as decision making. The field evolves by the day and it's hard to say when exactly we can expect machines to take over some more creative jobs.

The concept of automation in machine learning is discussed later, but it's important to remember that machine learning can take automation to a whole other level. Not every form of automation is equal, and the generally accepted division of automation is into four levels, based on complexity:

Basic automation: Automation of the simplest tasks. Robotic Process Automation (RPA) is the perfect example, as its goal is to use software bots to automate repetitive tasks. The end goal of this automation category is to completely remove the human factor from the equation, resulting in faster execution of repetitive tasks without error.
Process automation: This uses and applies basic automation techniques to an entire business process. The end goal is to completely automate a business activity and leave humans to only give the final approval.
Integration automation: This uses rules defined by humans to mimic human behavior in task completion. The end goal is to minimize human intervention in more complex business tasks.
AI automation: The most complex form of automation. The goal is to have a machine that can learn and make decisions based on previous situations and the decisions made in those situations.

You now know what automation is, and next, we'll discuss why it is a must in the 21st century.

Why is automation needed?

Both companies and customers can benefit from automation. Automation can improve resource allocation and management, and can make the business scaling process easier. Due to automation, companies can provide a more reliable and consistent service, which results in a more consistent user experience. As the end result, customers are more likely to buy and spend more than if the service quality was not consistent.

In the long run, automation simplifies and reduces human activities and reduces costs. Further, any automated process is likely to perform better than the same process performed by humans. Machines don't get tired, don't have a bad day, and don't require a salary.

The following list shows some of the most important reasons for automation:

Time saving: Automation simplifies daily routine tasks by making machines do them instead of humans. As the end result, humans can focus on more creative tasks right from the start.
Reduced cost: Automation should be thought of as a long-term investment. It comes with some start-up costs, sure, but those are covered quickly if automation is implemented correctly.
Accuracy and consistency: As mentioned before, humans are prone to errors, bad days, and inconsistencies. That's not the case with machines.
Workflow enhancements: Due to automation, more time can be spent on important tasks, such as providing individual assistance to customers. Employees tend to be happier and deliver better results if their shift isn't made up solely of repetitive and routine tasks.

The difficult question is not "do you automate?" but rather, "when do you automate?" There are a lot of different opinions on this topic and there isn't a single right or wrong answer. Deciding when to automate depends on the budget you have available and on the opportunity cost (the decisions/investments you would be able to make if time was not an issue).

Automating anything you are good at and focusing on the areas that require improvement is a general rule of thumb for most companies. Even as an individual, there is a high probability that you are doing something on a daily or weekly basis that can be described in plain language. And if something can be described step by step, it can be automated.

But how does the concept of automation apply to machine learning? Are machine learning and automation synonymous? That's what we will discuss next.

Are machine learning and automation the same thing?

Well, no. But machine learning can take automation to a whole different level. Let's refer back to the four levels of automation discussed a few of paragraphs ago. Only the last one uses machine learning, and it is the most advanced form of automation.

Let's consider a single activity in our day as a process. If you know exactly how the process will start and end, and everything that will happen in between and in which order, then this process can be automated without machine learning.

Here's an example. For the last couple of months, you've been monitoring real-estate prices in an area you want to move to. Every morning you make yourself a cup of coffee, sit in front of a laptop, and go to a real estate website. You filter the results to see only the ads that were placed in the last 24 hours, and then enter the data, such as the location, unit price, number of rooms, and so on, into a spreadsheet.

This process takes about an hour of your day, which results in 30 hours per month. That is a lot. In 30 hours, you can easily read a book or take an online course to further develop your skills in some other area. The process described in this paragraph can be automated easily, without the need for machine learning.

Let's take a look at another example. You are spending multiple hours per day on the stock market, deciding what to buy and what to sell. This process is different from the previous one, as it involves some sort of decision making. The thing is, with all of the datasets available online, a skilled individual can use machine learning methods to automate the buy/sell decision-making process.

This is the form of automation that includes machine learning, but no, machine learning and automation are not synonymous. Each can work without the other.

The following sections discuss in great detail the role of automation in machine learning (not vice versa), and answer what we are trying to automate and how it can be achieved in the modern day and age.

Applying automation to machine learning

We've covered the idea of automation and various types of automation thus far, but what's the connection between automation and machine learning? What exactly is it that we are trying to automate in machine learning?

That's what this section aims to demystify. By the end of this section, you will know the difference between the terms automation with machine learning and automating machine learning. These two might sound similar at first, but are very different in reality.

What are we trying to automate?

Let's get one thing straight – automation of machine learning processes has nothing to do with business process automation with machine learning. In the former, we're trying to automate the machine learning itself, ergo automating the process of selecting the best model and the best hyperparameters. The latter refers to automating a business process with the help of machine learning; for example, making a decision system that decides when to buy or sell a stock based on historical data.

It's crucial to remember this distinction. The primary focus of this book is to demonstrate how automation libraries can be used to automate the process of machine learning. By doing so, you will follow the exact same approach, regardless of the dataset, and always end up with the best possible model.

Choosing an appropriate machine learning algorithm isn't an easy task. Just take a look at the following diagram:

Figure 1.9 – Algorithms in scikit-learn (source: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011)

As you can see, multiple decisions are required to select an appropriate algorithm. In addition, every algorithm has its own set of hyperparameters (parameters specified by the engineer). To make things even worse, some of these hyperparameters are continuous in nature, so when you add it all up, there are hundreds of thousands or even millions of hyperparameter combinations that you as an engineer should test for.

Every hyperparameter combination requires training and evaluation of a completely new model. Concepts such as grid search can help you avoid writing tens of nested loops, but it is far from an optimal solution.

Modern machine learning engineers don't spend their time and energy on model training and optimization – but instead on raising the data quality and availability. Hyperparameter tweaking can squeeze that additional 2% increase in accuracy, but it is the data quality that can make or break your project.

We'll dive a bit deeper into hyperparameters next and demonstrate why searching for the optimal ones manually isn't that good an idea.

The problem of too many parameters

Let's take a look at some of hyperparameters available for one of the most popular machine learning algorithms – XGBoost. The following list shows the general ones:

booster
verbosity
validate_parameters
nthread
disable_default_eval_metric
num_pbuffer
num_feature

That's not much, and some of these hyperparameters are set automatically by the algorithm. The problem lies within the further selection. For example, if you choose gbtree as a value for the booster parameter, you can immediately tweak the values for the following:

eta
gamma
max_depth
min_child_weight
max_delta_step
subsample
sampling_method
colsample_bytree
colsample_bylevel
colsample_bynode
lambda
alpha
tree_method
sketch_eps
scale_pos_weight
updater
refresher_leaf
process_type
grow_policy
max_leaves
max_bin
predictor
num_parallel_tree
monotone_constraints
interaction_constraints

And that's a lot! As mentioned before, some hyperparameters take in continuous values, which tremendously increases the total number of combinations. Here's the final icing on the cake – these are only hyperparameters for a single model. Different models have different hyperparameters, which makes the tuning process that much more time consuming.

Put simply, model selection and hyperparameter tuning isn't something you should do manually. There are more important tasks to spend your energy on. Even if there's nothing else you have to do, I'd prefer going for lunch instead of manual tuning any day of the week.

AutoML enables us to do just that, so we'll explore it briefly in the next section.

What is AutoML?

AutoML stands for Automated Machine Learning, and its primary goal is to reduce or completely eliminate the role of data scientists in building machine learning models. Hearing that sentence might be harsh at first. I know what you are thinking. But no – AutoML can't replace data scientists and other data professionals.

In the best-case scenario, AutoML technologies enable other software engineers to utilize the power of machine learning in their application, without the need to have a solid background in ML. This best-case scenario is only possible if the data is adequately gathered and prepared – a task that's not the specialty of a backend developer.

To make things even harder for the non-data scientist, the machine learning process often requires extensive feature engineering. This step can be skipped, but more often than not, this will result in poor models.

In conclusion, AutoML won't replace data scientists, rather just the contrary – it's here to make the life of data scientists easier. AutoML only automates model selection and tuning to the full extent.

There are some AutoML services that advertise themselves as fully automating even the data preparation and feature engineering jobs, but that's just by combining various features together and making something that is not interpretable most of the time. A machine doesn't know the true relationships between variables. That's the job of a data scientist to discover.

Automation options

AutoML isn't that new a concept. The idea and some implementations have been around for years, and are receiving positive feedback overall. Still, some fail to implement and fully utilize AutoML solutions in their organization due to a lack of understanding.

AutoML can't do everything – someone still has to gather the data, store it, and prepare it. This isn't a small task, and more often than not requires a significant amount of domain knowledge. Then and only then can automated solutions be utilized to their full potential.

This section explores a couple of options for implementing AutoML solutions. We'll compare one code-based tool written in Python, and one that is delivered as a browser application, meaning that no coding is required. We'll start with the code-based one first.

PyCaret

PyCaret has been widely used to make production-ready machine learning models with as little code as possible. It is a completely free solution capable of training, visualizing, and interpreting machine learning models with ease.

It has built-in support for regression and classification models and shows in an interactive way which models were used for the task, and which generated the best result. It's up to the data scientist to decide which model will be used for the task. Both training and optimization are as simple as a function call.

The library also provides an option to explain machine learning models with game-theoretic algorithms such as SHAP (Shapely Additive Explanations), also with a single function call.

PyCaret still requires a bit of human interaction. Oftentimes, though, the initialization and training process of a model must be selected explicitly by the user, and that breaks the idea of a fully-automated solution.

Further, PyCaret can be slow to run and optimize for a larger dataset. Let's take a look at a code-free AutoML solution next.

ObviouslyAI

Not all of us know how to develop machine learning models, or even how to write code. That's where drag and drop solutions come into play. ObviouslyAI is certainly one of the best out there, and is also easy to use.

This service allows for in-browser model training and evaluation, and can even explain the reasoning behind decisions made by a model. It's a no-brainer for companies in which machine learning isn't the core business, as it's pretty easy to start with and doesn't cost nearly as much as an entire data science team.

A big gotcha with services like this one is the pricing. There's always a free plan included, but in this particular case it's limited to datasets with fewer than 50,000 rows. That's completely fine for occasional tests here and there, but is a deal-breaker for most production use cases.

The second deal-breaker is the actual automation. You can't easily automate mouse clicks and dataset loads. This service automates the machine learning process itself completely, but you still have to do some manual work.

TPOT

The acronym TPOT stands for Tree-based Pipeline Optimization Tool. It is a Python library designed to handle machine learning tasks in an automated fashion.

Here's a statement from the official documentation:

Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming (TPOT Documentation page, TPOT Team; November 5, 2019).

Genetic programming is a term that is further discussed in the later chapters. For now, just know that it is based on evolutionary algorithms – a special type of algorithm used to discover solutions to problems that humans don't know how to solve.

In a way, TPOT is your data science assistant. You can use it to automate everything boring in a data science project. The term "boring" is subjective, but throughout the book, we use it to refer to the tasks of manually selecting and tweaking machine learning models (read: spending days waiting for the model to tune).

TPOT can't automate the process of data gathering and cleaning, and the reason is obvious – a machine can't read your mind. It can, however, perform machine learning tasks on well prepared datasets better than most data scientists.

The following chapters discuss the library in great detail.

Filter reviews by

All

Amazon verified reviews

Anih John Jun 02, 2021

The author did an extensive work in demystifying ML from the basics to the ground up. Additionally, the auto ML tools in the Python ecosystem were discussed extensively as creating an ML model is not the end of the ML pipeline.More importantly, finding the best model suitable for your problem could be quite challenging. To solve this problem, the author provided deep insights about how TPOT (a python library) helps to solve this problem. Moreover, excellent tips were provided for model deployment with helpful code snippets to help with code development. If you are an data scientist or an ML professional or a researcher with a focus in AI looking to advance your skillset in ML and learn more, then I highly recommend this book for you.

Amazon Verified review

Praveen Kumar Venugopal Sep 27, 2021

This book dives deep into one of the most widely used AutoML package - TPOT with such deft and ease that any reader (novice/expert) will be pleasantly surprised. The book also covers every possible use case of the package starting from simple hands on with public datasets leading up to complex production deployment.Highlights:1) The author talks in a simple and relatable language, which makes the reader feel at ease. This is a very crucial aspect of any book aimed at explaining tools/packages , since an good understanding in the first few chapters will prevent confusion in the later ones2) I was genuinely surprised by the efforts put into explaining package installation/setup. The clarity was exceptional from simple python setup leading up to complex production scenario involving parallelization3) A special mention must be made to Chapter 5 - Parallel Training with TPOT and Dask. This chapter covers only the crucial aspects of parallelization and marries it to TPOT pipeline building. It is truly a work of art !!4) Chapter 8 is yet another excellent work. Model deployment is often not covered in many books in the AutoML field due to its perceived lack of applicability. Hence it was quite enlivening to see the author take the challenge head on and give a clear and concise pictureShortcomings:1) It is widely known that there are tens of open-source packages for AutoML and pipeline building. While the book does an excellent job in explaining TPOT, it performs poorly in highlighting its advantages over other packages. For instance, exporting the pipeline is a powerful feature in TPOT, which the author has merely touched upon without going into greater detail.2) In Chapter 6 , while the author tries to cover Neural Networks (Multi Layer Perceptrons) , there are multiple occurrences where the diagram is rather too simplistic and doesn't relate to the text content explaining it3) In the same chapter, the author also makes the mistake of unnecessarily diving into intricate details of forward pass, while completely ignoring back propagation. Strictly speaking neural nets is beyond the scope of this book, but any attempt to do so needs to be balanced, lest risk the reader being misleadThis book is a great source of information not just to learn TPOT, but to learn about AutoML pipelines (creation, tuning, deployment, maintenance) in general. The wide spectrum of use cases ranging from beginners' ride to production deployment covered in this book can be a great learning for readers

A P Aug 31, 2021

The author walks the reader through the process of discovering and optimizing machine learning pipelines, starting with a discussion of the methodologies utilized for identifying and optimizing machine learning pipelines. Afterward, he goes on to discuss regression and classification algorithms, illustrating his points with some good examples and offering step-by-step assistance throughout. In particular, I love how the results and output from TPOT are provided to the reader to know what to expect when they run the program themselves. The author also addresses parallel training, how to use TPOT in combination with Dask, and how to deploy models utilizing AWS Cloud providers, among other topics. It would be more effective, in my opinion, to provide a link to the webpage instead of describing each setting one at a time.

Josh Thompson May 14, 2021

Overview:Machine learning can be difficult because of the many decision which need to be made about data pre-processing, algorithms to use, and parameter settings. Automated machine learning (AutoML) seeks to automate this process thus eliminating much of the guesswork from the user. This book introduces the tree-based pipeline optimization tool or TPOT as an open-source and Python based solution for AutoML. By working through this book, users will become familiar with generating machine learning models automatically using TPOT.What I like:The first chapter of the book provides a nice introduction to machine learning and sets the stage for the need for automation.Chapter 2 takes a deep dive into TPOT providing the reader with the details of the algorithms used for discovering and optimizing machine learning pipelines. I found this overview to be complete and informative. I like the use of screenshots to show the reader what they should be seeing on the screen and they install and run the software.Chapters 3 & 4 take a closer look at regression and classification problems with some good examples and step by step instructions. The format is clear and should allow the reader to follow along with their own implementation of TPOT. I particularly like how the reader is shown the results and output from TPOT so they know what to expect as they run it themselves.Chapters 5-7 cover parallelizing TPOT and using TPOT to do neural network analysis. These are both important and useful chapters with so much focus on deep learning neural networks and the use of parallel computing to speed up machine learning analyses.The final two chapters cover various aspects of deploying a TPOT model for use in practice. It covers important topics such as deploying to the cloud and graphic-user interfaces (GUI). This is a great way to end the book and the examples given are both practical and useful.What I didn’t like:While the examples are very useful for learning how to run TPOT, they do not provide as much value for using TPOT to analyze real-world big data. For example, how do you scale TPOT to data with millions of rows and columns? How do you interpret TPOT models? This might make a great volume 2 for the book.What I would like to see:It would have been nice to see a final chapter with ideas for extending or improving TPOT. Such a chapter would give computer scientists some ideas for projects advance TPOT for more complex problems. What is missing? What still needs to be done? How can the algorithms be improved?

SB Jun 30, 2021

Machine Learning Automation with TPOT – ReviewFirst, I would like to congratulate Mr. Dario Radečić, the book's author, and I am delighted to review this cutting-edge technical book as part of the Artificial Intelligence and Machine Learning series (AIML).Overview:The author kept me on the edge of my seat and reading word for word from the beginning, and the way he organized each chapter was awesome. The author went from basics to advanced level concepts with clear along with excellent examples without getting bored, and a comprehensible way of each section from Idea of Automation, Practical Classification, Regression, and Neural Networks. At one level, things went into top gear and landed on the Deployed TPOT Model in Production, which was simply outstanding. I am sure every reader will take away a substantial amount of understanding and knowledge of the Machine Learning algorithm and the power pack of the TPOT. Perfectly author mentioned that “TPOT is your data science assistant. You can use it to automate everything. boring in a data science project.”What I like:Chapter 2,3 & 4"Deep Dive into TPOT" is a particularly important chapter with respect to the title of the book. Yes! Of course, the author has covered this topic well by giving the in-detail level of installing and configuring the TPOT library for standalone Python and Anaconda. If you follow the steps carefully, you will certainly get TPOT in your environment in a very few minutes. Followed by this, the author takes us to explore various datasets and all the ML processes on a touch basis and the crystal-clear implementation of the Regression and Classification model with a reasonable volume of datasets with appropriate features and detailed step by step coding was excellent and helpful, even those who are new to ML world and knowledgeable in ML space. And TPOT came into play as the final match-winning striker and exporting the optimized pipeline new python file for further usage was impressive. Even I have tried them and enjoyed the output.Chapter 5I would say "Parallel Training with TPOT and Dask" is an extra bounce for the readers and thank you very much for bringing those topics into the book for the benefit of Data Scientists and ML Engineers. The parallelism concept in Python reminds me of the threading concept in Java programming, and the sample set of examples and code was an excellent feed for the readers. Coming into Dask library usage and its advantage over data processing with sample code is notable.Chapter 6 & 7Another milestone in this book is the Crash Course in Neural Networks, in which the author quickly covered the theory of a single neuron, the theory of a single layer, and various activation functions in a rapid way. Using neural networks to classify handwritten digits in a sample program was extraordinary, Anyone can understand the author's approach and way of implementation. And the Neural Network Classifier with TPOT as well.What I didn’t like:With respect to TPOT Model Deployment and Using the Deployed TPOT Model in Production This book's chapters appear to be power packs, but they appear rushed, and if there were more details about different model implementations, I believe something similar to the earlier deep-dive would be helpful for readers. And I believe that the author might have thought of, it would be too much for the readers, resulting in a diluting of the rhythm.What I would like to see:I was expecting the training machine learning models with TPOT and Dask libraries with a reasonable data set, which was discussed in the initial chapters, would help readers to understand how to connect the dots. And one more thing is the goal is to create a file from an automation library, It would be great if we have detailed explanation/coverage of the optimized pipeline of a new python file for the readers; if we understood this, it would be an additional value-added as a take away from this book.Overall … I can give 4.5/5 for this. And all the absolute best for the author.

Machine Learning Automation with TPOT: Build, validate, and deploy fully automated machine learning models with Python

What do you get with a Packt Subscription?

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs