Just over 25 years ago (1994), a question was asked in an episode of The Today Show – "What is the internet, anyway?" It's hard to imagine that a couple of decades ago, the general population had difficulty defining what the internet is and how it works. Little did they know that we would have intelligent systems managing themselves only a quarter of a century later, available to the masses.
The concept of machine learning was introduced much earlier in 1949 by Donald Hebb. He presented theories on neuron excitement and communication between neurons (A Brief History of Machine Learning – DATAVERSITY, Foote, K., March 26, 2019). He was the first to introduce the concept of artificial neurons, their activation, and their relationships through weights.
In the 1950s, Arthur Samuel developed a computer program for playing checkers. The memory was quite limited at that time, so he designed a scoring function that attempted to measure every player's probability of winning based on the positions on the board. The program chose its next move using a MinMax strategy, which eventually evolved into the MinMax algorithm (A Brief History of Machine Learning – DATAVERSITY, Foote, K., March 26; 2019). Samuel was also the first one to come up with the term machine learning.
Frank Rosenblatt decided to combine Hebb's artificial brain cell model with the work of Arthur Samuel to create a perceptron. In 1957, a perceptron was planned as a machine, which led to building a Mark 1 perceptron machine, designed for image classification.
The idea seemed promising, to say at least, but the machine couldn't recognize useful visual patterns, which caused a stall in further research – this period is known as the first AI winter. There wasn't much going on with the perceptron and neural network models until the 1990s.
The preceding couple of paragraphs tell us more than enough about the state of machine learning and deep learning at the end of the 20th century. Groups of individuals were making tremendous progress with neural networks, while the general population had difficulty understanding even what the internet is.
To make machine learning useful in the real world, scientists and researchers required two things:
The first was rapidly becoming more available due to the rise of the internet. The second was slowly moving into a phase of exponential growth – both in CPU performance and storage capacity.
Still, the state of machine learning in the late 1990s and early 2000s was nowhere near where it is today. Today's hardware has led to a significant increase in the use of machine-learning-powered systems in production applications. It is difficult to imagine a world where Netflix doesn't recommend movies, or Google doesn't automatically filter spam from regular email.
But, what is machine learning, anyway?
What is machine learning?
There are a lot of definitions of machine learning out there, some more and some less formal. Here are a couple worth mentioning:
- Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed (What is Machine Learning? A Definition – Expert System, Expert System Team; May 6, 2020).
- Machine learning is the concept that a computer program can learn and adapt to new data without human intervention (Machine Learning – Investopedia, Frankenfield, J.; August 31, 2020).
- Machine learning is a field of computer science that aims to teach computers how to learn and act without being explicitly programmed (Machine Learning – DeepAI, Deep AI Team; May 17, 2020).
Even though these definitions are expressed differently, they convey the same information. Machine learning aims to develop a system or an algorithm capable of learning from data without human intervention.
The goal of a data scientist isn't to instruct the algorithm on how to learn, but rather to provide an adequately sized and prepared dataset to the algorithm and briefly specify the relationships between the dataset variables. For example, suppose the goal is to produce a model capable of predicting housing prices. In that case, the dataset should provide observations on a lot of historical prices, measured through variables such as location, size, number of rooms, age, whether it has a balcony or a garage, and so on.
It's up to the machine learning algorithm to decide which features are important and which aren't, ergo, which features have significant predictive power. The example in the previous paragraph explained the idea of a regression problem solved with supervised machine learning methods. We'll soon dive into both concepts, so don't worry if you don't quite understand it.
Further, we might want to build a model that can predict, with a decent amount of confidence, whether a customer is likely to churn (break the contract). Useful features would be the list of services the client is using, how long they have been using the service, whether the previous payments were made on time, and so on. This is another example of a supervised machine learning problem, but the target variable (churn) is categorical (yes or no) and not continuous, as was the case in the previous example. We call these types of problems classification machine learning problems.
Machine learning isn't limited to regression and classification. It is applied to many other areas, such as clustering and dimensionality reduction. These fall into the category of unsupervised machine learning techniques. These topics won't be discussed in this chapter.
But first, let's answer a question on the usability of machine learning models, and discuss who uses these models and in which circumstances.
In which sectors are the companies using machine learning?
In a single word – everywhere. But you'll have to continue reading to get a complete picture. Machine learning has been adopted in almost every industry in the last decade or two. The main reason is the advancements in hardware. Also, machine learning has become easier for the broader masses to use and understand.
It would be impossible to list every industry in which machine learning is used and to discuss further the specific problems it solves. The easier task would be to list the industries that can't benefit from machine learning, as there are far fewer of those.
We'll focus only on the better-known industries in this section.
Here's a list and explanation of the ten most common use cases of machine learning, both from the industry standpoint and as a general overview:
- The finance industry: Machine learning is gaining more and more popularity in the financial sector. Banks and financial institutions can use it to make smarter decisions. With machine learning, banks can detect clients who most likely won't repay their loans. Further, banks can use machine learning methods to track and understand the spending patterns of their customers. This can lead to the creation of more personalized services to the satisfaction of both parties. Machine learning can also be used to detect anomalies and fraud through unexpected behaviors on some client accounts.
- Medical industry: The recent advancements in medicine are at least partly due to advancements in machine learning. Various predictive methods can be used to detect diseases in the early stages, based on which medical experts can construct personalized therapy and recovery plans. Computer vision techniques such as image classification and object detection can be used, for example, to perform classification on lung images. These can also be used to detect the presence of a tumor based on a single image or a sequence of images.
- Image recognition: This is probably the most widely used application of machine learning because it can be applied in any industry. You can go from a simple cat-versus-dog image classification to classifying the skin conditions of endangered animals in Africa. Image recognition can also be used to detect whether an object of interest is present in the image. For example, the automatic detection of Waldo in the Where's Waldo? game has roughly the same logic as an algorithm in autonomous vehicles that detects pedestrians.
- Speech recognition: Yet another exciting and promising field. The general idea is that an algorithm can automatically recognize the spoken words in an audio clip and then convert them to a text file. Some of the better-known applications are appliance control (controlling the air conditioner with voice commands), voice dialing (automated recognition of a contact to call just from your voice), and internet search (browsing the web with your voice). These are only a couple of examples that immediately pop into mind. Automatic speech recognition software is challenging to develop. Not all languages are supported, and many non-native speakers have accents when speaking in a foreign language, which the ML algorithm may struggle to recognize.
- Natural Language Processing (NLP): Companies in the private sector can benefit tremendously from NLP. For example, a company can use NLP to analyze the sentiments of online reviews left by their customers if there are too many to classify manually. Further, companies can create chatbots on web pages that immediately start conversations with users, which then leads to more potential sales. For a more advanced example, NLP can be used to write summaries of long documents and even segment and analyze protein sequences.
- Recommender systems: As of late 2020, it's difficult to imagine a world where Google doesn't tailor the search results based on your past behaviors, Amazon doesn't automatically recommend similar products, Netflix doesn't recommend movies and TV shows based on the past watches, and Spotify doesn't recommend music that somehow flew under your radar. These are only a couple of examples, but it's not difficult to recognize the importance of recommender systems.
- Spam detection: Just like it's hard to imagine a world where the search results aren't tailored to your liking, it's also hard to imagine an email service that doesn't automatically filter out messages about that now-or-never discount on a vacuum cleaner. We are bombarded with information every day, and automatic spam detection algorithms can help us focus on what's important.
- Automated trading: Even the stock market is moving too fast to fully capture what's happening without automated means. Developing trading bots isn't easy, but machine learning can help you pick the best times to buy or sell, based on tons of historical data. If fully automated, you can watch how your money creates money while sipping margaritas on the beach. It might sound like a stretch to some of you, but with robust models and a ton of domain knowledge, I can't see why not.
- Anomaly detection: Let's dial back to our banking industry example. Banks can use anomaly detection algorithms for various use cases, such as flagging suspicious transactions and activities. Lately, I've been using anomaly detection algorithms to detect suspicious behavior in network traffic with the goal of automatic detection of cyberattacks and malware. It is another technique applicable to any industry if the data is formatted in the right way.
- Social networks: How many times has Facebook recommended you people you may know? Or YouTube recommended the video on the topic you were just thinking about? No, they are not reading your mind, but they are aware of your past behaviors and decisions and can predict your next move with a decent amount of confidence.
These are just a couple of examples of what machine learning can do – not an exhaustive list by any means. You are now familiar with a brief history of machine learning and know how machine learning can be applied to a wide array of tasks.
The next section will provide a brief refresher on supervised machine learning techniques, such as regression and classification.
Supervised learning
The majority of practical machine learning problems are solved through supervised learning algorithms. Supervised learning refers to a situation where you have an input variable (a predictor), typically denoted with X, and an output variable (what you are trying to predict), typically denoted with y.
There's a reason why features (X) are denoted with a capital letter and the target variable (y) isn't. In math terms, X denotes a matrix of features, and matrices are typically denoted with capital letters. On the other hand, y is a vector, and lowercase letters are typically used to denote vectors.
The goal of a supervised machine learning algorithm is to learn the function that can transform any input into the output. The most general math representation of a supervised learning algorithm is represented with the following formula:
Figure 1.1 – General supervised learning formula
We must apply one of two corrections to make this formula acceptable. The first one is to replace y with y-hat, as y generally denotes the true value, and y-hat denotes the prediction. The second correction we could make is to add the error term, as only then can we have the correct value of y on the other side. The error term represents the irreducible error – the type of error that can't be reduced by further training.
Here's how the first corrected formula looks:
Figure 1.2 – Corrected supervised learning formula (v1)
And here's the second one:
Figure 1.3 – Corrected supervised learning formula (v2)
It's more common to see the second one, but don't be confused by any of the formats – these formulas generally represent the same thing.
Supervised machine learning is called "supervised" because we have the labeled data at our disposal. You might have already picked this because of the feature and target discussion. This means that we have the correct answers already, ergo, we know which combinations of X yield the corresponding values of y.
The end goal is to make the best generalization from the data available. We want to produce the most unbiased model capable of generalizing on new, unseen data. The concepts of overfitting, underfitting, and the bias-variance trade-off are important to produce such a model, but they are not in the scope of this book.
As we've already mentioned, supervised learning problems are grouped into two main categories:
- Regression: The target variable is continuous in nature, such as the price of a house in USD, the temperature in degrees Fahrenheit, weight in pounds, height in inches, and so on.
- Classification: The target variable is a category – either binary (true/false, positive/negative, disease/no disease), or multi-class (no symptoms/mild symptoms/severe symptoms, school grades, and so on).
Both regression and classification are explored in the following sections.
Regression
As briefly discussed in the previous sections, regression refers to a phenomenon where the target variable is continuous. The target variable could represent a price, a weight, or a height, to name a few.
The most common type of regression is linear regression, a model where a linear relationship between variables is assumed. Linear regression further divides into a simple linear regression (only one feature), and multiple linear regression (multiple features).
Important note
Keep in mind that linear regression isn't the only type of regression. You can perform regression tasks with algorithms such as decision trees, random forests, support vector machines, gradient boosting, and artificial neural networks, but the same concepts still apply.
To make a quick recap of the regression concept, we'll declare a simple pandas.DataFrame
object consisting of two columns – Living area
and Price
. The goal is to predict the price based only on the living space. We are using a simple linear regression model here just because it makes the data visualization process simpler, which, as the end result, makes the regression concept easy to understand:
- The following is the dataset – both columns contain arbitrary and made-up values:
import pandas as pd
df = pd.DataFrame({
'LivingArea': [300, 356, 501, 407, 950, 782,
664, 456, 673, 821, 1024, 900,
512, 551, 510, 625, 718, 850],
'Price': [100, 120, 180, 152, 320, 260,
210, 150, 245, 300, 390, 305,
175, 185, 160, 224, 280, 299]
})
- To visualize these data points, we will use the
matplotlib
library. By default, the library doesn't look very appealing, so a couple of tweaks are made with the matplotlib.rcParams
package:import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['figure.figsize'] = 14, 8
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False
- The following options make the charts larger by default, and remove the borders (spines) on the top and right. The following code snippet visualizes our dataset as a two-dimensional scatter plot:
plt.scatter(df['LivingArea'], df['Price'], color='#7e7e7e', s=200)
plt.title('Living area vs. Price (000 USD)', size=20)
plt.xlabel('Living area', size=14)
plt.ylabel('Price (000 USD)', size=14)
plt.show()
The preceding code produces the following graph:
Figure 1.4 – Regression – Scatter plot of living area and price (000 USD)
- Training a linear regression model is most easily achieved with the
scikit-learn
library. The library contains tons of different algorithms and techniques we can apply on our data. The sklearn-learn.linear_model
module contains the LinearRegression
class. We'll use it to train the model on the entire dataset, and then to make predictions on the entire dataset. That's not something you would usually do in production environment, but is essential here to get a further understanding of how the model works:from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(df[['LivingArea']], df[['Price']])
preds = model.predict(df[['LivingArea']])
df['Predicted'] = preds
- We've assigned the prediction as yet another dataset column, just to make data visualization simpler. Once again, we can create a chart containing the entire dataset as a scatter plot. This time, we will add a line that represents the line of best fit – the line where the error is smallest:
plt.scatter(df['LivingArea'], df['Price'], color='#7e7e7e', s=200, label='Data points')
plt.plot(df['LivingArea'], df['Predicted'], color='#040404', label='Best fit line')
plt.title('Living area vs. Price (000 USD)', size=20)
plt.xlabel('Living area', size=14)
plt.ylabel('Price (000 USD)', size=14)
plt.legend()
plt.show()
The preceding code produces the following graph:
Figure 1.5 – Regression – Scatter plot of living area and price (000 USD) with the line of best fit
- As we can see, the simple linear regression model almost perfectly captures our dataset. This is not a surprise, as the dataset was created for this purpose. New predictions would be made along the line of best fit. For example, if we were interested in predicting the price of house that has a living space of 1,000 square meters, the model would make a prediction just a bit north of $350K. The implementation of this in the code is simple:
model.predict([[1000]])
>>> array([[356.18038708]])
- Further, if you were interested in evaluating this simple linear regression model, metrics like R2 and RMSE are a good choice. R2 measures the goodness of fit, ergo it tells us how much variance our model captures (ranging from 0 to 1). It is more formally referred to as the coefficient of determination. RMSE measures how wrong the model is on average, in the unit of interest. For example, an RMSE value of 10 would mean that on average our model is wrong by $10K, in either the positive or negative direction.
Both the R2 score and the RMSE are calculated as follows:
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error
rmse = lambda y, ypred: np.sqrt(mean_squared_error(y, ypred))
model_r2 = r2_score(df['Price'], df['Predicted'])
model_rmse = rmse(df['Price'], df['Predicted'])
print(f'R2 score: {model_r2:.2f}')
print(f'RMSE: {model_rmse:.2f}')
>>> R2 score: 0.97
>>> RMSE: 13.88
To conclude, we've built a simple but accurate model. Don't expect data in the real world to behave this nicely, and also don't expect to build such accurate models most of the time. The process of model selection and tuning is tedious and prone to human error, and that's where automation libraries such as TPOT come into play.
We'll cover a classification refresher in the next section, again on the fairly simple example.
Classification
Classification in machine learning refers to a type of problem where the target variable is categorical. We could turn the example from the Regression section in the classification problem by converting the target variable into categories, such as Sold/Did not sell.
In a nutshell, classification algorithms help us in various scenarios, such as predicting customer attrition, whether a tumor is malignant or not, whether someone has a given disease or not, and so on. You get the point.
Classification tasks can be further divided into binary classification tasks and multi-class classification tasks. We'll explore binary classification tasks briefly in this section. The most basic classification algorithm is logistic regression, and we'll use it in this section to build a simple classifier.
Note
Keep in mind that you are not limited only to logistic regression for performing classification tasks. On the contrary – it's good practice to use a logistic regression model as a baseline, and to use more sophisticated algorithms in production. More sophisticated algorithms include decision trees, random forests, gradient boosting, and artificial neural networks.
The data is completely made up and arbitrary in this example:
- We have two columns – the first indicates a measurement of some sort (called
Radius
), and the second column denotes the classification (either 0 or 1). The dataset is constructed with the following Python code:import numpy as np
import pandas as pd
df = pd.DataFrame({
'Radius': [0.3, 0.1, 1.7, 0.4, 1.9, 2.1, 0.25,
0.4, 2.0, 1.5, 0.6, 0.5, 1.8, 0.25],
'Class': [0, 0, 1, 0, 1, 1, 0,
0, 1, 1, 0, 0, 1, 0]
})
- We'll use the
matplotlib
library once again for visualization purposes. Here's how to import it and make it a bit more visually appealing:import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['figure.figsize'] = 14, 8
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False
- We can reuse the same logic from the previous regression example to make a visualization. This time, however, we won't see data that closely resembles a line. Instead, we'll see data points separated into two groups. On the lower left are the data points where the
Class
attribute is 0, and on the right where it's 1:plt.scatter(df['Radius'], df['Class'], color='#7e7e7e', s=200)
plt.title('Radius classification', size=20)
plt.xlabel('Radius (cm)', size=14)
plt.ylabel('Class', size=14)
plt.show()
The following graph is the output of the preceding code:
Figure 1.6 – Classification – Scatter plot between measurements and classes
The goal of a classification model isn't to produce a line of best fit, but instead to draw out the best possible separation between the classes.
- The logistic regression model is available in the
sklearn.linear_model
package. We'll use it to train the model on the entire dataset, and then to make predictions on the entire dataset. Again, that's not something we will keep doing later on in the book, but is essential to get insights into the inner workings of the model at this point:from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(df[['Radius']], df['Class'])
preds = model.predict(df[['Radius']])
df['Predicted'] = preds
- We can now use this model to make predictions on an arbitrary number of X values, ranging from the smallest to the largest in the entire dataset. The range of evenly spaced numbers is obtained through the
np.linspace
method. It takes three arguments – start
, stop
, and the number of elements. We'll set the number of elements to 1000
.
- Then, we can make a line that indicates the probabilities for every value of X generated. By doing so, we can visualize the decision boundary of the model:
xs = np.linspace(0, df['Radius'].max() + 0.1, 1000)
ys = [model.predict([[x]]) for x in xs]
plt.scatter(df['Radius'], df['Class'], color='#7e7e7e', s=200, label='Data points')
plt.plot(xs, ys, color='#040404', label='Decision boundary')
plt.title('Radius classification', size=20)
plt.xlabel('Radius (cm)', size=14)
plt.ylabel('Class', size=14)
plt.legend()
plt.show()
The preceding code produces the following visualization:
Figure 1.7 – Classification – Scatter plot between measurements and classes and the decision boundary
Our classification model is basically a step function, which is understandable for this simple problem. Nothing more complex is needed to correctly classify every instance in our dataset. This won't always be the case, but more on that later.
- A confusion matrix is one of the best methods for evaluating classification models. Our negative class is 0, and the positive class is 1. The confusion matrix is just a square matrix that shows the following:
- The confusion matrix is available in the
sklearn.metrics
package. Here's how to import it and obtain the results:from sklearn.metrics import confusion_matrix
confusion_matrix(df['Class'], df['Predicted'])
Here are the results:
Figure 1.8 – Classification – Evaluation with a confusion matrix
The previous figure shows that our model was able to classify every instance correctly. As a rule of thumb, if the diagonal elements stretching from the bottom left to the top right are zeros, it means the model is 100% accurate.
The confusion matrix interpretation concludes our brief refresher on supervised machine learning methods. Next, we will dive into the idea of automation, and discuss why we need it in machine learning.