Introduction to Machine Learning Using Python

The last chapter introduced you to the world of machine learning (ML). In this chapter, we will develop the ML foundations that are required for building and using Automated ML (AutoML) platforms. It is not always clear how ML is best applied or what it takes to implement it. However, ML tools are getting more straightforward to use, and AutoML platforms are making it more accessible to a broader audience. In the future there will undoubtedly be a higher collaboration between man and machine.

The future of ML may require people to prepare data for its consumption and identify use cases for implementation. More importantly, people are needed to interpret the results and audit the ML system—whether they are following the right and best approaches to solving a problem. The future looks pretty amazing, but we need to build that...

Technical requirements

All the code examples can be found in the Chapter 02 folder in GitHub.

Machine learning

Machine learning dates back to centuries. It was born from the theory that computers can learn without being programmed to perform specific tasks. The iterative aspect of ML is essential as the machines need to adapt themselves to new data always. They need to learn from the historical data, optimize for better computations, and also generalize themselves to provide proper results.

We all are aware of rule-based systems, where we have a set of predefined conditions for a machine to execute and provide the results. How great will it be when machines learn these patterns by themselves, deliver the results, and explain the rules that it discovered; this is ML. It is a broader term used for various methods and algorithms that are used by machines to learn from the data. As a branch of artificial intelligence (AI), the ML algorithms are quite often used to discover...

Linear regression

Let's begin our triple W session with linear regression first.

What is linear regression?

It is the traditional and most-used regression analysis. It is studied rigorously and used widely for practical purposes. Linear regression is a method for determining the relationship between a dependent variable (y) and one or more independent variables (x). This derived relationship can be used to predict an unexplained y from observed x's. Mathematically, if x is an independent variable (commonly known as the predictor) and y is a dependent variable (also known as the target), the relationship is expressed as follows:

Where m is the slope of line, b is the intercept of the best-fit regression line, and...

Important evaluation metrics – regression algorithms

Assessing the value of a ML model is a two-phase process. First, the model has to be evaluated for its statistical accuracy, that is, whether the statistical hypotheses are correct, model performance is outstanding, and the performance holds true for other independent datasets. This is accomplished using several model evaluation metrics. Then, a model is evaluated to see if the results are as expected as per business requirement and the stakeholders genuinely get some insights or useful predictions out of it.

A regression model is evaluated based on the following metrics:

Mean absolute error (MAE): It is the sum of absolute values of prediction error. The prediction error is defined as the difference between predicted and actual values. This metric gives an idea about the magnitude of the error. However, we cannot judge...

Logistic regression

Let's start again with the triple W for logistics regression. To reiterate the tripe W method, we first ask the algorithm what it is, followed by where it can be used, and finally by what method we can implement the model.

What is logistic regression?

Logistic regression can be thought of as an extension to linear regression algorithms. It fundamentally works like linear regression, but it is meant for discrete or categorical outcomes.

Where is logistic regression used?

Logistic regression is applied in the case of discrete target variables such...

Important evaluation metrics – classification algorithms

Most of the metrics used to assess a classification model are based on the values that we get in the four quadrants of a confusion matrix. Let's begin this section by understanding what it is:

Confusion matrix: It is the cornerstone of evaluating a classification model (that is, classifier). As the name stands, the matrix is sometimes confusing. Let's try to visualize the confusion matrix as two axes in a graph. The x axis label is prediction, with two values—Positive and Negative. Similarly, the y axis label is actually with the same two values—Positive and Negative, as shown in the following figure. This matrix is a table that contains the information about the count of actual and predicted values by a classifier:

If we try to deduce information about each quadrant in the matrix:
- Quadrant...

Decision trees

Decision trees are extensively-used classifiers in the ML world for their transparency on representing the rules that drive a classification/prediction. Let us ask the triple W questions to this algorithm to know more about it.

What are decision trees?

Decision trees are arranged in a hierarchical tree-like structure and are easy to explain and interpret. They are not susceptive to outliers. The process of creating a decision tree is a recursive partitioning method where it splits the training data into various groups with an objective to find homogeneous pure subgroups, that is, data with only one class.

Outliers are values that lie far away from other data points and distort the data distribution.

...

Support Vector Machines

SVM is a supervised ML algorithm used primarily for classification tasks, however, it can be used for regression problems as well.

What is SVM?

SVM is a classifier that works on the principle of separating hyperplanes. Given a training dataset, the algorithms find a hyperplane that maximizes the separation of the classes and uses these partitions for the prediction of a new dataset. The hyperplane is a subspace of one dimension less than its ambient plane. This means the line is a hyperplane for a two-dimensional dataset.

Where is SVM used?

SVM...

k-Nearest Neighbors

Before we build a KNN model for the HR attrition dataset, let us understand KNN's triple W.

What is k-Nearest Neighbors?

KNN is one of the most straightforward algorithms that stores all available data points and predicts new data based on distance similarity measures such as Euclidean distance. It is an algorithm that can make predictions using the training dataset directly. However, it is much more resource intensive as it doesn't have any training phase and requires all data present in memory to predict new instances.

Euclidean distance is calculated as the square root of the sum of the squared differences between two points.

...

Ensemble methods

Ensembling models are a robust approach to enhancing the efficiency of the predictive models. It is a well-thought out strategy that is very similar to a power-packed word—TEAM !! Any task done by a team leads to significant accomplishments.

What are ensemble models?

Likewise, in the ML world, an ensemble model is a team of models operating together to enhance the result of their work. Technically, ensemble models comprise of several supervised learning models that are individually trained, and the results are merged in various ways to achieve the final prediction. This result has higher predictive power than the results of any of its constituting learning algorithms independently.

Mostly, there are...

Comparing the results of classifiers

We have created around six classification models on the HR attrition dataset. The following table summarizes the evaluation scores for each model:

The random forest model appears to be a winner among all six models, with a record-breaking 99% accuracy. Now, we need not further improve the random forest model, but check whether it can generalize well to a new dataset and the results are not overfitting the train dataset. One of the methods is to do cross-validation.

Cross-validation

Cross-validation is a way to evaluate the accuracy of a model on a dataset that was not used for training, that is, a sample of data that is unknown to trained models. This ensures generalization of a model on independent datasets when deployed in a production environment. One of the methods is dividing the dataset into two sets—train and test sets. We demonstrated this method in our previous examples.

Another popular and more robust method is a k-fold cross-validation approach, where a dataset is partitioned into k subsamples of equal sizes. Where k is a non-zero positive integer. During the training phase, k-1 samples are used to train the model and the remaining one sample is used to test the model. This process is repeated for k times with one of the k samples used exactly once to test the model. The evaluation results are then averaged or combined...

Clustering

We will begin this section with a question. How do we start learning a new algorithm or a machine learning method? We start with triple W. So, let's being with that for the clustering method.

What is clustering?

Clustering is a technique to group similar data together, and a group has some unique characteristics that are different from other groups. Data can be clustered together using various methods. One of them is rule-based, where the groups are formed based on certain predefined conditions, such as grouping customers based on their age or industry. Another method is to use ML algorithms to cluster data together.

...

Summary

The ML and its automation journey are long. The aim of this chapter was to familiarize ourselves with machine learning concepts; most importantly, the scikit-learn and other Python packages, so that we can smoothly accelerate our learning in the next chapters, create a linear regression model and six classification models, and learn about clustering techniques and compare the models with each other.

We used a single HR attrition dataset for creating all classifiers. We observed that there are many similarities in these codes. The libraries imported are all similar except the one used to instantiate the machine learning class. The data preprocessing module is redundant in all code. The machine learning technique changes based on the task and data of the target attribute. Also, the evaluation methodology is equivalent to the similar type of ML methods.

Do you think that...