Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Machine Learning Fundamentals

You're reading from   Machine Learning Fundamentals Use Python and scikit-learn to get up and running with the hottest developments in machine learning

Arrow left icon
Product type Paperback
Published in Nov 2018
Publisher
ISBN-13 9781789803556
Length 240 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Hyatt Saleh Hyatt Saleh
Author Profile Icon Hyatt Saleh
Hyatt Saleh
Arrow right icon
View More author details
Toc

Supervised and Unsupervised Learning

Machine learning is divided into two main categories: supervised and unsupervised learning.

Supervised Learning

Supervised learning consists of understanding the relation between a given set of features and a target value, also known as a label or class. For instance, it can be used for modeling the relationship between a person's demographic information and their ability to pay loans, as shown in the following table:

Figure 1.19: A table depicting the relationship between a person's demographic information and the ability to pay loans
Figure 1.19: A table depicting the relationship between a person's demographic information and the ability to pay loans

Models trained to foresee these relationships can then be applied to predict labels for new data. As we can see from the preceding example, a bank that builds such a model can then input data from loan applicants to determine if they are likely to pay back the loan.

These models can be further divided into classification and regression tasks, which are explained as follows.

Classification tasks are used to build models out of data with discrete categories as labels; for instance, a classification task can be used to predict whether a person will pay a loan. You can have more than two discrete categories, such as predicting the ranking of a horse in a race, but they must be a finite number.

Most classification tasks output the prediction as the probability of an instance to belong to each output label. The assigned label is the one with the highest probability, as can be seen in the following diagram:

Figure 1.20: An illustration of the working of a classification algorithm
Figure 1.20: An illustration of the working of a classification algorithm

Some of the most common classification algorithms are as follows:

  • Decision trees: This algorithm follows a tree-like architecture that simulates the decision process given a previous decision.
  • Naïve Bayes classifier: This algorithm relies on a group of probabilistic equations based on Bayes' theorem, which assumes independence among features. It has the ability to consider several attributes.
  • Artificial neural networks (ANNs): These replicate the structure and performance of a biological neural network to perform pattern recognition tasks. An ANN consists of interconnected neurons, laid out with a set architecture. They pass information to one another until a result is achieved.

Regression tasks, on the other hand, are used for data with continuous quantities as labels; for example, a regression task can be used for predicting house prices. This means that the value is represented by a quantity and not by a set of possible outputs. Output labels can be of integer or float types:

  • The most popular algorithm for regression tasks is linear regression. It consists of only one independent feature (x) whose relation with its dependent feature (y) is linear. Due to its simplicity, it is often overseen, even though it performs very well for simple data problems.
  • Other, more complex regression algorithms include regression trees and support vector regression, as well as ANNs once again.

In conclusion, for supervised learning problems, each instance has a correct answer, also known as a label or class. The algorithms under this category aim to understand the data and then predict the class of a given set of features. Depending on the type of class (continuous or discrete), the supervised algorithms can be divided into classification or regression tasks.

Unsupervised Learning

Unsupervised learning consists of modeling the model to the data, without any relationship with an output label, also known as unlabeled data. This means that algorithms under this category search to understand the data and find patterns in it. For instance, unsupervised learning can be used to understand the profile of people belonging to a neighborhood, as shown in the following diagram:

Figure 1.21: An illustration of how unsupervised algorithms can be used to understand the profiles of people
Figure 1.21: An illustration of how unsupervised algorithms can be used to understand the profiles of people

When applying a predictor over these algorithms, no target label is given as output. The prediction, only available for some models, consists of placing the new instance into one of the subgroups of data that has been created.

Unsupervised learning is further divided into different tasks, but the most popular one is clustering, which will be discussed next.

Clustering tasks involve creating groups of data (clusters) and complying with the condition that instances from other groups differ visibly from the instances within the group. The output of any clustering algorithm is a label, which assigns the instance to the cluster of that label:

Figure 1.22: A diagram showing clusters of multiple sizes
Figure 1.22: A diagram representing clusters of multiple sizes

The preceding diagram shows a group of clusters, each of a different size, based on the number of instances that belong to each cluster. Considering this, even though clusters do not need to have the same number of instances, it is possible to set the minimum number of instances per cluster to avoid overfitting the data into tiny clusters of very specific data.

Some of the most popular clustering algorithms are as follows:

  • k-means: This focuses on separating the instances into n clusters of equal variance by minimizing the sum of the squared distances between two points.
  • Mean-shift clustering: This creates clusters by using centroids. Each instance becomes a candidate for centroid to be the mean of the points in that cluster.
  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN): This determines clusters as areas with a high density of points, separated by areas with low density.

In conclusion, unsupervised algorithms are designed to understand data when there is no label or class that indicates a correct answer for each set of features. The most common types of unsupervised algorithms are the clustering methods that allow you to classify a population into different groups.

You have been reading a chapter from
Machine Learning Fundamentals
Published in: Nov 2018
Publisher:
ISBN-13: 9781789803556
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image