Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Hands-On Machine Learning for Cybersecurity

You're reading from   Hands-On Machine Learning for Cybersecurity Safeguard your system by making your machines intelligent using the Python ecosystem

Arrow left icon
Product type Paperback
Published in Dec 2018
Publisher Packt
ISBN-13 9781788992282
Length 318 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Soma Halder Soma Halder
Author Profile Icon Soma Halder
Soma Halder
Sinan Ozdemir Sinan Ozdemir
Author Profile Icon Sinan Ozdemir
Sinan Ozdemir
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Basics of Machine Learning in Cybersecurity 2. Time Series Analysis and Ensemble Modeling FREE CHAPTER 3. Segregating Legitimate and Lousy URLs 4. Knocking Down CAPTCHAs 5. Using Data Science to Catch Email Fraud and Spam 6. Efficient Network Anomaly Detection Using k-means 7. Decision Tree and Context-Based Malicious Event Detection 8. Catching Impersonators and Hackers Red Handed 9. Changing the Game with TensorFlow 10. Financial Fraud and How Deep Learning Can Mitigate It 11. Case Studies 12. Other Books You May Enjoy

What is machine learning?

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
- Tom M. Mitchell

Machine learning is the branch of science that enables computers to learn, to adapt, to extrapolate patterns, and communicate with each other without explicitly being programmed to do so. The term dates back 1959 when it was first coined by Arthur Samuel at the IBM Artificial Intelligence Labs. machine learning had its foundation in statistics and now overlaps significantly with data mining and knowledge discovery. In the following chapters we will go through a lot of these concepts using cybersecurity as the back drop.

In the 1980s, machine learning gained much more prominence with the success of artificial neural networks (ANNs). Machine learning became glorified in the 1990s, when researchers started using it to day-to-day life problems. In the early 2000s, the internet and digitization poured fuel on this fire, and over the years companies like Google, Amazon, Facebook, and Netflix started leveraging machine learning to improve human-computer interactions even further. Voice recognition and face recognition systems have become our go-to technologies. More recently, artificially intelligent home automation products, self-driving cars, and robot butlers have sealed the deal.

The field of cybersecurity during this same period, however, saw several massive cyber attacks and data breaches. These are regular attacks as well as state-sponsored attacks. Cyber attacks have become so big that criminals these days are not content with regular impersonations and account take-overs, they target massive industrial security vulnerabilities and try to achieve maximum return of investment (ROI) from a single attack. Several Fortune 500 companies have fallen prey to sophisticated cyber attacks, spear fishing attacks, zero day vulnerabilities, and so on. Attacks on internet of things (IoT) devices and the cloud have gained momentum. These cyber breaches seemed to outsmart human security operations center (SOC) analysts and machine learning methods are needed to complement human effort. More and more threat detection systems are now dependent on these advanced intelligent techniques, and are slowly moving away from the signature-based detectors typically used in security information and event management (SIEM).

Problems that machine learning solves

The following table presents some of the problems that machine learning solves:

Use case Domain
Description
Face recognition Face recognition systems can identify people from digital images by recognizing facial features. These are similar to biometrics and extensively use security systems like the use of face recognition technology to unlock phones. Such systems use three-dimensional recognition and skin texture analysis to verify faces.
Fake news detection Fake news is rampant specially after the 2016 United States presidential election. To stop such yellow journalism and the turmoil created by fake news, detectors were introduced to separate fake news from legitimate news. The detectors use semantic and stylistic patterns of the text in the article, the source of article, and so on, to segregate fake from legit.
Sentiment analysis Understanding the overall positivity or negativity of a document is important as opinion is an influential parameter while making a decision. Sentiment analysis systems perform opinion mining to understand the mood and attitude of the customer.
Recommender systems These are systems that are able to assess the choice of a customer based on the personal history of previous choices made by the customer. This is another determining factor that influences such systems choices made by other similar customers. Such recommender systems are extremely popular and heavily used by industries to sell movies, products, insurances, and so on. Recommender systems in a way decide the go-to-market strategies for the company based on cumulative like or dislike.
Fraud detection systems Fraud detection systems are created for risk mitigation and safe fraud according to customer interest. Such systems detect outliers in transactions and raise flags by measuring anomaly coefficients.
Language translators Language translators are intelligent systems that are able to translate not just word to word but whole paragraphs at a time. Natural language translators use contextual information from multilingual documents and are able to make these translations.
Chatbots Intelligent chatbots are systems that enhance customer experience by providing auto responses when human customer service agents cannot respond. However, their activity is not just limited to being a virtual assistant. They have sentiment analysis capabilities and are also able to make recommendations.

Why use machine learning in cybersecurity?

Legacy-based threat detection systems used heuristics and static signatures on a large amount of data logs to detect threat and anomalies. However, this meant that analysts needed to be aware of how normal data logs should look. The process included data being ingested and processed through the traditional extraction, transformation, and load (ETL) phase. The transformed data is read by machines and analyzed by analysts who create signatures. The signatures are then evaluated by passing more data. An error in evaluation meant rewriting the rules. Signature-based threat detection techniques, though well understood, are not robust, since signatures need to be created on-the-go for larger volumes of data.

Current cybersecurity solutions

Today signature-based systems are being gradually replaced by intelligent cybersecurity agents. Machine learning products are aggressive in identifying new malware, zero day attacks, and advanced persistent threats. Insight from the immense amount of log data is being aggregated by log correlation methods. Endpoint solutions have been super active in identifying peripheral attacks. New machine learning driven cybersecurity products have been proactive in strengthening container systems like virtual machines. The following diagram gives a brief overview of some machine learning solutions in cybersecurity:

In general, machine learning products are created to predict attacks before they occur, but given the sophisticated nature of these attacks, preventive measures often fail. In such cases, machine learning often helps to remediate in other ways, like recognizing the attack at its initial stages and preventing it from spreading across the entire organization.

Many cybersecurity companies are relying on advanced analytics, such as user behavior analytics and predictive analytics, to identify advanced persistent threats early on in the threat life cycle. These methods have been successful in preventing data leakage of personally identifiable information (PII) and insider threats. But prescriptive analytics is another advanced machine learning solution worth mentioning in the cybersecurity perspective. Unlike predictive analytics, which predicts threat by comparing current threat logs with historic threat logs, prescriptive analytics is a more reactive process. Prescriptive analytics deals with situations where a cyber attack is already in play. It analyzes data at this stage to suggest what reactive measure could best fit the situation to keep the loss of information to a minimum.

Machine learning, however, has a down side in cybersecurity. Since alerts generated need to be tested by human SOC analysts, generating too many false alerts could cause alert fatigue. To prevent this issue of false positives, cybersecurity solutions also get insights from SIEM signals. The signals from SIEM systems are compared with the advanced analytics signals so that the system does not produce duplicate signals. Thus machine learning solutions in the field of cybersecurity products learn from the environment to keep false signals to a minimum.

Data in machine learning

Data is the fuel that drives the machine learning engine. Data, when fed to machine learning systems, helps in detecting patterns and mining data. This data can be in any form and comes in frequency from any source.

Structured versus unstructured data

Depending on the source of data and the use case in hand, data can either be structured data, that is, it can be easily mapped to identifiable column headers, or it can be unstructured, that is, it cannot be mapped to any identifiable data model. A mix of unstructured and structured data is called semi-structured data. We will discuss later in the chapter the differing learning approaches to handling these two type of data:

Labelled versus unlabelled data

Data can also be categorized into labelled and unlabelled data. Data that has been manually tagged with headers and meaning is called labelled. Data that has not been tagged is called unlabelled data. Both labelled and unlabelled data are fed to the preceding machine learning phases. In the training phase, the ratio of labelled to unlabelled is 60-40 and 40-60 in the testing phase. Unlabelled data is transformed to labelled data in the testing phase, as shown in the following diagram:

Machine learning phases

The general approach to solving machine learning consists of a series of phases. These phases are consistent no matter he source of data. That is, be it structured or unstructured, the stages required to tackle any kind of data are as shown in the following diagram:

We will discuss each of the phases in detail as follows:

  • The analysis phase: In this phase, the ingested data is analyzed to detect patterns in the data that help create explicit features or parameters that can be used to train the model.
  • The training phase: Data parameters generated in the previous phases are used to create machine learning models in this phase. The training phase is an iterative process, where the data incrementally helps to improve the quality of prediction.
  • The testing phase: Machine learning models created in the training phase are tested with more data and the model's performance is assessed. In this stage we test with data that has not been used in previous phase. Model evaluation at this phase may or may not require parameter training.
  • The application phase: The tuned models are finally fed with real-world data at this phase. At this stage, the model is deployed in the production environment.

Inconsistencies in data

In the training phase, a machine learning model may or may not generalize perfectly. This is due to the inconsistencies that we need to be aware of.

Overfitting

The production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.
- Oxford Dictionary

Overfitting is the phenomenon in which the system is too fitted to the training data. The system produces a negative bias when treated with new data. In other words, the models perform badly. Often this is because we feed only labelled data to our model. Hence we need both labelled and unlabelled data to train a machine learning system.

The following graph shows that to prevent any model errors we need to select data in the optimal order:

Underfitting

Underfitting is another scenario where model performs badly. This is a phenomenon where the performance of the model is affected because the model is not well trained. Such systems have trouble in generalizing new data.

For ideal model performance, both overfitting and underfitting can be prevented by performing some common machine learning procedures, like cross validation of the data, data pruning, and regularization of the data. We will go through these in much more detail in the following chapters after we get more acquainted with machine learning models.

Different types of machine learning algorithm

In this section, we will be discussing the different types of machine learning system and the most commonly used algorithms, with special emphasis on the ones that are more popular in the field of cybersecurity. The following diagram shows the different types of learning involved in machine learning:

Machine learning systems can be broadly categorized into two types: supervised approaches and unsupervised approaches, based on the types of learning they provide.

Supervised learning algorithms

Supervised learning is where a known dataset is used to classify or predict with data in hand. Supervised learning methods learn from labelled data and then use the insight to make decisions on the testing data.

Supervised learning has several subcategories of learning, for example:

  • Semi-supervised learning: This is the type of learning where the initial training data is incomplete. In other words, in this type of learning, both labelled and unlabelled are used in the training phase.
  • Active learning: In this type of learning algorithm, the machine learning system gets active queries made to the user and learns on-the-go. This is a specialized case of supervised learning.

Some popular examples of supervised learning are:

  • Face recognition: Face recognizers use supervised approaches to identify new faces. Face recognizers extract information from a bunch of facial images that are provided to it during the training phase. It uses insights gained after training to detect new faces.
  • Spam detect: Supervised learning helps distinguish spam emails in the inbox by separating them from legitimate emails also known as ham emails. During this process, the training data enables learning, which helps such systems to send ham emails to the inbox and spam emails to the Spam folder:

Unsupervised learning algorithms

The unsupervised learning technique is where the initial data is not labelled. Insights are drawn by processing data whose structure is not known before hand. These are more complex processes since the system learns by itself without any intervention.

Some practical examples of unsupervised learning techniques are:

  • User behavior analysis: Behavior analytics uses unlabelled data about different human traits and human interactions. This data is then used to put each individual into different groups based on their behavior patterns.
  • Market basket analysis: This is another example where unsupervised learning helps identify the likelihood that certain items will always appear together. An example of such an analysis is the shopping cart analysis, where chips, dips, and beer are likely to be found together in the basket, as shown in the following diagram:

Reinforcement learning

Reinforcement learning is a type of dynamic programming where the software learns from its environment to produce an output that will maximize the reward. Here the software requires no external agent but learns from the surrounding processes in the environment.

Some practical examples of reinforcement learning techniques are:

  • Self driving cars: Self driving cars exhibit autonomous motion by learning from the environment. The robust vision technologies in such a system are able to adapt from surrounding traffic conditions. Thus, when these technologies are amalgamated with complex software and hardware movements, they make it possible to navigate through the traffic.
  • Intelligent gaming programs: DeepMind's artificially intelligent G program has been successful in learning a number of games in a matter of hours. Such systems use reinforcement learning in the background to quickly adapt game moves. The G program was able to beat world known AI chess agent Stockfish with just four hours of training:

Another categorization of machine learning

Machine learning techniques can also be categorized by the type of problem they solve, like the classification, clustering, regression, dimensionality reduction, and density estimation techniques. The following diagram briefly discusses definitions and examples of these systems:

In the next chapter, we will be delving with details and its implementation with respect to cybersecurity problems.

Classification problems

Classification is the process of dividing data into multiple classes. Unknown data is ingested and divided into categories based on characteristics or features. Classification problems are an instance of supervised learning since the training data is labelled.

Web data classification is a classic example of this type of learning, where web contents get categorized with models to their respective type based on their textual content like news, social media, advertisements, and so on. The following diagram shows data classified into two classes:

Clustering problems

Clustering is the process of grouping data and putting similar data into the same group. Clustering techniques use a series of data parameters and go through several iterations before they can group the data. These techniques are most popular in the fields of information retrieval and pattern recognition. Clustering techniques are also popularly used in the demographic analysis of the population. The following diagram shows how similar data is grouped in clusters:

Regression problems

Regressions are statistical processes for analyzing data that helps with both data classification and prediction. In regression, the relationship between two variables present in the data population is estimated by analyzing multiple independent and dependent variables. Regression can be of many types like, linear regression, logistic regression, polynomial regression, lasso regression, and so on. An interesting use case with regression analysis is the fraud detection system. Regressions are also used in stock market analysis and prediction:

Dimensionality reduction problems

Dimensionality reduction problems are machine learning techniques where high dimensional data with multiple variables is represented with principle variables, without loosing any vital data. Dimensionality reduction techniques are often applied on network packet data to make the volume of data sizeable. These are also used in the process of feature extraction where it is impossible to model with high dimensional data. The following screenshot shows high-dimensional data with multiple variables:

Density estimation problems

Density estimation problems are statistical learning methods used in machine learning estimations from dense data that is otherwise unobservable. Technically, density estimation is the technique of computing the probability of the density function. Density estimation can be applied on path-parametric and non-parametric data. Medical analysis often uses these techniques for identifying symptoms related to diseases from a very large population. The following diagram shows the density estimation graph:

Deep learning

Deep learning is the form of machine learning where systems learn by examples. This is a more advanced form of machine learning. Deep learning is the study of deep neural networks and requires much larger datasets. Today deep learning is the most sought after technique. Some popular examples of deep learning applications include self driving cars, smart speakers, home-pods, and so on.

Algorithms in machine learning

So far we have dealt with different machine learning systems. In this section we will discuss the algorithms that drive them. The algorithms discussed here fall under one or many groups of machine learning that we have already covered.

Support vector machines

Support vector machines (SVMs) are supervised learning algorithms used in both linear and non linear classification. SVMs operate by creating an optimal hyperplane in high dimensional space. The separation created by this hyperplane is called class. SVMs need very little tuning once trained. They are used in high performing systems because of the reliability they have to offer.

SVMs are also used in regression analysis and in ranking and categorization.

Bayesian networks

Bayesian network (BN) are probabilistic models that are primarily used for prediction and decision making. These are belief networks that use the principles of probability theory along with statistics. BN uses directed acyclic graph (DAG) to represent the relationship of variables and any other corresponding dependencies.

Decision trees

Decision tree learning is a predictive machine learning technique that uses decision trees. Decision trees make use of decision analysis and predict the value of the target. Decision trees are simple implementations of classification problems and popular in operations research. Decisions are made by the output value predicted by the conditional variable.

Random forests

Random forests are extensions of decision tree learning. Here, several decisions trees are collectively used to make predictions. Since this is an ensemble, they are stable and reliable. Random forests can go in-depth to make irregular decisions. A popular use case for random forest is the quality assessment of text documents.

Hierarchical algorithms

Hierarchical algorithms are a form of clustering algorithm. They are sometimes referred as the hierarchical clustering algorithm (HCA). HCA can either be bottom up or agglomerative, or they may be top down or divisive. In the agglomerative approach, the first iteration forms its own cluster and gradually smaller clusters are merged to move up the hierarchy. The top down divisive approach starts with a single cluster that is recursively broken down into multiple clusters.

Genetic algorithms

Genetic algorithms are meta-heuristic algorithms used in constrained and unconstrained optimization problems. They mimic the physiological evolution process of humans and use these insights to solve problems. Genetic algorithms are known to outperform some traditional machine learning and search algorithms because they can withstand noise or changes in input pattern.

Similarity algorithms

Similarity algorithm are predominantly used in the field of text mining. Cosine similarity is a popular algorithm primarily used to compare the similarity between documents. The inner product space of two vectors identifies the amount of similarity between two documents. Similarity algorithms are used in authorship and plagiarism detection techniques.

ANNs

ANNs are intelligent computing systems that mimic the human nervous system. ANN comprises multiple nodes, both input and output. These input and output nodes are connected by a layer of hidden nodes. The complex relationship between input layers helps genetic algorithms are known like the human body does.

The machine learning architecture

A typical machine learning system comprises a pipeline of processes that happens in a sequence for any type of machine learning system, irrespective of the industry. The following diagram shows a typical machine learning system and the sub-processes involved:

Data ingestion

Data is ingested from different sources from real-time systems like IOTS (CCTV cameras), streaming media data, and transaction logs. Data that is ingested can also be data from batch processes or non-interactive processes like Linux cron jobs, Windows scheduler jobs, and so on. Single feed data like raw text data, log files, and process data dumps are also taken in by data stores. Data from enterprise resource planning (ERP), customer relationship management (CRM), and operational systems (OS) is also ingested. Here we analyze some data ingestors that are used in continuous, real-time, or batched data ingestion:

  • Amazon Kinesis: This is a cost-effective data ingestor from Amazon. Kinesis enables terabytes of real-time data to be stored per hour from different data sources. The Kinesis Client Library (KCL) helps to build applications on streaming data and further feeds to other Amazon services, like the Amazon S3, Redshift, and so on.
  • Apache Flume: Apache Flume is a dependable data collector used for streaming data. Apart from data collection, they are fault-tolerant and have a reliable architecture. They can also be used in aggregation and moving data.
  • Apache Kafka: Apache Kafka is another open source message broker used in data collection. This high throughput stream processors works extremely well for creating data pipelines. The cluster-centric design helps in creating wicked fast systems.
Some other data collectors that are widely used in the industry are Apache Sqoop, Apache Storm, Gobblin, Data Torrent, Syncsort, and Cloudera Morphlines.

Data store

The raw or aggregated data from data collectors is stored in data stores, like SQL databases, NoSQL databases, data warehouses, and distributed systems, like HDFS. This data may require some cleaning and preparation if it is unstructured. The file format in which the data is received varies from database dumps, JSON files, parquet files, avro files, and even flat files. For distributed data storage systems, the data upon ingestion gets distributed to different file formats.

Some of the popular data stores available for use as per industry standards are:

  • RDBMS (relational database management system): RDBMS are legacy storage options and are extremely popular in the data warehouse world. They store data retaining the Atomicity, Consistency, Isolation, and Durability (ACID) properties. However, they suffer from downsides are storage in volume and velocity.
  • MongoDB: MongoDB is a popular NoSQL, document-oriented database. It has a wide adoption in the cloud computing world. It can handle data in any format, like structured, semi- structured, and unstructured. With a high code push frequency, it is extremely agile and flexible. MongoDB is inexpensive compared with other monolithic data storage options.
  • Bigtable: This is a scalable NoSQL data base from Google. Bigtable is a part of the reliable Google Cloud Platform (GCP). It is seamlessly scalable, with a very high throughput. Being a part of GCP enables it to be easily plugged in behind visualization apps like Firebase. This is extremely popular among app makers, who use it to gather data insights. It is also used for business analytics.
  • AWS Cloud Storage Services: Amazon AWS is a range of cloud storage services for IOT devices, distributed data storage platforms, and databases. AWS data storage services are extremely secure for any cloud computing components.

The model engine

A machine learning model engine is responsible for managing the end-to-end flows involved in making the machine learning framework operational. The process includes data preparation, feature generation, training, and testing a model. In the next section we will discuss each of this processes in detail.

Data preparation

Data preparation is the stage where data cleansing is performed to check for the consistency and integrity of the data. Once the data is cleansed, the data is often formatted and sampled. The data is normalized so that all the data can be measured in the same scale. Data preparation also includes data transformation where the data is either decomposed or aggregated.

Feature generation

Feature generation is the process where data in analyzed and we look for patterns and attributes that may influence model results. Features are usually mutually independent, and are generated from either raw data or aggregated data. The primary goal of feature generation is performing dimensionality reduction and improved performance.

Training

Model training is the phase in which a machine learning algorithm learns from the data in hand. The learning algorithm detects data patterns and relationships, and categorizes data into classes. The data attributes need to be properly sampled to attain the best performance from the models. Usually 70-80 percent of the data is used in the training phase.

Testing

In the testing phase we validate the model we built in the testing phase. Testing is usually done with 20 percent of the data. Cross validations methods help determine the model performance. The performance of the model can be tested and tuned.

Performance tuning

Performance tuning and error detection are the most important iterations for a machine learning system as it helps improve the performance of the system. Machine learning systems are considered to have optimal performance if the generalized function of the algorithm gives a low generalization error with a high probability. This is conventionally known as the probably approximately correct (PAC) theory.

To compute the generalization error, which is the accuracy of classification or the error in forecast of regression model, we use the metrics described in the following sections.

Mean squared error

Imagine for a regression problem we have the line of best fit and we want to measure the distance of each point from the regression line. Mean squared error (MSE) is the statistical measure that would compute these deviations. MSE computes errors by finding the mean of the squares for each such deviations. The following shows the diagram for MSE:

Where i = 1, 2, 3...n

Mean absolute error

Mean absolute error (MAE) is another statistical method that helps to measure the distance (error) between two continuous variables. A continuous variable can be defined as a variable that could have an infinite number of changing values. Though MAEs are difficult to compute, they are considered as better performing than MSE because they are independent of the square function that has a larger influence on the errors. The following shows the MAE in action:

Precision, recall, and accuracy

Another measure for computing the performance for classification problems is estimating the precision, recall, and accuracy of the model.

Precision is defined as the number of true positives present in the mixture all retrieved instances:

Recall is the number of true positives identified from the total number of true positives present in all relevant documents:

Accuracy measures the percentage of closeness of the measured value from the standard value:

Fake document detection is a real-world use case that could explain this. For fake news detector systems, precision is the number of relevant fake news articles detected from the total number of documents that are detected. Recall, on the other hand, measures the number of fake news articles that get retrieved from the total number of fake news present. Accuracy measures the correctness with which such a system detects fake news. The following diagram shows the fake detector system:

How can model performance be improved?

Models with a low degree of accuracy and high generalization errors need improvement to achieve better results. Performance can be improved either by improving the quality of data, switching to a different algorithm, or tuning the current algorithm performance with ensembles.

Fetching the data to improve performance

Fetching more data to train a model can lead to an improvement in performance. Lowered performance can also be due to a lack of clean data, hence the data needs to be cleansed, resampled, and properly normalized. Revisiting the feature generation can also lead to improved performance. Very often, a lack of independent features within a model are causes for its skewed performance.

Switching machine learning algorithms

A model performance is often not up to the mark because we have not made the right choice of algorithm. In such scenarios, performing a baseline testing with different algorithms helps us make a proper selection. Baseline testing methods include, but are not limited to, k-fold cross validations.

Ensemble learning to improve performance

The performance of a model can be improved by ensembling the performance of multiple algorithms. Blending forecasts and datasets can help in making correct predictions. Some of the most complex artificially intelligent systems today are a byproduct of such ensembles.

Hands-on machine learning

We have so far established that machine learning is used heavily in industries and in the field of data driven research. Thus let's go through some machine learning tools that help to create such machine learning applications with both small or larger-scale data. The following flow diagram shows the various machine learning tools and languages that are currently at our disposal:

Python for machine learning

Python is the preferred language for developing machine learning applications. Though not the fastest, Python is extensively adapted by data scientists because of its versatility.

Python supports a wide range of tools and packages that enable machine learning experts to implement changes with much agility. Python, being a scripting language, is easy to adapt and code in. Python is extensively used for the graphical user interfaces (GUI) development.

Comparing Python 2.x with 3.x

Python 2.x is an older version compared to Python 3.x. Python 3.x was first developed in 2008 while the last Python 2.x update came out in 2010. Though it is perfectly fine to use table application with the 2.x, it is worthwhile to mention that 2.x has not been developed any further from 2.7.

Almost every machine learning package in use has support for both the 2.x and the 3.x versions. However, for the purposes of staying up-to-date, we will be using version 3.x in the uses cases we discuss in this book.

Python installation

Once you have made a decision to install Python 2 or Python 3, you can download the latest version from the Python website at the following URL:

https://www.python.org/download/releases/

On running the downloaded file, Python is installed in the following directory unless explicitly mentioned:

  • For Windows:
C:\Python2.x
C:\Python3.x
  • For macOS:
/usr/bin/python
  • For Linux:
/usr/bin/python
A Windows installation will require you to set the environment variables with the correct path.

To check the version of Python installed, you can run the following code:

import sys
print ("Python version:{}",format(sys.version))

Python interactive development environment

The top Python interactive development environments (IDEs) commonly used for developing Python code are as follows:

  • Spyder
  • Rodeo
  • Pycharm
  • Jupyter

For developmental purposes, we will be using IPython Jupyter Notebook due to its user-friendly interactive environment. Jupyter allows code transportation and easy mark-downs. Jupyter is browser-based, thus supporting different types of imports, exports, and parallel computation.

Jupyter Notebook installation

To download Jupyter Notebook, it is recommended that you:

  • First download Python, either Python 2.x or Python 3.x, as a prerequisite for Jupyter Notebook installation.
  • Once the Python installation is complete, download Anaconda from the following link, depending on the operating system where the installation is being done. Anaconda is a package/environment manager for Python. By default, Anaconda comes with 150 packages and another 250 open source package can be installed along with it:
https://www.anaconda.com/download/
  • Jupyter Notebook can also be installed by running the following commands:
pip install --upgrade pip
pip3 install jupyter

If the user is on Python 2, pip3 needs be replaced by pip.

After installation, you can just type jupyter notebook to run it. This opens Jupyter Notebook in the primary browser. Alternatively, you can open Jupyter from Anaconda Navigator. The following screenshot shows the Jupyter page:

Python packages

In this section, we discuss packages that form the backbone for Python's machine learning architecture.

NumPy

NumPy is a free Python package that is used to perform any computation task. NumPy is absolutely important when doing statistical analysis or machine learning. NumPy contains sophisticated functions for solving linear algebra, Fourier transform, and other numerical analysis. NumPy can be installed by running the following:

pip install numpy

To install this through Jupyter, use the following:

import sys
!{sys.executable} -m pip install numpy

SciPy

SciPy is a Python package that is created on top of the NumPy array object. SciPy contains an array of functions, such as integration, linear algebra, and e-processing functionalities. Like NumPy, it can also be installed likewise. NumPy and SciPy are generally used together.

To check the version of SciPy installed on your system, you can run the following code:

import scipy as sp
print ("SciPy version:{}",format(sp.version))

Scikit-learn

Scikit-learn is a free Python package that is also written in Python. Scikit-learn provides a machine learning library that supports several popular machine learning algorithms for classification, clustering, regression, and so on. Scikit-learn is very helpful for machine learning novices. Scikit-learn can be easily installed by running the following command:

pip install sklearn

To check whether the package is installed successfully, conduct a test using the following piece of code in Jupyter Notebook or the Python command line:

import sklearn

If the preceding argument throws no errors, then the package has been successfully installed.

Scikit-learn requires two dependent packages, NumPy and SciPy, to be installed. We will discuss their functionalities in the following sections. Scikit-learn comes with a few inbuilt datasets like:

  • Iris data set
  • Breast cancer dataset
  • Diabetes dataset
  • The Boston house prices dataset and others

Other public datasets from libsvm and svmlight can also be loaded, as follows:

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

A sample script that uses scikit-learn to load data is as follows:

from sklearn.datasets import load_boston
boston=datasets.load_boston()

pandas

The pandas open source package that provides easy to data structure and data frame. These are powerful for data analysis and are used in statistical learning. The pandas data frame allows different data types to be stored alongside each other, much unlike the NumPy array, where same data type need to be stored together.

Matplotlib

Matplotlib is a package used for plotting and graphing purposes. This helps create visualizations in 2D space. Matplotlib can be used from the Jupyter Notebook, from web application server, or from the other user interfaces.

Let's plot a small sample of the iris data that is available in the sklearn library. The data has 150 data samples and the dimensionality is 4.

We import the sklearn and matplotlib libraries in our Python environment and check the data and the features, as shown in the following code:

import matplotlib.pyplot as plt
from sklearn import datasets
iris = datasets.load_iris() print(iris.data.shape) # gives the data size and dimensions
print(iris.feature_names)

The output can be seen as follows:

Output:
(150, 4) ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

We extract the first two dimensions and plot it on an X by Y plot as follows:

X = iris.data[:, :2] # plotting the first two dimensions
y = iris.target
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
plt.figure(2, figsize=(8, 6))
plt.clf()plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,
edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

We get the following plot:

Mongodb with Python

MongoDB can store unstructured data that is fast and capable of retrieving large amounts of data over a small time. MongoDB uses a JSON format to store data in rows. Thus, any data without a common schema can be stored. We will be using MongoDB in the next few chapters because of its distributed nature. MongoDB has a fault tolerant distribution by shredding the data into multiple servers. MongoDB generates a primary key as you store data.

Installing MongoDB

To install MongoDB on your Windows, macOS, or Linux systems, run the following steps:

  1. Download MongoDB from the download center from the following link for a windows or Mac system:
https://www.mongodb.com/download-center
  1. On a Linux system you can download it from:
sudo apt-get install -y mongodb-org
  1. MongoDB requires a separate repository of its own where you can extract and store the contents upon installation
  2. Finally you can start the MongoDB service

PyMongo

To use MongoDB from within Python we will be using the PyMongo Library. PyMongo contains tools that helps you to work with MongoDB. There are libraries that act as an object data mapper for MongoDB, however PyMongo is the recommended one.

To install PyMongo, you can run the following:

python -m pip install pymongo

Alternatively, you can use the following:

import sys
!{sys.executable} -m pip install pymongo

Finally, you can get started with using MongoDB by importing the PyMongo library and then setting up a connection with MongoDB, as shown in the following code:

import pymongo
connection = pymongo.MongoClient()

On creating a successful connection with MongoDB, you can continue with different operations, like listing the databases present and so on, as seen in the following argument:

connection.database_names() #list databases in MongoDB

Each database in MongoDB contains data in containers called collections. You can retrieve data from these collections to pursue your desired operation, as follows:

selected_DB = connection["database_name"]
selected_DB.collection_names() # list all collections within the selected database 

Setting up the development and testing environment

In this section we will discuss how to set up a machine learning environment. This starts with a use case that we are trying to solve, and once we have shortlisted the problem, we select the IDE where we will do the the end-to-end coding.

We need to procure a dataset and divide the data into testing and training data. Finally, we finish the setup of the environment by importing the ideal packages that are required for computation and visualization.

Since we deal with machine learning use cases for the rest of this book, we choose our use case in a different sector. We will go with the most generic example, that is, prediction of stock prices. We use a standard dataset with xx points and yy dimensions.

Use case

We come up with a use case that predicts the onset of a given few features by creating a stock predictor that ingests in a bunch of parameters and uses these to make a prediction.

Data

We can use multiple data sources, like audio, video, or textual data, to make such a prediction. However, we stick to a single text data type. We use scikit-learn's default diabetes dataset to to come up with a single machine learning model that is regression for doing the predictions and error analysis.

Code

We will use open source code available from the scikit-learn site for this case study. The link to the code is available as shown in the following code:

http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py

We will import the following packages:

  • matplotlib
  • numPy
  • sklearn

Since we will be using regression for our analysis, we import the linear_model, mean_square_error, and r2_score libraries, as seen in the following code:

print(__doc__)
# Code source: Jaques Grobler
# License: BSD 3 clause
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

We import the diabetes data and perform the following actions:

  • List the dimension and size
  • List the features

The associated code for the preceding code is:

# Load the diabetes dataset
diabetes = datasets.load_diabetes()
print(diabetes.data.shape) # gives the data size and dimensions
print(diabetes.feature_names
print(diabetes.DESCR)

The data has 442 rows of data and 10 features. The features are:

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

To train the model we use a single feature, that is, the bmi of the individual, as shown:

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 3]

Earlier in the chapter, we discussed the fact that selecting a proper training and testing set is integral. The last 20 items are kept for testing in our case, as shown in the following code:

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]#everything except the last twenty itemsdiabetes_X_test = diabetes_X[-20:]#last twenty items in the array

Further we also split the targets into training and testing sets as shown:

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
everything except the last two items
diabetes_y_test = diabetes.target[-20:]

Next we perform regression on this data to generate results. We use the testing data to fit the model and then use the testing dataset to make predictions on the test dataset that we have extracted, as seen in the following code:

# Create linear regression object
regr = linear_model.LinearRegression()
#Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

We compute the goodness of fit by computing how large or small the errors are by computing the MSE and variance, as follows:

# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

Finally, we plot the prediction using the Matplotlib graph, as follows:

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()

The output graph looks as follows:

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image