You're reading from Python Machine Learning by Example Build intelligent systems using Python, TensorFlow 2, PyTorch, and scikit-learn

Product type Paperback

Published in Oct 2020

Publisher Packt

ISBN-13 9781800209718

Length 526 pages

Edition 3rd Edition

Languages

Python

Tools

PyTorch

Concepts

Machine Learning

Author (1):

Yuxi (Hayden) Liu

View More author details

Table of Contents (17) Chapters

Preface

1. Getting Started with Machine Learning and Python

2. Building a Movie Recommendation Engine with Naïve Bayes FREE CHAPTER

3. Recognizing Faces with Support Vector Machine

4. Predicting Online Ad Click-Through with Tree-Based Algorithms

5. Predicting Online Ad Click-Through with Logistic Regression

6. Scaling Up Prediction to Terabyte Click Logs

7. Predicting Stock Prices with Regression Algorithms

8. Predicting Stock Prices with Artificial Neural Networks

9. Mining the 20 Newsgroups Dataset with Text Analysis Techniques

10. Discovering Underlying Topics in the Newsgroups Dataset with Clustering and Topic Modeling

11. Machine Learning Best Practices

12. Categorizing Images of Clothing with Convolutional Neural Networks

13. Making Predictions with Sequences Using Recurrent Neural Networks

14. Making Decisions in Complex Environments with Reinforcement Learning

15. Other Books You May Enjoy

16. Index

Installing software and setting up

As the book title says, Python is the language we will use to implement all machine learning algorithms and techniques throughout the entire book. We will also exploit many popular Python packages and tools such as NumPy, SciPy, TensorFlow, and scikit-learn. By the end of this kick-off chapter, make sure you set up the tools and working environment properly, even if you are already an expert in Python or might be familiar with some of those tools.

Setting up Python and environments

We will be using Python 3 in this book. As you may know, Python 2 will no longer be supported after 2020, so starting with or switching to Python 3 is strongly recommended. Trust me, the transition is pretty smooth. But if you're stuck with Python 2, you still should be able to modify the codes to work for you. The Anaconda Python 3 distribution is one of the best options for data science and machine learning practitioners.

Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager, conda. The distribution (https://docs.anaconda.com/anaconda/packages/pkg-docs/, depending on your OS, or version 3.7, 3.6, or 2.7) includes more than 600 Python packages (as of 2020), which makes it very convenient. For casual users, the Miniconda (https://conda.io/miniconda.html) distribution may be the better choice. Miniconda contains the conda package manager and Python. Obviously, Miniconda takes much less disk space than Anaconda.

The procedures to install Anaconda and Miniconda are similar. You can follow the instructions from https://docs.conda.io/projects/conda/en/latest/user-guide/install/. First, you have to download the appropriate installer for your OS and Python version, as follows:

Figure 1.13: Installation entry based on your OS

Follow the steps listed in your OS. You can choose between a GUI and a CLI. I personally find the latter easier.

I was able to use the Python 3 installer, although the Python version in my system was 2.7 at the time I installed it. This is possible since Anaconda comes with its own Python. On my machine, the Anaconda installer created an anaconda directory in my home directory and required about 900 MB. Similarly, the Miniconda installer installs a miniconda directory in your home directory.

Feel free to play around with it after you set it up. One way to verify that you have set up Anaconda properly is by entering the following command line in your terminal on Linux/Mac or Command Prompt on Windows (from now on, we will just mention terminal):

python

The preceding command line will display your Python running environment, as shown in the following screenshot:

Figure 1.14: Screenshot after running "python" in the terminal

If this isn't what you're seeing, please check the system path or the path Python is running from.

At the end of this section, I want to emphasize the reasons why Python is the most popular language for machine learning and data science. First of all, Python is famous for its high readability and simplicity, which makes it easy to build machine learning models. We spend less time in worrying about getting the right syntax and compilation and, as a result, have more time to find the right machine learning solution. Second, we have an extensive selection of Python libraries and frameworks for machine learning:

Data analysis	NumPy, SciPy, pandas
Data visualization	Matplotlib, Seaborn
Modeling	scikit-learn, TensorFlow, Keras

Table 1.5: Popular Python libraries for machine learning

The next step involves setting up some of these packages that we will use throughout this book.

Installing the main Python packages

For most projects in this book, we will be using NumPy (http://www.numpy.org/), scikit-learn (http://scikit-learn.org/stable/), and TensorFlow (https://www.tensorflow.org/). In the sections that follow, we will cover the installation of several Python packages that we will be mainly using in this book.

NumPy

NumPy is the fundamental package for machine learning with Python. It offers powerful tools including the following:

The N-dimensional array ndarray class and several subclasses representing matrices and arrays
Various sophisticated array functions
Useful linear algebra capabilities

Installation instructions for NumPy can be found at http://docs.scipy.org/doc/numpy/user/install.html. Alternatively, an easier method involves installing it with pip in the command line as follows:

pip install numpy

To install conda for Anaconda users, run the following command line:

conda install numpy

A quick way to verify your installation is to import it into the shell as follows:

>>> import numpy

It has installed correctly if no error message is visible.

SciPy

In machine learning, we mainly use NumPy arrays to store data vectors or matrices composed of feature vectors. SciPy (https://www.scipy.org/scipylib/index.html) uses NumPy arrays and offers a variety of scientific and mathematical functions. Installing SciPy in the terminal is similar, again as follows:

pip install scipy

Pandas

We also use the pandas library (https://pandas.pydata.org/) for data wrangling later in this book. The best way to get pandas is via pip or conda:

conda install pandas

Scikit-learn

The scikit-learn library is a Python machine learning package optimized for performance as a lot of the code runs almost as fast as equivalent C code. The same statement is true for NumPy and SciPy. Scikit-learn requires both NumPy and SciPy to be installed. As the installation guide in http://scikit-learn.org/stable/install.html states, the easiest way to install scikit-learn is to use pip or conda as follows:

pip install -U scikit-learn

TensorFlow

TensorFlow is a Python-friendly open source library invented by the Google Brain team for high-performance numerical computation. It makes machine learning faster and deep learning easier with the Python-based convenient frontend API and high-performance C++-based backend execution. Plus, it allows easy deployment of computation across CPUs and GPUs, which empowers expensive and large-scale machine learning. In this book, we will focus on CPU as our computation platform. Hence, according to https://www.tensorflow.org/install/, installing TensorFlow 2 is done via the following command line:

pip install tensorflow

There are many other packages we will be using intensively, for example, Matplotlib for plotting and visualization, Seaborn for visualization, NLTK for natural language processing, PySpark for large-scale machine learning, and PyTorch for reinforcement learning. We will provide installation details for any package when we first encounter it in this book.

Introducing TensorFlow 2

TensorFlow provides us with an end-to-end scalable platform for implementing and deploying machine learning algorithms. TensorFlow 2 was largely redesigned from its first mature version 1.0 and was released at the end of 2019.

TensorFlow has been widely known for its deep learning modules. However, its most powerful point is computation graphs, which algorithms are built on. Basically, a computation graph is used to convey relationships between the input and the output via tensors. For instance, if we want to evaluate a linear relationship, y = 3 * a + 2 * b, we can represent it in the following computation graph:

Figure 1.15: Computation graph for a y = 3 * a + 2 * b machine

Here, a and b are the input tensors, c and d are the intermediate tensors, and y is the output.

You can think of a computation graph as a network of nodes connected by edges. Each node is a tensor and each edge is an operation or function that takes its input node and returns a value to its output node. To train a machine learning model, TensorFlow builds the computation graph and computes the gradients accordingly (gradients are vectors providing the steepest direction where an optimal solution is reached). In the upcoming chapters, you will see some examples of training machine learning models using TensorFlow.

At the end, we highly recommend you go through https://www.tensorflow.org/guide/data if you are interested in exploring more about TensorFlow and computation graphs.