Using Python for machine learning
Python is one of the most popular programming languages for data science, and thanks to its very active developer and open-source community, a large number of useful libraries for scientific computing and machine learning have been developed.
Although the performance of interpreted languages, such as Python, for computation-intensive tasks is inferior to lower-level programming languages, extension libraries such as NumPy and SciPy have been developed that build upon lower-layer Fortran and C implementations for fast vectorized operations on multidimensional arrays.
For machine learning programming tasks, we will mostly refer to the scikit-learn library, which is currently one of the most popular and accessible open-source machine learning libraries. In the later chapters, when we focus on a subfield of machine learning called deep learning, we will use the latest version of the PyTorch library, which specializes in training so-called deep neural network models very efficiently by utilizing graphics cards.
Installing Python and packages from the Python Package Index
Python is available for all three major operating systems—Microsoft Windows, macOS, and Linux—and the installer, as well as the documentation, can be downloaded from the official Python website: https://www.python.org.
The code examples provided in this book have been written for and tested in Python 3.9, and we generally recommend that you use the most recent version of Python 3 that is available. Some of the code may also be compatible with Python 2.7, but as the official support for Python 2.7 ended in 2019, and the majority of open-source libraries have already stopped supporting Python 2.7 (https://python3statement.org), we strongly advise that you use Python 3.9 or newer.
You can check your Python version by executing
python --version
or
python3 --version
in your terminal (or PowerShell if you are using Windows).
The additional packages that we will be using throughout this book can be installed via the pip
installer program, which has been part of the Python Standard Library since Python 3.3. More information about pip
can be found at https://docs.python.org/3/installing/index.html.
After we have successfully installed Python, we can execute pip
from the terminal to install additional Python packages:
pip install SomePackage
Already installed packages can be updated via the --upgrade
flag:
pip install SomePackage --upgrade
Using the Anaconda Python distribution and package manager
A highly recommended open-source package management system for installing Python for scientific computing contexts is conda by Continuum Analytics. Conda is free and licensed under a permissive open-source license. Its goal is to help with the installation and version management of Python packages for data science, math, and engineering across different operating systems. If you want to use conda, it comes in different flavors, namely Anaconda, Miniconda, and Miniforge:
- Anaconda comes with many scientific computing packages pre-installed. The Anaconda installer can be downloaded at https://docs.anaconda.com/anaconda/install/, and an Anaconda quick start guide is available at https://docs.anaconda.com/anaconda/user-guide/getting-started/.
- Miniconda is a leaner alternative to Anaconda (https://docs.conda.io/en/latest/miniconda.html). Essentially, it is similar to Anaconda but without any packages pre-installed, which many people (including the authors) prefer.
- Miniforge is similar to Miniconda but community-maintained and uses a different package repository (conda-forge) from Miniconda and Anaconda. We found that Miniforge is a great alternative to Miniconda. Download and installation instructions can be found in the GitHub repository at https://github.com/conda-forge/miniforge.
After successfully installing conda through either Anaconda, Miniconda, or Miniforge, we can install new Python packages using the following command:
conda install SomePackage
Existing packages can be updated using the following command:
conda update SomePackage
Packages that are not available through the official conda channel might be available via the community-supported conda-forge project (https://conda-forge.org), which can be specified via the --channel conda-forge
flag. For example:
conda install SomePackage --channel conda-forge
Packages that are not available through the default conda channel or conda-forge can be installed via pip
as explained earlier. For example:
pip install SomePackage
Packages for scientific computing, data science, and machine learning
Throughout the first half of this book, we will mainly use NumPy’s multidimensional arrays to store and manipulate data. Occasionally, we will make use of pandas, which is a library built on top of NumPy that provides additional higher-level data manipulation tools that make working with tabular data even more convenient. To augment your learning experience and visualize quantitative data, which is often extremely useful to make sense of it, we will use the very customizable Matplotlib library.
The main machine learning library used in this book is scikit-learn (Chapters 3 to 11). Chapter 12, Parallelizing Neural Network Training with PyTorch, will then introduce the PyTorch library for deep learning.
The version numbers of the major Python packages that were used to write this book are mentioned in the following list. Please make sure that the version numbers of your installed packages are, ideally, equal to these version numbers to ensure that the code examples run correctly:
- NumPy 1.21.2
- SciPy 1.7.0
- Scikit-learn 1.0
- Matplotlib 3.4.3
- pandas 1.3.2
After installing these packages, you can double-check the installed version by importing the package in Python and accessing its __version__
attribute, for example:
>>> import numpy
>>> numpy.__version__
'1.21.2'
For your convenience, we included a python-environment-check.py
script in this book’s complimentary code repository at https://github.com/rasbt/machine-learning-book so that you can check both your Python version and the package versions by executing this script.
Certain chapters will require additional packages and will provide information about the installations. For instance, do not worry about installing PyTorch at this point. Chapter 12 will provide tips and instructions when you need them.
If you encounter errors even though your code matches the code in the chapter exactly, we recommend you first check the version numbers of the underlying packages before spending more time on debugging or reaching out to the publisher or authors. Sometimes, newer versions of libraries introduce backward-incompatible changes that could explain these errors.
If you do not want to change the package version in your main Python installation, we recommend using a virtual environment for installing the packages used in this book. If you use Python without the conda manager, you can use the venv
library to create a new virtual environment. For example, you can create and activate the virtual environment via the following two commands:
python3 -m venv /Users/sebastian/Desktop/pyml-book
source /Users/sebastian/Desktop/pyml-book/bin/activate
Note that you need to activate the virtual environment every time you open a new terminal or PowerShell. You can find more information about venv
at https://docs.python.org/3/library/venv.html.
If you are using Anaconda with the conda package manager, you can create and activate a virtual environment as follows:
conda create -n pyml python=3.9
conda activate pyml