For most projects in this book we need scikit-learn (refer to, http://scikit-learn.org/stable/install.html) and matplotlib (refer to, http://matplotlib.org/users/installing.html). Both packages require NumPy, but we also need SciPy for sparse matrices as mentioned before. The scikit-learn library is a machine learning package, which is optimized for performance as a lot of the code runs almost as fast as equivalent C code. The same statement is true for NumPy and SciPy. There are various ways to speed up the code, however they are out of scope for this book, so if you want to know more, please consult the documentation.
matplotlib is a plotting and visualization package. We can also use the seaborn package for visualization. Seaborn uses matplotlib under the hood. There are several other Python visualization packages that cover different usage scenarios. matplotlib and seaborn are mostly useful for the visualization for small to medium datasets. The NumPy package offers the ndarray class and various useful array functions. The ndarray class is an array, that can be one or multi-dimensional. This class also has several subclasses representing matrices, masked arrays, and heterogeneous record arrays. In machine learning we mainly use NumPy arrays to store feature vectors or matrices composed of feature vectors. SciPy uses NumPy arrays and offers a variety of scientific and mathematical functions. We also require the pandas library for data wrangling.
In this book, we will use Python 3. As you may know, Python 2 will no longer be supported after 2020, so I strongly recommend switching to Python 3. If you are stuck with Python 2 you should still be able to modify the example code to work for you. In my opinion, the Anaconda Python 3 distribution is the best option. Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager, conda. The distribution includes more than 200 Python packages, which makes it very convenient. For casual users, the Miniconda distribution may be the better choice. Miniconda contains the conda package manager and Python.
The procedures to install Anaconda and Miniconda are similar. Obviously, Anaconda takes more disk space. Follow the instructions from the Anaconda website at http://conda.pydata.org/docs/install/quick.html. First, you have to download the appropriate installer for your operating system and Python version. Sometimes you can choose between a GUI and a command line installer. I used the Python 3 installer, although my system Python version is 2.7. This is possible since Anaconda comes with its own Python. On my machine the Anaconda installer created an anaconda directory in my home directory and required about 900 MB. The Miniconda installer installs a miniconda directory in your home directory. Installation instructions for NumPy are at http://docs.scipy.org/doc/numpy/user/install.html.
Alternatively install NumPy with pip as follows:
$ [sudo] pip install numpy
The command for Anaconda users is:
$ conda install numpy
To install the other dependencies, substitute NumPy by the appropriate package. Please read the documentation carefully, not all options work equally well for each operating system. The pandas installation documentation is at http://pandas.pydata.org/pandas-docs/dev/install.html.