Getting started
Assuming that you have already installed Python (everything at least as recent as 2.7 should be fine), we need to install NumPy and SciPy for numerical operations as well as Matplotlib for visualization.
Introduction to NumPy, SciPy, and Matplotlib
Before we can talk about concrete machine learning algorithms, we have to talk about how best to store the data we will chew through. This is important as the most advanced learning algorithm will not be of any help to us if they will never finish. This may be simply because accessing the data is too slow. Or maybe its representation forces the operating system to swap all day. Add to this that Python is an interpreted language (a highly optimized one, though) that is slow for many numerically heavy algorithms compared to C or Fortran. So we might ask why on earth so many scientists and companies are betting their fortune on Python even in the highly computation-intensive areas?
The answer is that in Python, it is very easy to offload number-crunching tasks to the lower layer in the form of a C or Fortran extension. That is exactly what NumPy and SciPy do (http://scipy.org/install.html). In this tandem, NumPy provides the support of highly optimized multidimensional arrays, which are the basic data structure of most state-of-the-art algorithms. SciPy uses those arrays to provide a set of fast numerical recipes. Finally, Matplotlib (http://matplotlib.org/) is probably the most convenient and feature-rich library to plot high-quality graphs using Python.
Installing Python
Luckily, for all the major operating systems, namely Windows, Mac, and Linux, there are targeted installers for NumPy, SciPy, and Matplotlib. If you are unsure about the installation process, you might want to install Enthought Python Distribution (https://www.enthought.com/products/epd_free.php) or Python(x,y) (http://code.google.com/p/pythonxy/wiki/Downloads), which come with all the earlier mentioned packages included.
Chewing data efficiently with NumPy and intelligently with SciPy
Let us quickly walk through some basic NumPy examples and then take a look at what SciPy provides on top of it. On the way, we will get our feet wet with plotting using the marvelous Matplotlib package.
You will find more interesting examples of what NumPy can offer at http://www.scipy.org/Tentative_NumPy_Tutorial.
You will also find the book NumPy Beginner's Guide - Second Edition, Ivan Idris, Packt Publishing very valuable. Additional tutorial style guides are at http://scipy-lectures.github.com; you may also visit the official SciPy tutorial at http://docs.scipy.org/doc/scipy/reference/tutorial.
In this book, we will use NumPy Version 1.6.2 and SciPy Version 0.11.0.
Learning NumPy
So let us import NumPy and play a bit with it. For that, we need to start the Python interactive shell.
>>> import numpy >>> numpy.version.full_version 1.6.2
As we do not want to pollute our namespace, we certainly should not do the following:
>>> from numpy import *
The numpy.array
array will potentially shadow the array package that is included in standard Python. Instead, we will use the following convenient shortcut:
>>> import numpy as np >>> a = np.array([0,1,2,3,4,5]) >>> a array([0, 1, 2, 3, 4, 5]) >>> a.ndim 1 >>> a.shape (6,)
We just created an array in a similar way to how we would create a list in Python. However, NumPy arrays have additional information about the shape. In this case, it is a one-dimensional array of five elements. No surprises so far.
We can now transform this array in to a 2D matrix.
>>> b = a.reshape((3,2)) >>> b array([[0, 1], [2, 3], [4, 5]]) >>> b.ndim 2 >>> b.shape (3, 2)
The funny thing starts when we realize just how much the NumPy package is optimized. For example, it avoids copies wherever possible.
>>> b[1][0]=77 >>> b array([[ 0, 1], [77, 3], [ 4, 5]]) >>> a array([ 0, 1, 77, 3, 4, 5])
In this case, we have modified the value 2
to 77
in b
, and we can immediately see the same change reflected in a
as well. Keep that in mind whenever you need a true copy.
>>> c = a.reshape((3,2)).copy() >>> c array([[ 0, 1], [77, 3], [ 4, 5]]) >>> c[0][0] = -99 >>> a array([ 0, 1, 77, 3, 4, 5]) >>> c array([[-99, 1], [ 77, 3], [ 4, 5]])
Here, c
and a
are totally independent copies.
Another big advantage of NumPy arrays is that the operations are propagated to the individual elements.
>>> a*2 array([ 2, 4, 6, 8, 10]) >>> a**2 array([ 1, 4, 9, 16, 25]) Contrast that to ordinary Python lists: >>> [1,2,3,4,5]*2 [1, 2, 3, 4, 5, 1, 2, 3, 4, 5] >>> [1,2,3,4,5]**2 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'
Of course, by using NumPy arrays we sacrifice the agility Python lists offer. Simple operations like adding or removing are a bit complex for NumPy arrays. Luckily, we have both at our disposal, and we will use the right one for the task at hand.
Indexing
Part of the power of NumPy comes from the versatile ways in which its arrays can be accessed.
In addition to normal list indexing, it allows us to use arrays themselves as indices.
>>> a[np.array([2,3,4])] array([77, 3, 4])
In addition to the fact that conditions are now propagated to the individual elements, we gain a very convenient way to access our data.
>>> a>4 array([False, False, True, False, False, True], dtype=bool) >>> a[a>4] array([77, 5])
This can also be used to trim outliers.
>>> a[a>4] = 4 >>> a array([0, 1, 4, 3, 4, 4])
As this is a frequent use case, there is a special clip function for it, clipping the values at both ends of an interval with one function call as follows:
>>> a.clip(0,4) array([0, 1, 4, 3, 4, 4])
Handling non-existing values
The power of NumPy's indexing capabilities comes in handy when preprocessing data that we have just read in from a text file. It will most likely contain invalid values, which we will mark as not being a real number using numpy.NAN
as follows:
c = np.array([1, 2, np.NAN, 3, 4]) # let's pretend we have read this from a text file >>> c array([ 1., 2., nan, 3., 4.]) >>> np.isnan(c) array([False, False, True, False, False], dtype=bool) >>> c[~np.isnan(c)] array([ 1., 2., 3., 4.]) >>> np.mean(c[~np.isnan(c)]) 2.5
Comparing runtime behaviors
Let us compare the runtime behavior of NumPy with normal Python lists. In the following code, we will calculate the sum of all squared numbers of 1 to 1000 and see how much time the calculation will take. We do it 10000 times and report the total time so that our measurement is accurate enough.
import timeit normal_py_sec = timeit.timeit('sum(x*x for x in xrange(1000))', number=10000) naive_np_sec = timeit.timeit('sum(na*na)', setup="import numpy as np; na=np.arange(1000)", number=10000) good_np_sec = timeit.timeit('na.dot(na)', setup="import numpy as np; na=np.arange(1000)", number=10000) print("Normal Python: %f sec"%normal_py_sec) print("Naive NumPy: %f sec"%naive_np_sec) print("Good NumPy: %f sec"%good_np_sec) Normal Python: 1.157467 sec Naive NumPy: 4.061293 sec Good NumPy: 0.033419 sec
We make two interesting observations. First, just using NumPy as data storage (Naive NumPy) takes 3.5 times longer, which is surprising since we believe it must be much faster as it is written as a C extension. One reason for this is that the access of individual elements from Python itself is rather costly. Only when we are able to apply algorithms inside the optimized extension code do we get speed improvements, and tremendous ones at that: using the dot()
function of NumPy, we are more than 25 times faster. In summary, in every algorithm we are about to implement, we should always look at how we can move loops over individual elements from Python to some of the highly optimized NumPy or SciPy extension functions.
However, the speed comes at a price. Using NumPy arrays, we no longer have the incredible flexibility of Python lists, which can hold basically anything. NumPy arrays always have only one datatype.
>>> a = np.array([1,2,3]) >>> a.dtype dtype('int64')
If we try to use elements of different types, NumPy will do its best to coerce them to the most reasonable common datatype:
>>> np.array([1, "stringy"]) array(['1', 'stringy'], dtype='|S8') >>> np.array([1, "stringy", set([1,2,3])]) array([1, stringy, set([1, 2, 3])], dtype=object)
Learning SciPy
On top of the efficient data structures of NumPy, SciPy offers a magnitude of algorithms working on those arrays. Whatever numerical-heavy algorithm you take from current books on numerical recipes, you will most likely find support for them in SciPy in one way or another. Whether it is matrix manipulation, linear algebra, optimization, clustering, spatial operations, or even Fast Fourier transformation, the toolbox is readily filled. Therefore, it is a good habit to always inspect the scipy
module before you start implementing a numerical algorithm.
For convenience, the complete namespace of NumPy is also accessible via SciPy. So, from now on, we will use NumPy's machinery via the SciPy namespace. You can check this easily by comparing the function references of any base function; for example:
>>> import scipy, numpy >>> scipy.version.full_version 0.11.0 >>> scipy.dot is numpy.dot True
The diverse algorithms are grouped into the following toolboxes:
SciPy package |
Functionality |
---|---|
|
Hierarchical clustering ( Vector quantization / K-Means ( |
|
Physical and mathematical constants Conversion methods |
|
Discrete Fourier transform algorithms |
|
Integration routines |
|
Interpolation (linear, cubic, and so on) |
|
Data input and output |
|
Linear algebra routines using the optimized BLAS and LAPACK libraries |
|
Functions for fitting maximum entropy models |
|
n-dimensional image package |
|
Orthogonal distance regression |
|
Optimization (finding minima and roots) |
|
Signal processing |
|
Sparse matrices |
|
Spatial data structures and algorithms |
|
Special mathematical functions such as Bessel or Jacobian |
|
Statistics toolkit |
The toolboxes most interesting to our endeavor are scipy.stats
, scipy.interpolate
, scipy.cluster
, and scipy.signal
. For the sake of brevity, we will briefly explore some features of the stats
package and leave the others to be explained when they show up in the chapters.