Python for computational statistics and data science
Python is a widely used, general purpose programming language. Starting to program with Python is a good point. Python provides simple programming syntax and a lot of APIs, which we can use to expand our program.
To use Python on your computer, you can download and install it from https://www.python.org/downloads/ if your OS does not yet have it installed. After completing the installation, we can run the Python program via Terminal, or the Command Prompt on the Windows platform, by typing the following command:
$ python
Tip
Note: remove $ sign. Just type python
on Terminal. This is applicable to Python 2.x.
Once you have executed the command, you should see the Python command prompt, as shown in the following screenshot:
If you installed Python 3, you usually run the program using the following command:
$ python3
You should see the Python 3 shell on your Terminal:
What's next?
There are lots of Python resources to help you learn how to write programs using Python. I recommend to reading the Python documents at https://www.python.org/doc/. You can also read Python books to accelerate your learning. This book does not cover topics about the basic Python programming language.
Python libraries for computational statistics and data science
Python has big communities. They help their members to learn and share. Several community members have been open sources related to computational statistics and data science, which can be used for our work. We will use these libraries for our implementation.
The following are several Python libraries for statistics and data science.
NumPy
NumPy is a fundamental package for efficient scientific computing in Python. This library has capabilities for handling N-dimensional arrays and integrating C/C++ and Fortran code. It also provides features for linear algebra, Fourier transform, and random number.
The official website for NumPy can be found at http://www.numpy.org.
Pandas
Pandas is a library for handling table-like structures called DataFrame objects. This has powerful and efficient numerical operations similar to NumPy's array object.
Further information about pandas can be found at http://pandas.pydata.org.
SciPy
SciPy is an expansion of the NumPy library. It contains functions for linear algebra, interpolation, integration, clustering, and so on.
The official website can be found at http://scipy.org/scipylib/index.html.
Scikit-learn
Scikit-learn is the most popular machine learning library for Python. It provides many functionalities, such as preprocessing data, classification, regression, clustering, dimensionality reduction, and model selection.
Further information about Scikit-learn can be found at http://scikit-learn.org/stable/.
Shogun
Shogun is a machine learning library for Python, which focuses on large-scale kernel methods such as support vector machines (SVMs). This library comes with a range of different SVM implementations.
The official website can be found at http://www.shogun-toolbox.org.
SymPy
SymPy is a Python library for symbolic mathematical computations. It has capabilities in calculus, algebra, geometry, discrete mathematics, quantum physics, and more.
The official website can be found at http://www.sympygamma.com.
Statsmodels
Statsmodels is a Python module we can use to process data, estimate statistical models and test data.
You can find out more about Statsmodels by visiting the official website at http://statsmodels.sourceforge.net.