You're reading from Learning pandas High performance data manipulation and analysis using Python

Product type Paperback

Published in Jun 2017

Publisher

ISBN-13 9781787123137

Length 446 pages

Edition 2nd Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Author (1):

Michael Heydt

View More author details

Other Python libraries of value with pandas

pandas forms one small, but important, part of the data analysis and data science ecosystem within Python. As a reference, here are a few other important Python libraries worth noting. The list is not exhaustive, but outlines several you will likely come across..

Numeric and scientific computing - NumPy and SciPy

NumPy (http://www.numpy.org/) is the cornerstone toolbox for scientific computing with Python, and is included in most distributions of modern Python. It is actually a foundational toolbox from which pandas was built, and when using pandas you will almost certainly use it frequently. NumPy provides, among other things, support for multidimensional arrays with basic operations on them and useful linear algebra functions.

The use of the array features of NumPy goes hand in hand with pandas, specifically the pandas Series object. Most of our examples will reference NumPy, but the pandas Series functionality is such a tight superset of the NumPy array that we will, except for a few brief situations, not delve into details of NumPy.

SciPy (https://www.scipy.org/) provides a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics, and much more.

Statistical analysis – StatsModels

StatsModels (http://statsmodels.sourceforge.net/) is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that Stats Models fully meets their needs for statistical computing and data analysis in Python.

Features include:

Linear regression models
Generalized linear models
Discrete choice models
Robust linear models
Many models and functions for time series analysis
Nonparametric estimators
A collection of datasets as examples
A wide range of statistical tests
Input-output tools for producing tables in a number of formats (text, LaTex, HTML) and for reading Stata files into NumPy and pandas
Plotting functions
Extensive unit tests to ensure correctness of results

Machine learning – scikit-learn

scikit-learn (http://scikit-learn.org/) is a machine learning library built from NumPy, SciPy, and matplotlib. It offers simple and efficient tools for common tasks in data analysis such as classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

PyMC - stochastic Bayesian modeling

PyMC (https://github.com/pymc-devs/pymc) is a Python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large number of problems. Along with core sampling functionality, PyMC includes methods for summarizing output, plotting, goodness of fit, and convergence diagnostics.

Data visualization - matplotlib and seaborn

Python has a rich set of frameworks for data visualization. Two of the most popular are matplotlib and the newer seaborn.

Matplotlib

Matplotlib is a Python 2D plotting library that produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the Jupyter Notebook, web application servers, and four graphical user interface toolkits.

pandas contains very tight integration with matplotlib, including functions as part of Series and DataFrame objects that automatically call matplotlib. This does not mean that pandas is limited to just matplotlib. As we will see, this can be easily changed to others such as ggplot2 and seaborn.

Seaborn

Seaborn (http://seaborn.pydata.org/introduction.html) is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for NumPy and pandas data structures and statistical routines from SciPy and StatsModels. It provides additional functionality beyond matplotlib, and also by default demonstrates a richer and more modern visual style than matplotlib.