Packt+ | Advance your knowledge in tech

You're reading from Learning pandas High performance data manipulation and analysis using Python

Product type Paperback

Published in Jun 2017

Publisher

ISBN-13 9781787123137

Length 446 pages

Edition 2nd Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Author (1):

Michael Heydt

View More author details

pandas is a Python library containing high-level data structures and tools that have been created to help Python programmers to perform powerful data analysis. The ultimate purpose of pandas is to help you quickly discover information in data, with information being defined as an underlying meaning.

The development of pandas was begun in 2008 by Wes McKinney; it was open sourced in 2009. pandas is currently supported and actively developed by various organizations and contributors.

pandas was initially designed with finance in mind specifically with its ability around time series data manipulation and processing historical stock information. The processing of financial information has many challenges, the following being a few:

Representing security data, such as a stock's price, as it changes over time
Matching the measurement of multiple streams of data at identical times
Determining the relationship (correlation) of two or more streams of data
Representing times and dates as first-class entities
Converting the period of samples of data, either up or down

To do this processing, a tool was needed that allows us to retrieve, index, clean and tidy, reshape, combine, slice, and perform various analyses on both single- and multidimensional data, including heterogeneous-typed data that is automatically aligned along a set of common index labels. This is where pandas comes in, having been created with many useful and powerful features such as the following:

Fast and efficient Series and DataFrame objects for data manipulation with integrated indexing
Intelligent data alignment using indexes and labels
Integrated handling of missing data
Facilities for converting messy data into orderly data (tidying)
Built-in tools for reading and writing data between in-memory data structures and files, databases, and web services
The ability to process data stored in many common formats such as CSV, Excel, HDF5, and JSON
Flexible reshaping and pivoting of sets of data
Smart label-based slicing, fancy indexing, and subsetting of large datasets
Columns can be inserted and deleted from data structures for size mutability
Aggregating or transforming data with a powerful data grouping facility to perform split-apply-combine on datasets
High-performance merging and joining of datasets
Hierarchical indexing facilitating working with high-dimensional data in a lower-dimensional data structure
Extensive features for time series data, including date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting, and lagging
Highly optimized for performance, with critical code paths written in Cython or C

The robust feature set, combined with its seamless integration with Python and other tools within the Python ecosystem, has given pandas wide adoption in many domains. It is in use in a wide variety of academic and commercial domains, including finance, neurosciences, economics, statistics, advertising, and web analytic. It has become one of the most preferred tools for data scientists to represent data for manipulation and analysis.

Python has long been exceptional for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain -specific language such as R. This is very important, as those familiar with Python, a more generalized programming language than R (more a statistical package), gain many data representation and manipulation features of R while remaining entirely within an incredibly rich Python ecosystem.

Combined with IPython, Jupyter notebooks, and a wide range of other libraries, the environment for performing data analysis in Python excels in performance, productivity, and the ability to collaborate, compared to many other tools. This has led to the widespread adoption of pandas by many users in many industries.