Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Learning pandas

You're reading from   Learning pandas High performance data manipulation and analysis using Python

Arrow left icon
Product type Paperback
Published in Jun 2017
Publisher
ISBN-13 9781787123137
Length 446 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Michael Heydt Michael Heydt
Author Profile Icon Michael Heydt
Michael Heydt
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. pandas and Data Analysis FREE CHAPTER 2. Up and Running with pandas 3. Representing Univariate Data with the Series 4. Representing Tabular and Multivariate Data with the DataFrame 5. Manipulating DataFrame Structure 6. Indexing Data 7. Categorical Data 8. Numerical and Statistical Methods 9. Accessing Data 10. Tidying Up Your Data 11. Combining, Relating, and Reshaping Data 12. Data Aggregation 13. Time-Series Modelling 14. Visualization 15. Historical Stock Price Analysis

Introducing pandas

pandas is a Python library containing high-level data structures and tools that have been created to help Python programmers to perform powerful data analysis. The ultimate purpose of pandas is to help you quickly discover information in data, with information being defined as an underlying meaning.

The development of pandas was begun in 2008 by Wes McKinney; it was open sourced in 2009. pandas is currently supported and actively developed by various organizations and contributors.

pandas was initially designed with finance in mind specifically with its ability around time series data manipulation and processing historical stock information. The processing of financial information has many challenges, the following being a few:

  • Representing security data, such as a stock's price, as it changes over time
  • Matching the measurement of multiple streams of data at identical times
  • Determining the relationship (correlation) of two or more streams of data
  • Representing times and dates as first-class entities
  • Converting the period of samples of data, either up or down

To do this processing, a tool was needed that allows us to retrieve, index, clean and tidy, reshape, combine, slice, and perform various analyses on both single- and multidimensional data, including heterogeneous-typed data that is automatically aligned along a set of common index labels. This is where pandas comes in, having been created with many useful and powerful features such as the following:

  • Fast and efficient Series and DataFrame objects for data manipulation with integrated indexing
  • Intelligent data alignment using indexes and labels
  • Integrated handling of missing data
  • Facilities for converting messy data into orderly data (tidying)
  • Built-in tools for reading and writing data between in-memory data structures and files, databases, and web services
  • The ability to process data stored in many common formats such as CSV, Excel, HDF5, and JSON
  • Flexible reshaping and pivoting of sets of data
  • Smart label-based slicing, fancy indexing, and subsetting of large datasets
  • Columns can be inserted and deleted from data structures for size mutability
  • Aggregating or transforming data with a powerful data grouping facility to perform split-apply-combine on datasets
  • High-performance merging and joining of datasets
  • Hierarchical indexing facilitating working with high-dimensional data in a lower-dimensional data structure
  • Extensive features for time series data, including date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting, and lagging
  • Highly optimized for performance, with critical code paths written in Cython or C

The robust feature set, combined with its seamless integration with Python and other tools within the Python ecosystem, has given pandas wide adoption in many domains. It is in use in a wide variety of academic and commercial domains, including finance, neurosciences, economics, statistics, advertising, and web analytic. It has become one of the most preferred tools for data scientists to represent data for manipulation and analysis.

Python has long been exceptional for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain -specific language such as R. This is very important, as those familiar with Python, a more generalized programming language than R (more a statistical package), gain many data representation and manipulation features of R while remaining entirely within an incredibly rich Python ecosystem.

Combined with IPython, Jupyter notebooks, and a wide range of other libraries, the environment for performing data analysis in Python excels in performance, productivity, and the ability to collaborate, compared to many other tools. This has led to the widespread adoption of pandas by many users in many industries.

You have been reading a chapter from
Learning pandas - Second Edition
Published in: Jun 2017
Publisher:
ISBN-13: 9781787123137
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at £16.99/month. Cancel anytime