Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Learning pandas
Learning pandas

Learning pandas: High performance data manipulation and analysis using Python , Second Edition

eBook
$43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Table of content icon View table of contents Preview book icon Preview Book

Learning pandas

pandas and Data Analysis

Welcome to Learning pandas! In this book, we will go on a journey that will see us learning pandas, an open source data analysis library for the Python programming language. The pandas library provides high-performance and easy-to-use data structures and analysis tools built with Python. pandas brings to Python many good things from the statistical programming language R, specifically data frame objects and R packages such as plyr and reshape2, and places them in a single library that you can use from within Python.

In this first chapter, we will take the time to understand pandas and how it fits into the bigger picture of data analysis. This will give the reader who is interested in pandas a feeling for its place in the bigger picture of data analysis instead of having a complete focus on the details of using pandas. The goal is that while learning pandas you also learn why those features exist in support of performing data analysis tasks.

So, let's jump in. In this chapter, we will cover:

  • What pandas is, why it was created, and what it gives you
  • How pandas relates to data analysis and data science
  • The processes involved in data analysis and how it is supported by pandas
  • General concepts of data and analytics
  • Basic concepts of data analysis and statistical analysis
  • Types of data and their applicability to pandas
  • Other libraries in the Python ecosystem that you will likely use with pandas

Introducing pandas

pandas is a Python library containing high-level data structures and tools that have been created to help Python programmers to perform powerful data analysis. The ultimate purpose of pandas is to help you quickly discover information in data, with information being defined as an underlying meaning.

The development of pandas was begun in 2008 by Wes McKinney; it was open sourced in 2009. pandas is currently supported and actively developed by various organizations and contributors.

pandas was initially designed with finance in mind specifically with its ability around time series data manipulation and processing historical stock information. The processing of financial information has many challenges, the following being a few:

  • Representing security data, such as a stock's price, as it changes over time
  • Matching the measurement of multiple streams of data at identical times
  • Determining the relationship (correlation) of two or more streams of data
  • Representing times and dates as first-class entities
  • Converting the period of samples of data, either up or down

To do this processing, a tool was needed that allows us to retrieve, index, clean and tidy, reshape, combine, slice, and perform various analyses on both single- and multidimensional data, including heterogeneous-typed data that is automatically aligned along a set of common index labels. This is where pandas comes in, having been created with many useful and powerful features such as the following:

  • Fast and efficient Series and DataFrame objects for data manipulation with integrated indexing
  • Intelligent data alignment using indexes and labels
  • Integrated handling of missing data
  • Facilities for converting messy data into orderly data (tidying)
  • Built-in tools for reading and writing data between in-memory data structures and files, databases, and web services
  • The ability to process data stored in many common formats such as CSV, Excel, HDF5, and JSON
  • Flexible reshaping and pivoting of sets of data
  • Smart label-based slicing, fancy indexing, and subsetting of large datasets
  • Columns can be inserted and deleted from data structures for size mutability
  • Aggregating or transforming data with a powerful data grouping facility to perform split-apply-combine on datasets
  • High-performance merging and joining of datasets
  • Hierarchical indexing facilitating working with high-dimensional data in a lower-dimensional data structure
  • Extensive features for time series data, including date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting, and lagging
  • Highly optimized for performance, with critical code paths written in Cython or C

The robust feature set, combined with its seamless integration with Python and other tools within the Python ecosystem, has given pandas wide adoption in many domains. It is in use in a wide variety of academic and commercial domains, including finance, neurosciences, economics, statistics, advertising, and web analytic. It has become one of the most preferred tools for data scientists to represent data for manipulation and analysis.

Python has long been exceptional for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain -specific language such as R. This is very important, as those familiar with Python, a more generalized programming language than R (more a statistical package), gain many data representation and manipulation features of R while remaining entirely within an incredibly rich Python ecosystem.

Combined with IPython, Jupyter notebooks, and a wide range of other libraries, the environment for performing data analysis in Python excels in performance, productivity, and the ability to collaborate, compared to many other tools. This has led to the widespread adoption of pandas by many users in many industries.

Data manipulation, analysis, science, and pandas

We live in a world in which massive amounts of data are produced and stored every day. This data comes from a plethora of information systems, devices, and sensors. Almost everything you do, and items you use to do it, produces data which can be, or is, captured.

This has been greatly enabled by the ubiquitous nature of services that are connected to networks, and by the great increases in data storage facilities; this, combined with the ever-decreasing cost of storage, has made capturing and storing even the most trivial of data effective.

This has led to massive amounts of data being piled up and ready for access. But this data is spread out all over cyber-space, and is cannot actually be referred to as information. It tends to be a collected collection of the recording of events, whether financial, of your interactions with social networks, or of your personal health monitor tracking your heartbeat throughout the day. This data is stored in all kinds of formats, is located in scattered places, and beyond its raw nature does give much insight.

Logically, the overall process can be broken into three major areas of discipline:

  • Data manipulation
  • Data analysis
  • Data science

These three disciplines can and do have a lot of overlap. Where each ends and the others begin is open to interpretation. For the purposes of this book we will define each as in the following sections.

Data manipulation

Data is distributed all over the planet. It is stored in different formats. It has widely varied levels of quality. Because of this there is a need for tools and processes for pulling data together and into a form that can be used for decision making. This requires many different tasks and capabilities from a tool that manipulates data in preparation for analysis. The features needed from such a tool include:

  • Programmability for reuse and sharing
  • Access to data from external sources
  • Storing data locally
  • Indexing data for efficient retrieval
  • Alignment of data in different sets based upon attributes
  • Combining data in different sets
  • Transformation of data into other representations
  • Cleaning data from cruft
  • Effective handling of bad data
  • Grouping data into common baskets
  • Aggregation of data of like characteristics
  • Application of functions to calculate meaning or perform transformations
  • Query and slicing to explore pieces of the whole
  • Restructuring into other forms
  • Modeling distinct categories of data such as categorical, continuous, discrete, and time series
  • Resampling data to different frequencies

There are many data manipulation tools in existence. Each differs in support for the items on this list, how they are deployed, and how they are utilized by their users. These tools include relational databases (SQL Server, Oracle), spreadsheets (Excel), event processing systems (such as Spark), and more generic tools such as R and pandas.

Data analysis

Data analysis is the process of creating meaning from data. Data with quantified meaning is often called information. Data analysis is the process of creating information from data through the creation of data models and mathematics to find patterns. It often overlaps data manipulation and the distinction between the two is not always clear. Many data manipulation tools also contain analyses functions, and data analysis tools often provide data manipulation capabilities.

Data science

Data science is the process of using statistics and data analysis processes to create an understanding of phenomena within data. Data science usually starts with information and applies a more complex domain-based analysis to the information. These domains span many fields such as mathematics, statistics, information science, computer science, machine learning, classification, cluster analysis, data mining, databases, and visualization. Data science is multidisciplinary. Its methods of domain analysis are often very different and specific to a specific domain.

Where does pandas fit?

pandas first and foremost excels in data manipulation. All of the needs itemized earlier will be covered in this book using pandas. This is the core of pandas and is most of what we will focus on in this book.

It is worth noting that that pandas has a specific design goal: emphasizing data

But pandas does provide several features for performing data analysis. These capabilities typically revolve around descriptive statistics and functions required for finance such as correlations.

Therefore, pandas itself is not a data science toolkit. It is more of a manipulation tool with some analysis capabilities. pandas explicitly leaves complex statistical, financial, and other types of analyses to other Python libraries, such as SciPy, NumPy, scikit-learn, and leans upon graphics libraries such as matplotlib and ggvis for data visualization.

This focus is actually a strength of pandas over other languages such as R as pandas applications are able to leverage an extensive network of robust Python frameworks already built and tested elsewhere by the Python community.

The process of data analysis

The primary goal of this book is to thoroughly teach you how to use pandas to manipulate data. But there is a secondary, and perhaps no less important, goal of showing how pandas fits into the processes that a data analyst/scientist performs in everyday life.

One description of the steps involved in the process of data analysis is given on the pandas web site:

  • Munging and cleaning data
  • Analyzing/modeling
  • Organization into a form suitable for communication

This small list is a good initial definition, but it fails to cover the overall scope of the process and why many features implemented in pandas were created. The following expands upon this process and sets the framework for what is to come throughout this journey.

The process

The proposed process is one that will be referred to as The Data Process and is represented in the following diagram:

This process sets up a framework for defining logical steps that are taken in working with data. For now, let's take a quick look at each of these steps in the process and some of the tasks that you as a data analyst using pandas will perform.

It is important to understand that this is not purely a linear process. It is best done in a highly interactive and agile/iterative manner.

Ideation

The first step in any data problem is to identify what it is you want to figure out. This is referred to as ideation, of coming up with an idea of what we want to do and prove. Ideation generally relates to hypothesizing about patterns in data that can be used to make intelligent decisions.

These decisions are often within the context of a business, but also within other disciplines such as the sciences and research. The in-vogue thing right now is understanding the operations of businesses, as there are often copious amounts of money to be made in understanding data.

But what kinds of decision are we typically looking to make? The following are several questions for which answers are commonly asked:

  • Why did something happen?
  • Can we predict the future using historical data?
  • How can I optimize operations in the future?

This list is by no means exhaustive, but it does cover a sizable percentage of the reasons why anyone undertakes these endeavors. To get answers to these questions, one must be involved with collecting and understanding data relative to the problem. This involves defining what data is going to be researched, what the benefit is of the research, how the data is going to be obtained, what the success criteria are, and how the information is going to be eventually communicated.

pandas itself does not provide tools to assist in ideation. But once you have gained understanding and skill in using pandas, you will naturally realize how pandas will help you in being able to formulate ideas. This is because you will be armed with a powerful tool you can used to frame many complicated hypotheses.

Retrieval

Once you have an idea you must then find data to try and support your hypothesis. This data can come from within your organization or from external data providers. This data normally is provided as archived data or can be provided in real-time (although pandas is not well known for being a real-time data processing tool).

Data is often very raw, even if obtained from data sources that you have created or from within your organization. Being raw means that the data can be disorganized, may be in various formats, and erroneous; relative to supporting your analysis, it may be incomplete and need manual augmentation.

There is a lot of free data in the world. Much data is not free and actually costs significant amounts of money to obtain. Some is freely available with public APIs, and the others by subscription. Data you pay for is often cleaner, but this is not always the case.

In either case, pandas provides a robust and easy-to-use set of tools for retrieving data from various sources and that may be in many different formats. pandas also gives us the ability to not only retrieve data, but to also provide an initial structuring of the data via pandas data structures without needing to manually create complex coding, which may be required in other tools or programming languages.

Preparation

During preparation, raw data is made ready for exploration. This preparation is often a very interesting process. It is very frequently the case that data from is fraught with all kinds of issues related to quality. You will likely spend a lot of time dealing with these quality issues, and often this is a very non-trivial amount of time.

Why? Well there are a number of reasons:

  • The data is simply incorrect
  • Parts of the dataset are missing
  • Data is not represented using measurements appropriate for your analysis
  • The data is in formats not convenient for your analysis
  • Data is at a level of detail not appropriate for your analysis
  • Not all the fields you need are available from a single source
  • The representation of data differs depending upon the provider

The preparation process focuses on solving these issues. pandas provides many great facilities for preparing data, often referred to as tidying up data. These facilities include intelligent means of handling missing data, converting data types, using format conversion, changing frequencies of measurements, joining data from multiple sets of data, mapping/converting symbols into shared representations, and grouping data, among many others. We will cover all of these in depth.

Exploration

Exploration involves being able to interactively slice and dice your data to try and make quick discoveries. Exploration can include various tasks such as:

  • Examining how variables relate to each other
  • Determining how the data is distributed
  • Finding and excluding outliers
  • Creating quick visualizations
  • Quickly creating new data representations or models to feed into more permanent and detailed modeling processes

Exploration is one of the great strengths of pandas. While exploration can be performed in most programming languages, each has its own level of ceremony—how much non-exploratory effort must be performedbefore actually getting to discoveries.

When used with the read-eval-print-loop (REPL) nature of IPython and/or Jupyter notebooks, pandas creates an exploratory environment that is almost free of ceremony. The expressiveness of the syntax of pandas lets you describe complex data manipulation constructs succinctly, and the result of every action you take upon your data is immediately presented for your inspection. This allows you to quickly determine the validity of the action you just took without having to recompile and completely rerun your programs.

Modeling

In the modeling stage you formalize your discoveries found during exploration into an explicit explanation of the steps and data structures required to get to the desired meaning contained within your data. This is the model, a combination of both data structures as well as steps in code to get from the raw data to your information and conclusions.

The modeling process is iterative where, through an exploration of the data, you select the variables required to support your analysis, organize the variables for input to analytical processes, execute the model, and determine how well the model supports your original assumptions. It can include a formal modeling of the structure of the data, but can also combine techniques from various analytic domains such as (and not limited to) statistics, machine learning, and operations research.

To facilitate this, pandas provides extensive data modeling facilities. It is in this step that you will move more from exploring your data, to formalizing the data model in DataFrame objects, and ensuring the processes to create these models are succinct. Additionally, by being based in Python, you get to use its full power to create programs to automate the process from beginning to end. The models you create are executable.

From an analytic perspective, pandas provides several capabilities, most notably integrated support for descriptive statistics, which can get you to your goal for many types of problems. And because pandas is Python-based, if you need more advanced analytic capabilities, it is very easy to integrate with other parts of the extensive Python scientific environment.

Presentation

The penultimate step of the process is presenting your findings to others, typically in the form of a report or presentation. You will want to create a persuasive and thorough explanation of your solution. This can often be done using various plotting tools in Python and manually creating a presentation.

Jupyter notebooks are a powerful tool in creating presentations for your analyses with pandas. These notebooks provide a means of both executing code and providing rich markdown capabilities to annotate and describe the execution at multiple points in the application. These can be used to create very effective, executable presentations that are visually rich with pieces of code, stylized text, and graphics.

We will explore Jupyter notebooks briefly in Chapter 2, Up and Running with pandas.

Reproduction

An important piece of research is sharing and making your research reproducible. It is often said that if other researchers cannot reproduce your experiment and results, then you didn't prove a thing.

Fortunately, for you, by having used pandas and Python, you will be able to easily make your analysis reproducible. This can be done by sharing the Python code that drives your pandas code, as well as the data.

Jupyter notebooks also provide a convenient means of packaging both the code and application in a means that can be easily shared with anyone else with a Jupyter Notebook installation. And there are many free, and secure, sharing sites on the internet that allow you to either create or deploy your Jupyter notebooks for sharing.

A note on being iterative and agile

Something very important to understand about data manipulation, analysis, and science is that it is an iterative process. Although there is a natural forward flow along the stages previously discussed, you will end up going forwards and backwards in the process. For instance, while in the exploration phase you may identify anomalies in the data that relate to data purity issues from the preparation stage, and need to go back and rectify those issues.

This is part of the fun of the process. You are on an adventure to solve your initial problem, all the while gaining incremental insights about the data you are working with. These insights may lead you to ask new questions, to more exact questions, or to a realization that your initial questions were not the actual questions that needed to be asked. The process is truly a journey and not necessarily the destination.

Relating the book to the process

The following gives a quick mapping of the steps in the process to where you will learn about them in this book. Do not fret if the steps that are earlier in the process are in later chapters. The book will walk you through this in a logical progression for learning pandas, and you can refer back from the chapters to the relevant stage in the process.

Step in process

Place

Ideation

Ideation is the creative process in data science. You need to have the idea. The fact that you are reading this qualifies you as you must be looking to analyze some data, and want to in the future.

Retrieval

Retrieval of data is primarily covered in Chapter 9, Accessing Data.

Preparation

Preparation of data is primarily covered in Chapter 10, Tidying Up your Data, but it is also a common thread running through most of the chapters.

Exploration

Exploration spans Chapter 3, Representing Univariate Data with the Series, through Chapter 15, Historical Stock Price Analysis, so most of the chapters of the book. But the most focused chapters for exploration are Chapter 14, Visualization and Chapter 15, Historical Stock Price Analysis, in both of which we begin to see the results of data analysis.

Modeling

Modeling has its focus in Chapter 3, Representing Univariate Data with the pandas Series, and Chapter 4, Representing Tabular and Multivariate Data with the DataFrame with the pandas DataFrame, and also Chapter 11, Combining, Relating, and Reshaping Data through Chapter 13, Time-Series Modelling, and with a specific focus towards finance in Chapter 15, Historical Stock Price Analysis.

Presentation

Presentation is the primary purpose of Chapter 14, Visualization.

Reproduction

Reproduction flows throughout the book, as the examples are provided as Jupyter notebooks. By working in notebooks, you are by default using a tool for reproduction and have the ability to share notebooks in various ways.

Concepts of data and analysis in our tour of pandas

When learning pandas and data analysis you will come across many concepts in data, modeling and analysis. Let's examine several of these concepts and how they relate to pandas.

Types of data

Working with data in the wild you will come across several broad categories of data that will need to be coerced into pandas data structures. They are important to understand as the tools required to work with each type vary.

pandas is inherently used for manipulating structured data but provides several tools for facilitating the conversion of non-structured data into a means we can manipulate.

Structured

Structured data is any type of data that is organized as fixed fields within a record or file, such as data in relational databases and spreadsheets. Structured data depends upon a data model, which is the defined organization and meaning of the data and often how the data should be processed. This includes specifying the type of the data (integer, float, string, and so on), and any restrictions on the data, such as the number of characters, maximum and minimum values, or a restriction to a certain set of values.

Structured data is the type of data that pandas is designed to utilize. As we will see first with the Series and then with the DataFrame, pandas organizes structured data into one or more columns of data, each of a single and specific data type, and then a series of zero or more rows of data.

Unstructured

Unstructured data is data that is without any defined organization and which specifically does not break down into stringently defined columns of specific types. This can consist of many types of information such as photos and graphic images, videos, streaming sensor data, web pages, PDF files, PowerPoint presentations, emails, blog entries, wikis, and word processing documents.

While pandas does not manipulate unstructured data directly, it provides a number of facilities to extract structured data from unstructured sources. As a specific example that we will examine, pandas has tools to retrieve web pages and extract specific pieces of content into a DataFrame.

Semi-structured

Semi-structured data fits in between unstructured. It can be considered a type of structured data, but lacks the strict data model structure. JSON is a form of semi-structured data. While good JSON will have a defined format, there is no specific schema for data that is always strictly enforced. Much of the time, the data will be in a repeatable pattern that can be easily converted into structured data types like the pandas DataFrame, but the process may need some guidance from you to specify or coerce data types.

Variables

When modeling data in pandas, we will be modeling one or more variables and looking to find statistical meaning amongst the values or across multiple variables. This definition of a variable is not in the sense of a variable in a programming language but one of statistical variables.

A variable is any characteristic, number, or quantity that can be measured or counted. A variable is so named because the value may vary between data units in a population and may change in value over time. Stock value, age, sex, business income and expenses, country of birth, capital expenditure, class grades, eye color, and vehicle type are examples of variables.

There are several broad types of statistical variables that we will come across when using pandas:

  • Categorical
  • Continuous
  • Discrete

Categorical

A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values. Each of the possible values is often referred to as a level. Categorical variables in pandas are represented by Categoricals, a pandas data type which corresponds to categorical variables in statistics. Examples of categorical variables are gender, social class, blood types, country affiliations, observation time, or ratings such as Likert scales.

Continuous

A continuous variable is a variable that can take on infinitely many (an uncountable number of) values. Observations can take any value between a certain set of real numbers. Examples of continuous variables include height, time, and temperature. Continuous variables in pandas are represented by either float or integer types (native to Python), typically in collections that represent multiple samplings of the specific variable.

Discrete

A discrete variable is a variable where the values are based on a count from a set of distinct whole values. A discrete variable cannot be a fractional value between any two variables. Examples of discrete variables include the number of registered cars, number of business locations, and number of children in a family, all of which measure whole units (for example 1, 2, or 3 children). Discrete variables are normally represented in pandas by integers (or occasionally floats), again normally in collections of two or more samplings of a variable.

Time series data

Time series data is a first-class entity within pandas. Time adds an important, extra dimension to samples of variables within pandas. Often variables are independent of the time they were sampled at; that is, the time at which they are sampled is not important. But in many cases they are. A time series forms a sample of a discrete variable at specific time intervals, where the observations have a natural temporal ordering.

A stochastic model for a time series will generally reflect the fact that observations close together in time will be more closely related than observations that are further apart. Time series models will often make use of the natural one-way ordering of time so that values for a given period will be expressed as deriving in some way from past values rather than from future values.

A common scenario with pandas is financial data where a variable represents the value of a stock as it changes at regular intervals throughout the day. We often want to determine changes in the rate of change of the price at specific intervals. We may also want to correlate the price of multiple stocks across specific intervals of time.

This is such an important and robust capability in pandas that we will spend an entire chapter examining the concept.

General concepts of analysis and statistics

In this text, we will only approach the periphery of statistics and the technical processes of data analysis. But several analytical concepts of are worth noting, some of which have implementations directly created within pandas. Others will need to rely on other libraries such as SciPy, but you may also come across them while working with pandas so an initial shout-out is valuable.

Quantitative versus qualitative data/analysis

Qualitative analysis is the scientific study of data that can be observed but cannot be measured. It focuses on cataloging the qualities of data. Examples of qualitative data can be:

  • The softness of your skin
  • How elegantly someone runs

Quantitative analysis is the study of actual values within data, with real measurements of items presented as data. Normally, these are values such as:

  • Quantity
  • Price
  • Height

pandas deals primarily with quantitative data, providing you with extensive tools for representing observations of variables. Pandas does not provide for qualitative analysis, but does let you represent qualitative information.

Single and multivariate analysis

Statistics, from a certain perspective, is the practice of studying variables, and specifically the observation of those variables. Much of statistics is based upon doing this analysis for a single variable, which is referred to as univariate analysis. Univariate analysis is the simplest form of analyzing data. It does not deal with causes or relationships and is normally used to describe or summarize data, and to find patterns in it.

Multivariate analysis is a modeling technique where there exist two or more output variables that affect the outcome of an experiment. Multivariate analysis is often related to concepts such as correlation and regression, which help us understand the relationships between multiple variables, as well as how those relationships affect the outcome.

pandas primarily provides fundamental univariate analysis capabilities. And these capabilities are generally descriptive statistics, although there is inherent support for concepts such as correlations (as they are very common in finance and other domains).

Other more complex statistics can be performed with StatsModels. Again, this is not per se a weakness of pandas, but a specific design decision to let those concepts be handled by other dedicated Python libraries.

Descriptive statistics

Descriptive statistics are functions that summarize a given dataset, typically where the dataset represents a population or sample of a single variable (univariate data). They describe the dataset and form measures of a central tendency and measures of variability and dispersion.

For example, the following are descriptive statistics:

  • The distribution (for example, normal, Poisson)
  • The central tendency (for example, mean, median, and mode)
  • The dispersion (for example, variance, standard deviation)

As we will see, the pandas Series and DataFrame objects have integrated support for a large number of descriptive statistics.

Inferential statistics

Inferential statistics differs from descriptive statistics in that inferential statistics attempts to infer conclusions from data instead of simply summarizing it. Examples of inferential statistics include:

  • t-test
  • chi square
  • ANOVA
  • Bootstrapping

These inferential techniques are generally deferred from pandas to other tools such as SciPy and StatsModels.

Stochastic models

Stochastic models are a form of statistical modeling that includes one or more random variables, and typically includes use of time series data. The purpose of a stochastic model is to estimate the chance that an outcome is within a specific forecast to predict conditions for different situations.

An example of stochastic modeling is the Monte Carlo simulation. The Monte Carlo simulation is often used for financial portfolio evaluation by simulating the performance of a portfolio based upon repeated simulation of the portfolio in markets that are influenced by various factors and the inherent probability distributions of the constituent stock returns.

pandas gives us the fundamental data structure for stochastic models in the DataFrame, often using time series data, to get up and running for stochastic models. While it is possible to code your own stochastic models and analyses using pandas and Python, in many cases there are domain-specific libraries such as PyMC to facilitate this type of modeling.

Probability and Bayesian statistics

Bayesian statistics is an approach to statistical inference, derived from Bayes' theorem, a mathematical equation built off simple probability axioms. It allows an analyst to calculate any conditional probability of interest. A conditional probability is simply the probability of event A given that event B has occurred.

Therefore, in probability terms, the data events have already occurred and have been collected (since we know the probability). By using Bayes' theorem, we can then calculate the probability of various things of interest, given or conditional upon, this already observed data.

Bayesian modeling is beyond the scope of this book, but again the underlying data models are well handled using pandas and then actually analyzed using libraries such as PyMC.

Correlation

Correlation is one of the most common statistics and is directly built into the pandas DataFrame. A correlation is a single number that describes the degree of relationship between two variables, and specifically between two sequences of observations of those variables.

A common example of using a correlation is to determine how closely the prices of two stocks follows each other as time progresses. If the changes move closely, the two stocks have a high correlation, and if there is no discernible pattern they are uncorrelated. This is valuable information that can be used in a number of investment strategies.

The level of correlation of two stocks can also vary slightly with the time frame of the entire dataset, as well as the interval. Fortunately, pandas has powerful capabilities for us to easily change these parameters and rerun correlations. We will look at correlations in several places later in the book.

Regression

Regression is a statistical measure that estimates the strength of relationship between a dependent variable and a series of other variables. It can be used to understand the relationships between variables. An example in finance would be understanding the relationship between commodity prices and the stocks of businesses dealing in those commodities.

There was originally a regression model built directly into pandas, but it has been moved out into the StatsModels library. This shows a pattern common in pandas. Often pandas has concepts built into it, but as they mature they are deemed to fit most effectively into other Python libraries. This is both good and bad. It is initially great to have it directly in pandas, but as you upgrade to new versions of pandas it can break your code!

Other Python libraries of value with pandas

pandas forms one small, but important, part of the data analysis and data science ecosystem within Python. As a reference, here are a few other important Python libraries worth noting. The list is not exhaustive, but outlines several you will likely come across..

Numeric and scientific computing - NumPy and SciPy

NumPy (http://www.numpy.org/) is the cornerstone toolbox for scientific computing with Python, and is included in most distributions of modern Python. It is actually a foundational toolbox from which pandas was built, and when using pandas you will almost certainly use it frequently. NumPy provides, among other things, support for multidimensional arrays with basic operations on them and useful linear algebra functions.

The use of the array features of NumPy goes hand in hand with pandas, specifically the pandas Series object. Most of our examples will reference NumPy, but the pandas Series functionality is such a tight superset of the NumPy array that we will, except for a few brief situations, not delve into details of NumPy.

SciPy (https://www.scipy.org/) provides a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics, and much more.

Statistical analysis – StatsModels

StatsModels (http://statsmodels.sourceforge.net/) is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that Stats Models fully meets their needs for statistical computing and data analysis in Python.

Features include:

  • Linear regression models
  • Generalized linear models
  • Discrete choice models
  • Robust linear models
  • Many models and functions for time series analysis
  • Nonparametric estimators
  • A collection of datasets as examples
  • A wide range of statistical tests
  • Input-output tools for producing tables in a number of formats (text, LaTex, HTML) and for reading Stata files into NumPy and pandas
  • Plotting functions
  • Extensive unit tests to ensure correctness of results

Machine learning – scikit-learn

scikit-learn (http://scikit-learn.org/) is a machine learning library built from NumPy, SciPy, and matplotlib. It offers simple and efficient tools for common tasks in data analysis such as classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

PyMC - stochastic Bayesian modeling

PyMC (https://github.com/pymc-devs/pymc) is a Python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large number of problems. Along with core sampling functionality, PyMC includes methods for summarizing output, plotting, goodness of fit, and convergence diagnostics.

Data visualization - matplotlib and seaborn

Python has a rich set of frameworks for data visualization. Two of the most popular are matplotlib and the newer seaborn.

Matplotlib

Matplotlib is a Python 2D plotting library that produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the Jupyter Notebook, web application servers, and four graphical user interface toolkits.

pandas contains very tight integration with matplotlib, including functions as part of Series and DataFrame objects that automatically call matplotlib. This does not mean that pandas is limited to just matplotlib. As we will see, this can be easily changed to others such as ggplot2 and seaborn.

Seaborn

Seaborn (http://seaborn.pydata.org/introduction.html) is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for NumPy and pandas data structures and statistical routines from SciPy and StatsModels. It provides additional functionality beyond matplotlib, and also by default demonstrates a richer and more modern visual style than matplotlib.

Summary

In this chapter, we went on a tour of the how and why of pandas, data manipulation/analysis, and science. This started with an overview of why pandas exists, what functionality it contains, and how it relates to concepts of data manipulation, analysis, and data science.

Then we covered a process for data analysis to set a framework for why certain functions exist in pandas. These include retrieving data, organizing and cleaning it up, doing exploration, and then building a formal model, presenting your findings, and being able to share and reproduce the analysis.

Next, we covered several concepts involved in data and statistical modeling. This included covering many common analysis techniques and concepts, so as to introduce you to these and make you more familiar when they are explored in more detail in subsequent chapters.

pandas is also a part of a larger Python ecosystem of libraries that are useful for data analysis and science. While this book will focus only on pandas, there are other libraries that you will come across and that were introduced so you are familiar with them when they crop up.

We are ready to begin using pandas. In the next chapter, we will begin to ease ourselves into pandas, starting with obtaining a Python and pandas environment, an overview of Jupyter notebooks, and then getting a quick introduction to pandas Series and DataFrame objects before delving into them im more depth in subsequent elements of pandas.

Left arrow icon Right arrow icon

Key benefits

  • • Get comfortable using pandas and Python as an effective data exploration and analysis tool
  • • Explore pandas through a framework of data analysis, with an explanation of how pandas is well suited for the various stages in a data analysis process
  • • A comprehensive guide to pandas with many of clear and practical examples to help you get up and using pandas

Description

You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance. With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science.

Who is this book for?

This book is ideal for data scientists, data analysts, Python programmers who want to plunge into data analysis using pandas, and anyone with a curiosity about analyzing data. Some knowledge of statistics and programming will be helpful to get the most out of this book but not strictly required. Prior exposure to pandas is also not required.

What you will learn

  • • Understand how data analysts and scientists think about of the processes of gathering and understanding data
  • • Learn how pandas can be used to support the end-to-end process of data analysis
  • • Use pandas Series and DataFrame objects to represent single and multivariate data
  • • Slicing and dicing data with pandas, as well as combining, grouping, and aggregating data from multiple sources
  • • How to access data from external sources such as files, databases, and web services
  • • Represent and manipulate time-series data and the many of the intricacies involved with this type of data
  • • How to visualize statistical information
  • • How to use pandas to solve several common data representation and analysis problems within finance
Estimated delivery fee Deliver to Colombia

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 30, 2017
Length: 446 pages
Edition : 2nd
Language : English
ISBN-13 : 9781787123137
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Estimated delivery fee Deliver to Colombia

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Publication date : Jun 30, 2017
Length: 446 pages
Edition : 2nd
Language : English
ISBN-13 : 9781787123137
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 199.97
Python: End-to-end Data Analysis
$89.99
Learning pandas
$54.99
Pandas Cookbook
$54.99
Total $ 199.97 Stars icon

Table of Contents

15 Chapters
pandas and Data Analysis Chevron down icon Chevron up icon
Up and Running with pandas Chevron down icon Chevron up icon
Representing Univariate Data with the Series Chevron down icon Chevron up icon
Representing Tabular and Multivariate Data with the DataFrame Chevron down icon Chevron up icon
Manipulating DataFrame Structure Chevron down icon Chevron up icon
Indexing Data Chevron down icon Chevron up icon
Categorical Data Chevron down icon Chevron up icon
Numerical and Statistical Methods Chevron down icon Chevron up icon
Accessing Data Chevron down icon Chevron up icon
Tidying Up Your Data Chevron down icon Chevron up icon
Combining, Relating, and Reshaping Data Chevron down icon Chevron up icon
Data Aggregation Chevron down icon Chevron up icon
Time-Series Modelling Chevron down icon Chevron up icon
Visualization Chevron down icon Chevron up icon
Historical Stock Price Analysis Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.5
(2 Ratings)
5 star 50%
4 star 0%
3 star 0%
2 star 50%
1 star 0%
Terry Letsche Sep 06, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book was truly brand-new looking, and it arrived promptly.
Amazon Verified review Amazon
Stephen Brown Aug 01, 2017
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
I have the McKinney book ("Python for Data Analysis", first edition) and was getting this hoping that it filled in some of the gaps in that one, especially with some more advanced grouping/aggregation.What I liked:* You can get the kindle version super cheap if you buy the hard copy* A little more up-to-date in a few places than the McKinney book (although the second edition of that one is coming out in Fall 2017)What I didn't like:* Lots of typos and poor editing*** Just as one example, in both the kindle and printed version, most of the figures in Chapter 6 are off by one from where they should be, so if you see Text1: Figure 1, Text 2: Figure 2, then Text1 explicitly refers to the contents of Figure 1, but you have to skip down to the next figure (Figure 2) to see the actual correct figure. There are lots of other examples, but I just gave up even making note of them after a while. How embarrassing!!!* Tons of whitespace and huge fonts that are obviously used as filler (apparently this is typical of Packt books; I tend to stick with O'Reilly if possible, so I didn't know that going in)* Much more expensive than McKinney while not really covering any topics better, and a few not as wellI tried to write a balanced review, but I was honestly more disappointed in this book than almost any other technical book I've purchased in the past 20 years, especially given how expensive it is. Do yourself a favor and just stick with the McKinney book, especially when the second edition comes out in a few months.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela