You're reading from Python: End-to-end Data Analysis Leverage the power of Python to clean, scrape, analyze, and visualize your data

Product type Course

Published in May 2017

Publisher Packt

ISBN-13 9781788394697

Length 931 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Data Analysis

Authors (5):

Luiz Felipe Martins

Ivan Idris

Phuong Vo.T.H

Martin Czygan

Magnus Vilhelm Persson

+1 more

View More author details

Table of Contents (6) Chapters

Preface

1. Module 1 FREE CHAPTER

2. Module 2

3. Module 3

A. Bibliography

Index

Chapter 7. Data Analysis Application Examples

In this chapter, we want to get you acquainted with typical data preparation tasks and analysis techniques, because being fluent in preparing, grouping, and reshaping data is an important building block for successful data analysis.

While preparing data seems like a mundane task – and often it is – it is a step we cannot skip, although we can strive to simplify it by using tools such as Pandas.

Why is preparation necessary at all? Because most useful data will come from the real world and will have deficiencies, contain errors or will be fragmentary.

There are more reasons why data preparation is useful: it gets you in close contact with the raw material. Knowing your input helps you to spot potential errors early and build confidence in your results.

Here are a few data preparation scenarios:

A client hands you three files, each containing time series data about a single geological phenomenon, but the observed data is recorded on different intervals and uses different separators
A machine learning algorithm can only work with numeric data, but your input only contains text labels
You are handed the raw logs of a web server of an up and coming service and your task is to make suggestions on a growth strategy, based on existing visitor behavior

Data munging

The arsenal of tools for data munging is huge, and while we will focus on Python we want to mention some useful tools as well. If they are available on your system and you expect to work a lot with data, they are worth learning.

One group of tools belongs to the UNIX tradition, which emphasizes text processing and as a consequence has, over the last four decades, developed many high-performance and battle-tested tools for dealing with text. Some common tools are: sed, grep, awk, sort, uniq, tr, cut, tail, and head. They do very elementary things, such as filtering out lines (grep) or columns (cut) from files, replacing text (sed, tr) or displaying only parts of files (head, tail).

We want to demonstrate the power of these tools with a single example only.

Imagine you are handed the log files of a web server and you are interested in the distribution of the IP addresses.

Each line of the log file contains an entry in the common log server format (you can download this data set from http://ita.ee.lbl.gov/html/contrib/EPA-HTTP.html):

$ cat epa-html.txt
wpbfl2-45.gate.net [29:23:56:12] "GET /Access/ HTTP/1.0" 200 2376ebaca.icsi.net [30:00:22:20] "GET /Info.html HTTP/1.0" 200 884

For instance, we want to know how often certain users have visited our site.

We are interested in the first column only, since this is where the IP address or hostname can be found. After that, we need to count the number of occurrences of each host and finally display the results in a friendly way.

The sort | uniq -c stanza is our workhorse here: it sorts the data first and uniq -c will save the number of occurrences along with the value. The sort -nr | head -15 is our formatting part; we sort numerically (-n) and in reverse (-r), and keep only the top 15 entries.

Putting it all together with pipes:

$ cut -d ' ' -f 1 epa-http.txt | sort | uniq -c | sort -nr | head -15
294 sandy.rtptok1.epa.gov
292 e659229.boeing.com
266 wicdgserv.wic.epa.gov
263 keyhole.es.dupont.com
248 dwilson.pr.mcs.net
176 oea4.r8stw56.epa.gov
174 macip26.nacion.co.cr
172 dcimsd23.dcimsd.epa.gov
167 www-b1.proxy.aol.com
158 piweba3y.prodigy.com
152 wictrn13.dcwictrn.epa.gov
151 nntp1.reach.com
151 inetg1.arco.com
149 canto04.nmsu.edu
146 weisman.metrokc.gov

With one command, we get to convert a sequential server log into an ordered list of the most common hosts that visited our site. We also see that we do not seem to have large differences in the number of visits among our top users.

There are more little helpful tools of which the following are just a tiny selection:

csvkit: This is the suite of utilities for working with CSV, the king of tabular file formats
jq: This is a lightweight and flexible command-line JSON processor
xmlstarlet: This is a tool that supports XML queries with XPath, among other things
q: This runs SQL on text files

Where the UNIX command line ends, lightweight languages take over. You might be able to get an impression from text only, but your colleagues might appreciate visual representations, such as charts or pretty graphs, generated by matplotlib, much more.

Python and its data tools ecosystem are much more versatile than the command line, but for first explorations and simple operations the effectiveness of the command line is often unbeatable.

Cleaning data

Most real-world data will have some defects and therefore will need to go through a cleaning step first. We start with a small file. Although this file contains only four rows, it will allow us to demonstrate the process up to a cleaned data set:

$ cat small.csv
22,6.1
41,5.7
  18,5.3*
29,NA

Note that this file has a few issues. The lines that contain values are all comma-separated, but we have missing (NA) and probably unclean (5.3*) values. We can load this file into a data frame, nevertheless:

>>> import pandas as pd
>>> df = pd.read_csv("small.csv")
>>> df
   22   6.1
0  41   5.7
1  18  5.3*
2  29   NaN

Pandas used the first row as header, but this is not what we want:

>>> df = pd.read_csv("small.csv", header=None)
>>> df
    0     1
0  22   6.1
1  41   5.7
2  18  5.3*
3  29   NaN

This is better, but instead of numeric values, we would like to supply our own column names:

>>> df = pd.read_csv("small.csv", names=["age", "height"])
>>> df
   age height
0   22    6.1
1   41    5.7
2   18   5.3*
3   29    NaN

The age column looks good, since Pandas already inferred the intended type, but the height cannot be parsed into numeric values yet:

>>> df.age.dtype
dtype('int64')
>>> df.height.dtype
dtype('O')

If we try to coerce the height column into float values, Pandas will report an exception:

>>> df.height.astype('float')
ValueError: invalid literal for float(): 5.3*

We could use whatever value is parseable as a float and throw away the rest with the convert_objects method:

>>> df.height.convert_objects(convert_numeric=True)
0    6.1
1    5.7
2    NaN
3    NaN
Name: height, dtype: float64

If we know in advance the undesirable characters in our data set, we can augment the read_csv method with a custom converter function:

>>> remove_stars = lambda s: s.replace("*", "")
>>> df = pd.read_csv("small.csv", names=["age", "height"],
                     converters={"height": remove_stars})
>>> df
   age height
0   22    6.1
1   41    5.7
2   18    5.3
3   29     NA

Now we can finally make the height column a bit more useful. We can assign it the updated version, which has the favored type:

>>> df.height = df.height.convert_objects(convert_numeric=True)
>>> df
   age  height
0   22     6.1
1   41     5.7
2   18     5.3
3   29     NaN

If we wanted to only keep the complete entries, we could drop any row that contains undefined values:

>>> df.dropna()
   age  height
0   22     6.1
1   41     5.7
2   18     5.3

We could use a default height, maybe a fixed value:

>>> df.fillna(5.0)
   age  height
0   22     6.1
1   41     5.7
2   18     5.3
3   29     5.0

On the other hand, we could also use the average of the existing values:

>>> df.fillna(df.height.mean())
   age  height
0   22     6.1
1   41     5.7
2   18     5.3
3   29     5.7

The last three data frames are complete and correct, depending on your definition of correct when dealing with missing values. Especially, the columns have the requested types and are ready for further analysis. Which of the data frames is best suited will depend on the task at hand.

Filtering

Even if we have clean and probably correct data, we might want to use only parts of it or we might want to check for outliers. An outlier is an observation point that is distant from other observations because of variability or measurement errors. In both cases, we want to reduce the number of elements in our data set to make it more relevant for further processing.

In this example, we will try to find potential outliers. We will use the Europe Brent Crude Oil Spot Price as recorded by the U.S. Energy Information Administration. The raw Excel data is available from http://www.eia.gov/dnav/pet/hist_xls/rbrted.xls (it can be found in the second worksheet). We cleaned the data slightly (the cleaning process is part of an exercise at the end of this chapter) and will work with the following data frame, containing 7160 entries, ranging from 1987 to 2015:

>>> df.head()
        date  price
0 1987-05-20  18.63
1 1987-05-21  18.45
2 1987-05-22  18.55
3 1987-05-25  18.60
4 1987-05-26  18.63
>>> df.tail()
           date  price
7155 2015-08-04  49.08
7156 2015-08-05  49.04
7157 2015-08-06  47.80
7158 2015-08-07  47.54
7159 2015-08-10  48.30

While many people know about oil prices – be it from the news or the filling station – let us forget anything we know about it for a minute. We could first ask for the extremes:

>>> df[df.price==df.price.min()]
           date  price
2937 1998-12-10    9.1
>>> df[df.price==df.price.max()]
           date   price
5373 2008-07-03  143.95

Another way to find potential outliers would be to ask for values that deviate most from the mean. We can use the np.abs function to calculate the deviation from the mean first:

>>> np.abs(df.price - df.price.mean())
0       26.17137
1       26.35137
...
7157     2.99863
7158     2.73863  
7159     3.49863

We can now compare this deviation from a multiple – we choose 2.5 – of the standard deviation:

>>> import numpy as np
>>> df[np.abs(df.price - df.price.mean()) > 2.5 * df.price.std()]
       date   price
5354 2008-06-06  132.81
5355 2008-06-09  134.43
5356 2008-06-10  135.24
5357 2008-06-11  134.52
5358 2008-06-12  132.11
5359 2008-06-13  134.29
5360 2008-06-16  133.90
5361 2008-06-17  131.27
5363 2008-06-19  131.84
5364 2008-06-20  134.28
5365 2008-06-23  134.54
5366 2008-06-24  135.37
5367 2008-06-25  131.59
5368 2008-06-26  136.82
5369 2008-06-27  139.38
5370 2008-06-30  138.40
5371 2008-07-01  140.67
5372 2008-07-02  141.24
5373 2008-07-03  143.95
5374 2008-07-07  139.62
5375 2008-07-08  134.15
5376 2008-07-09  133.91
5377 2008-07-10  135.81
5378 2008-07-11  143.68
5379 2008-07-14  142.43
5380 2008-07-15  136.02
5381 2008-07-16  133.31
5382 2008-07-17  134.16

We see that those few days in summer 2008 must have been special. Sure enough, it is not difficult to find articles and essays with titles like Causes and Consequences of the Oil Shock of 2007–08. We have discovered a trace to these events solely by looking at the data.

We could ask the above question for each decade separately. We first make our data frame look more like a time series:

>>> df.index = df.date
>>> del df["date"]
>>> df.head()
            price
date
1987-05-20  18.63
1987-05-21  18.45
1987-05-22  18.55
1987-05-25  18.60
1987-05-26  18.63

We could filter out the eighties:

>>> decade = df["1980":"1989"]
>>> decade[np.abs(decade.price - decade.price.mean()) > 2.5 * decade.price.std()]
            price
date
1988-10-03  11.60
1988-10-04  11.65
1988-10-05  11.20
1988-10-06  11.30
1988-10-07  11.35

We observe that within the data available (1987–1989), the fall of 1988 exhibits a slight spike in the oil prices. Similarly, during the nineties, we see that we have a larger deviation, in the fall of 1990:

>>> decade = df["1990":"1999"]
>>> decade[np.abs(decade.price - decade.price.mean()) > 5 * decade.price.std()]
            price
date
1990-09-24  40.75
1990-09-26  40.85
1990-09-27  41.45
1990-09-28  41.00
1990-10-09  40.90
1990-10-10  40.20
1990-10-11  41.15

There are many more use cases for filtering data. Space and time are typical units: you might want to filter census data by state or city, or economical data by quarter. The possibilities are endless and will be driven by your project.

Merging data

The situation is common: you have multiple data sources, but in order to make statements about the content, you would rather combine them. Fortunately, Pandas' concatenation and merge functions abstract away most of the pain, when combining, joining, or aligning data. It does so in a highly optimized manner as well.

In a case where two data frames have a similar shape, it might be useful to just append one after the other. Maybe A and B are products and one data frame contains the number of items sold per product in a store:

>>> df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> df1
   A  B
0  1  4
1  2  5
2  3  6
>>> df2 = pd.DataFrame({'A': [4, 5, 6], 'B': [7, 8, 9]})
>>> df2
   A  B
0  4  7
1  5  8
2  6  9
>>> df1.append(df2)
   A  B
0  1  4
1  2  5
2  3  6
0  4  7
1  5  8
2  6  9

Sometimes, we won't care about the indices of the originating data frames:

>>> df1.append(df2, ignore_index=True)
   A  B
0  1  4
1  2  5
2  3  6
3  4  7
4  5  8
5  6  9

A more flexible way to combine objects is offered by the pd.concat function, which takes an arbitrary number of series, data frames, or panels as input. The default behavior resembles an append:

>>> pd.concat([df1, df2])
   A  B
0  1  4
1  2  5
2  3  6
0  4  7
1  5  8
2  6  9

The default concat operation appends both frames along the rows – or index, which corresponds to axis 0. To concatenate along the columns, we can pass in the axis keyword argument:

>>> pd.concat([df1, df2], axis=1)
   A  B  A  B
0  1  4  4  7
1  2  5  5  8
2  3  6  6  9

We can add keys to create a hierarchical index.

>>> pd.concat([df1, df2], keys=['UK', 'DE'])
      A  B
UK 0  1  4
   1  2  5
   2  3  6
DE 0  4  7
   1  5  8
   2  6  9

This can be useful if you want to refer back to parts of the data frame later. We use the ix indexer:

>>> df3 = pd.concat([df1, df2], keys=['UK', 'DE'])
>>> df3.ix["UK"]
   A  B
0  1  4
1  2  5
2  3  6

Data frames resemble database tables. It is therefore not surprising that Pandas implements SQL-like join operations on them. What is positively surprising is that these operations are highly optimized and extremely fast:

>>> import numpy as np
>>> df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                        'value': range(4)})
>>> df1
  key  value
0   A      0
1   B      1
2   C      2
3   D      3
>>> df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'], 'value': range(10, 14)})
>>> df2
  key  value
0   B     10
1   D     11
2   D     12
3   E     13

If we merge on key, we get an inner join. This creates a new data frame by combining the column values of the original data frames based upon the join predicate, here the key attribute is used:

>>> df1.merge(df2, on='key')
  key  value_x  value_y
0   B        1       10
1   D        3       11
2   D        3       12

A left, right and full join can be specified by the how parameter:

>>> df1.merge(df2, on='key', how='left')
  key  value_x  value_y
0   A        0      NaN
1   B        1       10
2   C        2      NaN
3   D        3       11
4   D        3       12
>>> df1.merge(df2, on='key', how='right')
  key  value_x  value_y
0   B        1       10
1   D        3       11
2   D        3       12
3   E      NaN       13
>>> df1.merge(df2, on='key', how='outer')
  key  value_x  value_y
0   A        0      NaN
1   B        1       10
2   C        2      NaN
3   D        3       11
4   D        3       12
5   E      NaN       13

The merge methods can be specified with the how parameter. The following table shows the methods in comparison with SQL:

Merge Method	SQL Join Name	Description
`left`	LEFT OUTER JOIN	Use keys from the left frame only.
`right`	RIGHT OUTER JOIN	Use keys from the right frame only.
`outer`	FULL OUTER JOIN	Use a union of keys from both frames.
`inner`	INNER JOIN	Use an intersection of keys from both frames.

Reshaping data

We saw how to combine data frames but sometimes we have all the right data in a single data structure, but the format is impractical for certain tasks. We start again with some artificial weather data:

>>> df
          date    city  value
0   2000-01-03  London      6
1   2000-01-04  London      3
2   2000-01-05  London      4
3   2000-01-03  Mexico      3
4   2000-01-04  Mexico      9
5   2000-01-05  Mexico      8
6   2000-01-03  Mumbai     12
7   2000-01-04  Mumbai      9
8   2000-01-05  Mumbai      8
9   2000-01-03   Tokyo      5
10  2000-01-04   Tokyo      5
11  2000-01-05   Tokyo      6

If we want to calculate the maximum temperature per city, we could just group the data by city and then take the max function:

>>> df.groupby('city').max()
              date  value
city
London  2000-01-05      6
Mexico  2000-01-05      9
Mumbai  2000-01-05     12
Tokyo   2000-01-05      6

However, if we have to bring our data into form every time, we could be a little more effective, by creating a reshaped data frame first, having the dates as an index and the cities as columns.

We can create such a data frame with the pivot function. The arguments are the index (we use date), the columns (we use the cities), and the values (which are stored in the value column of the original data frame):

>>> pv = df.pivot("date", "city", "value")
>>> pv
city date         London  Mexico  Mumbai  Tokyo
2000-01-03       6       3      12      5
2000-01-04       3       9       9      5
2000-01-05       4       8       8      6

We can use max function on this new data frame directly:

>>> pv.max()
city
London     6
Mexico     9
Mumbai    12
Tokyo      6
dtype: int64

With a more suitable shape, other operations become easier as well. For example, to find the maximum temperature per day, we can simply provide an additional axis argument:

>>> pv.max(axis=1)
date
2000-01-03    12
2000-01-04     9
2000-01-05     8
dtype: int64

Data aggregation

As a final topic, we will look at ways to get a condensed view of data with aggregations. Pandas comes with a lot of aggregation functions built-in. We already saw the describe function in Chapter 3, Data Analysis with Pandas. This works on parts of the data as well. We start with some artificial data again, containing measurements about the number of sunshine hours per city and date:

>>> df.head()
   country     city        date  hours
0  Germany  Hamburg  2015-06-01      8
1  Germany  Hamburg  2015-06-02     10
2  Germany  Hamburg  2015-06-03      9
3  Germany  Hamburg  2015-06-04      7
4  Germany  Hamburg  2015-06-05      3

To view a summary per city, we use the describe function on the grouped data set:

>>> df.groupby("city").describe()
                      hours
city
Berlin     count  10.000000
           mean    6.000000
           std     3.741657
           min     0.000000
           25%     4.000000
           50%     6.000000
           75%     9.750000
           max    10.000000
Birmingham count  10.000000
           mean    5.100000
           std     2.078995
           min     2.000000
           25%     4.000000
           50%     5.500000
           75%     6.750000
           max     8.000000

On certain data sets, it can be useful to group by more than one attribute. We can get an overview about the sunny hours per country and date by passing in two column names:

>>> df.groupby(["country", "date"]).describe()
                         hours country date
France  2015-06-01 count  5.000000
                   mean   6.200000
                   std    1.095445
                   min    5.000000
                   25%    5.000000
                   50%    7.000000
                   75%    7.000000
                   max    7.000000
        2015-06-02 count  5.000000
                   mean   3.600000
                   std    3.577709
                   min    0.000000
                   25%    0.000000
                   50%    4.000000
                   75%    6.000000
                   max    8.000000
UK      2015-06-07 std    3.872983
                   min    0.000000
                   25%    2.000000
                   50%    6.000000
                   75%    8.000000
                   max    9.000000

We can compute single statistics as well:

>>> df.groupby("city").mean()
            hours
city
Berlin        6.0
Birmingham    5.1
Bordeax       4.7
Edinburgh     7.5
Frankfurt     5.8
Glasgow       4.8
Hamburg       5.5
Leipzig       5.0
London        4.8
Lyon          5.0
Manchester    5.2
Marseille     6.2
Munich        6.6
Nice          3.9
Paris         6.3

Finally, we can define any function to be applied on the groups with the agg method. The above could have been written in terms of agg like this:

>>> df.groupby("city").agg(np.mean)
hours
city
Berlin        6.0
Birmingham    5.1
Bordeax       4.7
Edinburgh     7.5
Frankfurt     5.8
Glasgow       4.8
...

But arbitrary functions are possible. As a last example, we define a custom function, which takes an input of a series object and computes the difference between the smallest and the largest element:

>>> df.groupby("city").agg(lambda s: abs(min(s) - max(s)))
        hours
city
Berlin         10
Birmingham      6
Bordeax        10
Edinburgh       8
Frankfurt       9
Glasgow        10
Hamburg        10
Leipzig         9
London         10
Lyon            8
Manchester     10
Marseille      10
Munich          9
Nice           10
Paris           9

Grouping data

One typical workflow during data exploration looks as follows:

You find a criterion that you want to use to group your data. Maybe you have GDP data for every country along with the continent and you would like to ask questions about the continents. These questions usually lead to some function applications- you might want to compute the mean GDP per continent. Finally, you want to store this data for further processing in a new data structure.

We use a simpler example here. Imagine some fictional weather data about the number of sunny hours per day and city:

>>> df
          date    city  value
0   2000-01-03  London      6
1   2000-01-04  London      3
2   2000-01-05  London      4
3   2000-01-03  Mexico      3
4   2000-01-04  Mexico      9
5   2000-01-05  Mexico      8
6   2000-01-03  Mumbai     12
7   2000-01-04  Mumbai      9
8   2000-01-05  Mumbai      8
9   2000-01-03   Tokyo      5
10  2000-01-04   Tokyo      5
11  2000-01-05   Tokyo      6

The groups attributes return a dictionary containing the unique groups and the corresponding values as axis labels:

>>> df.groupby("city").groups
{'London': [0, 1, 2],
'Mexico': [3, 4, 5],
'Mumbai': [6, 7, 8],
'Tokyo': [9, 10, 11]}

Although the result of a groupby is a GroupBy object, not a DataFrame, we can use the usual indexing notation to refer to columns:

>>> grouped = df.groupby(["city", "value"])
>>> grouped["value"].max()
city
London     6
Mexico     9
Mumbai    12
Tokyo      6
Name: value, dtype: int64
>>> grouped["value"].sum()
city
London    13
Mexico    20
Mumbai    29
Tokyo     16
Name: value, dtype: int64

We see that, according to our data set, Mumbai seems to be a sunny city. An alternative – and more verbose – way to achieve the above would be:
```
>>> df['value'].groupby(df['city']).sum()
city
London    13
Mexico    20
Mumbai    29
Tokyo     16
Name: value, dtype: 
int64
```

Summary

In this chapter we have looked at ways to manipulate data frames, from cleaning and filtering, to grouping, aggregation, and reshaping. Pandas makes a lot of the common operations very easy and more complex operations, such as pivoting or grouping by multiple attributes, can often be expressed as one-liners as well. Cleaning and preparing data is an essential part of data exploration and analysis.

The next chapter explains a brief of machine learning algorithms that is applying data analysis result to make decisions or build helpful products.

Practice exercises

Exercise 1: Cleaning: In the section about filtering, we used the Europe Brent Crude Oil Spot Price, which can be found as an Excel document on the internet. Take this Excel spreadsheet and try to convert it into a CSV document that is ready to be imported with Pandas.

Hint: There are many ways to do this. We used a small tool called xls2csv.py and we were able to load the resulting CSV file with a helper method:

import datetime
import pandas as pd
def convert_date(s):
    parts = s.replace("(", "").replace(")", "").split(",")
	if len(parts) < 6:
	return datetime.date(1970, 1, 1)
	return datetime.datetime(*[int(p) for p in parts])
	df = pd.read_csv("RBRTEd.csv", sep=',', names=["date", "price"], converters={"date": convert_date}).dropna()

Take a data set that is important for your work – or if you do not have any at hand, a data set that interests you and that is available online. Ask one or two questions about the data in advance. Then use cleaning, filtering, grouping, and plotting techniques to answer your question.