Chapter 7. Data Analysis Application Examples
In this chapter, we want to get you acquainted with typical data preparation tasks and analysis techniques, because being fluent in preparing, grouping, and reshaping data is an important building block for successful data analysis.
While preparing data seems like a mundane task – and often it is – it is a step we cannot skip, although we can strive to simplify it by using tools such as Pandas.
Why is preparation necessary at all? Because most useful data will come from the real world and will have deficiencies, contain errors or will be fragmentary.
There are more reasons why data preparation is useful: it gets you in close contact with the raw material. Knowing your input helps you to spot potential errors early and build confidence in your results.
Here are a few data preparation scenarios:
- A client hands you three files, each containing time series data about a single geological phenomenon, but the observed data is recorded on different intervals and uses different separators
- A machine learning algorithm can only work with numeric data, but your input only contains text labels
- You are handed the raw logs of a web server of an up and coming service and your task is to make suggestions on a growth strategy, based on existing visitor behavior
Data munging
The arsenal of tools for data munging is huge, and while we will focus on Python we want to mention some useful tools as well. If they are available on your system and you expect to work a lot with data, they are worth learning.
One group of tools belongs to the UNIX tradition, which emphasizes text processing and as a consequence has, over the last four decades, developed many high-performance and battle-tested tools for dealing with text. Some common tools are: sed
, grep
, awk
, sort
, uniq
, tr
, cut
, tail
, and head
. They do very elementary things, such as filtering out lines (grep
) or columns (cut
) from files, replacing text (sed
, tr
) or displaying only parts of files (head
, tail
).
We want to demonstrate the power of these tools with a single example only.
Imagine you are handed the log files of a web server and you are interested in the distribution of the IP addresses.
Each line of the log file contains an entry in the common log server format (you can download this data set from http://ita.ee.lbl.gov/html/contrib/EPA-HTTP.html):
$ cat epa-html.txt wpbfl2-45.gate.net [29:23:56:12] "GET /Access/ HTTP/1.0" 200 2376ebaca.icsi.net [30:00:22:20] "GET /Info.html HTTP/1.0" 200 884
For instance, we want to know how often certain users have visited our site.
We are interested in the first column only, since this is where the IP address or hostname can be found. After that, we need to count the number of occurrences of each host and finally display the results in a friendly way.
The sort
| uniq -c
stanza is our workhorse here: it sorts the data first and uniq -c
will save the number of occurrences along with the value. The sort -nr | head -15
is our formatting part; we sort numerically (-n
) and in reverse (-r
), and keep only the top 15 entries.
Putting it all together with pipes:
$ cut -d ' ' -f 1 epa-http.txt | sort | uniq -c | sort -nr | head -15 294 sandy.rtptok1.epa.gov 292 e659229.boeing.com 266 wicdgserv.wic.epa.gov 263 keyhole.es.dupont.com 248 dwilson.pr.mcs.net 176 oea4.r8stw56.epa.gov 174 macip26.nacion.co.cr 172 dcimsd23.dcimsd.epa.gov 167 www-b1.proxy.aol.com 158 piweba3y.prodigy.com 152 wictrn13.dcwictrn.epa.gov 151 nntp1.reach.com 151 inetg1.arco.com 149 canto04.nmsu.edu 146 weisman.metrokc.gov
With one command, we get to convert a sequential server log into an ordered list of the most common hosts that visited our site. We also see that we do not seem to have large differences in the number of visits among our top users.
There are more little helpful tools of which the following are just a tiny selection:
csvkit
: This is the suite of utilities for working with CSV, the king of tabular file formatsjq
: This is a lightweight and flexible command-line JSON processorxmlstarlet
: This is a tool that supports XML queries with XPath, among other thingsq
: This runs SQL on text files
Where the UNIX command line ends, lightweight languages take over. You might be able to get an impression from text only, but your colleagues might appreciate visual representations, such as charts or pretty graphs, generated by matplotlib, much more.
Python and its data tools ecosystem are much more versatile than the command line, but for first explorations and simple operations the effectiveness of the command line is often unbeatable.
Cleaning data
Most real-world data will have some defects and therefore will need to go through a cleaning step first. We start with a small file. Although this file contains only four rows, it will allow us to demonstrate the process up to a cleaned data set:
$ cat small.csv
22,6.1
41,5.7
18,5.3*
29,NA
Note that this file has a few issues. The lines that contain values are all comma-separated, but we have missing (NA) and probably unclean (5.3*) values. We can load this file into a data frame, nevertheless:
>>> import pandas as pd
>>> df = pd.read_csv("small.csv")
>>> df
22 6.1
0 41 5.7
1 18 5.3*
2 29 NaN
Pandas used the first row as header
, but this is not what we want:
>>> df = pd.read_csv("small.csv", header=None)
>>> df
0 1
0 22 6.1
1 41 5.7
2 18 5.3*
3 29 NaN
This is better, but instead of numeric values, we would like to supply our own column names:
>>> df = pd.read_csv("small.csv", names=["age", "height"])
>>> df
age height
0 22 6.1
1 41 5.7
2 18 5.3*
3 29 NaN
The age
column looks good, since Pandas already inferred the intended type, but the height
cannot be parsed into numeric values yet:
>>> df.age.dtype
dtype('int64')
>>> df.height.dtype
dtype('O')
If we try to coerce the height
column into float values, Pandas will report an exception:
>>> df.height.astype('float')
ValueError: invalid literal for float(): 5.3*
We could use whatever value is parseable as a float and throw away the rest with the convert_objects
method:
>>> df.height.convert_objects(convert_numeric=True)
0 6.1
1 5.7
2 NaN
3 NaN
Name: height, dtype: float64
If we know in advance the undesirable characters in our data set, we can augment the read_csv
method with a custom converter function:
>>> remove_stars = lambda s: s.replace("*", "")
>>> df = pd.read_csv("small.csv", names=["age", "height"],
converters={"height": remove_stars})
>>> df
age height
0 22 6.1
1 41 5.7
2 18 5.3
3 29 NA
Now we can finally make the height column a bit more useful. We can assign it the updated version, which has the favored type:
>>> df.height = df.height.convert_objects(convert_numeric=True)
>>> df
age height
0 22 6.1
1 41 5.7
2 18 5.3
3 29 NaN
If we wanted to only keep the complete entries, we could drop any row that contains undefined values:
>>> df.dropna()
age height
0 22 6.1
1 41 5.7
2 18 5.3
We could use a default height, maybe a fixed value:
>>> df.fillna(5.0)
age height
0 22 6.1
1 41 5.7
2 18 5.3
3 29 5.0
On the other hand, we could also use the average of the existing values:
>>> df.fillna(df.height.mean())
age height
0 22 6.1
1 41 5.7
2 18 5.3
3 29 5.7
The last three data frames are complete and correct, depending on your definition of correct when dealing with missing values. Especially, the columns have the requested types and are ready for further analysis. Which of the data frames is best suited will depend on the task at hand.
Filtering
Even if we have clean and probably correct data, we might want to use only parts of it or we might want to check for outliers. An outlier is an observation point that is distant from other observations because of variability or measurement errors. In both cases, we want to reduce the number of elements in our data set to make it more relevant for further processing.
In this example, we will try to find potential outliers. We will use the Europe Brent Crude Oil Spot Price as recorded by the U.S. Energy Information Administration. The raw Excel data is available from http://www.eia.gov/dnav/pet/hist_xls/rbrted.xls (it can be found in the second worksheet). We cleaned the data slightly (the cleaning process is part of an exercise at the end of this chapter) and will work with the following data frame, containing 7160 entries, ranging from 1987 to 2015:
>>> df.head() date price 0 1987-05-20 18.63 1 1987-05-21 18.45 2 1987-05-22 18.55 3 1987-05-25 18.60 4 1987-05-26 18.63 >>> df.tail() date price 7155 2015-08-04 49.08 7156 2015-08-05 49.04 7157 2015-08-06 47.80 7158 2015-08-07 47.54 7159 2015-08-10 48.30
While many people know about oil prices – be it from the news or the filling station – let us forget anything we know about it for a minute. We could first ask for the extremes:
>>> df[df.price==df.price.min()] date price 2937 1998-12-10 9.1 >>> df[df.price==df.price.max()] date price 5373 2008-07-03 143.95
Another way to find potential outliers would be to ask for values that deviate most from the mean. We can use the np.abs
function to calculate the deviation from the mean first:
>>> np.abs(df.price - df.price.mean()) 0 26.17137 1 26.35137 ... 7157 2.99863 7158 2.73863 7159 3.49863
We can now compare this deviation from a multiple – we choose 2.5 – of the standard deviation:
>>> import numpy as np >>> df[np.abs(df.price - df.price.mean()) > 2.5 * df.price.std()] date price 5354 2008-06-06 132.81 5355 2008-06-09 134.43 5356 2008-06-10 135.24 5357 2008-06-11 134.52 5358 2008-06-12 132.11 5359 2008-06-13 134.29 5360 2008-06-16 133.90 5361 2008-06-17 131.27 5363 2008-06-19 131.84 5364 2008-06-20 134.28 5365 2008-06-23 134.54 5366 2008-06-24 135.37 5367 2008-06-25 131.59 5368 2008-06-26 136.82 5369 2008-06-27 139.38 5370 2008-06-30 138.40 5371 2008-07-01 140.67 5372 2008-07-02 141.24 5373 2008-07-03 143.95 5374 2008-07-07 139.62 5375 2008-07-08 134.15 5376 2008-07-09 133.91 5377 2008-07-10 135.81 5378 2008-07-11 143.68 5379 2008-07-14 142.43 5380 2008-07-15 136.02 5381 2008-07-16 133.31 5382 2008-07-17 134.16
We see that those few days in summer 2008 must have been special. Sure enough, it is not difficult to find articles and essays with titles like Causes and Consequences of the Oil Shock of 2007–08. We have discovered a trace to these events solely by looking at the data.
We could ask the above question for each decade separately. We first make our data frame look more like a time series:
>>> df.index = df.date
>>> del df["date"]
>>> df.head()
price
date
1987-05-20 18.63
1987-05-21 18.45
1987-05-22 18.55
1987-05-25 18.60
1987-05-26 18.63
We could filter out the eighties:
>>> decade = df["1980":"1989"]
>>> decade[np.abs(decade.price - decade.price.mean()) > 2.5 * decade.price.std()]
price
date
1988-10-03 11.60
1988-10-04 11.65
1988-10-05 11.20
1988-10-06 11.30
1988-10-07 11.35
We observe that within the data available (1987–1989), the fall of 1988 exhibits a slight spike in the oil prices. Similarly, during the nineties, we see that we have a larger deviation, in the fall of 1990:
>>> decade = df["1990":"1999"]
>>> decade[np.abs(decade.price - decade.price.mean()) > 5 * decade.price.std()]
price
date
1990-09-24 40.75
1990-09-26 40.85
1990-09-27 41.45
1990-09-28 41.00
1990-10-09 40.90
1990-10-10 40.20
1990-10-11 41.15
There are many more use cases for filtering data. Space and time are typical units: you might want to filter census data by state or city, or economical data by quarter. The possibilities are endless and will be driven by your project.
Merging data
The situation is common: you have multiple data sources, but in order to make statements about the content, you would rather combine them. Fortunately, Pandas' concatenation and merge functions abstract away most of the pain, when combining, joining, or aligning data. It does so in a highly optimized manner as well.
In a case where two data frames have a similar shape, it might be useful to just append one after the other. Maybe A
and B
are products and one data frame contains the number of items sold per product in a store:
>>> df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> df1
A B
0 1 4
1 2 5
2 3 6
>>> df2 = pd.DataFrame({'A': [4, 5, 6], 'B': [7, 8, 9]})
>>> df2
A B
0 4 7
1 5 8
2 6 9
>>> df1.append(df2)
A B
0 1 4
1 2 5
2 3 6
0 4 7
1 5 8
2 6 9
Sometimes, we won't care about the indices of the originating data frames:
>>> df1.append(df2, ignore_index=True)
A B
0 1 4
1 2 5
2 3 6
3 4 7
4 5 8
5 6 9
A more flexible way to combine objects is offered by the pd.concat
function, which takes an arbitrary number of series, data frames, or panels as input. The default behavior resembles an append:
>>> pd.concat([df1, df2])
A B
0 1 4
1 2 5
2 3 6
0 4 7
1 5 8
2 6 9
The default concat
operation appends both frames along the rows – or index, which corresponds to axis 0. To concatenate along the columns, we can pass in the axis keyword argument:
>>> pd.concat([df1, df2], axis=1)
A B A B
0 1 4 4 7
1 2 5 5 8
2 3 6 6 9
We can add keys to create a hierarchical index.
>>> pd.concat([df1, df2], keys=['UK', 'DE'])
A B
UK 0 1 4
1 2 5
2 3 6
DE 0 4 7
1 5 8
2 6 9
This can be useful if you want to refer back to parts of the data frame later. We use the ix
indexer:
>>> df3 = pd.concat([df1, df2], keys=['UK', 'DE'])
>>> df3.ix["UK"]
A B
0 1 4
1 2 5
2 3 6
Data frames resemble database tables. It is therefore not surprising that Pandas implements SQL-like join operations on them. What is positively surprising is that these operations are highly optimized and extremely fast:
>>> import numpy as np
>>> df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
'value': range(4)})
>>> df1
key value
0 A 0
1 B 1
2 C 2
3 D 3
>>> df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'], 'value': range(10, 14)})
>>> df2
key value
0 B 10
1 D 11
2 D 12
3 E 13
If we merge on key
, we get an inner join. This creates a new data frame by combining the column values of the original data frames based upon the join predicate, here the key
attribute is used:
>>> df1.merge(df2, on='key')
key value_x value_y
0 B 1 10
1 D 3 11
2 D 3 12
A left, right and full join can be specified by the how
parameter:
>>> df1.merge(df2, on='key', how='left')
key value_x value_y
0 A 0 NaN
1 B 1 10
2 C 2 NaN
3 D 3 11
4 D 3 12
>>> df1.merge(df2, on='key', how='right')
key value_x value_y
0 B 1 10
1 D 3 11
2 D 3 12
3 E NaN 13
>>> df1.merge(df2, on='key', how='outer')
key value_x value_y
0 A 0 NaN
1 B 1 10
2 C 2 NaN
3 D 3 11
4 D 3 12
5 E NaN 13
The merge methods can be specified with the how parameter. The following table shows the methods in comparison with SQL:
Merge Method |
SQL Join Name |
Description |
---|---|---|
|
LEFT OUTER JOIN |
Use keys from the left frame only. |
|
RIGHT OUTER JOIN |
Use keys from the right frame only. |
|
FULL OUTER JOIN |
Use a union of keys from both frames. |
|
INNER JOIN |
Use an intersection of keys from both frames. |
Reshaping data
We saw how to combine data frames but sometimes we have all the right data in a single data structure, but the format is impractical for certain tasks. We start again with some artificial weather data:
>>> df
date city value
0 2000-01-03 London 6
1 2000-01-04 London 3
2 2000-01-05 London 4
3 2000-01-03 Mexico 3
4 2000-01-04 Mexico 9
5 2000-01-05 Mexico 8
6 2000-01-03 Mumbai 12
7 2000-01-04 Mumbai 9
8 2000-01-05 Mumbai 8
9 2000-01-03 Tokyo 5
10 2000-01-04 Tokyo 5
11 2000-01-05 Tokyo 6
If we want to calculate the maximum temperature per city, we could just group the data by city and then take the max
function:
>>> df.groupby('city').max()
date value
city
London 2000-01-05 6
Mexico 2000-01-05 9
Mumbai 2000-01-05 12
Tokyo 2000-01-05 6
However, if we have to bring our data into form every time, we could be a little more effective, by creating a reshaped data frame first, having the dates as an index and the cities as columns.
We can create such a data frame with the pivot
function. The arguments are the index (we use date), the columns (we use the cities), and the values (which are stored in the value column of the original data frame):
>>> pv = df.pivot("date", "city", "value")
>>> pv
city date London Mexico Mumbai Tokyo
2000-01-03 6 3 12 5
2000-01-04 3 9 9 5
2000-01-05 4 8 8 6
We can use max
function on this new data frame directly:
>>> pv.max()
city
London 6
Mexico 9
Mumbai 12
Tokyo 6
dtype: int64
With a more suitable shape, other operations become easier as well. For example, to find the maximum temperature per day, we can simply provide an additional axis argument:
>>> pv.max(axis=1)
date
2000-01-03 12
2000-01-04 9
2000-01-05 8
dtype: int64
Data aggregation
As a final topic, we will look at ways to get a condensed view of data with aggregations. Pandas comes with a lot of aggregation functions built-in. We already saw the describe
function in Chapter 3, Data Analysis with Pandas. This works on parts of the data as well. We start with some artificial data again, containing measurements about the number of sunshine hours per city and date:
>>> df.head() country city date hours 0 Germany Hamburg 2015-06-01 8 1 Germany Hamburg 2015-06-02 10 2 Germany Hamburg 2015-06-03 9 3 Germany Hamburg 2015-06-04 7 4 Germany Hamburg 2015-06-05 3
To view a summary per city
, we use the describe
function on the grouped data set:
>>> df.groupby("city").describe() hours city Berlin count 10.000000 mean 6.000000 std 3.741657 min 0.000000 25% 4.000000 50% 6.000000 75% 9.750000 max 10.000000 Birmingham count 10.000000 mean 5.100000 std 2.078995 min 2.000000 25% 4.000000 50% 5.500000 75% 6.750000 max 8.000000
On certain data sets
, it can be useful to group by more than one attribute. We can get an overview about the sunny hours per country and date by passing in two column names:
>>> df.groupby(["country", "date"]).describe() hours country date France 2015-06-01 count 5.000000 mean 6.200000 std 1.095445 min 5.000000 25% 5.000000 50% 7.000000 75% 7.000000 max 7.000000 2015-06-02 count 5.000000 mean 3.600000 std 3.577709 min 0.000000 25% 0.000000 50% 4.000000 75% 6.000000 max 8.000000 UK 2015-06-07 std 3.872983 min 0.000000 25% 2.000000 50% 6.000000 75% 8.000000 max 9.000000
We can compute single statistics as well:
>>> df.groupby("city").mean()
hours
city
Berlin 6.0
Birmingham 5.1
Bordeax 4.7
Edinburgh 7.5
Frankfurt 5.8
Glasgow 4.8
Hamburg 5.5
Leipzig 5.0
London 4.8
Lyon 5.0
Manchester 5.2
Marseille 6.2
Munich 6.6
Nice 3.9
Paris 6.3
Finally, we can define any function to be applied on the groups with the agg
method. The above could have been written in terms of agg
like this:
>>> df.groupby("city").agg(np.mean)
hours
city
Berlin 6.0
Birmingham 5.1
Bordeax 4.7
Edinburgh 7.5
Frankfurt 5.8
Glasgow 4.8
...
But arbitrary functions are possible. As a last example, we define a custom
function, which takes an input of a series object and computes the difference between the smallest and the largest element:
>>> df.groupby("city").agg(lambda s: abs(min(s) - max(s)))
hours
city
Berlin 10
Birmingham 6
Bordeax 10
Edinburgh 8
Frankfurt 9
Glasgow 10
Hamburg 10
Leipzig 9
London 10
Lyon 8
Manchester 10
Marseille 10
Munich 9
Nice 10
Paris 9
Grouping data
One typical workflow during data exploration looks as follows:
- You find a criterion that you want to use to group your data. Maybe you have GDP data for every country along with the continent and you would like to ask questions about the continents. These questions usually lead to some function applications- you might want to compute the mean GDP per continent. Finally, you want to store this data for further processing in a new data structure.
- We use a simpler example here. Imagine some fictional weather data about the number of sunny hours per day and city:
>>> df date city value 0 2000-01-03 London 6 1 2000-01-04 London 3 2 2000-01-05 London 4 3 2000-01-03 Mexico 3 4 2000-01-04 Mexico 9 5 2000-01-05 Mexico 8 6 2000-01-03 Mumbai 12 7 2000-01-04 Mumbai 9 8 2000-01-05 Mumbai 8 9 2000-01-03 Tokyo 5 10 2000-01-04 Tokyo 5 11 2000-01-05 Tokyo 6
- The
groups
attributes return a dictionary containing the unique groups and the corresponding values as axis labels:>>> df.groupby("city").groups {'London': [0, 1, 2], 'Mexico': [3, 4, 5], 'Mumbai': [6, 7, 8], 'Tokyo': [9, 10, 11]}
- Although the result of a
groupby
is a GroupBy object, not a DataFrame, we can use the usual indexing notation to refer to columns:>>> grouped = df.groupby(["city", "value"]) >>> grouped["value"].max() city London 6 Mexico 9 Mumbai 12 Tokyo 6 Name: value, dtype: int64 >>> grouped["value"].sum() city London 13 Mexico 20 Mumbai 29 Tokyo 16 Name: value, dtype: int64
- We see that, according to our data set, Mumbai seems to be a sunny city. An alternative – and more verbose – way to achieve the above would be:
>>> df['value'].groupby(df['city']).sum() city London 13 Mexico 20 Mumbai 29 Tokyo 16 Name: value, dtype: int64
Summary
In this chapter we have looked at ways to manipulate data frames, from cleaning and filtering, to grouping, aggregation, and reshaping. Pandas makes a lot of the common operations very easy and more complex operations, such as pivoting or grouping by multiple attributes, can often be expressed as one-liners as well. Cleaning and preparing data is an essential part of data exploration and analysis.
The next chapter explains a brief of machine learning algorithms that is applying data analysis result to make decisions or build helpful products.
Practice exercises
Exercise 1: Cleaning: In the section about filtering, we used the Europe Brent Crude Oil Spot Price, which can be found as an Excel document on the internet. Take this Excel spreadsheet and try to convert it into a CSV document that is ready to be imported with Pandas.
Hint: There are many ways to do this. We used a small tool called xls2csv.py
and we were able to load the resulting CSV file with a helper method:
import datetime
import pandas as pd
def convert_date(s):
parts = s.replace("(", "").replace(")", "").split(",")
if len(parts) < 6:
return datetime.date(1970, 1, 1)
return datetime.datetime(*[int(p) for p in parts])
df = pd.read_csv("RBRTEd.csv", sep=',', names=["date", "price"], converters={"date": convert_date}).dropna()
Take a data set that is important for your work – or if you do not have any at hand, a data set that interests you and that is available online. Ask one or two questions about the data in advance. Then use cleaning, filtering, grouping, and plotting techniques to answer your question.