Chapter 5. Time Series
Time series typically consist of a sequence of data points coming from measurements taken over time. This kind of data is very common and occurs in a multitude of fields.
A business executive is interested in stock prices, prices of goods and services or monthly sales figures. A meteorologist takes temperature measurements several times a day and also keeps records of precipitation, humidity, wind direction and force. A neurologist can use electroencephalography to measure electrical activity of the brain along the scalp. A sociologist can use campaign contribution data to learn about political parties and their supporters and use these insights as an argumentation aid. More examples for time series data can be enumerated almost endlessly.
Time series primer
In general, time series serve two purposes. First, they help us to learn about the underlying process that generated the data. On the other hand, we would like to be able to forecast future values of the same or related series using existing data. When we measure temperature, precipitation or wind, we would like to learn more about more complex things, such as weather or the climate of a region and how various factors interact. At the same time, we might be interested in weather forecasting.
In this chapter we will explore the time series capabilities of Pandas. Apart from its powerful core data structures – the series and the DataFrame – Pandas comes with helper functions for dealing with time related data. With its extensive built-in optimizations, Pandas is capable of handling large time series with millions of data points with ease.
We will gradually approach time series, starting with the basic building blocks of date and time objects.
Working with date and time objects
Python supports date and time handling in the date time and time modules from the standard library:
>>> import datetime
>>> datetime.datetime(2000, 1, 1)
datetime.datetime(2000, 1, 1, 0, 0)
Sometimes, dates are given or expected as strings, so a conversion from or to strings is necessary, which is realized by two functions: strptime
and strftime
, respectively:
>>> datetime.datetime.strptime("2000/1/1", "%Y/%m/%d")
datetime.datetime(2000, 1, 1, 0, 0)
>>> datetime.datetime(2000, 1, 1, 0, 0).strftime("%Y%m%d")
'20000101'
Real-world data usually comes in all kinds of shapes and it would be great if we did not need to remember the exact date format specifies for parsing. Thankfully, Pandas abstracts away a lot of the friction, when dealing with strings representing dates or time. One of these helper functions is to_datetime
:
>>> import pandas as pd
>>> import numpy as np
>>> pd.to_datetime("4th of July")
Timestamp('2015-07-04
>>> pd.to_datetime("13.01.2000")
Timestamp('2000-01-13 00:00:00')
>>> pd.to_datetime("7/8/2000")
Timestamp('2000-07-08 00:00:00')
The last can refer to August 7th or July 8th, depending on the region. To disambiguate this case, to_datetime
can be passed a keyword argument dayfirst
:
>>> pd.to_datetime("7/8/2000", dayfirst=True)
Timestamp('2000-08-07 00:00:00')
Timestamp objects can be seen as Pandas' version of datetime
objects and indeed, the Timestamp
class is a subclass of datetime
:
>>> issubclass(pd.Timestamp, datetime.datetime)
True
Which means they can be used interchangeably in many cases:
>>> ts = pd.to_datetime(946684800000000000)
>>> ts.year, ts.month, ts.day, ts.weekday()
(2000, 1, 1, 5)
Timestamp objects are an important part of time series capabilities of Pandas, since timestamps are the building block of DateTimeIndex
objects:
>>> index = [pd.Timestamp("2000-01-01"),
pd.Timestamp("2000-01-02"),
pd.Timestamp("2000-01-03")]
>>> ts = pd.Series(np.random.randn(len(index)), index=index)
>>> ts
2000-01-01 0.731897
2000-01-02 0.761540
2000-01-03 -1.316866
dtype: float64
>>> ts.indexDatetime
Index(['2000-01-01', '2000-01-02', '2000-01-03'],
dtype='datetime64[ns]', freq=None, tz=None)
There are a few things to note here: We create a list of timestamp objects and pass it to the series constructor as index. This list of timestamps gets converted into a DatetimeIndex
on the fly. If we had passed only the date strings, we would not get a DatetimeIndex
, just an index
:
>>> ts = pd.Series(np.random.randn(len(index)), index=[
"2000-01-01", "2000-01-02", "2000-01-03"])
>>> ts.index
Index([u'2000-01-01', u'2000-01-02', u'2000-01-03'], dtype='object')
However, the to_datetime
function is flexible enough to be of help, if all we have is a list of date strings:
>>> index = pd.to_datetime(["2000-01-01", "2000-01-02", "2000-01-03"])
>>> ts = pd.Series(np.random.randn(len(index)), index=index)
>>> ts.index
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq=None, tz=None))
Another thing to note is that while we have a DatetimeIndex
, the freq
and tz
attributes are both None
. We will learn about the utility of both attributes later in this chapter.
With to_datetime
we are able to convert a variety of strings and even lists of strings into timestamp or DatetimeIndex
objects. Sometimes we are not explicitly given all the information about a series and we have to generate sequences of time stamps of fixed intervals ourselves.
Pandas offer another great utility function for this task: date_range
.
The date_range
function helps to generate a fixed frequency datetime
index between start and end dates. It is also possible to specify either the start or end date and the number of timestamps to generate.
The frequency can be specified by the freq
parameter, which supports a number of offsets. You can use typical time intervals like hours, minutes, and seconds:
>>> pd.date_range(start="2000-01-01", periods=3, freq='H')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:00:00', '2000-01-01 02:00:00'], dtype='datetime64[ns]', freq='H', tz=None)
>>> pd.date_range(start="2000-01-01", periods=3, freq='T')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00', '2000-01-01 00:02:00'], dtype='datetime64[ns]', freq='T', tz=None)
>>> pd.date_range(start="2000-01-01", periods=3, freq='S')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:00:01', '2000-01-01 00:00:02'], dtype='datetime64[ns]', freq='S', tz=None)
The freq
attribute allows us to specify a multitude of options. Pandas has been used successfully in finance and economics, not least because it is really simple to work with business dates as well. As an example, to get an index with the first three business days of the millennium, the B
offset alias can be used:
>>> pd.date_range(start="2000-01-01", periods=3, freq='B')
DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05'], dtype='datetime64[ns]', freq='B', tz=None)
The following table shows the available offset aliases and can be also be looked up in the Pandas documentation on time series under http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases:
Alias |
Description |
---|---|
B |
Business day frequency |
C |
Custom business day frequency |
D |
Calendar day frequency |
W |
Weekly frequency |
M |
Month end frequency |
BM |
Business month end frequency |
CBM |
Custom business month end frequency |
MS |
Month start frequency |
BMS |
Business month start frequency |
CBMS |
Custom business month start frequency |
Q |
Quarter end frequency |
BQ |
Business quarter frequency |
QS |
Quarter start frequency |
BQS |
Business quarter start frequency |
A |
Year end frequency |
BA |
Business year end frequency |
AS |
Year start frequency |
BAS |
Business year start frequency |
BH |
Business hour frequency |
H |
Hourly frequency |
T |
Minutely frequency |
S |
Secondly frequency |
L |
Milliseconds |
U |
Microseconds |
N |
Nanoseconds |
Moreover, the offset aliases can be used in combination as well. Here, we are generating a datetime
index with five elements, each one day, one hour, one minute and one second apart:
>>> pd.date_range(start="2000-01-01", periods=5, freq='1D1h1min10s')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-02 01:01:10', '2000-01-03 02:02:20', '2000-01-04 03:03:30', '2000-01-05 04:04:40'], dtype='datetime64[ns]', freq='90070S', tz=None)
If we want to index data every 12 hours of our business time, which by default starts at 9 AM and ends at 5 PM, we would simply prefix the BH
alias:
>>> pd.date_range(start="2000-01-01", periods=5, freq='12BH')
DatetimeIndex(['2000-01-03 09:00:00', '2000-01-04 13:00:00', '2000-01-06 09:00:00', '2000-01-07 13:00:00', '2000-01-11 09:00:00'], dtype='datetime64[ns]', freq='12BH', tz=None)
A custom definition of what a business hour means is also possible:
>>> ts.index
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq=None, tz=None)
We can use this custom business hour to build indexes as well:
>>> pd.date_range(start="2000-01-01", periods=5, freq=12 * bh)
DatetimeIndex(['2000-01-03 07:00:00', '2000-01-03 19:00:00', '2000-01-04 07:00:00', '2000-01-04 19:00:00', '2000-01-05 07:00:00', '2000-01-05 19:00:00', '2000-01-06 07:00:00'], dtype='datetime64[ns]', freq='12BH', tz=None)
Some frequencies allow us to specify an anchoring suffix, which allows us to express intervals, such as every Friday or every second Tuesday of the month:
>>> pd.date_range(start="2000-01-01", periods=5, freq='W-FRI')
DatetimeIndex(['2000-01-07', '2000-01-14', '2000-01-21', '2000-01-28', '2000-02-04'], dtype='datetime64[ns]', freq='W-FRI', tz=None)
>>> pd.date_range(start="2000-01-01", periods=5, freq='WOM-2TUE')
DatetimeIndex(['2000-01-11', '2000-02-08', '2000-03-14', '2000-04-11', '2000-05-09'], dtype='datetime64[ns]', freq='WOM-2TUE', tz=None)
Finally, we can merge various indexes of different frequencies. The possibilities are endless. We only show one example, where we combine two indexes – each over a decade – one pointing to every first business day of a year and one to the last day of February:
>>> s = pd.date_range(start="2000-01-01", periods=10, freq='BAS-JAN')
>>> t = pd.date_range(start="2000-01-01", periods=10, freq='A-FEB')
>>> s.union(t)
DatetimeIndex(['2000-01-03', '2000-02-29', '2001-01-01', '2001-02-28', '2002-01-01', '2002-02-28', '2003-01-01', '2003-02-28','2004-01-01', '2004-02-29', '2005-01-03', '2005-02-28', '2006-01-02', '2006-02-28', '2007-01-01', '2007-02-28','2008-01-01', '2008-02-29', '2009-01-01', '2009-02-28'], dtype='datetime64[ns]', freq=None, tz=None)
We see, that 2000 and 2005 did not start on a weekday and that 2000, 2004, and 2008 were the leap years.
We have seen two powerful functions so far, to_datetime
and date_range
. Now we want to dive into time series by first showing how you can create and plot time series data with only a few lines. In the rest of this section, we will show various ways to access and slice time series data.
It is easy to get started with time series data in Pandas. A random walk can be created and plotted in a few lines:
>>> index = pd.date_range(start='2000-01-01', periods=200, freq='B')
>>> ts = pd.Series(np.random.randn(len(index)), index=index)
>>> walk = ts.cumsum()
>>> walk.plot()
A possible output of this plot is show in the following figure:
Just as with usual series objects, you can select parts and slice the index:
>>> ts.head()
2000-01-03 1.464142
2000-01-04 0.103077
2000-01-05 0.762656
2000-01-06 1.157041
2000-01-07 -0.427284
Freq: B, dtype: float64
>>> ts[0]
1.4641415817112928
>>> ts[1:3]
2000-01-04 0.103077
2000-01-05 0.762656
We can use date strings as keys, even though our series has a DatetimeIndex
:
>>> ts['2000-01-03']
1.4641415817112928
Even though the DatetimeIndex
is made of timestamp objects, we can use datetime
objects as keys as well:
>>> ts[datetime.datetime(2000, 1, 3)]
1.4641415817112928
Access is similar to lookup in dictionaries or lists, but more powerful. We can, for example, slice with strings or even mixed objects:
>>> ts['2000-01-03':'2000-01-05']
2000-01-03 1.464142
2000-01-04 0.103077
2000-01-05 0.762656
Freq: B, dtype: float64
>>> ts['2000-01-03':datetime.datetime(2000, 1, 5)]
2000-01-03 1.464142
2000-01-04 0.103077
2000-01-05 0.762656
Freq: B, dtype: float64
>>> ts['2000-01-03':datetime.date(2000, 1, 5)]
2000-01-03 -0.807669
2000-01-04 0.029802
2000-01-05 -0.434855
Freq: B, dtype: float64
It is even possible to use partial strings to select groups of entries. If we are only interested in February, we could simply write:
>>> ts['2000-02']
2000-02-01 0.277544
2000-02-02 -0.844352
2000-02-03 -1.900688
2000-02-04 -0.120010
2000-02-07 -0.465916
2000-02-08 -0.575722
2000-02-09 0.426153
2000-02-10 0.720124
2000-02-11 0.213050
2000-02-14 -0.604096
2000-02-15 -1.275345
2000-02-16 -0.708486
2000-02-17 -0.262574
2000-02-18 1.898234
2000-02-21 0.772746
2000-02-22 1.142317
2000-02-23 -1.461767
2000-02-24 -2.746059
2000-02-25 -0.608201
2000-02-28 0.513832
2000-02-29 -0.132000
To see all entries from March until May, including:
>>> ts['2000-03':'2000-05']
2000-03-01 0.528070
2000-03-02 0.200661
...
2000-05-30 1.206963
2000-05-31 0.230351
Freq: B, dtype: float64
Time series can be shifted forward or backward in time. The index stays in place, the values move:
>>> small_ts = ts['2000-02-01':'2000-02-05']
>>> small_ts
2000-02-01 0.277544
2000-02-02 -0.844352
2000-02-03 -1.900688
2000-02-04 -0.120010
Freq: B, dtype: float64
>>> small_ts.shift(2)
2000-02-01 NaN
2000-02-02 NaN
2000-02-03 0.277544
2000-02-04 -0.844352
Freq: B, dtype: float64
To shift backwards in time, we simply use negative values:
>>> small_ts.shift(-2)
2000-02-01 -1.900688
2000-02-02 -0.120010
2000-02-03 NaN
2000-02-04 NaN
Freq: B, dtype: float64
Resampling time series
Resampling describes the process of frequency conversion over time series data. It is a helpful technique in various circumstances as it fosters understanding by grouping together and aggregating data. It is possible to create a new time series from daily temperature data that shows the average temperature per week or month. On the other hand, real-world data may not be taken in uniform intervals and it is required to map observations into uniform intervals or to fill in missing values for certain points in time. These are two of the main use directions of resampling: binning and aggregation, and filling in missing data. Downsampling and upsampling occur in other fields as well, such as digital signal processing. There, the process of downsampling is often called decimation and performs a reduction of the sample rate. The inverse process is called interpolation, where the sample rate is increased. We will look at both directions from a data analysis angle.
Downsampling time series data
Downsampling reduces the number of samples in the data. During this reduction, we are able to apply aggregations over data points. Let's imagine a busy airport with thousands of people passing through every hour. The airport administration has installed a visitor counter in the main area, to get an impression of exactly how busy their airport is.
They are receiving data from the counter device every minute. Here are the hypothetical measurements for a day, beginning at 08:00, ending 600 minutes later at 18:00:
>>> rng = pd.date_range('4/29/2015 8:00', periods=600, freq='T')
>>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng)
>>> ts.head()
2015-04-29 08:00:00 9
2015-04-29 08:01:00 60
2015-04-29 08:02:00 65
2015-04-29 08:03:00 25
2015-04-29 08:04:00 19
To get a better picture of the day, we can downsample this time series to larger intervals, for example, 10 minutes. We can choose an aggregation function as well. The default aggregation is to take all the values and calculate the mean:
>>> ts.resample('10min').head()
2015-04-29 08:00:00 49.1
2015-04-29 08:10:00 56.0
2015-04-29 08:20:00 42.0
2015-04-29 08:30:00 51.9
2015-04-29 08:40:00 59.0
Freq: 10T, dtype: float64
In our airport example, we are also interested in the sum of the values, that is, the combined number of visitors for a given time frame. We can choose the aggregation function by passing a function or a function name to the how
parameter works:
>>> ts.resample('10min', how='sum').head()
2015-04-29 08:00:00 442
2015-04-29 08:10:00 409
2015-04-29 08:20:00 532
2015-04-29 08:30:00 433
2015-04-29 08:40:00 470
Freq: 10T, dtype: int64
Or we can reduce the sampling interval even more by resampling to an hourly interval:
>>> ts.resample('1h', how='sum').head()
2015-04-29 08:00:00 2745
2015-04-29 09:00:00 2897
2015-04-29 10:00:00 3088
2015-04-29 11:00:00 2616
2015-04-29 12:00:00 2691
Freq: H, dtype: int64
We can ask for other things as well. For example, what was the maximum number of people that passed through our airport within one hour:
>>> ts.resample('1h', how='max').head()
2015-04-29 08:00:00 97
2015-04-29 09:00:00 98
2015-04-29 10:00:00 99
2015-04-29 11:00:00 98
2015-04-29 12:00:00 99
Freq: H, dtype: int64
Or we can define a custom function if we are interested in more unusual metrics. For example, we could be interested in selecting a random sample for each hour:
>>> import random
>>> ts.resample('1h', how=lambda m: random.choice(m)).head()
2015-04-29 08:00:00 28
2015-04-29 09:00:00 14
2015-04-29 10:00:00 68
2015-04-29 11:00:00 31
2015-04-29 12:00:00 5
If you specify a function by string, Pandas uses highly optimized versions.
The built-in functions that can be used as argument to how
are: sum
, mean
, std, sem
, max
, min
, median
, first
, last
, ohlc
. The ohlc
metric is popular in finance. It stands for open-high-low-close. An OHLC chart is a typical way to illustrate movements in the price of a financial instrument over time.
While in our airport this metric might not be that valuable, we can compute it nonetheless:
>>> ts.resample('1h', how='ohlc').head()
open high low close
2015-04-29 08:00:00 9 97 0 14
2015-04-29 09:00:00 68 98 3 12
2015-04-29 10:00:00 71 99 1 1
2015-04-29 11:00:00 59 98 0 4
2015-04-29 12:00:00 56 99 3
55
Upsampling time series data
In upsampling, the frequency of the time series is increased. As a result, we have more sample points than data points. One of the main questions is how to account for the entries in the series where we have no measurement.
Let's start with hourly data for a single day:
>>> rng = pd.date_range('4/29/2015 8:00', periods=10, freq='H')
>>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng)
>>> ts.head()
2015-04-29 08:00:00 30
2015-04-29 09:00:00 27
2015-04-29 10:00:00 54
2015-04-29 11:00:00 9
2015-04-29 12:00:00 48
Freq: H, dtype: int64
If we upsample to data points taken every 15 minutes, our time series will be extended with NaN
values:
>>> ts.resample('15min')
>>> ts.head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 NaN
2015-04-29 08:30:00 NaN
2015-04-29 08:45:00 NaN
2015-04-29 09:00:00 27
There are various ways to deal with missing values, which can be controlled by the fill_method
keyword argument to resample. Values can be filled either forward or backward:
>>> ts.resample('15min', fill_method='ffill').head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 30
2015-04-29 08:30:00 30
2015-04-29 08:45:00 30
2015-04-29 09:00:00 27
Freq: 15T, dtype: int64
>>> ts.resample('15min', fill_method='bfill').head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 27
2015-04-29 08:30:00 27
2015-04-29 08:45:00 27
2015-04-29 09:00:00 27
With the limit
parameter, it is possible to control the number of missing values to be filled:
>>> ts.resample('15min', fill_method='ffill', limit=2).head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 30
2015-04-29 08:30:00 30
2015-04-29 08:45:00 NaN
2015-04-29 09:00:00 27
Freq: 15T, dtype: float64
If you want to adjust the labels during resampling, you can use the loffset
keyword argument:
>>> ts.resample('15min', fill_method='ffill', limit=2, loffset='5min').head()
2015-04-29 08:05:00 30
2015-04-29 08:20:00 30
2015-04-29 08:35:00 30
2015-04-29 08:50:00 NaN
2015-04-29 09:05:00 27
Freq: 15T, dtype: float64
There is another way to fill in missing values. We could employ an algorithm to construct new data points that would somehow fit the existing points, for some definition of somehow. This process is called interpolation.
We can ask Pandas to interpolate a time series for us:
>>> tsx = ts.resample('15min')
>>> tsx.interpolate().head()
2015-04-29 08:00:00 30.00
2015-04-29 08:15:00 29.25
2015-04-29 08:30:00 28.50
2015-04-29 08:45:00 27.75
2015-04-29 09:00:00 27.00
Freq: 15T, dtype: float64
We saw the default interpolate
method – a linear interpolation – in action. Pandas assumes a linear relationship between two existing points.
Pandas supports over a dozen interpolation
functions, some of which require the scipy
library to be installed. We will not cover interpolation
methods in this chapter, but we encourage you to explore the various methods yourself. The right interpolation
method will depend on the requirements of your application.
Time zone handling
While, by default, Pandas objects are time zone unaware, many real-world applications will make use of time zones. As with working with time in general, time zones are no trivial matter: do you know which countries have daylight saving time and do you know when the time zone is switched in those countries? Thankfully, Pandas builds on the time zone capabilities of two popular and proven utility libraries for time and date handling: pytz
and dateutil
:
>>> t = pd.Timestamp('2000-01-01')
>>> t.tz is None
True
To supply time zone information, you can use the tz
keyword argument:
>>> t = pd.Timestamp('2000-01-01', tz='Europe/Berlin')
>>> t.tz
<DstTzInfo 'Europe/Berlin' CET+1:00:00 STD>
This works for ranges
as well:
>>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D', tz='Europe/London')
>>> rng
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08','2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D', tz='Europe/London')
Time zone objects can also be constructed beforehand:
>>> import pytz
>>> tz = pytz.timezone('Europe/London')
>>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D', tz=tz)
>>> rng
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08', '2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D', tz='Europe/London')
Sometimes, you will already have a time zone unaware time series object that you would like to make time zone aware. The tz_localize
function helps to switch between time zone aware and time zone unaware objects:
>>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D')
>>> ts = pd.Series(np.random.randn(len(rng)), rng)
>>> ts.index.tz is None
True
>>> ts_utc = ts.tz_localize('UTC')
>>> ts_utc.index.tz
<UTC>
To move a time zone aware object to other time zones, you can use the tz_convert
method:
>>> ts_utc.tz_convert('Europe/Berlin').index.tz
<DstTzInfo 'Europe/Berlin' LMT+0:53:00 STD>
Finally, to detach any time zone information from an object, it is possible to pass None
to either tz_convert
or tz_localize
:
>>> ts_utc.tz_convert(None).index.tz is None
True
>>> ts_utc.tz_localize(None).index.tz
is None
True
Timedeltas
Along with the powerful timestamp object, which acts as a building block for the DatetimeIndex
, there is another useful data structure, which has been introduced in Pandas 0.15 – the Timedelta. The Timedelta can serve as a basis for indices as well, in this case a TimedeltaIndex
.
Timedeltas are differences in times, expressed in difference units. The Timedelta
class in Pandas is a subclass of datetime.timedelta
from the Python standard library. As with other Pandas data structures, the Timedelta can be constructed from a variety of inputs:
>>> pd.Timedelta('1 days')
Timedelta('1 days 00:00:00')
>>> pd.Timedelta('-1 days 2 min 10s 3us')
Timedelta('-2 days +23:57:49.999997')
>>> pd.Timedelta(days=1,seconds=1)
Timedelta('1 days 00:00:01')
As you would expect, Timedeltas
allow basic arithmetic:
>>> pd.Timedelta(days=1) + pd.Timedelta(seconds=1)
Timedelta('1 days 00:00:01')
Similar to to_datetime
, there is a to_timedelta
function that can parse strings or lists of strings into Timedelta structures or TimedeltaIndices
:
>>> pd.to_timedelta('20.1s')
Timedelta('0 days 00:00:20.100000')
Instead of absolute dates, we could create an index of timedeltas
. Imagine measurements from a volcano, for example. We might want to take measurements but index it from a given date, for example the date of the last eruption. We could create a timedelta
index that has the last seven days as entries:
>>> pd.to_timedelta(np.arange(7), unit='D')
TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days', '5 days', '6 days'], dtype='timedelta64[ns]', freq=None)
We could then work with time series data, indexed from the last eruption. If we had measurements for many eruptions (from possibly multiple volcanos), we would have an index that would make comparisons and analysis of this data easier. For example, we could ask whether there is a typical pattern that occurs between the third day and the fifth day after an eruption. This question would not be impossible to answer with a DatetimeIndex
, but a TimedeltaIndex
makes this kind of exploration much more convenient.
Time series plotting
Pandas comes with great support for plotting, and this holds true for time series data as well.
As a first example, let's take some monthly data and plot it:
>>> rng = pd.date_range(start='2000', periods=120, freq='MS')
>>> ts = pd.Series(np.random.randint(-10, 10, size=len(rng)), rng).cumsum()
>>> ts.head()
2000-01-01 -4
2000-02-01 -6
2000-03-01 -16
2000-04-01 -26
2000-05-01 -24
Freq: MS, dtype: int64
Since matplotlib is used under the hood, we can pass a familiar parameter to plot, such as c for color, or title for the chart title:
>>> ts.plot(c='k', title='Example time series')
>>> plt.show()
The following figure shows an example time series plot:
We can overlay an aggregate plot over 2 and 5 years:
>>> ts.resample('2A').plot(c='0.75', ls='--')
>>> ts.resample('5A').plot(c='0.25', ls='-.')
The following figure shows the resampled 2-year plot:
The following figure shows the resample 5-year plot:
We can pass the kind of chart to the plot
method as well. The return value of the plot
method is an AxesSubplot
, which allows us to customize many aspects of the plot. Here we are setting the label values on the X
axis to the year values from our time series:
>>> plt.clf()
>>> tsx = ts.resample('1A')
>>> ax = tsx.plot(kind='bar', color='k')
>>> ax.set_xticklabels(tsx.index.year)
Let's imagine we have four time series that we would like to plot simultaneously. We generate a matrix of 1000 × 4 random values and treat each column as a separated time series:
>>> plt.clf()
>>> ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
>>> df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
>>> df = df.cumsum()>>> df.plot(color=['k', '0.75', '0.5', '0.25'], ls='--')
Summary
In this chapter we showed how you can work with time series in Pandas. We introduced two index types, the DatetimeIndex
and the TimedeltaIndex
and explored their building blocks in depth. Pandas comes with versatile helper functions that take much of the pain out of parsing dates of various formats or generating fixed frequency sequences. Resampling data can help get a more condensed picture of the data, or it can help align various datasets of different frequencies to one another. One of the explicit goals of Pandas is to make it easy to work with missing data, which is also relevant in the context of upsampling.
Finally, we showed how time series can be visualized. Since matplotlib and Pandas are natural companions, we discovered that we can reuse our previous knowledge about matplotlib for time series data as well.
In the next chapter, we will explore ways to load and store data in text files and databases.
Practice exercises
Exercise 1: Find one or two real-world examples for data sets, which could – in a sensible way – be assigned to the following groups:
- Fixed frequency data
- Variable frequency data
- Data where frequency is usually measured in seconds
- Data where frequency is measured in nanoseconds
- Data, where a
TimedeltaIndex
would be preferable
Create various fixed frequency ranges:
- Every minute between 1 AM and 2 AM on 2000-01-01
- Every two hours for a whole week starting 2000-01-01
- An entry for every Saturday and Sunday during the year 2000
- An entry for every Monday of a month, if it was a business day, for the years 2000, 2001 and 2002