Working with time-series in Python
Python has a lot of libraries and packages for time-series, such as datetime
, time
, calendar
, dateutil
, and pytz
, which can be highly confusing for beginners. At the same time, there are many different data types like date
, time
, datetime
, tzinfo
, timedelta
, relativedelta
, and more.
When it comes to using them, the devil is in the details. Just to name one example: many of these types are insensitive to the timezone. You should feel reassured, however, knowing that to get started, familiarity with a small subset of these libraries and data types is enough.
Requirements
In this chapter, we'll use several libraries, which we can quickly install from the terminal (or similarly from Anaconda Navigator):
pip install -U dython scipy numpy pandas seaborn scikit-learn
We'll execute the commands from the Python (or IPython) terminal, but equally we could execute them from a Jupyter notebook (or a different environment).
It's a good start if we at least know datetime and pandas, two very prominent libraries, which we'll cover in the following two sections. We'll create basic objects and do simple manipulations on them.
Datetime
The date
and datetime
data types are not primitive types in Python the way that numbers (float
and int
), string
, list
, dictionary
, tuple
, or file
are. To work with date
and datetime
objects, we have to import datetime, a library that is part of the Python Standard Library, and the libraries that come by default with CPython and other main Python distributions.
datetime comes with objects such as date
, datetime
, time
, and timedelta
, among others. The difference between datetime
and date
objects is that the datetime
object includes time information in addition to a date.
To get a date, we can do this:
from datetime import date
To get today's date:
today = date.today()
To get some other date:
other_date = date(2021, 3, 24)
If we want a datetime
object (a timestamp) instead, we can do this as well:
from datetime import datetime
now = datetime.now()
This will get the current timestamp. We can create a datetime
for a specific date and time as well:
some_date = datetime(2021, 5, 18, 15, 39, 0)
some_date.isoformat()
We can get a string output in isoformat:
'2021-05-18T15:39:00'
isoformat, short for the ISO 8601 format, is an international standard for representing dates and times.
We can also work with time differences using timedelta
:
from datetime import timedelta
year = timedelta(days=365)
These timedelta
objects can be added to other objects for calculations. We can do calculations with a timedelta
object, for example:
year * 10
This should give us the following output:
datetime.timedelta(days=3650)
The datetime library can parse string inputs to date
and datetime
types and output these objects as string
:
from datetime import date
some_date = date.fromisoformat('2021-03-24')
Or:
some_date = datetime.date(2021, 3, 24)
We can format the output with string format options, for example like this:
some_date.strftime('%A %d. %B %Y')
This would give us:
'Wednesday 24. March 2021'
Similarly, we can read in a date
or datetime
object from a string, and we can use the same format options:
from datetime import datetime
dt = datetime.strptime('24/03/21 15:48', '%d/%m/%y %H:%M')
You can find a complete list of formatting options that you can use both for parsing strings and printing datetime
objects here: https://strftime.org/.
A few important ones are listed in this table:
Format string |
Meaning |
%Y |
Year as 4 digits |
%y |
Year as 2 digits |
%m |
Month as a number |
%d |
Day |
%H |
Hour as 2 digits |
%M |
Minute as 2 digits |
Figure 2.2: Format strings for dates
It's useful to remember these strings with formatting options. For example, the format string for a US date separated by slashes would look like this:
'%d/%m/%Y'
pandas
We introduced the pandas library in the previous chapter. pandas is one of the most important libraries in the Python ecosystem for data science, used for data manipulation and analysis. Initially released in 2008, it has been a major driver of Python's success.
pandas comes with significant time-series functionality such as date range generation, frequency conversion, moving window statistics, date shifting, and lagging.
Let's go through some of these basics. We can create a time-series as follows:
import pandas as pd
pd.date_range(start='2021-03-24', end='2021-09-01')
This gives us a DateTimeIndex
like this:
DatetimeIndex(['2021-03-24', '2021-03-25', '2021-03-26', '2021-03-27',
'2021-03-28', '2021-03-29', '2021-03-30', '2021-03-31',
'2021-04-01', '2021-04-02',
...
'2021-08-23', '2021-08-24', '2021-08-25', '2021-08-26',
'2021-08-27', '2021-08-28', '2021-08-29', '2021-08-30',
'2021-08-31', '2021-09-01'],
dtype='datetime64[ns]', length=162, freq='D')
We can also create a time-series as follows:
pd.Series(pd.date_range("2021", freq="D", periods=3))
This would give us a time-series like this:
0 2021-01-01
1 2021-01-02
2 2021-01-03
dtype: datetime64[ns]
As you can see, this type is called a DatetimeIndex
. This means we can use this data type for indexing a dataset.
One of the most important functionalities is parsing to date
or datetime
objects from either string
or separate columns:
import pandas as pd
df = pd.DataFrame({'year': [2021, 2022],
'month': [3, 4],
'day': [24, 25]}
)
ts1 = pd.to_datetime(df)
ts2 = pd.to_datetime('20210324', format='%Y%m%d')
We've created two time-series.
You can take a rolling window for calculations like this:
s = pd.Series([1, 2, 3, 4, 5])
s.rolling(3).sum()
Can you guess the result of this? If not, why don't you put this into your Python interpreter?
A time-series would usually be an index with a time object and one or more columns with numeric or other types, such as this:
import numpy as np
rng = pd.date_range('2021-03-24', '2021-09-01', freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
We can have a look at our time-series:
2021-03-24 -2.332713
2021-03-25 0.177074
2021-03-26 -2.136295
2021-03-27 2.992240
2021-03-28 -0.457537
...
2021-08-28 -0.705022
2021-08-29 1.089697
2021-08-30 0.384947
2021-08-31 1.003391
2021-09-01 -1.021058
Freq: D, Length: 162, dtype: float64
We can index these time-series datasets like any other pandas Series or DataFrame. ts[:2].index
would give us:
DatetimeIndex(['2021-03-24', '2021-03-25'], dtype='datetime64[ns]', freq='D')
Interestingly, we can index directly with strings or datetime objects. For example, ts['2021-03-28':'2021-03-30']
gives us:
2021-03-28 -0.457537
2021-03-29 -1.089423
2021-03-30 -0.708091
Freq: D, dtype: float64
You can shift or lag the values in a time-series back and forward in time using the shift
method. This changes the alignment of the data:
ts.shift(1)[:5]
We can also change the resolution of time-series objects, for example like this:
ts.asfreq('M')
Please note the difference between datetime
and pd.DateTimeIndex
. Even though they encode the same kind of information, they are different types and they might not always play well with each other. Therefore, I'd recommend to always explicitly convert types when doing comparisons.
In the next section, let's go through a basic example of importing a time-series dataset, getting summary statistics, and plotting some variables.