A time series is a set of data collected sequentially over time. For example, think of any chart where the x axis is some measurement of time—anything from the number of stars in the universe since the Big Bang until today or the amount of energy released each nanosecond from a nuclear reaction. The data behind both is time series. The chart in the weather app on your phone showing the expected temperature for the next 7 days? That’s also the plot of a time series.
In this book, we are mostly concerned with events on the human scales of years, months, days, and hours, but all of this is time series data. Predicting future values is the act of forecasting.
Forecasting the weather has obviously been important to humans for millennia, particularly since the advent of agriculture. In fact, over 2,300 years ago, the Greek philosopher Aristotle wrote a treatise called Meteorology that contained a discussion of early weather forecasting. The very word forecast was coined by an English meteorologist in the 1850s, Robert FitzRoy, who achieved fame as the captain of the HMS Beagle during Charles Darwin’s pioneering voyage.
However, time series data is not unique to weather. The field of medicine adopted time series analysis techniques with the 1901 invention of the first practical electrocardiogram (ECG) by the Dutch physician Willem Einthoven. The ECG produces the familiar pattern of heartbeats we now see on the machine next to a patient’s bed in every medical drama.
Today, one of the most discussed fields of forecasting is economics. There are entire television channels dedicated to analyzing trends in the stock market. Governments use economic forecasting to advise central bank policy, politicians use economic forecasting to develop their platforms, and business leaders use economic forecasting to guide their decisions.
In this book, we will be forecasting topics as varied as carbon dioxide levels in the atmosphere, the number of riders on Chicago’s public bike share program, the growth of the wolf population in Yellowstone, the solar sunspot cycles, local rainfall, and even Instagram likes on some popular accounts.
The problem with dependent data
So, why does time series forecasting require its own unique approach? From a statistical perspective, you might see a scatter plot of time series with a relatively clear trend and attempt to fit a line using standard regression—the technique for fitting a straight line to data. The problem is that this violates the assumption of independence that linear regression demands.
To illustrate time series dependence with an example, let’s say that a gambler is rolling an unbiased die. I tell you that they just rolled a 2 and ask what the next value will be. This data is independent; previous rolls have no effect on future rolls, so knowing that the previous roll was a 2 does not provide any information about the next roll.
However, in a different situation, let’s say that I call you from an undisclosed location somewhere on Earth and ask you to guess the temperature at my location. Your best bet would be to guess some average global temperature for that day. Now, imagine that I tell you that yesterday’s temperature at my location was 90°F. That provides a great deal of information to you because you intuitively know that yesterday’s temperature and today’s temperature are linked in some way; they are not independent.
With time series data, you cannot randomly shuffle the order of data without disturbing the trends, within a reasonable margin of error. The order of the data matters; it is not independent. When data is dependent like this, a regression model can show statistical significance by random chance, even when there is no true correlation, much more often than your chosen confidence level would suggest.
Because high values tend to follow high values and low values tend to follow low values, a time series dataset is more likely to show more clusters of high or low values than would otherwise be present, and this, in turn, can lead to the appearance of more correlations than would otherwise be present.
The website Spurious Correlations by Tyler Vigen specializes in pointing out examples of seemingly significant, but utterly ridiculous, time series correlations. Here is one example:
Figure 1.1 – A spurious time series correlation (https://www.tylervigen.com/spurious-correlations)
Obviously, the number of people who drown in pools each year is completely independent of the number of films Nicolas Cage appears in. They simply have no effect on each other at all. However, by making the fallacy of treating time series data as if it were independent, Vigen has shown that by pure random chance, the two series of data do, in fact, correlate significantly. These types of random chances are much more likely to happen when ignoring dependence in time series data.
Now that you understand what exactly time series data is and what sets it apart from other datasets, let’s look at a few milestones in the development of models, from the earliest models up to Prophet.