You're reading from 15 Math Concepts Every Data Scientist Should Know Understand and learn how to apply the math behind data science algorithms

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781837634187

Length 510 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Data Science

Author (1):

David Hoyle

View More author details

Table of Contents (21) Chapters

Preface

1. Part 1: Essential Concepts FREE CHAPTER

2. Chapter 1: Recap of Mathematical Notation and Terminology

3. Chapter 2: Random Variables and Probability Distributions

4. Chapter 3: Matrices and Linear Algebra

5. Chapter 4: Loss Functions and Optimization

6. Chapter 5: Probabilistic Modeling

7. Part 2: Intermediate Concepts

8. Chapter 6: Time Series and Forecasting

9. Chapter 7: Hypothesis Testing

10. Chapter 8: Model Complexity

11. Chapter 9: Function Decomposition

12. Chapter 10: Network Analysis

13. Part 3: Selected Advanced Concepts

14. Chapter 11: Dynamical Systems

15. Chapter 12: Kernel Methods

16. Chapter 13: Information Theory

17. Chapter 14: Non-Parametric Bayesian Methods

18. Chapter 15: Random Matrices

19. Index

Why subscribe?

20. Other Books You May Enjoy

First-order discrete Markov processes

A discrete first-order Markov process is one of the simplest dynamical systems we can study. Our time variable is discrete, and we have a finite number of states between which we can move from one timepoint to the next.

Since the possible states are discrete, we can label them using integer values 1,2,3… and so on. We also use <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>i</mml:mi></mml:math> and to represent generic states.

The state at time <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi></mml:math> is denoted by the variable , and so a trajectory of observations from timepoint to timepoint would be represented by the sequence . For example, if we have five possible states <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:mn>5</mml:mn></mml:math> and we set <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>T</mml:mi><mml:mo>=</mml:mo><mml:mn>10</mml:mn></mml:math> , then we may have the specific trajectory . We can shorten the representation of the trajectory and say the observed trajectory is the sequence [3,4,4,3,4,3,5,1,2,2,4].

By taking the possible states that the system can be in to be a finite set, we simplify the mathematics considerably. However, since we can make that finite set as large as we want, we have considerable flexibility in the real-world systems to which we can apply first-order discrete Markov processes.

Variations of first-order Markov processes

We have used the words discrete and process to describe our Markov models. But be aware that Markov process is a very general term that applies to a variety of different model types. You will also see the term Markov chain used to describe the discrete-time discrete state-space models we study here. That is because we have a sequence or chain of states.

We can, of course, have different variants of a Markov process. The possible variants are the following:

Discrete-time and discrete state space – what we study in the rest of this chapter
Continuous time and discrete state space
Discrete-time and continuous state space
Continuous time and continuous state space

All of these are Markov processes, but the term Markov chain tends to be reserved for variants 1-3, where either the time variable or the space is discrete. We will study discrete-time and discrete state-space Markov processes (Markov chains) in the rest of this chapter; for shorthand, we will refer to them just as discrete Markov processes.

A Markov process is a probabilistic model

What determines the value of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math> ? The interesting aspect of a first-order discrete Markov process is that the trajectory is stochastic, not deterministic. In our previous example, where we had five possible states, if we are in state 4 at <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>3</mml:mn></mml:math> (that is, ), then the state we observe at the next timepoint could be any of the five states 1, 2, 3, 4, or 5. In our example, the state <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math> turned out to be 3. This makes a first-order discrete Markov process a probabilistic model, so we’ll need to introduce some probabilities.

In general, we consider <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math> to be a random variable. At timepoint , the probability distribution from which <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math> is drawn we’ll denote as , meaning that in the notation from Chapter 2. If we have a total of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>N</mml:mi></mml:math> possible states, then is a vector of probabilities. It gives us the proportions of observations we would expect to see in each state at timepoint <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi></mml:math> . We can write this vector out in long hand as an array:

Eq. 2

On the right-hand side of Eq. 2, we have denoted the probabilities as <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mfenced separators=""><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:mfenced></mml:math> to indicate that the probability of each state <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>i</mml:mi></mml:math> can change over time. This is because in a first-order Markov process, the probability of being in state at the next timepoint +1 changes according to which state we are currently in. In more formal notation, we write this as the following:

Eq. 3

This means that we have a probability matrix <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> , with matrix elements <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math> , with the probability being the probability of being in state <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>j</mml:mi></mml:math> at timepoint given we were in state at timepoint Notice that the probabilities do not depend upon time. They are static.

Because which state <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>i</mml:mi></mml:math> we have at timepoint determines the probabilities of which state we get at the next timepoint , it means the state probability distribution influences the state probability distribution <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math> . In other words, the state probability distribution changes over time. It evolves. In a moment, we’ll derive the equation that shows precisely how <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math> evolves over time, but first, we’ll look at the probability matrix <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> in more detail.

The transition probability matrix

We can think of Eq. 3 as saying that the probability of going from state <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>i</mml:mi></mml:math> to state is . That is, Eq. 3 describes the probabilities of transitioning from state to state . Consequently, we refer to the probabilities as transition probabilities, and the matrix <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> as a transition matrix. You will also see it referred to as a transition probability matrix, a transition rate matrix, or just a plain rate matrix since it describes the rate (frequency) with which we transition from one state to another.

Properties of the transition probability matrix

Since the transition matrix elements <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math> are probabilities, they must satisfy the rules of probabilities and probability distributions. Firstly, this means the following:

Eq. 4

Secondly, the total probability of going from state <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>i</mml:mi></mml:math> to any state must be 1 because we must end up in some state after having been in state . In math, this means the following:

Eq. 5

In matrix algebra, we can write this as follows:

Eq. 6

Here, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:math> is a column vector with <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>N</mml:mi></mml:math> elements that are all 1. To make this more concrete, let’s look at an example.

Epidemic modeling with a first-order discrete Markov process

When modeling the progression of a disease outbreak or an epidemic, one of the standard models used by epidemiologists and mathematical modelers is the SIR model. The acronym SIR refers to the three states in the model; S = susceptible to infection, I = infected, R = recovered from infection. In one timestep, a person can move between these states. For example, a person who is susceptible to infection may become infected, while a person who has been infected may recover from their infection. For a person who had been infected and recovered, we would expect them to have built up some immunity to future infections. It is not guaranteed that a recovered person won’t become susceptible in the future, but we would expect the probability of transitioning from a recovered state to a susceptible state to be small, certainly smaller than the probability of transitioning from a susceptible state to an infected state. Similarly, a person who is infected can’t go straight back to being susceptible – they would have to recover first. Likewise, a person can’t jump straight from being in a susceptible state to a recovered state – they must become infected first. We’ll map our S, I, and R states to the integers 1,2, and 3, respectively. So the susceptible state is equivalent to a state value of 1; the infected state is equivalent to a state value of 2; and the recovered state is equivalent to a state value of 3. Given this mapping and the aforementioned considerations, the transition matrix may look like this:

Eq. 7

The transition matrix in Eq. 7 is a simplification. In real epidemic modeling, the transition probabilities would be dependent on the proportions of the population in each state. For example, the more people who are infected, then the higher the probability that someone susceptible will become infected because there is a higher probability of that susceptible person encountering an infected person. This means that in real-world epidemic modeling the transition probabilities are dynamic themselves. However, for simplicity of explanation, we will take the SIR transition probabilities to be static and we will use the matrix in Eq. 7.

The transition probability matrix is a network

The matrix in Eq. 7 looks like the weighted adjacency matrices we studied in the previous chapter on networks. Does this mean we can think of our states and the transition rate matrix as a network and use network diagrams to represent it? The answer is yes. In fact, Figure 11.1 shows a network diagram representation of the transition matrix in Eq. 7:

Figure 11.1: Network representation of the SIR transition rate matrix in Eq. 7

If we look at both Figure 11.1 and the transition rate matrix in Eq. 7, we notice several things:

The network in Figure 11.1 has arcs from each node to itself. This is because there is a nonzero probability of staying in each state from one timepoint to the next.
The transition rate matrix is not symmetric. This makes sense from the perspective of our SIR model but is a very general occurrence. It would be unusual to study a problem where the transition rate matrix was symmetric.
There are matrix elements in the transition rate matrix that are zero. This means that the transition occurs with zero probability. It is effectively forbidden. For example, the transition from a susceptible state to a recovered state has zero probability because you can’t recover if you haven’t been infected, and so you must transition from susceptible to infected first before you can get to the recovered state.

Using the transition matrix to generate state trajectories

The transition rate matrix in Eq. 7 tells us how to evolve the state from one timepoint to the next. That means we can use Eq. 7 to generate an example sequence of states for an individual person. We can start the sequence in the susceptible state. This is the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math> state. From there, the first row in the transition rate matrix in Eq. 7 tells us the probability distribution from which the next state is drawn. This is the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math> state. Once we know the state, we can select the corresponding row in the rate matrix and use that to sample the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math> state. How do we draw states from a probability distribution? We learned about this in Chapter 2. We can use the numpy.random.choice function to do this for us. In fact, we’ll now look at a code example doing just that.

Code example – using the transition rate matrix to evolve the state

The code example given next, and more, can be found in the Code_Examples_Chap11.ipynb Jupyter notebook in the GitHub repository. We’re going to start from the susceptible state, and sample the next state using the probabilities from the first row of the rate matrix in Eq. 7. We’ll repeat a further nine times, each time using the probability distribution from the row of the rate matrix corresponding to the current state. At the end, we will have a sequence of 11 states. This will be an example trajectory from our SIR first-order Markov process and could represent the trajectory of an individual person:

import numpy as np
# Set the seed for the numpy random number generator.
# This ensures that we get reproducible results
np.random.seed(17335)
### Define our states
# Our states are "Susceptible" = index 0,
# "Infected" = index 1
# "Recovered" = index 2
state_map = {0:'S', 1:'I', 2:'R'}
### Define our rate matrix
rate_matrix = np.array([
    [0.9, 0.1, 0.0],[0.0, 0.8, 0.2],[0.05, 0.0, 0.95]])
### We'll start in the susceptible state and then sample
### our next state and repeat for 10 iterations
n_iter=10
# Set initial state and its label
current_state = 0
states_sequence = [state_map[current_state]]
# Do 10 iterations
for i in range(n_iter):
    # Get transition probabilities when starting from
    # our current state
    next_state_probs = rate_matrix[current_state,:]
    # Use numpy function to sample an integer from [0,1,2]
    # with the specified transition probabilities
    next_state = np.random.choice(3, p=next_state_probs)
    # Update our current state to this new state
    # and get its label
    current_state = next_state
    states_sequence.extend(state_map[current_state])

Let’s look at the generated sequence:

print(states_sequence)

This gives the following output:

['S', 'S', 'S', 'S', 'S', 'I', 'R', 'R', 'R', 'R', 'R']

We can see that the individual here remained in the susceptible state for five timepoints, then became infected, but quickly recovered.

The preceding code example shows how to generate an example trajectory from a first-order discrete Markov process, but one person’s trajectory from an SIR model does not give us a complete picture of the epidemic. What is happening across the whole population of individuals? How are the proportions of people in the S, I and R states changing over time? We’ll answer that question in the next sub-section.

Evolution of the state probability distribution

We can calculate the distribution of states at the next timepoint just using the rules of probability (see Chapter 2). We have the following:

Eq. 8

In Eq. 8, the notation <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mtext>Prob</mml:mtext><mml:mfenced separators=""><mml:mrow><mml:mi>i</mml:mi><mml:mo>→</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:mfenced></mml:math> is the probability of transitioning from state <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>i</mml:mi></mml:math> to state and so is just our transition matrix probability . Eq. 8 says that the probability of ending up in state at timepoint is the sum of the probabilities of all the ways we can get to <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>j</mml:mi></mml:math> ; that is, a sum over all the starting states at timepoint . Recall our shorthand notation for and for the vector of probabilities . In terms of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math> and our transition rate matrix <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> , we can write Eq. 8 as follows:

Eq. 9

Eq. 9 tells us how the probability distribution over states evolves over time. It tells us in the language of linear algebra that we learned in Chapter 3. Eq. 9 is our evolution equation for first-order discrete Markov processes.

We can iterate Eq. 9. The probability distribution <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math> can also be written, using Eq. 9, as . Plugging this back into Eq. 9, we get the following:

Eq. 10

Continuing this iteration, we get the following:

Eq. 11

Eq. 11 tells us what our probability distribution will be at any later timepoint t, given our starting distribution <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math> and our rate matrix <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> . This is true even if we start at <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math> in a definite state. If at we are in a susceptible state in our SIR example, then we just set . This reflects the fact that at <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math> , before any epidemic has begun, 100% of people are in a susceptible state, no one is infected, and no one has yet recovered from an infection. Note that due to the stochastic nature of the state evolution, even if we start off with a 100% probability of being in the susceptible state, at a later timepoint <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi></mml:math> we will have a probability distribution that is spread out, and the probabilities of being infected or recovered will be nonzero.

What happens over the long term? How much spreading out of the state probability distribution happens? What happens if we make <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi></mml:math> big? What happens if we take ? We’ll find out in the next sub-section.

Stationary distributions and limiting distributions

Let’s increase <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi></mml:math> in Eq. 11. We will assume (for now) that as we increase , the state probability distribution settles down and gets closer and closer to some particular set of values. This is the limiting distribution. We’ll use the notation <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> to represent this limiting distribution. By definition of taking <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi><mml:mo>→</mml:mo><mml:mi mathvariant="normal">∞</mml:mi></mml:math> , the limiting distribution is mathematically defined from Eq. 11 as follows:

Eq. 12

You’ll notice that we have written <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> on the left-hand side of Eq. 12 as though it doesn’t depend upon the starting distribution <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math> . That is because it doesn’t. No matter what starting distribution <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math> we start with, if we iterate the evolution equation in Eq. 9 for an infinite amount of time, we will always end up with the distribution <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> . This is not quite true, but will be for many real-world situations, and will certainly be true for all the examples we present here – see point 1 in the Notes and further reading section at the end of the chapter for more details.

With that caveat in place, can we unpack what <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> represents more intuitively? Mathematically, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> is what we get if we iterate the evolution equation in Eq. 9 for an infinite amount of time. But this also means that <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> is close to the distribution we would get if we iterated Eq. 9 for a long but finite amount of time. In our SIR example, this means that <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> would be close to the distribution we would get at the end of a long epidemic. It is as though our system is approaching a stable stationary equilibrium. So, how do we identify this stationary equilibrium? This brings us to another concept and helps us address the question of whether we always have a limiting distribution.

The new concept we need is that of a stationary distribution. As the name suggests, a stationary distribution is a state probability distribution that doesn’t change in time. Because it doesn’t change in time, we’ll use the notation <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> to represent it. In comparison to , we have dropped the subscript <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi></mml:math> because by definition, the stationary distribution does not depend upon time. Because doesn’t change with time, it means that when we apply the evolution equation in Eq. 9, we must get <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> back. In other words, the stationary state distribution satisfies the following equation:

Eq. 13

This equation looks familiar. It is saying that when we operate on the vector <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> with the matrix , we get the vector . It is an eigenvector equation that we met in Chapter 3. Formally, the stationary distribution <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> is a left eigenvector of the matrix corresponding to an eigenvalue of 1. It is a left eigenvector because we are multiplying by <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> with <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> to the left of .

Another way of saying Eq. 13 is that we can only get a stationary distribution if the transition matrix <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> has a left eigenvector with eigenvalue 1. This gives a mathematical way of identifying stationary distributions.

We can also take the transpose of Eq. 13 and get the following:

Eq. 14

This means that the stationary distribution, or rather its transpose <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msup><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">⊤</mml:mi></mml:mrow></mml:msup></mml:math> , is also a right eigenvector of corresponding to an eigenvalue of 1.

But what is so special about stationary distributions? Well, let’s take our limiting distribution <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> and apply our evolution equation Eq. 9 to <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> one more time. What does give us? Because <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> is what we got when we applied <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> to <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math> an infinite number of times, applying <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> to <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> is the same as applying <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> to <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math> infinity+1 number of times. But infinity+1 is still infinite, so it can’t have changed anything. This means applying <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> to <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> must give us back <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> . In math, we have the following:

Eq. 15

Comparing Eq. 15 to Eq. 13 immediately tells us that our limiting distribution <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:msub></mml:math> , if it exists, is a stationary distribution. This also means that stationary distributions are candidates for a limiting distribution. And we already know how to identify stationary distributions from the left eigenvectors of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> . The gotcha here is that a limiting distribution is a stationary distribution, but not all stationary distributions are limiting distributions. That is why we said the stationary distributions of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> only give us candidates for the limiting distribution. It is possible for a transition matrix <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> to have stable distributions but no limiting distributions.

With that gotcha understood, let’s look at a code example. We’ll revisit our SIR example and calculate the left eigenvectors of in Eq. 7, and we’ll iterate the evolution equation Eq. 9 for a large number of times.

Code example – stationary and limiting distributions

The code example given next, and more, can be found in the Code_Examples_Chap11.ipynb Jupyter notebook in the GitHub repository. Any stable distribution of the first-order Markov process is given by a right eigenvector of the transpose of the transition rate matrix corresponding to eigenvalue 1. We can use the numpy.linalg.eig function, which computes the right eigenvectors and eigenvalues of a square matrix. We’ll assume we have already imported the numpy package as np and still have the rate matrix from our SIR previous code example in memory. First, we compute the right eigenvectors of the rate matrix:

# Compute the right-eigenvectors and eigenvalues of our rate matrix
rate_eigen = np.linalg.eig(np.transpose(rate_matrix))

We can then look at the eigenvalues:

# The eigenvalues are held in the first element of the tuple
rate_eigen[0]

This gives the following output:

array([0.825+0.06614378j, 0.825-0.06614378j, 1.0+0.j])

The last eigenvalue is 1 (its imaginary part is zero). Therefore, we can use the last eigenvector as the stable distribution. It will be normalized to unit length, but we need its elements to sum to 1. But remember from Chapter 3 that any multiple of an eigenvector is still an eigenvector with the same eigenvalue, so we can just re-scale it to sum to 1:

# Extract the last eigenvector (we can drop the zero imaginary part)
pi_stable = np.real(rate_eigen[1][:,2])
# Rescale the vector so its elements sum to 1
pi_stable /=np.sum(pi_stable)

Let’s look at the stable distribution:

# Print the stable distribution
pi_stable

This gives the following output:

array([0.28571429, 0.14285714, 0.57142857])

From this, we can see that for the stable distribution, 28.6% of people are in the susceptible state, 14.3% are in the infected state, and 57.1% are in the recovered state.

So, we have a stable distribution, but do we have a limiting distribution, and if so, is it the same as the stable distribution? To approximate any limiting distribution, we’ll start from a definite initial state; that is, a distribution that has only one nonzero value. We’ll then apply the transition rate matrix a large number (100) of times to get an approximation of the limiting distribution:

# Set the initial state distribution. We'll
# set it to a distribution representing 100% of people
# being in the susceptible state
current_distribution = np.array([1.0, 0.0, 0.0])
n_iter = 100
for i in range(n_iter):
    # Get the state distribution at the next timepoint by
    # multiplying by the transition rate matrix
    next_distribution = np.matmul(current_distribution, rate_matrix)
    current_distribution = next_distribution
current_distribution

This gives the following output for the approximation of the limiting distribution:

array([0.28571429, 0.14285715, 0.57142856])

By comparing our approximation of the limiting distribution to the stable distribution we obtained previously, we can see they are numerically identical (up to the precision of the calculation). Our limiting distribution is the same as our stable distribution. If our epidemic were to run for a long time without any further interventions, we would have 28.6% susceptible, 14.3% infected, and 57.1% recovered.

Having done that simple but instructive code example, we’ll now look at an interesting characteristic of first-order Markov processes: the fact that they are memoryless.

First-order discrete Markov processes are memoryless

Wait! What? What do we mean by saying our first-order Markov process is memoryless? The transition probabilities depend upon the current state. The next state is strongly influenced by what has come before. Surely that is memory?

Not quite. What we’re talking about here is conditional independence. Once we know (or specify) the current state, we have all the information we need to produce the next state (by sampling from the appropriate row of the transition matrix). The state preceding the current state has no influence on the next state once we have specified the current state. Given the current state, the next state is independent of the preceding state. That is what we mean by conditional independence. This property is also termed the Markov property and is what we mean when we say a system is Markovian.

In practical terms, what does this mean? In a first-order model, specifying the current state means we completely forget about the preceding state. That is what we mean by memoryless. It is a strength of Markov models as it simplifies the mathematics. But it is also a weakness of Markov models, as it means beyond the current state the state history has no effect on the next state. We’ll address this weakness in part in the next section, but we won’t be able to eliminate this weakness.

To finish this section on first-order discrete Markov processes, we’ll return to the fact that they are probabilistic models. If you remember from Chapter 5, a probabilistic model allows us to calculate things such as the likelihood of the data given the model and its parameters. We can do lots of useful things with the likelihood once we have a formula for it. So, what is the formula for the likelihood of an observed state sequence given a first-order discrete Markov model?

Likelihood of the state sequence

Imagine we have a state sequence for timepoints <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:mi>T</mml:mi></mml:math> . This means we have a set of random variables for which we got the following observed state values: . What is the likelihood of that sequence of observed state values? Remember from Chapter 5 that the likelihood is the probability of the data given the model parameters. In this case, the model parameters are the transition probabilities and the initial state distribution. Can we write the probability of the data in terms of these model parameters? To see that we can, we’ll first write the probability of the data as follows:

Prob(X 0 = i 0, X 1 = i 1, … , X T = i T) = Prob(X 0 = i 0)×Prob(X 1 = i 1| X 0 = i 0) × Prob(X 2 = i 2| X 1 = i 1) × … × Prob(X T = i T| X T−1 = i T−1)

Eq. 16

The right-hand side of Eq. 16 follows from the conditional independence property of the first-order Markov process. Once we know the state at time <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:math> , the probability distribution of the next state at time is entirely known and given by the appropriate row of the transition matrix. This means we can write the right-hand side of Eq. 16 as follows:

Eq. 17

In Eq. 17 <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math> is the state probability distribution at <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math> . Eq. 17 has expressed the likelihood in terms of the model parameters, but we can simplify the expression in Eq. 17 further. Because many of the transitions we observe in the state sequence will be of the same type (for example, a <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mn>2</mml:mn><mml:mo>→</mml:mo><mml:mn>3</mml:mn></mml:math> or an in our SIR epidemic example), we can combine them in Eq. 17. We then get the following:

Eq. 18

Here, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math> is the number of times we observe a transition <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>i</mml:mi><mml:mo>→</mml:mo><mml:mi>j</mml:mi></mml:math> in the state sequence . The expression in Eq. 18 is very compact and easy to calculate, as all we need to do is count the number of observed transitions of each type.

Having obtained a simple expression for the likelihood of an observed state sequence, can we use it to do useful things, such as estimating the model parameters via maximum likelihood estimation (MLE)? The answer is yes, and so next, we’ll look at how to use Eq. 18 to estimate the transition probabilities <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math> via maximum likelihood.

MLE of the transition probabilities

The log-likelihood of the observed state sequence is given by taking the logarithm of Eq. 18. We get the following:

Eq. 19

We want to maximize the log-likelihood in Eq. 19 with respect to each <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math> and subject to the constraints, for all . We do this maximization using the differential calculus recapped in Chapter 1 and using Lagrange multipliers to impose the constraints. Differentiating Eq. 19 with respect to <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math> and setting the derivative to zero, we get the following solution:

Eq. 20

In Eq. 20, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>λ</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math> is the Lagrange multiplier for the constraint . The hat symbol on top of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math> is there to remind us that this is an estimate of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math> and not the true value of itself. The expression in Eq. 20 is very simple, and we can easily adjust <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>λ</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math> to match the constraint, to finally get the following:

$<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo> </mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mrow><mml:munderover><mml:mo stretchy="false">∑</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>′</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:msup><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>′</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:mrow></mml:mfrac><mml:mo> </mml:mo></mml:math>$

Eq. 21

The expression in Eq. 21 is very simple and very pleasing. It says that the maximum likelihood estimates of the transition probabilities are just the observed proportions of each transition type.

We have covered a lot in this section on first-order discrete Markov processes. That is because they are extremely useful for building simple models of dynamical processes. They are a workhorse model – used everywhere. Let’s recap what we have learned about them.

What we learned

In this section, we have learned the following:

How first-order discrete Markov processes model how a system evolves from one state to another over time
That the time variable in a first-order discrete Markov process is discrete; that is, it increases in steps
That a first-order discrete Markov process is a probabilistic model
That the probability distribution from which the next state is drawn depends only on the current state and is given by the corresponding row of the transition probability matrix
That the transition matrix can be represented as a network (a graph)
How to generate trajectories from a first-order discrete Markov process using the rate matrix
About stable distributions of first-order discrete Markov processes and how they are given by left eigenvectors of the rate matrix corresponding to an eigenvalue of 1
About limiting distributions of first-order discrete Markov processes
That we don’t always have a limiting distribution, but if we do, it will correspond to a stable distribution of the Markov process
That first-order discrete Markov processes are memoryless
How to calculate the likelihood of an observed state sequence and how to use the likelihood formula to estimate the transition probabilities via maximum likelihood

Having learned about first-order discrete Markov processes, in the next section, we will build upon and extend the idea to having the transition probabilities depend upon a longer amount of history than just the preceding state. This introduces what are known as higher-order discrete Markov processes. Since we will be building upon the concepts and terminology of first-order Markov processes, a lot of the next section will be very similar, and so it will be a much shorter section.