First-order discrete Markov processes
A discrete first-order Markov process is one of the simplest dynamical systems we can study. Our time variable is discrete, and we have a finite number of states between which we can move from one timepoint to the next.
Since the possible states are discrete, we can label them using integer values 1,2,3… and so on. We also use and
to represent generic states.
The state at time is denoted by the variable
, and so a trajectory of observations from timepoint
to timepoint
would be represented by the sequence
. For example, if we have five possible states
and we set
, then we may have the specific trajectory
. We can shorten the representation of the trajectory and say the observed trajectory is the sequence [3,4,4,3,4,3,5,1,2,2,4].
By taking the possible states that the system can be in to be a finite set, we simplify the mathematics considerably. However, since we can make that finite set as large as we want, we have considerable flexibility in the real-world systems to which we can apply first-order discrete Markov processes.
Variations of first-order Markov processes
We have used the words discrete and process to describe our Markov models. But be aware that Markov process is a very general term that applies to a variety of different model types. You will also see the term Markov chain used to describe the discrete-time discrete state-space models we study here. That is because we have a sequence or chain of states.
We can, of course, have different variants of a Markov process. The possible variants are the following:
- Discrete-time and discrete state space – what we study in the rest of this chapter
- Continuous time and discrete state space
- Discrete-time and continuous state space
- Continuous time and continuous state space
All of these are Markov processes, but the term Markov chain tends to be reserved for variants 1-3, where either the time variable or the space is discrete. We will study discrete-time and discrete state-space Markov processes (Markov chains) in the rest of this chapter; for shorthand, we will refer to them just as discrete Markov processes.
A Markov process is a probabilistic model
What determines the value of ? The interesting aspect of a first-order discrete Markov process is that the trajectory is stochastic, not deterministic. In our previous example, where we had five possible states, if we are in state 4 at
(that is,
), then the state we observe at the next timepoint
could be any of the five states 1, 2, 3, 4, or 5. In our example, the state
turned out to be 3. This makes a first-order discrete Markov process a probabilistic model, so we’ll need to introduce some probabilities.
In general, we consider to be a random variable. At timepoint
, the probability distribution from which
is drawn we’ll denote as
, meaning that
in the notation from Chapter 2. If we have a total of
possible states, then
is a vector of
probabilities. It gives us the proportions of observations we would expect to see in each state at timepoint
. We can write this vector out in long hand as an array:
Eq. 2
On the right-hand side of Eq. 2, we have denoted the probabilities as to indicate that the probability of each state
can change over time. This is because in a first-order Markov process, the probability of being in state
at the next timepoint
+1 changes according to which state we are currently in. In more formal notation, we write this as the following:
Eq. 3
This means that we have a probability matrix , with matrix elements
, with the probability
being the probability of being in state
at timepoint
given we were in state
at timepoint
Notice that the probabilities
do not depend upon time. They are static.
Because which state we have at timepoint
determines the probabilities of which state we get at the next timepoint
, it means the state probability distribution
influences the state probability distribution
. In other words, the state probability distribution changes over time. It evolves. In a moment, we’ll derive the equation that shows precisely how
evolves over time, but first, we’ll look at the probability matrix
in more detail.
The transition probability matrix
We can think of Eq. 3 as saying that the probability of going from state to state
is
. That is, Eq. 3 describes the probabilities of transitioning from state
to state
. Consequently, we refer to the probabilities
as transition probabilities, and the matrix
as a transition matrix. You will also see it referred to as a transition probability matrix, a transition rate matrix, or just a plain rate matrix since it describes the rate (frequency) with which we transition from one state to another.
Properties of the transition probability matrix
Since the transition matrix elements are probabilities, they must satisfy the rules of probabilities and probability distributions. Firstly, this means the following:
Eq. 4
Secondly, the total probability of going from state to any state must be 1 because we must end up in some state after having been in state
. In math, this means the following:
Eq. 5
In matrix algebra, we can write this as follows:
Eq. 6
Here, is a column vector with
elements that are all 1. To make this more concrete, let’s look at an example.
Epidemic modeling with a first-order discrete Markov process
When modeling the progression of a disease outbreak or an epidemic, one of the standard models used by epidemiologists and mathematical modelers is the SIR model. The acronym SIR refers to the three states in the model; S = susceptible to infection, I = infected, R = recovered from infection. In one timestep, a person can move between these states. For example, a person who is susceptible to infection may become infected, while a person who has been infected may recover from their infection. For a person who had been infected and recovered, we would expect them to have built up some immunity to future infections. It is not guaranteed that a recovered person won’t become susceptible in the future, but we would expect the probability of transitioning from a recovered state to a susceptible state to be small, certainly smaller than the probability of transitioning from a susceptible state to an infected state. Similarly, a person who is infected can’t go straight back to being susceptible – they would have to recover first. Likewise, a person can’t jump straight from being in a susceptible state to a recovered state – they must become infected first. We’ll map our S, I, and R states to the integers 1,2, and 3, respectively. So the susceptible state is equivalent to a state value of 1; the infected state is equivalent to a state value of 2; and the recovered state is equivalent to a state value of 3. Given this mapping and the aforementioned considerations, the transition matrix may look like this:
Eq. 7
The transition matrix in Eq. 7 is a simplification. In real epidemic modeling, the transition probabilities would be dependent on the proportions of the population in each state. For example, the more people who are infected, then the higher the probability that someone susceptible will become infected because there is a higher probability of that susceptible person encountering an infected person. This means that in real-world epidemic modeling the transition probabilities are dynamic themselves. However, for simplicity of explanation, we will take the SIR transition probabilities to be static and we will use the matrix in Eq. 7.
The transition probability matrix is a network
The matrix in Eq. 7 looks like the weighted adjacency matrices we studied in the previous chapter on networks. Does this mean we can think of our states and the transition rate matrix as a network and use network diagrams to represent it? The answer is yes. In fact, Figure 11.1 shows a network diagram representation of the transition matrix in Eq. 7:

Figure 11.1: Network representation of the SIR transition rate matrix in Eq. 7
If we look at both Figure 11.1 and the transition rate matrix in Eq. 7, we notice several things:
- The network in Figure 11.1 has arcs from each node to itself. This is because there is a nonzero probability of staying in each state from one timepoint to the next.
- The transition rate matrix is not symmetric. This makes sense from the perspective of our SIR model but is a very general occurrence. It would be unusual to study a problem where the transition rate matrix was symmetric.
- There are matrix elements in the transition rate matrix that are zero. This means that the transition occurs with zero probability. It is effectively forbidden. For example, the transition from a susceptible state to a recovered state has zero probability because you can’t recover if you haven’t been infected, and so you must transition from susceptible to infected first before you can get to the recovered state.
Using the transition matrix to generate state trajectories
The transition rate matrix in Eq. 7 tells us how to evolve the state from one timepoint to the next. That means we can use Eq. 7 to generate an example sequence of states for an individual person. We can start the sequence in the susceptible state. This is the state. From there, the first row in the transition rate matrix in Eq. 7 tells us the probability distribution from which the next state is drawn. This is the
state. Once we know the
state, we can select the corresponding row in the rate matrix and use that to sample the
state. How do we draw states from a probability distribution? We learned about this in Chapter 2. We can use the
numpy.random.choice
function to do this for us. In fact, we’ll now look at a code example doing just that.
Code example – using the transition rate matrix to evolve the state
The code example given next, and more, can be found in the Code_Examples_Chap11.ipynb
Jupyter notebook in the GitHub repository. We’re going to start from the susceptible state, and sample the next state using the probabilities from the first row of the rate matrix in Eq. 7. We’ll repeat a further nine times, each time using the probability distribution from the row of the rate matrix corresponding to the current state. At the end, we will have a sequence of 11 states. This will be an example trajectory from our SIR first-order Markov process and could represent the trajectory of an individual person:
import numpy as np # Set the seed for the numpy random number generator. # This ensures that we get reproducible results np.random.seed(17335) ### Define our states # Our states are "Susceptible" = index 0, # "Infected" = index 1 # "Recovered" = index 2 state_map = {0:'S', 1:'I', 2:'R'} ### Define our rate matrix rate_matrix = np.array([ [0.9, 0.1, 0.0],[0.0, 0.8, 0.2],[0.05, 0.0, 0.95]]) ### We'll start in the susceptible state and then sample ### our next state and repeat for 10 iterations n_iter=10 # Set initial state and its label current_state = 0 states_sequence = [state_map[current_state]] # Do 10 iterations for i in range(n_iter): # Get transition probabilities when starting from # our current state next_state_probs = rate_matrix[current_state,:] # Use numpy function to sample an integer from [0,1,2] # with the specified transition probabilities next_state = np.random.choice(3, p=next_state_probs) # Update our current state to this new state # and get its label current_state = next_state states_sequence.extend(state_map[current_state])
Let’s look at the generated sequence:
print(states_sequence)
This gives the following output:
['S', 'S', 'S', 'S', 'S', 'I', 'R', 'R', 'R', 'R', 'R']
We can see that the individual here remained in the susceptible state for five timepoints, then became infected, but quickly recovered.
The preceding code example shows how to generate an example trajectory from a first-order discrete Markov process, but one person’s trajectory from an SIR model does not give us a complete picture of the epidemic. What is happening across the whole population of individuals? How are the proportions of people in the S, I and R states changing over time? We’ll answer that question in the next sub-section.
Evolution of the state probability distribution
We can calculate the distribution of states at the next timepoint just using the rules of probability (see Chapter 2). We have the following:
Eq. 8
In Eq. 8, the notation is the probability of transitioning from state
to state
and so is just our transition matrix probability
. Eq. 8 says that the probability of ending up in state
at timepoint
is the sum of the probabilities of all the ways we can get to
; that is, a sum over all the starting states
at timepoint
. Recall our shorthand notation
for
and
for the vector of probabilities
. In terms of
and our transition rate matrix
, we can write Eq. 8 as follows:
Eq. 9
Eq. 9 tells us how the probability distribution over states evolves over time. It tells us in the language of linear algebra that we learned in Chapter 3. Eq. 9 is our evolution equation for first-order discrete Markov processes.
We can iterate Eq. 9. The probability distribution can also be written, using Eq. 9, as
. Plugging this back into Eq. 9, we get the following:
Eq. 10
Continuing this iteration, we get the following:
Eq. 11
Eq. 11 tells us what our probability distribution will be at any later timepoint t, given our starting distribution and our rate matrix
. This is true even if we start at
in a definite state. If at
we are in a susceptible state in our SIR example, then we just set
. This reflects the fact that at
, before any epidemic has begun, 100% of people are in a susceptible state, no one is infected, and no one has yet recovered from an infection. Note that due to the stochastic nature of the state evolution, even if we start off with a 100% probability of being in the susceptible state, at a later timepoint
we will have a probability distribution that is spread out, and the probabilities of being infected or recovered will be nonzero.
What happens over the long term? How much spreading out of the state probability distribution happens? What happens if we make big? What happens if we take
? We’ll find out in the next sub-section.
Stationary distributions and limiting distributions
Let’s increase in Eq. 11. We will assume (for now) that as we increase
, the state probability distribution settles down and gets closer and closer to some particular set of values. This is the limiting distribution. We’ll use the notation
to represent this limiting distribution. By definition of taking
, the limiting distribution
is mathematically defined from Eq. 11 as follows:
Eq. 12
You’ll notice that we have written on the left-hand side of Eq. 12 as though it doesn’t depend upon the starting distribution
. That is because it doesn’t. No matter what starting distribution
we start with, if we iterate the evolution equation in Eq. 9 for an infinite amount of time, we will always end up with the distribution
. This is not quite true, but will be for many real-world situations, and will certainly be true for all the examples we present here – see point 1 in the Notes and further reading section at the end of the chapter for more details.
With that caveat in place, can we unpack what represents more intuitively? Mathematically,
is what we get if we iterate the evolution equation in Eq. 9 for an infinite amount of time. But this also means that
is close to the distribution we would get if we iterated Eq. 9 for a long but finite amount of time. In our SIR example, this means that
would be close to the distribution we would get at the end of a long epidemic. It is as though our system is approaching a stable stationary equilibrium. So, how do we identify this stationary equilibrium? This brings us to another concept and helps us address the question of whether we always have a limiting distribution.
The new concept we need is that of a stationary distribution. As the name suggests, a stationary distribution is a state probability distribution that doesn’t change in time. Because it doesn’t change in time, we’ll use the notation to represent it. In comparison to
, we have dropped the subscript
because by definition, the stationary distribution does not depend upon time. Because
doesn’t change with time, it means that when we apply the evolution equation in Eq. 9, we must get
back. In other words, the stationary state distribution satisfies the following equation:
Eq. 13
This equation looks familiar. It is saying that when we operate on the vector with the matrix
, we get the vector
. It is an eigenvector equation that we met in Chapter 3. Formally, the stationary distribution
is a left eigenvector of the matrix
corresponding to an eigenvalue of 1. It is a left eigenvector because we are multiplying by
with
to the left of
.
Another way of saying Eq. 13 is that we can only get a stationary distribution if the transition matrix has a left eigenvector with eigenvalue 1. This gives a mathematical way of identifying stationary distributions.
We can also take the transpose of Eq. 13 and get the following:
Eq. 14
This means that the stationary distribution, or rather its transpose , is also a right eigenvector of
corresponding to an eigenvalue of 1.
But what is so special about stationary distributions? Well, let’s take our limiting distribution and apply our evolution equation Eq. 9 to
one more time. What does
give us? Because
is what we got when we applied
to
an infinite number of times, applying
to
is the same as applying
to
infinity+1 number of times. But infinity+1 is still infinite, so it can’t have changed anything. This means applying
to
must give us back
. In math, we have the following:
Eq. 15
Comparing Eq. 15 to Eq. 13 immediately tells us that our limiting distribution , if it exists, is a stationary distribution. This also means that stationary distributions are candidates for a limiting distribution. And we already know how to identify stationary distributions from the left eigenvectors of
. The gotcha here is that a limiting distribution is a stationary distribution, but not all stationary distributions are limiting distributions. That is why we said the stationary distributions of
only give us candidates for the limiting distribution. It is possible for a transition matrix
to have stable distributions but no limiting distributions.
With that gotcha understood, let’s look at a code example. We’ll revisit our SIR example and calculate the left eigenvectors of in Eq. 7, and we’ll iterate the evolution equation Eq. 9 for a large number of times.
Code example – stationary and limiting distributions
The code example given next, and more, can be found in the Code_Examples_Chap11.ipynb
Jupyter notebook in the GitHub repository. Any stable distribution of the first-order Markov process is given by a right eigenvector of the transpose of the transition rate matrix corresponding to eigenvalue 1. We can use the numpy.linalg.eig
function, which computes the right eigenvectors and eigenvalues of a square matrix. We’ll assume we have already imported the numpy
package as np
and still have the rate matrix from our SIR previous code example in memory. First, we compute the right eigenvectors of the rate matrix:
# Compute the right-eigenvectors and eigenvalues of our rate matrix rate_eigen = np.linalg.eig(np.transpose(rate_matrix))
We can then look at the eigenvalues:
# The eigenvalues are held in the first element of the tuple rate_eigen[0]
This gives the following output:
array([0.825+0.06614378j, 0.825-0.06614378j, 1.0+0.j])
The last eigenvalue is 1 (its imaginary part is zero). Therefore, we can use the last eigenvector as the stable distribution. It will be normalized to unit length, but we need its elements to sum to 1. But remember from Chapter 3 that any multiple of an eigenvector is still an eigenvector with the same eigenvalue, so we can just re-scale it to sum to 1:
# Extract the last eigenvector (we can drop the zero imaginary part) pi_stable = np.real(rate_eigen[1][:,2]) # Rescale the vector so its elements sum to 1 pi_stable /=np.sum(pi_stable)
Let’s look at the stable distribution:
# Print the stable distribution pi_stable
This gives the following output:
array([0.28571429, 0.14285714, 0.57142857])
From this, we can see that for the stable distribution, 28.6% of people are in the susceptible state, 14.3% are in the infected state, and 57.1% are in the recovered state.
So, we have a stable distribution, but do we have a limiting distribution, and if so, is it the same as the stable distribution? To approximate any limiting distribution, we’ll start from a definite initial state; that is, a distribution that has only one nonzero value. We’ll then apply the transition rate matrix a large number (100) of times to get an approximation of the limiting distribution:
# Set the initial state distribution. We'll # set it to a distribution representing 100% of people # being in the susceptible state current_distribution = np.array([1.0, 0.0, 0.0]) n_iter = 100 for i in range(n_iter): # Get the state distribution at the next timepoint by # multiplying by the transition rate matrix next_distribution = np.matmul(current_distribution, rate_matrix) current_distribution = next_distribution current_distribution
This gives the following output for the approximation of the limiting distribution:
array([0.28571429, 0.14285715, 0.57142856])
By comparing our approximation of the limiting distribution to the stable distribution we obtained previously, we can see they are numerically identical (up to the precision of the calculation). Our limiting distribution is the same as our stable distribution. If our epidemic were to run for a long time without any further interventions, we would have 28.6% susceptible, 14.3% infected, and 57.1% recovered.
Having done that simple but instructive code example, we’ll now look at an interesting characteristic of first-order Markov processes: the fact that they are memoryless.
First-order discrete Markov processes are memoryless
Wait! What? What do we mean by saying our first-order Markov process is memoryless? The transition probabilities depend upon the current state. The next state is strongly influenced by what has come before. Surely that is memory?
Not quite. What we’re talking about here is conditional independence. Once we know (or specify) the current state, we have all the information we need to produce the next state (by sampling from the appropriate row of the transition matrix). The state preceding the current state has no influence on the next state once we have specified the current state. Given the current state, the next state is independent of the preceding state. That is what we mean by conditional independence. This property is also termed the Markov property and is what we mean when we say a system is Markovian.
In practical terms, what does this mean? In a first-order model, specifying the current state means we completely forget about the preceding state. That is what we mean by memoryless. It is a strength of Markov models as it simplifies the mathematics. But it is also a weakness of Markov models, as it means beyond the current state the state history has no effect on the next state. We’ll address this weakness in part in the next section, but we won’t be able to eliminate this weakness.
To finish this section on first-order discrete Markov processes, we’ll return to the fact that they are probabilistic models. If you remember from Chapter 5, a probabilistic model allows us to calculate things such as the likelihood of the data given the model and its parameters. We can do lots of useful things with the likelihood once we have a formula for it. So, what is the formula for the likelihood of an observed state sequence given a first-order discrete Markov model?
Likelihood of the state sequence
Imagine we have a state sequence for timepoints . This means we have a set of random variables
for which we got the following observed state values:
. What is the likelihood of that sequence of observed state values? Remember from Chapter 5 that the likelihood is the probability of the data given the model parameters. In this case, the model parameters are the transition probabilities and the initial state distribution. Can we write the probability of the data in terms of these model parameters? To see that we can, we’ll first write the probability of the data as follows:
Prob(X 0 = i 0, X 1 = i 1, … , X T = i T) = Prob(X 0 = i 0)×Prob(X 1 = i 1| X 0 = i 0) × Prob(X 2 = i 2| X 1 = i 1) × … × Prob(X T = i T| X T−1 = i T−1)
Eq. 16
The right-hand side of Eq. 16 follows from the conditional independence property of the first-order Markov process. Once we know the state at time , the probability distribution of the next state at time
is entirely known and given by the appropriate row of the transition matrix. This means we can write the right-hand side of Eq. 16 as follows:
Eq. 17
In Eq. 17 is the state probability distribution at
. Eq. 17 has expressed the likelihood in terms of the model parameters, but we can simplify the expression in Eq. 17 further. Because many of the transitions we observe in the state sequence will be of the same type (for example, a
or an
in our SIR epidemic example), we can combine them in Eq. 17. We then get the following:
Eq. 18
Here, is the number of times we observe a transition
in the state sequence
. The expression in Eq. 18 is very compact and easy to calculate, as all we need to do is count the number of observed transitions of each type.
Having obtained a simple expression for the likelihood of an observed state sequence, can we use it to do useful things, such as estimating the model parameters via maximum likelihood estimation (MLE)? The answer is yes, and so next, we’ll look at how to use Eq. 18 to estimate the transition probabilities via maximum likelihood.
MLE of the transition probabilities
The log-likelihood of the observed state sequence is given by taking the logarithm of Eq. 18. We get the following:
Eq. 19
We want to maximize the log-likelihood in Eq. 19 with respect to each and subject to the constraints,
for all
. We do this maximization using the differential calculus recapped in Chapter 1 and using Lagrange multipliers to impose the constraints. Differentiating Eq. 19 with respect to
and setting the derivative to zero, we get the following solution:
Eq. 20
In Eq. 20, is the Lagrange multiplier for the constraint
. The hat symbol on top of
is there to remind us that this is an estimate of
and not the true value of
itself. The expression in Eq. 20 is very simple, and we can easily adjust
to match the constraint, to finally get the following:
Eq. 21
The expression in Eq. 21 is very simple and very pleasing. It says that the maximum likelihood estimates of the transition probabilities are just the observed proportions of each transition type.
We have covered a lot in this section on first-order discrete Markov processes. That is because they are extremely useful for building simple models of dynamical processes. They are a workhorse model – used everywhere. Let’s recap what we have learned about them.
What we learned
In this section, we have learned the following:
- How first-order discrete Markov processes model how a system evolves from one state to another over time
- That the time variable in a first-order discrete Markov process is discrete; that is, it increases in steps
- That a first-order discrete Markov process is a probabilistic model
- That the probability distribution from which the next state is drawn depends only on the current state and is given by the corresponding row of the transition probability matrix
- That the transition matrix can be represented as a network (a graph)
- How to generate trajectories from a first-order discrete Markov process using the rate matrix
- About stable distributions of first-order discrete Markov processes and how they are given by left eigenvectors of the rate matrix corresponding to an eigenvalue of 1
- About limiting distributions of first-order discrete Markov processes
- That we don’t always have a limiting distribution, but if we do, it will correspond to a stable distribution of the Markov process
- That first-order discrete Markov processes are memoryless
- How to calculate the likelihood of an observed state sequence and how to use the likelihood formula to estimate the transition probabilities via maximum likelihood
Having learned about first-order discrete Markov processes, in the next section, we will build upon and extend the idea to having the transition probabilities depend upon a longer amount of history than just the preceding state. This introduces what are known as higher-order discrete Markov processes. Since we will be building upon the concepts and terminology of first-order Markov processes, a lot of the next section will be very similar, and so it will be a much shorter section.