Probability theory

The title of this section may be a little bit pretentious as we are not going to learn probability theory in just a few pages, but that is not my intention. I just want to introduce a few general and important concepts that are necessary to better understand Bayesian methods, and that should be enough for understanding the rest of the book. If necessary, we will expand or introduce new probability-related concepts as we need them. For a detailed study of probability theory, I highly recommend the book, Introduction to Probability by Joseph K Blitzstein and Jessica Hwang. Another useful book could be Mathematical Theory of Bayesian Statistics by Sumio Watanabe, as the title says, the book is more Bayesian-oriented than the first, and also heavier on the mathematical side.

Interpreting probabilities

While probability theory is a mature and well-established branch of mathematics, there is more than one interpretation of probability. From a Bayesian perspective, a probability is a measure that quantifies the uncertainty level of a statement. Under this definition of probability, it is totally valid and natural to ask about the probability of life on Mars, the probability of the mass of the electron being 9.1 x 10^-31 kg, or the probability of the 9^th of July of 1816 being a sunny day in Buenos Aires. Notice, for example, that life on Mars exists or does not exist; the outcome is binary, a yes-no question. But given that we are not sure about that fact, a sensible course of action is trying to find out how likely life on Mars is. Since this definition of probability is related to our epistemic state of mind, people often call it the subjective definition of probability. But notice that any scientific-minded person will not use their personal beliefs, or the information provided by an angel to answer such a question, instead they will use all the relevant geophysical data about Mars, and all the relevant biochemical knowledge about necessary conditions for life, and so on. Therefore, Bayesian probabilities, and by extension Bayesian statistics, is as subjective (or objective) as any other well-established scientific method we have.

If we do not have information about a problem, it is reasonable to state that every possible event is equally likely, formally this is equivalent to assigning the same probability to every possible event. In the absence of information, our uncertainty is maximum. If we know instead that some events are more likely, then this can be formally represented by assigning higher probability to those events and less to the others. Notice than when we talk about events in stats-speak, we are not restricting ourselves to things that can happen, such as an asteroid crashing into Earth or my auntie's 60^th birthday party; an event is just any of the possible values (or subset of values) a variable can take, such as the event that you are older than 30, or the price of a Sachertorte, or the number of bikes sold last year around the world.

The concept of probability it is also related to the subject of logic. Under Aristotelian or classical logic, we can only have statements that take the values of true or false. Under the Bayesian definition of probability, certainty is just a special case: a true statement has a probability of 1, a false statement has probability of 0. We would assign a probability of 1 to the statement, There is Martian life, only after having conclusive data indicating something is growing, and reproducing, and doing other activities we associate with living organisms. Notice, however, that assigning a probability of 0 is harder because we can always think that there is some Martian spot that is unexplored, or that we have made mistakes with some experiment, or several other reasons that could lead us to falsely believe life is absent on Mars even when it is not. Related to this point is Cromwell's rule, stating that we should reserve the use of the prior probabilities of 0 or 1 to logically true or false statements. Interestingly enough, Richard Cox mathematically proved that if we want to extend logic to include uncertainty, we must use probabilities and probability theory. Bayes' theorem is just a logical consequence of the rules of probability, as we will see soon. Hence, another way of thinking about Bayesian statistics is as an extension of logic when dealing with uncertainty, something that clearly has nothing to do with subjective reasoning in the pejorative sense—people often used the term subjective.

To summarize, using probability to model uncertainty is not necessarily related to the debate about whether nature is deterministic or random at is most fundamental level, nor is related to subjective personal beliefs. Instead, it is a purely methodological approach to model uncertainty. We recognize most phenomena are difficult to grasp because we generally have to deal with incomplete and/or noisy data, we are intrinsically limited by our evolution-sculpted primate brain, or any other sound reason you could add. As a consequence, we use a modeling approach that explicitly takes uncertainty into account.

From a practical point of view, the most relevant piece of information from this section is that Bayesian's use probabilities as a tool to quantify uncertainty.

Now that we've discussed the Bayesian interpretation of probability, let's learn about a few of the mathematical properties of probabilities.

Defining probabilities

Probabilities are numbers in the [0, 1] interval, that is, numbers between 0 and 1, including both extremes. Probabilities follow a few rules; one of these rules is the product rule:

We read this as follows: the probability of and is equal to the probability of given , times the probability of . The expression represents the joint probability of and . The expression is used to indicate a conditional probability; the name refers to the fact that the probability of is conditioned on knowing . For example, the probability that the pavement is wet is different from the probability that the pavement is wet if we know (or given that) it's raining. A conditional probability can be larger than, smaller than, or equal to the unconditioned probability. If knowing does not provides us with information about , then . This will be true only if and are independent of each other. On the contrary, if knowing gives us useful information about , then the conditional probability could be larger or smaller than the unconditional probability, depending on whether knowing makes less or more likely. Let's see a simple example using a fair six-sided die. What is the probability of getting the number 3 if we roll the die, ? This is 1/6 since each of the six numbers has the same chance for a fair six-sided die. And what is the probability of getting the number 3 given that we have obtained an odd number, ? This is 1/3, because if we know we have an odd number, the only possible numbers are and each of them has the same chance. Finally, what is the probability of ? This is 0, because if we know the number is even, then the only possible ones are , and thus getting a 3 is not possible.

As we can see from this simple example, by conditioning on observed data we effectively change the probability of events, and with it, the uncertainty we have about them. Conditional probabilities are the heart of statistics, irrespective of your problem being rolling dice or building self-driving cars.

Probability distributions

A probability distribution is a mathematical object that describes how likely different events are. In general, these events are restricted somehow to a set of possible events, such as for a die (we exclude undefined). A common and useful conceptualization in statistics is to think data is generated from some true probability distribution with unknown parameters. Then, inference is the process of finding out the values of those parameters using just a sample (also known as a dataset) from the true probability distribution. In general, we do not have access to the true probability distribution and thus we must resign ourselves and create a model in an attempt to somehow approximate that distribution. Probabilistic models are built by properly combining probability distributions.

Notice that, in general, we cannot be sure of whether our model is correct and thus we need to evaluate and criticize the models in order to gain confidence and convince others about our models being suitable for the problem we want to explore or solve.

If a variable, , can be described by a probability distribution, then we call a random variable. As a general rule, people use capital letters, such as , to indicate the objects random variable, and , to indicate an instance of that random variable. can be a vector and thus contains many elements or individual values . Let's see an example using Python; our true probability distribution will be a Normal (or Gaussian) distribution with means of and ; these two parameters completely and unambiguously define a Normal distribution. Using SciPy, we can define the random variable, , by writing stats.norm(μ, σ) and we can get an instance, , from it using the rvs (random variates) method. In the following example, we ask for three values:

μ = 0.
σ = 1.
X = stats.norm(μ, σ)
x = X.rvs(3)

You will notice that each time you execute this code (in stats-speak: every trial), you will get a different random result. Notice that once the values of the parameters of a distributions are known, the probability of each value is also known; what is random is the exact values we will get at each trial. A common misconception with the meaning of random is that you can get any possible value from a random variable or that all values are equally likely. The allowed values of a random variable and their probabilities are strictly controlled by a probability distribution, and the randomness arises only from the fact that we can not predict the exact values we will get in each trial. Every time we execute the preceding code, we will get three different numbers, but if we iterate the code thousands of times, we will be able to empirically check that the mean of the sampled values is around zero and that 95% of the samples will have values within the [-1.96, +1.96] range. Please do not trust me, use your Pythonic powers and verify it for yourself. We can arrive at this same conclusion if we study the mathematical properties of the Normal distribution.

A common notation used in statistics to indicate that a variable is distributed as a normal distribution with parameters and is:

In this context, when you find the (tilde) symbol, say is distributed as.

In many texts, it is common to see the normal distribution expressed in terms of the variance and not the standard deviation. Thus, people write

. In this book, we are going to parameterize the Normal distribution using the standard deviation, first because it is easier to interpret, and second because this is how PyMC3 works.

We will meet several probability distributions in this book; every time we discover one, we will take a moment to describe it. We started with the normal distribution because it is like the Zeus of the probability distributions. A variable, , follows a Gaussian distribution if its values are dictated by the following expression:

This is the probability density function for the Normal distribution; you do not need to memorize expression 1.3, I just like to show it to you so you can see where the numbers come from. As we have already mentioned, and are the parameters of the distribution, and by specifying those values we totally define the distribution; you can see this from expression 1.3 as the other terms are all constants. can take any real value, that is, , and dictates the mean of the distribution (and also the median and mode, which are all equal). is the standard deviation, which can only be positive and dictates the spread of the distribution, the larger the value of , the more spread out the distribution. Since there are an infinite number of possible combinations of and , there is an infinite number of instances of the Gaussian distribution and all of them belong to the same Gaussian family.

Mathematical formulas are concise and unambiguous and some people say even beautiful, but we must admit that meeting them for the first time can be intimidating, specially for those not very mathematically inclined; a good way to break the ice is to use Python to explore them:

mu_params = [-1, 0, 1]
sd_params = [0.5, 1, 1.5]
x = np.linspace(-7, 7, 200)
_, ax = plt.subplots(len(mu_params), len(sd_params), sharex=True, 
                     sharey=True,
                     figsize=(9, 7), constrained_layout=True)
for i in range(3):
    for j in range(3):
        mu = mu_params[i]
        sd = sd_params[j]
        y = stats.norm(mu, sd).pdf(x)
        ax[i,j].plot(x, y)
        ax[i,j].plot([], label="μ = {:3.2f}\nσ = {:3.2f}".format(mu, 
                     sd), alpha=0)
        ax[i,j].legend(loc=1)
ax[2,1].set_xlabel('x')
ax[1,0].set_ylabel('p(x)', rotation=0, labelpad=20)
ax[1,0].set_yticks([])

Most of the preceding code is for plotting, the probabilistic part is performed by the y = stats.norm(mu, sd).pdf(x) line. With this line, we are evaluating the probability density function of the normal distribution, given the mu and sd parameters for a set of x values. The preceding code generates Figure 1.1. In each subplot we have a blue (dark gray) curve representing a Gaussian distribution with specific parameters, and , included in each subplot's legend:

Figure 1.1

Most of the figures in this book are generated directly from the code preceding them, many times even without a leading phrase connecting the code to the figure. This pattern should be familiar to those of you working with Jupyter Notebooks or Jupyter Lab.

There are two types of random variables: continuous and discrete. Continuous variables can take any value from some interval (we can use Python floats to represent them), and discrete variables can take only certain values (we can use Python integers to represent them). The Normal distribution is a continuous distribution.

Notice that in the preceding plot, the yticks are omitted; this is a feature not a bug. The reason to omit them is that, in my experience, those values do not add too much relevant information and could be potentially confusing to some people. Let me explain myself: the actual numbers on the y axis do not really matter—what is relevant is their relative values. If you take two values from , let's say and _, and you find that (two times higher in the plot), you can safely say the probability of the value of is twice the probability of . This is something that most people will understand intuitively, and fortunately is the correct interpretation. The only tricky part when dealing with continuous distributions is that the values plotted on the y axis are not probabilities, but probability densities. To get a proper probability, you have to integrate between a given interval, that is, you have to compute the area below the curve (for that interval). While probabilities cannot be greater than one, probability densities can, and is the total area under the probability density curve the one restricted to be 1. Understanding the difference between probabilities and probability densities is crucial from a mathematical point of view. For the practical approach used in this book, we can be a little bit more sloppy, since that difference is not that important as long as you get the correct intuition on how to interpret the previous plot in terms of relative values.

Independently and identically distributed variables

Many models assume that successive values of random variables are all sampled from the same distribution and those values are independent of each other. In such a case, we will say that the variables are independently and identically distributed (iid) variables for short. Using mathematical notation, we can see that two variables are independent if for every value of and .

A common example of non-iid variables are temporal series, where a temporal dependency in the random variable is a key feature that should be taken into account. Take, for example, the following data coming from http://cdiac.esd.ornl.gov. This data is a record of atmospheric CO₂ measurements from 1959 to 1997. We are going to load the data (including the accompanying code) and plot it:

data = np.genfromtxt('../data/mauna_loa_CO2.csv', delimiter=',')
plt.plot(data[:,0], data[:,1])
plt.xlabel('year')
plt.ylabel('$CO_2$ (ppmv)')
plt.savefig('B11197_01_02.png', dpi=300)

Figure 1.2

Each data point corresponds to the measured levels of atmospheric CO₂ per month. The temporal dependency of data points is easy to see in this plot. In fact, we have two trends here: a seasonal one (this is related to cycles of vegetation growth and decay), and a global one, indicating an increasing concentration of atmospheric CO₂.

Bayes' theorem

Now that we have learned some of the basic concepts and jargon from probability theory, we can move to the moment everyone was waiting for. Without further ado, let's contemplate, in all its majesty, Bayes' theorem:

Well, it's not that impressive, is it? It looks like an elementary school formula, and yet, paraphrasing Richard Feynman, this is all you need to know about Bayesian statistics.

Learning where Bayes' theorem comes from will help us to understand its meaning.

According to the product rule we have:

This can also be written as:

Given that the terms on the left are equal for equations 1.5 and 1.6, we can combine them and write:

And if we reorder 1.7, we get expression 1.4, thus is Bayes' theorem.

Now, let's see what formula 1.4 implies and why it is important. First, it says that is not necessarily the same as . This is a very important fact, one that is easy to miss in daily situations even for people trained in statistics and probability. Let's use a simple example to clarify why these quantities are not necessarily the same. The probability of a person being the Pope given that this person is Argentinian is not the same as the probability of being Argentinian given that this person is the Pope. As there are around 44,000,000 Argentinians alive and a single one of them is the current Pope, we have and we also have _.

If we replace with hypothesis and with data, Bayes' theorem tells us how to compute the probability of a hypothesis, , given the data, , and that's the way you will find Bayes' theorem explained in a lot of places. But, how do we turn a hypothesis into something that we can put inside Bayes' theorem? Well, we do it by using probability distributions. So, in general, our hypothesis is a hypothesis in a very, very, very narrow sense; we will be more precise if we talk about finding a suitable value for parameters in our models, that is, parameters of probability distributions. By the way, don't try to set to statements such as unicorns are real, unless you are willing to build a realistic probabilistic model of unicorn existence!

Bayes' theorem is central to Bayesian statistics, as we will see in Chapter 2, Programming Probabilistically using tool such as PyMC3 free ourselves of the need to explicitly write Bayes' theorem every time we build a Bayesian model. Nevertheless, it is important to know the name of its parts because we will constantly refer to them and it is important to understand what each part means because this will help us to conceptualize models, so let's do it:

: Prior
: Likelihood
: Posterior
: Marginal likelihood

The prior distribution should reflect what we know about the value of the parameter before seeing the data, . If we know nothing, like Jon Snow, we could use flat priors that do not convey too much information. In general, we can do better than flat priors, as we will learn in this book. The use of priors is why some people still talk about Bayesian statistics as subjective, even when priors are just another assumption that we made when modeling and hence are just as subjective (or objective) as any other assumption, such as likelihoods.

The likelihood is how we will introduce data in our analysis. It is an expression of the plausibility of the data given the parameters. In some texts, you will find people call this term sampling model, statistical model, or just model. We will stick to the name likelihood and we will call the combination of priors and likelihood model.

The posterior distribution is the result of the Bayesian analysis and reflects all that we know about a problem (given our data and model). The posterior is a probability distribution for the parameters in our model and not a single value. This distribution is a balance between the prior and the likelihood. There is a well-known joke: A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule. One excellent way to kill the mood after hearing this joke is to explain that if the likelihood and priors are both vague, you will get a posterior reflecting vague beliefs about seeing a mule rather than strong ones. Anyway, I like the joke, and I like how it captures the idea of a posterior being somehow a compromise between prior and likelihood. Conceptually, we can think of the posterior as the updated prior in the light of (new) data. In fact, the posterior from one analysis can be used as the prior for a new analysis. This makes Bayesian analysis particularly suitable for analyzing data that becomes available in sequential order. Some examples could be an early warning system for natural disasters that processes online data coming from meteorological stations and satellites. For more details, read about online machine learning methods.

The last term is the marginal likelihood, also known as evidence. Formally, the marginal likelihood is the probability of observing the data averaged over all the possible values the parameters can take (as prescribed by the prior). Anyway, for most of this book, we will not care about the marginal likelihood, and we will think of it as a simple normalization factor. We can do this because when analyzing the posterior distribution, we will only care about the relative values of the parameters and not their absolute values. You may remember that we mentioned this when we talked about how to interpret plots of probability distributions in the previous section. If we ignore the marginal likelihood, we can write Bayes' theorem as a proportionality:

Understanding the exact role of each term in Bayes' theorem will take some time and practice, and it will require we work through a few examples, and that's what the rest of this book is for.