Probabilities and uncertainty
While Probability Theory is a mature and well-established branch of mathematics, there is more than one interpretation of what probabilities are. To a Bayesian, a probability is a measure that quantifies the uncertainty level of a statement. If we know nothing about coins and we do not have any data about coin tosses, it is reasonable to think that the probability of a coin landing heads could take any value between 0 and 1; that is, in the absence of information, all values are equally likely, our uncertainty is maximum. If we know instead that coins tend to be balanced, then we may say that the probability of a coin landing is exactly 0.5 or may be around 0.5 if we admit that the balance is not perfect. If now, we collect data, we can update these prior assumptions and hopefully reduce the uncertainty about the bias of the coin. Under this definition of probability, it is totally valid and natural to ask about the probability of life on Mars, the probability of the mass of the electron being 9.1 x 10-31 kg, or the probability of the 9th of July of 1816 being a sunny day. Notice, for example, that the question of whether or not life exists on Mars has a binary outcome but what we are really asking is how likely is it to find life on Mars given our data and what we know about biology and the physical conditions on that planet? The statement is about our state of knowledge and not, directly, about a property of nature. We are using probabilities because we cannot be sure about the events, not because the events are necessarily random. Since this definition of probability is about our epistemic state of mind, sometimes it is referred to as the subjective definition of probability, explaining the slogan of subjective statistics often attached to the Bayesian paradigm. Nevertheless, this definition does not mean all statements should be treated as equally valid and so anything goes; this definition is about acknowledging that our understanding about the world is imperfect and conditioned on the data and models we have made. There is not such a thing as a model-free or theory-free understanding of the world; even if it were be possible to free ourselves from our social preconditioning, we will end up with a biological limitation: our brain, subject to the evolutionary process, has been wired with models of the world. We are doomed to think like humans and we will never think like bats or anything else! Moreover, the universe is an uncertain place and, in general the best we can do is to make probabilistic statements about it. Notice that it does not matter if the underlying reality of the world is deterministic or stochastic; we are using probability as a tool to quantify uncertainty.
Logic is about thinking without making mistakes. Under the Aristotelian or classical logic, we can only have statements taking the values true or false. Under the Bayesian definition of probability, certainty is just a special case: a true statement has a probability of 1, a false one has probability 0. We would assign a probability of 1 about life on Mars only after having conclusive data indicating something is growing and reproducing and doing other activities we associate with living organisms. Notice, however, that assigning a probability of 0 is harder because we can always think that there is some Martian spot that is unexplored, or that we have made mistakes with some experiment, or several other reasons that could lead us to falsely believe life is absent on Mars when it is not. Related to this point is Cromwell's rule, stating that we should reserve the use of the prior probabilities of 0 or 1 to logically true or false statements. Interesting enough, Cox mathematically proved that if we want to extend logic to include uncertainty we must use probabilities and probability theory. Bayes' theorem is just a logical consequence of the rules of probability as we will see soon. Hence, another way of thinking about Bayesian statistics is as an extension of logic when dealing with uncertainty, something that clearly has nothing to do with subjective reasoning in the pejorative sense. Now that we know the Bayesian interpretation of probability, let's see some of the mathematical properties of probabilities. For a more detailed study of probability theory, you can read Introduction to probability by Joseph K Blitzstein & Jessica Hwang.
Probabilities are numbers in the interval [0, 1], that is, numbers between 0 and 1, including both extremes. Probabilities follow some rules; one of these rules is the product rule:
We read this as follows: the probability of A and B is equal to the probability of A given B, times the probability of B. The expression p(A, B) represents the joint probability of A and B. The expression p(A|B) is used to indicate a conditional probability; the name refers to the fact that the probability of A is conditioned on knowing B. For example, the probability that a pavement is wet is different from the probability that the pavement is wet if we know (or given that) is raining. A conditional probability can be larger than, smaller than or equal to the unconditioned probability. If knowing B does not provides us with information about A, then p(A|B)=p(A). That is A and B are independent of each other. On the contrary, if knowing B gives us useful information about A, then the conditional probability could be larger or smaller than the unconditional probability depending on whether knowing B makes A less or more likely.
Conditional probabilities are a key concept in statistics, and understanding them is crucial to understanding Bayes' theorem, as we will see soon. Let's try to understand them from a different perspective. If we reorder the equation for the product rule, we get the following:
Notice that a conditional probability is always larger or equal than the joint probability. The reasons are that: we do not condition on zero-probability events, this is implied in the expression, and probabilities are restricted to be in the interval [0, 1]. Why do we divide by p(B)? Knowing B is equivalent to saying that we have restricted the space of possible events to B and thus, to find the conditional probability, we take the favorable cases and divide them by the total number of events. It is important to realize that all probabilities are indeed conditionals, there is not such a thing as an absolute probability floating in vacuum space. There is always some model, assumption, or condition, even if we don't notice or know them. The probability of rain is not the same if we are talking about Earth, Mars, or some other place in the Universe. In the same way, the probability of a coin landing heads or tails depends on our assumptions of the coin being biased in one way or another. Now that we are more familiar with the concept of probability, let's jump to the next topic, probability distributions.
Probability distributions
A probability distribution is a mathematical object that describes how likely different events are. In general, these events are restricted somehow to a set of possible events. A common and useful conceptualization in statistics is to think that data was generated from some probability distribution with unobserved parameters. Since the parameters are unobserved and we only have data, we will use Bayes' theorem to invert the relationship, that is, to go from the data to the parameters. Probability distributions are the building blocks of Bayesian models; by combining them in proper ways we can get useful complex models.
We will meet several probability distributions throughout the book; every time we discover one we will take a moment to try to understand it. Probably the most famous of all of them is the Gaussian or normal distribution. A variable x follows a Gaussian distribution if its values are dictated by the following formula:
In the formula, and are the parameters of the distributions. The first one can take any real value, that is, , and dictates the mean of the distribution (and also the median and mode, which are all equal). The second is the standard deviation, which can only be positive and dictates the spread of the distribution. Since there are an infinite number of possible combinations of and values, there is an infinite number of instances of the Gaussian distribution and all of them belong to the same Gaussian family. Mathematical formulas are concise and unambiguous and some people say even beautiful, but we must admit that meeting them can be intimidating; a good way to break the ice is to use Python to explore them. Let's see what the Gaussian distribution family looks like:
The output of the preceding code is as follows:
A variable, such as x, that comes from a probability distribution is called a random variable. It is not that the variable can take any possible value. On the contrary, the values are strictly controlled by the probability distribution; the randomness arises from the fact that we could not predict which value the variable will take, but only the probability of observing those values. A common notation used to say that a variable is distributed as a Gaussian or normal distribution with parameters and is as follows:
The symbol ~ (tilde) is read as is distributed as.
There are two types of random variable, continuous and discrete. Continuous random variables can take any value from some interval (we can use Python floats to represent them), and the
discrete random variables can take only certain values (we can use Python integers to represent them).
Many models assume that successive values of a random variables are all sampled from the same distribution and those values are independent of each other. In such a case, we will say that the variables are independently and identically distributed, or iid variables for short. Using mathematical notation, we can see that two variables are independent if for every value of x and y:
A common example of non iid variables are temporal series, where a temporal dependency in the random variable is a key feature that should be taken into account. Take for example the following data coming from http://cdiac.esd.ornl.gov. This data is a record of atmospheric CO2 measurements from 1959 to 1997. We are going to load the data (included with the accompanying code) and plot it.
Each point corresponds to the measured levels of atmospheric CO2 per month. It is easy to see in this plot the temporal dependency of data points. In fact, we have two trends here, a seasonal one (this is related to cycles of vegetation growth and decay) and a global one indicating an increasing concentration of atmospheric CO2.
Bayes' theorem and statistical inference
Now that we have learned some of the basic concepts and jargon from statistics, we can move to the moment everyone was waiting for. Without further ado let's contemplate, in all its majesty, Bayes' theorem:
Well, it is not that impressive, is it? It looks like an elementary school formula and yet, paraphrasing Richard Feynman, this is all you need to know about Bayesian statistics.
Learning where Bayes' theorem comes from will help us to understand its meaning. In fact, we have already seen all the probability theory necessary to derive it:
- According to the product rule, we have the following:
- This can also be written as follows:
- Given than the terms on the left are equal, we can write the following:
- And if we reorder it, we get Bayes' theorem:
Now, let's see what this formula implies and why it is important. First, it says that p(D|H) is not necessarily the same as p(D|H). This is a very important fact, one that's easy to miss in daily situations even for people trained in statistics and probability. Let's use a simple example to clarify why these quantities are not necessary the same. The probability of having two legs given these someone is a human is not the same as the probability of being a human given that someone has two legs. Almost all humans have two legs, except for people that have suffered from accidents or birth problems, but a lot of non-human animals have two legs, such as birds.
If we replace H with hypothesis and D with data, Bayes' theorem tells us how to compute the probability of a hypothesis H given the data D, and that's the way you will find Bayes' theorem explained in a lot of places. But, how do we turn a hypothesis into something that we can put inside Bayes' theorem? Well, we do that using probability distributions so, in general, our H will be a hypothesis in a very narrow sense. What we will be really doing is trying to find parameters of our models, that is, parameters of probability distributions. So maybe, instead of hypothesis, it is better to talk about models and avoid confusion. And by the way, don't try to set H to statements such as "unicorns are real", unless you are willing to build a realistic probabilistic model of unicorn existence!
Since Bayes' theorem is central and we will use it over and over again, let's learn the names of its parts:
- p(H): Prior
- p(D|H): Likelihood
- p(H|D): Posterior
- p(D): Evidence
The prior distribution should reflect what we know about the value of some parameter before seeing the data D. If we know nothing, like Jon Snow, we will use flat priors that do not convey too much information. In general, we can do better, as we will learn through this book. The use of priors is why some people still think Bayesian statistics is subjective, even when priors are just another assumption that we made when modeling and hence are just as subjective (or objective) as any other assumption, such as likelihoods.
The likelihood is how we will introduce data in our analysis. It is an expression of the plausibility of the data given the parameters.
The posterior distribution is the result of the Bayesian analysis and reflects all that we know about a problem (given our data and model). The posterior is a probability distribution for the parameters in our model and not a single value. This distribution is a balance of the prior and the likelihood. There is a joke that says: A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule. One way to kill the mood after hearing this joke is to explain that if the likelihood and priors are both vague you will get a posterior reflecting vague beliefs about seeing a mule rather than strong ones. Anyway the joke captures the idea of a posterior being somehow a compromise between prior and likelihood. Conceptually, we can think of the posterior as the updated prior in the light of the data. In fact, the posterior of one analysis can be used as the prior of a new analysis after collecting new data. This makes Bayesian analysis particularly suitable for analyzing data that becomes available in sequential order. Some examples could be early warning systems for disasters that process online data coming from meteorological stations and satellites. For more details read about online machine learning methods.
The last term is the evidence, also known as marginal likelihood. Formally, the evidence is the probability of observing the data averaged over all the possible values the parameters can take. Anyway, for most of the parts of the book, we will not care about the evidence, and we will think of it as a simple normalization factor. This will not be problematic since we will only care about the relative values of the parameters and not their absolute ones. If we ignore the evidence, we can write Bayes' theorem as a proportionality:
Understanding the exact role of each term will take some time and will also require some examples, and that's what the rest of the book is for.