1.7 Single-parameter inference
Now that we know what Bayesian statistics is, let’s learn how to do Bayesian statistics with a simple example. We are going to begin inferring a single, unknown parameter.
1.7.1 The coin-flipping problem
The coin-flipping problem, or the BetaBinomial model if you want to sound fancy at parties, is a classical problem in statistics and goes like this: we toss a coin several times and record how many heads and tails we get. Based on this data, we try to answer questions such as, is the coin fair? Or, more generally, how biased is the coin? While this problem may sound dull, we should not underestimate it.
The coin-flipping problem is a great example to learn the basics of Bayesian statistics because it is a simple model that we can solve and compute with ease. Besides, many real problems consist of binary, mutually exclusive outcomes such as 0 or 1, positive or negative, odds or evens, spam or ham, hotdog or not a hotdog, cat or dog, safe or unsafe, and healthy or unhealthy. Thus, even when we are talking about coins, this model applies to any of those problems. To estimate the bias of a coin, and in general, to answer any questions in a Bayesian setting, we will need data and a probabilistic model. For this example, we will assume that we have already tossed a coin several times and we have a record of the number of observed heads, so the data-gathering part is already done. Getting the model will take a little bit more effort. Since this is our first model, we will explicitly write Bayes’ theorem and do all the necessary math (don’t be afraid, I promise it will be painless) and we will proceed very slowly. From 2 onward, we will use PyMC and our computer to do the math for us.
The first thing we will do is generalize the concept of bias. We will say that a coin with a bias of 1 will always land heads, one with a bias of 0 will always land tails, and one with a bias of 0.5 will land heads half of the time and tails half of the time. To represent the bias, we will use the parameter θ, and to represent the total number of heads for several tosses, we will use the variable Y . According to Bayes’ theorem, we have to specify the prior, p(θ), and likelihood, p(Y |θ), we will use. Let’s start with the likelihood.
1.7.2 Choosing the likelihood
Let’s assume that only two outcomes are possible—heads or tails—and let’s also assume that a coin toss does not affect other tosses, that is, we are assuming coin tosses are independent of each other. We will further assume all coin tosses come from the same distribution. Thus the random variable coin toss is an example of an independent and identically distributed (iid) variable. I hope you agree that these are very reasonable assumptions to make for our problem. Given these assumptions, a good candidate for the likelihood is the Binomial distribution:
This is a discrete distribution returning the probability of getting y heads (or, in general, successes) out of N coin tosses (or, in general, trials or experiments) given a fixed value of θ.
Figure 1.10 shows nine distributions from the Binomial family; each subplot has its legend indicating the values of the parameters. Notice that for this plot, I did not omit the values on the y-axis. I did this so you can check for yourself that if you sum the height of all bars, you will get 1, that is, for discrete distributions, the height of the bars represents actual probabilities.
Figure 1.10: Nine members of the Binomial family
The Binomial distribution is a reasonable choice for the likelihood. We can see that θ indicates how likely it is to obtain a head when tossing a coin. This is easier to see when N = 1 but is valid for any value of N, just compare the value of θ with the height of the bar for y = 1 (heads).
1.7.3 Choosing the prior
As a prior, we will use a Beta distribution, which is a very common distribution in Bayesian statistics and looks as follows:
If we look carefully, we will see that the Beta distribution looks similar to the Binomial except for the first term. Γ is the Greek uppercase gamma letter, which represents the gamma function, but that’s not really important. What is relevant for us is that the first term is a normalizing constant that ensures the distribution integrates to 1. We can see from the preceding formula that the Beta distribution has two parameters, α and β. Figure 1.11 shows nine members of the Beta family.
Figure 1.11: Nine members of the Beta family
I like the Beta distribution and all the shapes we can get from it, but why are we using it for our model? There are many reasons to use a Beta distribution for this and other problems. One of them is that the Beta distribution is restricted to be between 0 and 1, in the same way our θ parameter is. In general, we use the Beta distribution when we want to model the proportions of a Binomial variable. Another reason is its versatility. As we can see in Figure 1.11, the distribution adopts several shapes (all restricted to the [0,1] interval), including a Uniform distribution, Gaussian-like distributions, and U-like distributions.
As a third reason, the Beta distribution is the conjugate prior to the Binomial distribution (which we are using as the likelihood). A conjugate prior of a likelihood is a prior that, when used in combination with a given likelihood, returns a posterior with the same functional form as the prior. Untwisting the tongue, every time we use a Beta distribution as the prior and a Binomial distribution as the likelihood, we will get a Beta as the posterior distribution. There are other pairs of conjugate priors; for example, the Normal distribution is the conjugate prior to itself. For many years, Bayesian analysis was restricted to the use of conjugate priors. Conjugacy ensures mathematical tractability of the posterior, which is important given that a common problem in Bayesian statistics ends up with a posterior we cannot solve analytically. This was a deal breaker before the development of suitable computational methods to solve probabilistic methods. From Chapter 2 onwards, we will learn how to use modern computational methods to solve Bayesian problems, whether we choose conjugate priors or not.
1.7.4 Getting the posterior
Let’s remember that Bayes’ theorem says the posterior is proportional to the likelihood times the prior. So, for our problem, we have to multiply the Binomial and the Beta distributions:
We can simplify this expression by dropping all the terms that do not depend on θ and our results will still be valid. Accordingly, we can write:
Reordering it, and noticing this has the form of a Beta distribution, we get:
Based on this analytical expression, we can compute the posterior. Figure 1.12 shows the results for 3 priors and different numbers of trials. The following block of code shows the gist to generate Figure 1.12 (omitting the code necessary for plotting).
Code 1.6
n_trials = [0, 1, 2, 3, 4, 8, 16, 32, 50, 150]
n_heads = [0, 1, 1, 1, 1, 4, 6, 9, 13, 48]
beta_params = [(1, 1), (20, 20), (1, 4)]
x = np.linspace(0, 1, 2000)
for idx, N in enumerate(n_trials):
y = n_heads[idx]
for (α_prior, β_prior) in beta_params:
posterior = pz.Beta(α_prior + y, β_prior + N - y).pdf(x)
Figure 1.12: The first subplot shows 3 priors. The rest show successive updates as we get new data
On the first subplot of Figure 1.12, we have zero trials, thus the three curves represent our priors:
The Uniform prior (black): This represents all the possible values for the bias being equally probable a priori.
The Gaussian-like prior (dark gray): This is centered and concentrated around 0.5, so this prior is compatible with information indicating that the coin has more or less about the same chance of landing heads or tails. We could also say this prior is compatible with the knowledge that coins are fair.
The skewed prior (light gray): This puts most of the weight on a tail-biased outcome.
The rest of the subplots show posterior distributions for successive trials. The number of trials (or coin tosses) and the number of heads are indicated in each subplot’s legend. There is also a black dot at 0.35 representing the true value for θ. Of course, in real problems, we do not know this value, and it is here just for pedagogical reasons. Figure 1.12, can teach us a lot about Bayesian analysis, so grab your coffee, tea, or favorite drink, and let’s take a moment to understand it:
The result of a Bayesian analysis is a posterior distribution – not a single value but a distribution of plausible values given the data and our model.
The most probable value is given by the mode of the posterior (the peak of the distribution).
The spread of the posterior is proportional to the uncertainty about the value of a parameter; the more spread out the distribution, the less certain we are.
Intuitively, we are more confident in a result when we have observed more data supporting that result. Thus, even when numerically = = 0.5, seeing four heads out of eight trials gives us more confidence that the bias is 0.5 than observing one head out of two trials. This intuition is reflected in the posterior, as you can check for yourself if you pay attention to the (black) posterior in the third and sixth subplots; while the mode is the same, the spread (uncertainty) is larger in the third subplot than in the sixth subplot.
Given a sufficiently large amount of data, two or more Bayesian models with different priors will tend to converge to the same result. In the limit of infinite data, no matter which prior we use, all of them will provide the same posterior.
Remember that infinite is a limit and not a number, so from a practical point of view, we could get practically equivalent posteriors for a finite and relatively small number of data points.
How fast posteriors converge to the same distribution depends on the data and the model. We can see that the posteriors arising from the black prior (Uniform) and gray prior (biased towards tails) converge faster to almost the same distribution, while it takes longer for the dark gray posterior (the one arising from the concentrated prior). Even after 150 trials, it is somehow easy to recognize the dark gray posterior as a different distribution from the two others.
Something not obvious from the figure is that we will get the same result if we update the posterior sequentially as if we do it all at once. We can compute the posterior 150 times, each time adding one more observation and using the obtained posterior as the new prior, or we can just compute one posterior for the 150 tosses at once. The result will be exactly the same. This feature not only makes perfect sense, but it also leads to a natural way of updating our estimations when we get new data, a situation common in many data-analysis problems.
1.7.5 The influence of the prior
From the preceding example, it is clear that priors can influence inferences. That’s fine – priors are supposed to do that. Maybe it would be better to not have priors at all. That would make modeling easier, right? Well, not necessarily. If you are not setting the prior, someone else will be doing it for you. Sometimes this is fine – default priors can be useful and have their place – but sometimes it is better to have more control. Let me explain.
We can think that every (statistical) model, Bayesian or not, has some kind of prior, even if the prior is not set explicitly. For instance, many procedures typically used in frequentist statistics can be seen as special cases of a Bayesian model under certain conditions, such as flat priors. One common way to estimate parameters is known as maximum likelihood; this method avoids setting a prior and works just by finding the single value maximizing the likelihood. This value is usually notated by adding a little hat on top of the name of the parameter we are estimating, such as . Contrary to the posterior estimate, which is a distribution, is a point estimate, a number. For the coin-flipping problem, we can compute it analytically:
If you go back to Figure 1.12, you will be able to check for yourself that the mode of the black posterior (the one corresponding to the uniform/flat prior) agrees with the values of , computed for each subplot. This is not a coincidence; it is a consequence of the fact that setting a Uniform prior and then taking the mode of the posterior is equivalent to maximum likelihood.
We cannot avoid priors, but if we include them in our analysis, we can get some potential benefits. The most direct benefit is that we get a posterior distribution, which is a distribution of plausible values and not only the most probable ones. Having a distribution can be more informative than a single-point estimate, as we saw the width of the distribution is related to the uncertainty we have for the estimate. Another benefit is that computing the posteriors means to average over the prior. This can lead to models that are more difficult to overfit and more robust predictions [Wilson and Izmailov, 2022].
Priors can bring us other benefits. Starting in the next chapter, we are going to use numerical methods to get posteriors. These methods feel like magic, until they don’t. The folk theorem of statistical computing states, ”When you have computational problems, often there’s a problem with your model” [Gelman, 2008]. Sometimes a wise choice of prior can make inference easier or faster. It is important to remark that we are not advocating for setting priors specifically to make inference faster, but it is often the case that by thinking about priors, we can get faster models.
One advantage of priors, one that is sometimes overlooked, is that having to think about priors can force us to think a little bit deeper about the problem we are trying to solve and the data we have. Sometimes the modeling process leads to a better understanding by itself irrespective of how well we end and fit the data or make predictions. By being explicit about priors, we get more transparent models, meaning they’re easier to criticize, debug (in a broad sense of the word), explain to others, and hopefully improve.