You're reading from Training Systems Using Python Statistical Modeling Explore popular techniques for modeling your data in Python

Product type Paperback

Published in May 2019

Publisher Packt

ISBN-13 9781838823733

Length 290 pages

Edition 1st Edition

Languages

Python

Tools

Pandas

Concepts

Machine Learning

Author (1):

Curtis Miller

View More author details

Bayesian analysis for proportions

In this section, we'll revisit inference for proportions, but from a Bayesian perspective. We will look at Bayesian methods for analyzing proportions of success in a group. This includes talking about computing credible intervals, and the Bayesian version of hypothesis testing for both one and two samples.

Conjugate priors are a class of prior probability distributions common in Bayesian statistics. A conjugate prior is a prior distribution such that the posterior distribution belongs to the same family of probability distributions as the prior. For binary data, the beta distribution is a conjugate prior. This is a distribution defined where only values in the (0, 1) interval have a chance of appearing. They are specified by two parameters. In a trial, if there are M successes out of N trials, then the posterior distribution is the prior distribution when we add M to the first parameter of the prior, and N - M to the second parameter of the prior. This concentrates the distribution to the observed population proportion.

Conjugate priors for proportions

So, let's see this in action. For data that takes values of either 0 or 1, we're going to use the beta distribution as our conjugate prior. The notation that is used to refer to the beta distribution is B(α, β).

α - 1 can be interpreted as imaginary prior successes, and β - 1 can be interpreted as imaginary prior failures. That's if you have added the data to your dataset—imaginary successes and imaginary failures.

If α = β = 1, then we interpret this as being no prior successes or failures; therefore, every probability of success, θ, is equally likely in some sense. This is referred to as an uninformative prior. Let's now implement this using the following steps:

First, we're going to import the beta function from scipy.stats; this is the beta distribution. In addition to this, we will import the numpy library and the matplotlib library, as follows:

We're then going to plot the function and see how it looks, using the following code:

This results in the following output:

So, if we plot β when α=1 and β=1, we end up with a uniform distribution. In some sense, each p is equally likely.

Now, we will use a=3 and b=3, to indicate two imaginary successes and two imaginary failures, which gives us the following output:

Now, our prior distribution biases our data toward 0.5—in other words, it is equally likely to succeed as it is to fail.

Given a sample size of N, if there are M successes, then the posterior distribution when the prior is β, with the parameters (α, β), will be B (α + M, β + N - M). So, let's reconsider an earlier example; we have a website with 1,126 visitors. 310 clicked on an ad purchased by a sponsor, and we want to know what proportion of individuals will click on the ad in general.

So, we're going to use our prior distribution beta (3, 3). This means that the posterior distribution will be given by the beta distribution, with the first parameter, 313, and the second parameter, 819. This is what the prior distribution and posterior distribution looks like when plotted against each other:

The blue represents the prior distribution, and red represents the posterior distribution.

Credible intervals for proportions

Bayesian statistics doesn't use confidence intervals but credible intervals instead. We specify a probability, and that will be the probability that the parameter of interest lies in the credible interval. For example, there's a 95% chance that θ lies in its 95% credible interval. We compute credible intervals by computing the quantiles from the posterior distribution of the parameter, so that the chosen proportion of the posterior distribution lies between these two quantiles.

So, I've already gone ahead and written a function that will compute credible intervals for you. You give this function the number of successes, the total sample size, the first argument of the prior and the second argument of the prior, and the credibility (or chance of containing θ) of the interval. You can see the entire function as follows:

So, here is the function; I've already written it so that it works for you. We can use this function to compute credible intervals for our data.

So, we have a 95% credible interval based on the uninformative prior, as follows:

Therefore, we believe that θ will be between 25% and 30%, with a 95% probability.

The next one is the same interval when we have a different prior—that is, the one that we actually used before and is the one that we plotted:

The data hasn't changed very much, but still, this is going to be our credible interval.

The last one is the credible interval when we increase the level of credibility to .99 or the probability of containing the true parameter:

Since this probability is higher, this must be a longer interval, which is exactly what we see, although it's not that much longer.

Bayesian hypothesis testing for proportions

Unlike classical statistics, where we say a hypothesis is either right or wrong, Bayesian statistics holds that every hypothesis is true, with some probability. We don't reject hypotheses, but simply ignore them if they are unlikely to be true. For one sample, computing the probability of a hypothesis can be done by considering what region of possible values of θ correspond to the hypothesis being true, and using the posterior distribution of θ to compute the probability that θ is in that region.

In this case, we need to use what's known as the cumulative distribution function (CDF) of the posterior distribution. This is the probability that a random variable is less than or equal to a quantity, x. So, what we want is the probability that θ is greater than 0.3 when D is given, that is, if we are testing the website administrator's claim that there are at least 30% of visitors to the site clicking on the ad.

So, we will use the CDF function and evaluate it at 0.3. This is going to correspond to the administrator's claim. This will give us the probability that more than 30% of visitors clicked on the ad. The following screenshot shows how we define the CDF function:

What we end up with is a very small probability, therefore, it's likely that the administrator is incorrect.

Now, while there's a small probability, I would like to point out that this is not the same thing as a p value. A p value says something completely different; a p value should not be interpreted as the probability that the null hypothesis is true, whereas, in this case, this can be interpreted as a probability that the hypothesis we asked is true. This is the probability that data is greater than 0.3, given the data that we saw.

Comparing two proportions

Sometimes, we may want to compare two proportions from two populations. Crucially, we will assume that they are independent of each other. It's difficult to analytically compute the probability that one proportion is less than another, so we often rely on Monte Carlo methods, otherwise known as simulation or random sampling.

We randomly generate the two proportions from their respective posterior distributions, and then track how often one is less than the other. We use the frequency we observed in our simulation to estimate the desired probability.

So, let's see this in action; we have two parameters: θ_A and θ_B. These correspond to the proportion of individuals who click on an ad from format A or format B. Users are randomly assigned to one format or the other, and the website tracks how many viewers click on the ad in the different formats.

516 visitors saw format A and 108 of them clicked it. 510 visitors saw format B and 144 of them clicked it. We use the same prior for both θ_A and θ_B, which is beta (3, 3). Additionally, the posterior distribution for θ_A will be B (111, 411) and for θ_B, it will be B (147, 369). This results in the following output:

We now want to know the probability of θ_A being less than θ_B—this is difficult to compute analytically. We can randomly simulate θ_A and θ_B, and then use that to estimate this probability. So, let's randomly simulate one θ_A, as follows:

Then, randomly simulate one θ_B, as follows:

Finally, we're going to do 1,000 simulations by computing 1,000 θ_A values and 1,000 θ_B values, as follows:

This is what we end up with; here, we can see how often θ_A is less than θ_B, that is, θ_Awas 996 times less than θ_B. So, what's the average of this? Well, it is 0.996; this is the probability that θ_A is less than θ_B, or an estimate of that probability. Given this, it seems highly likely that more people clicked on the ad for format B than people who clicked on the ad for format A.