Basic probability for machine learning
Probability provides information about the likelihood of an event occurring. In this field, there are several key terms that are important to understand:
- Trial or experiment: An action that results in a certain outcome with a certain likelihood
- Sample space: This encompasses all potential outcomes of a given experiment
- Event: This denotes a non-empty portion of the sample space
Therefore, in technical terms, probability is a measure of the likelihood of an event occurring when an experiment is conducted.
In this very simple case, the probability of event A with one outcome is equal to the chance of event A divided by the chance of all possible events. For example, in flipping a fair coin, there are two outcomes with the same chance: heads and tails. The chance of having heads will be 1/(1+1) = ½.
In order to calculate the probability, given an event, A, with n outcomes and a sample space, S, the probability of event A is calculated as
where represents the outcomes in A. Assuming all results of the experiment have equal probability, and the selection of one does not influence the selection of others in subsequent rounds (meaning they are statistically independent), then
Hence, the value of probability ranges from 0 to 1, with the sample space embodying the complete set of potential outcomes, denoted as P(S) = 1.
Statistically independent
In the realm of statistics, two events are defined as independent if the occurrence of one event doesn’t influence the likelihood of the other event’s occurrence. To put it formally, events A and B are independent precisely when P(A and B) = P(A)P(B), where P(A) and P(B) are the respective probabilities of events A and B happening.
Consider this example to clarify the concept of statistical independence: imagine we possess two coins, one fair (an equal chance of turning up heads or tails) and the other biased (showing a head is more likely than a tail). If we flip the fair coin and the biased coin, these two events are statistically independent because the outcome of one coin flip doesn’t alter the probability of the other coin turning up heads or tails. Specifically, the likelihood of both coins showing heads is the product of the individual probabilities: (1/2) * (3/4) = 3/8.
Statistical independence is a pivotal concept in statistics and probability theory, frequently leveraged in machine learning to outline the connections between variables within a dataset. By comprehending these relationships, machine learning algorithms can better spot patterns and deliver more precise predictions. We will describe the relationship between different types of events in the following:
- Complementary event: The complementary event to A, signified as A’, encompasses the probability of all potential outcomes in the sample space not included in A. It’s critical to understand that A and A’ are statistically independent:
- Union and intersection: The complementary event to A, signified as A’, encompasses the probability of all potential outcomes in the sample space not included in A. It’s critical to understand that A and A’ are statistically independent.
- Mutually exclusive: When two events have no shared outcomes, they are viewed as mutually exclusive. In other words, if A and B are mutually exclusive events, then P(A ∩ B) = 0. This conclusion can be drawn from the addition rule of probability, as A and B are disjointed events:
- Independent: Two events are deemed independent when the occurrence of one doesn’t impact the occurrence of the other. If A and B are two independent events, then
Next, we are going to describe the discrete random variable, its distribution, and how to use it to calculate the probabilities.
Discrete random variables and their distribution
A discrete random variable refers to a variable that can assume a finite or countably infinite number of potential outcomes. Examples of such variables might be the count of heads resulting from a coin toss, the tally of cars crossing a toll booth within a specific time span, or the number of blonde-haired students in a classroom.
The probability distribution of a discrete random variable assigns a certain likelihood to each potential outcome the variable could adopt. For instance, in the case of a coin toss, the probability distribution assigns a 0.5 probability to both 0 and 1, representing tails and heads, respectively. For the car toll booth scenario, the distribution could be assigning a probability of 0.1 to no cars passing, 0.3 to one car, 0.4 to two cars, 0.15 to three cars, and 0.05 to four or more cars.
A graphical representation of the probability distribution of a discrete random variable can be achieved through a probability mass function (PMF), which correlates each possible outcome of the variable to its likelihood of occurrence. This function is usually represented as a bar chart or histogram, with each bar signifying the probability of a specific value.
The PMF is bound by two key principles:
- It must be non-negative across all potential values of the random variable
- The total sum of probabilities for all possible outcomes should equate to 1
The expected value of a discrete random variable offers an insight into its central tendency, computed as the probability-weighted average of its possible outcomes. This expected value is signified as E[X], with X representing the random variable.
Probability density function
The probability density function (PDF) is a tool used to describe the distribution of a continuous random variable. It can be used to calculate the probability of a value falling within a specific range. In simpler terms, it helps determine the chances of a continuous variable, X, having a value within the interval [a, b], or in statistical terms,
For continuous variables, the probability of a single value occurring is always 0, which is in contrast to discrete variables that can assign non-0 probabilities to distinct values. PDFs provide a way to estimate the likelihood of a value falling within a given range instead of a single value.
For example, you can use a PDF to find the chances of the next IQ score measured falling between 100 and 120.
Figure 2.2 – Probability density function for IQ from 100–120
To ascertain the distribution of a discrete random variable, one can either provide its PMF or cumulative distribution function (CDF). For continuous random variables, we primarily utilize the CDF, as it is well established. However, the PMF is not suitable for these types of variables because P(X=x) equals 0 for all x in the set of real numbers, given that X can assume any real value between a and b. Therefore, we typically define the PDF instead. The PDF resembles the concept of mass density in physics, signifying the concentration of probability. Its unit is the probability per unit length. To get a grasp of the PDF, let’s analyze a continuous random variable, X, and establish the function fX(x) as follows:
If the limit exists.
The function provides the probability density at a given point, x. This is equivalent to the limit of the ratio of the probability of the interval (x, x + Δ] to the length of the interval as that length approaches 0.
Let’s contemplate a continuous random variable, X, possessing an absolutely continuous CDF, denoted as . If is differentiable at x, the function is referred to as the PDF of X:
Assuming is differentiable at .
For example, let’s consider a continuous uniform random variable, X, with uniform distribution. Its CDF is given by:
which is 0 for any x outside the bounds.
By using integration, the CDF can be obtained from the PDF:
Additionally, we have
So, if we integrate over the entire real line, we will get 1:
Explicitly, when integrating the PDF across the entire real number line, the result should equal 1. This signifies that the area beneath the PDF curve must equate to 1, or P(S) = 1, which remains true for the uniform distribution. The PDF signifies the density of probability; thus, it must be non-negative and can exceed 1.
Consider a continuous random variable, X, with PDF represented as . The ensuing properties are applicable:
, for all real x
Next, we’ll move on to cover maximum likelihood.
Maximum likelihood estimation
Maximum likelihood is a statistical approach, that is used to estimate the parameters of a probability distribution. The objective is to identify the parameter values that maximize the likelihood of observing the data, essentially determining the parameters most likely to have generated the data.
Suppose we have a random sample, , from a population with a probability distribution , where θ is a vector of parameters. The likelihood of observing the sample, X, given the parameters, θ, is defined as the product of the individual probabilities of observing each data point:
In case of having independent and identically distributed observations, the likelihood function can be expressed as the product of the univariate density functions, each evaluated at the corresponding observation:
The maximum likelihood estimate (MLE) is the parameter vector value that offers the maximum value for the likelihood function across the parameter space.
In many cases, it’s more convenient to employ the natural logarithm of the likelihood function, referred to as the log-likelihood. The peak of the log-likelihood happens at the identical parameter vector value as the likelihood function’s maximum, and the conditions required for a maximum (or minimum) are acquired by equating the log-likelihood derivatives with respect to each parameter to 0. If the log-likelihood is differentiable with respect to the parameters, these conditions result in a set of equations that can be solved numerically to derive the MLE. One common use case or scenario where MLE significantly impacts ML model performance is in linear regression. When building a linear regression model, MLE is often used to estimate the coefficients that define the relationship between input features and the target variable. MLE helps find the values for the coefficients that maximize the likelihood of observing the given data under the assumed linear regression model, improving the accuracy of the predictions.
The MLEs of the parameters, θ, are the values that maximize the likelihood function. In other words, the MLEs are the values of θ that make the observed data, X, most probable.
To find the MLEs, we typically take the natural logarithm of the likelihood function, as it is often easier to work with the logarithm of a product than with the product itself:
The MLEs are determined by equating the partial derivatives of the log-likelihood function with respect to each parameter to 0 and then solving these equations for the parameters:
...
where k is the number of parameters in θ. The goal of a maximum likelihood estimator is to find θ such that
Once the MLEs have been found, they can be used to make predictions about the population based on the sample data. Maximum likelihood is widely used in many fields, including psychology, economics, engineering, and biology. It serves as a potent tool for comprehending the connections among variables and for predicting outcomes based on observed data. For example, building a word predictor using maximum likelihood estimation.
Next, we introduce the problem of word autocompletion, also known as word prediction, which is a feature in where an application predicts the next word a user is typing. The aim of word prediction is to save time and make typing easier by predicting what the user is likely to type next based on their previous inputs and other contextual factors. Word prediction can be found in various forms in many applications, including search engines, text editors, and mobile device keyboards, and is designed to save time and increase the accuracy of inputs.
Given a group of words that the user typed, how would we suggest the next word?
If the words were The United States of, then it would be trivial to assume that the next word would be America. However, what about finding the next word for How are? One could suggest several next words.
There usually isn’t just one clear next word. Thus, we’d want to suggest the most likely word or perhaps even the most likely words. In that case, we would be interested in suggesting a probabilistic representation of the possible next words and picking the next word as the one that is most probable.
The maximum likelihood estimator provides us with that precise capability. It can tell us which word is most probable given the previous words that the user typed.
In order to calculate the MLE, we need to calculate the probability function of all word combinations. We can do that by processing large texts and counting how many times each combination of words exists.
Consider reviewing a large cohort of text that has the following occurrences:
“you” |
“they” |
“those” |
“the” |
Any other word |
|
“how are …” |
16 |
14 |
0 |
100 |
10 |
not “how are…” |
200 |
100 |
300 |
1,000 |
30,000 |
Table 2.1 – Sample of n-grams occurrences in a document
For instance, there are 16 occurrences in the text where the sequence “how are you” appears. There are 140 sequences that have a length of three that start with the words “how are.” That is calculated as:
There are 216 sequences that have a length of three and that end with the word “you”. That is calculated as:
Now, let’s suggest a formula for the most likely next word.
Based on the common maximum likelihood estimation for the probablistic variable , the formula would be to find a value for which maximizes:
However, this common formula has a few characteristics that wouldn’t be advantagous to our application.
Consider the next formula which has specific advantages that are necessary for our use case. It is the maximum likelihood formula for parametric estimation, meaning, estimating deterministic parameters. It suggests finding a value for which maximizes:
is by no means a deterministic parameter, however, this formula suits our use case as it reduces common word bias emphasizing contextual fit, and adjusts for word specificity, thus enhancing the relevance of our predictions. We will elaborate more on these traits in the conclusion of this exercise.
Let’s enhance this formula so to make it easier to calculate:
In our case, is “how” and is “are.”
There are five candidates for the next word; let’s calculate the probability for each of them:
- P(“how”, “are” | “you”) = 16 / (200 + 16) = 16/216 = 2/27
- P(“how”, “are” | “they”) = 14 / (100 +14) = 14/114 = 7/57
- P(“how”, “are” | “those”) = 0 / 300 = 0
- P(“how”, “are” | “the”) = 100 / (1000 + 100) = 100/1100 = 1/11
- P(“how”, “are” | any other word) = 10 / (30,000 + 10) = 10/30010 = 1/3001
Out of all the options, the highest value of probability is 7/57 and it is achieved when “they” is the next word.
Note that the intuition behind this maximum likelihood estimator is having the suggested next word make the words that the user typed most likely. One could wonder, why not take the word that is most probable given the first two words, meaning, the orginal maximum likelihood formula for probabilistic variables? From the table, we see that given the words “how are,” the most frequent third word is “the,” with a probability of 100/140. However, this approach wouldn’t take into account the fact that the word “the” is extremely prevalent altogether, as it is most frequently used in the text in general. Thus, its high frequency isn’t due to its relationship to the first two words; it is because it is simply a very common word in general. The maximum likelyhood formula we chose takes that into account.
Bayesian estimation
Bayesian estimation is a statistical approach that involves updating our beliefs or probabilities about a quantity of interest based on new data. The term “Bayesian” refers to Thomas Bayes, an 18th-century statistician who first developed the concept of Bayesian probability.
In Bayesian estimation, we start with prior beliefs about the quantity of interest, which are expressed as a probability distribution. These prior beliefs are updated as we collect new data. The updated beliefs are represented as a posterior distribution. The Bayesian framework provides a systematic way of updating prior beliefs with new data, taking into account the degree of uncertainty in both the prior beliefs and the new data.
The posterior distribution is calculated using Bayes’ theorem, which is the fundamental equation of Bayesian estimation. Bayes’ theorem states that
where Θ is the quantity of interest, X is the new data, P(Θ|X) is the posterior distribution, P(X|Θ) is the likelihood of the data given the parameter value, P(Θ) is the prior distribution, and P(X) is the marginal likelihood or evidence.
The marginal likelihood is calculated as follows:
where the integral is taken over the entire space of Θ. The marginal likelihood is often used as a normalizing constant, ensuring that the posterior distribution integrates to 1.
In Bayesian estimation, the choice of prior distribution is important, as it reflects our beliefs about the quantity of interest before collecting any data. The prior distribution can be chosen based on prior knowledge or previous studies. If no prior knowledge is available, a non-informative prior can be used, such as a uniform distribution.
Once the posterior distribution is calculated, it can be used to make predictions about the quantity of interest. As an example, the posterior distribution’s mean can serve as a point estimate, whereas the posterior distribution itself can be employed to establish credible intervals. These intervals represent the probable range within which the true value of the target quantity resides.