Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering NLP from Foundations to LLMs

You're reading from   Mastering NLP from Foundations to LLMs Apply advanced rule-based techniques to LLMs and solve real-world business problems using Python

Arrow left icon
Product type Paperback
Published in Apr 2024
Publisher Packt
ISBN-13 9781804619186
Length 340 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Meysam Ghaffari Meysam Ghaffari
Author Profile Icon Meysam Ghaffari
Meysam Ghaffari
Lior Gazit Lior Gazit
Author Profile Icon Lior Gazit
Lior Gazit
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Navigating the NLP Landscape: A Comprehensive Introduction FREE CHAPTER 2. Chapter 2: Mastering Linear Algebra, Probability, and Statistics for Machine Learning and NLP 3. Chapter 3: Unleashing Machine Learning Potentials in Natural Language Processing 4. Chapter 4: Streamlining Text Preprocessing Techniques for Optimal NLP Performance 5. Chapter 5: Empowering Text Classification: Leveraging Traditional Machine Learning Techniques 6. Chapter 6: Text Classification Reimagined: Delving Deep into Deep Learning Language Models 7. Chapter 7: Demystifying Large Language Models: Theory, Design, and Langchain Implementation 8. Chapter 8: Accessing the Power of Large Language Models: Advanced Setup and Integration with RAG 9. Chapter 9: Exploring the Frontiers: Advanced Applications and Innovations Driven by LLMs 10. Chapter 10: Riding the Wave: Analyzing Past, Present, and Future Trends Shaped by LLMs and AI 11. Chapter 11: Exclusive Industry Insights: Perspectives and Predictions from World Class Experts 12. Index 13. Other Books You May Enjoy

Basic probability for machine learning

Probability provides information about the likelihood of an event occurring. In this field, there are several key terms that are important to understand:

  • Trial or experiment: An action that results in a certain outcome with a certain likelihood
  • Sample space: This encompasses all potential outcomes of a given experiment
  • Event: This denotes a non-empty portion of the sample space

Therefore, in technical terms, probability is a measure of the likelihood of an event occurring when an experiment is conducted.

In this very simple case, the probability of event A with one outcome is equal to the chance of event A divided by the chance of all possible events. For example, in flipping a fair coin, there are two outcomes with the same chance: heads and tails. The chance of having heads will be 1/(1+1) = ½.

In order to calculate the probability, given an event, A, with n outcomes and a sample space, S, the probability of event A is calculated as

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:mi>P</mml:mi><mml:mfenced separators="|"><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:mfenced><mml:mi mathvariant="normal"> </mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="normal"> </mml:mi><mml:mrow><mml:munderover><mml:mo stretchy="false">∑</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mi mathvariant="normal"> </mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="normal"> </mml:mi><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mi>P</mml:mi><mml:mfenced separators="|"><mml:mrow><mml:msub><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mrow></mml:math>

where <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><msub><mi>E</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>E</mi><mi>n</mi></msub></mrow></mrow></math> represents the outcomes in A. Assuming all results of the experiment have equal probability, and the selection of one does not influence the selection of others in subsequent rounds (meaning they are statistically independent), then

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mi>P</mi><mfenced open="(" close=")"><mi>A</mi></mfenced><mspace width="0.25em" /><mo>=</mo><mspace width="0.25em" /><mfrac><mrow><mi>N</mi><mi>o</mi><mo>.</mo><mspace width="0.25em" /><mi>o</mi><mi>f</mi><mspace width="0.25em" /><mi>o</mi><mi>u</mi><mi>t</mi><mi>c</mi><mi>o</mi><mi>m</mi><mi>e</mi><mi>s</mi><mspace width="0.25em" /><mi>i</mi><mi>n</mi><mspace width="0.25em" /><mi>A</mi></mrow><mrow><mi>N</mi><mi>o</mi><mo>.</mo><mspace width="0.25em" /><mi>o</mi><mi>f</mi><mspace width="0.25em" /><mi>o</mi><mi>u</mi><mi>t</mi><mi>c</mi><mi>o</mi><mi>m</mi><mi>e</mi><mi>s</mi><mspace width="0.25em" /><mi>i</mi><mi>n</mi><mspace width="0.25em" /><mi>S</mi></mrow></mfrac></mrow></mrow></math>

Hence, the value of probability ranges from 0 to 1, with the sample space embodying the complete set of potential outcomes, denoted as P(S) = 1.

Statistically independent

In the realm of statistics, two events are defined as independent if the occurrence of one event doesn’t influence the likelihood of the other event’s occurrence. To put it formally, events A and B are independent precisely when P(A and B) = P(A)P(B), where P(A) and P(B) are the respective probabilities of events A and B happening.

Consider this example to clarify the concept of statistical independence: imagine we possess two coins, one fair (an equal chance of turning up heads or tails) and the other biased (showing a head is more likely than a tail). If we flip the fair coin and the biased coin, these two events are statistically independent because the outcome of one coin flip doesn’t alter the probability of the other coin turning up heads or tails. Specifically, the likelihood of both coins showing heads is the product of the individual probabilities: (1/2) * (3/4) = 3/8.

Statistical independence is a pivotal concept in statistics and probability theory, frequently leveraged in machine learning to outline the connections between variables within a dataset. By comprehending these relationships, machine learning algorithms can better spot patterns and deliver more precise predictions. We will describe the relationship between different types of events in the following:

  • Complementary event: The complementary event to A, signified as A’, encompasses the probability of all potential outcomes in the sample space not included in A. It’s critical to understand that A and A’ are statistically independent:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mrow><mi>P</mi><mo>(</mo><mi>A</mi><mo>′</mo><mo>)</mo><mo>=</mo><mn>1</mn><mo>−</mo><mi>P</mi><mo>(</mo><mi>A</mi><mo>)</mo></mrow></mrow></mrow></math>

  • Union and intersection: The complementary event to A, signified as A’, encompasses the probability of all potential outcomes in the sample space not included in A. It’s critical to understand that A and A’ are statistically independent.
  • Mutually exclusive: When two events have no shared outcomes, they are viewed as mutually exclusive. In other words, if A and B are mutually exclusive events, then P(A B) = 0. This conclusion can be drawn from the addition rule of probability, as A and B are disjointed events:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mrow><mi>P</mi><mo>(</mo><mi>A</mi><mi>U</mi><mi>B</mi><mo>)</mo><mo>=</mo><mi>P</mi><mo>(</mo><mi>A</mi><mo>)</mo><mo>+</mo><mi>P</mi><mo>(</mo><mi>B</mi><mo>)</mo></mrow></mrow></mrow></math>

  • Independent: Two events are deemed independent when the occurrence of one doesn’t impact the occurrence of the other. If A and B are two independent events, then

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mrow><mi>P</mi><mo>(</mo><mi>A</mi><mo>∩</mo><mi>B</mi><mo>)</mo><mo>=</mo><mi>P</mi><mo>(</mo><mi>A</mi><mo>)</mo><mo>∙</mo><mi>P</mi><mo>(</mo><mi>B</mi><mo>)</mo></mrow></mrow></mrow></math>

Next, we are going to describe the discrete random variable, its distribution, and how to use it to calculate the probabilities.

Discrete random variables and their distribution

A discrete random variable refers to a variable that can assume a finite or countably infinite number of potential outcomes. Examples of such variables might be the count of heads resulting from a coin toss, the tally of cars crossing a toll booth within a specific time span, or the number of blonde-haired students in a classroom.

The probability distribution of a discrete random variable assigns a certain likelihood to each potential outcome the variable could adopt. For instance, in the case of a coin toss, the probability distribution assigns a 0.5 probability to both 0 and 1, representing tails and heads, respectively. For the car toll booth scenario, the distribution could be assigning a probability of 0.1 to no cars passing, 0.3 to one car, 0.4 to two cars, 0.15 to three cars, and 0.05 to four or more cars.

A graphical representation of the probability distribution of a discrete random variable can be achieved through a probability mass function (PMF), which correlates each possible outcome of the variable to its likelihood of occurrence. This function is usually represented as a bar chart or histogram, with each bar signifying the probability of a specific value.

The PMF is bound by two key principles:

  • It must be non-negative across all potential values of the random variable
  • The total sum of probabilities for all possible outcomes should equate to 1

The expected value of a discrete random variable offers an insight into its central tendency, computed as the probability-weighted average of its possible outcomes. This expected value is signified as E[X], with X representing the random variable.

Probability density function

The probability density function (PDF) is a tool used to describe the distribution of a continuous random variable. It can be used to calculate the probability of a value falling within a specific range. In simpler terms, it helps determine the chances of a continuous variable, X, having a value within the interval [a, b], or in statistical terms,

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mrow><mi>P</mi><mo>(</mo><mi>A</mi><mo><</mo><mi>X</mi><mo><</mo><mi>B</mi><mo>)</mo></mrow></mrow></mrow></math>

For continuous variables, the probability of a single value occurring is always 0, which is in contrast to discrete variables that can assign non-0 probabilities to distinct values. PDFs provide a way to estimate the likelihood of a value falling within a given range instead of a single value.

For example, you can use a PDF to find the chances of the next IQ score measured falling between 100 and 120.

 Figure 2.2 – Probability density function for IQ from 100–120

Figure 2.2 – Probability density function for IQ from 100–120

To ascertain the distribution of a discrete random variable, one can either provide its PMF or cumulative distribution function (CDF). For continuous random variables, we primarily utilize the CDF, as it is well established. However, the PMF is not suitable for these types of variables because P(X=x) equals 0 for all x in the set of real numbers, given that X can assume any real value between a and b. Therefore, we typically define the PDF instead. The PDF resembles the concept of mass density in physics, signifying the concentration of probability. Its unit is the probability per unit length. To get a grasp of the PDF, let’s analyze a continuous random variable, X, and establish the function fX(x) as follows:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><msub><mi>f</mi><mi>X</mi></msub><mfenced open="(" close=")"><mi>x</mi></mfenced><mo>=</mo><munder><mi>lim</mi><mrow><mo>∆</mo><mo>→</mo><msup><mn>0</mn><mo>+</mo></msup></mrow></munder><mfrac><mrow><mi>P</mi><mo>(</mo><mi>x</mi><mo><</mo><mi>X</mi><mo>≤</mo><mfenced open="(" close=")"><mrow><mi>x</mi><mo>+</mo><mo>∆</mo></mrow></mfenced><mo>)</mo></mrow><mo>∆</mo></mfrac></mrow></mrow></math>

If the limit exists.

The function <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><msub><mi>f</mi><mi>X</mi></msub><mfenced open="(" close=")"><mi>x</mi></mfenced></mrow></mrow></math>provides the probability density at a given point, x. This is equivalent to the limit of the ratio of the probability of the interval (x, x + Δ] to the length of the interval as that length approaches 0.

Let’s contemplate a continuous random variable, X, possessing an absolutely continuous CDF, denoted as <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced></mml:math>. If <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced></mml:math> is differentiable at x, the function <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced></mml:math> is referred to as the PDF of X:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><msub><mi>f</mi><mi>X</mi></msub><mfenced open="(" close=")"><mi>x</mi></mfenced><mspace width="0.25em" /><mo>=</mo><munder><mi>lim</mi><mrow><mo>∆</mo><mo>→</mo><msup><mn>0</mn><mo>+</mo></msup></mrow></munder><mfrac><mrow><msub><mi>F</mi><mi>X</mi></msub><mfenced open="(" close=")"><mrow><mi>x</mi><mo>+</mo><mo>∆</mo></mrow></mfenced><mo>−</mo><msub><mi>F</mi><mi>X</mi></msub><mfenced open="(" close=")"><mi>x</mi></mfenced></mrow><mo>∆</mo></mfrac><mo>=</mo><mfrac><mrow><mi>d</mi><msub><mi>F</mi><mi>X</mi></msub><mfenced open="(" close=")"><mi>x</mi></mfenced></mrow><mrow><mi>d</mi><mi>x</mi></mrow></mfrac><mo>=</mo><msub><mrow><mi>F</mi><mo>′</mo></mrow><mi>X</mi></msub><mfenced open="(" close=")"><mi>x</mi></mfenced></mrow></mrow></math>

Assuming <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced></mml:math> is differentiable at <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>x</mml:mi></mml:math>.

For example, let’s consider a continuous uniform random variable, X, with uniform <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>U</mml:mi><mml:mo>(</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>)</mml:mo></mml:math> distribution. Its CDF is given by:

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced><mml:mi mathvariant="normal"> </mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="normal"> </mml:mi><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>-</mml:mo><mml:mi>a</mml:mi></mml:mrow></mml:mfrac><mml:mi mathvariant="normal"> </mml:mi><mml:mi mathvariant="normal"> </mml:mi><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mi mathvariant="normal"> </mml:mi><mml:mi>a</mml:mi><mml:mi mathvariant="normal"> </mml:mi><mml:mo><</mml:mo><mml:mi>x</mml:mi><mml:mi mathvariant="normal"> </mml:mi><mml:mo><</mml:mo><mml:mi mathvariant="normal"> </mml:mi><mml:mi>b</mml:mi></mml:math>

which is 0 for any x outside the bounds.

By using integration, the CDF can be obtained from the PDF:

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced><mml:mi mathvariant="normal"> </mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="normal"> </mml:mi><mml:mrow><mml:munderover><mml:mo stretchy="false">∫</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mfenced><mml:mi>d</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:mrow></mml:math>

Additionally, we have

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mi>P</mi><mfenced open="(" close=")"><mrow><mi>a</mi><mspace width="0.25em" /><mo><</mo><mspace width="0.25em" /><mi>X</mi><mspace width="0.25em" /><mo>≤</mo><mi>b</mi></mrow></mfenced><mspace width="0.25em" /><mo>=</mo><mspace width="0.25em" /><msub><mi>F</mi><mi>X</mi></msub><mfenced open="(" close=")"><mi>b</mi></mfenced><mspace width="0.25em" /><mo>−</mo><mspace width="0.25em" /><msub><mi>F</mi><mi>X</mi></msub><mfenced open="(" close=")"><mi>a</mi></mfenced><mspace width="0.25em" /><mo>=</mo><mspace width="0.25em" /><mrow><msubsup><mo>∫</mo><mi>a</mi><mi>b</mi></msubsup><mrow><msub><mi>f</mi><mi>X</mi></msub><mfenced open="(" close=")"><mi>u</mi></mfenced><mi>d</mi><mi>u</mi></mrow></mrow></mrow></mrow></math>

So, if we integrate over the entire real line, we will get 1:

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:mrow><mml:munderover><mml:mo stretchy="false">∫</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mfenced><mml:mi>d</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:mrow><mml:mi mathvariant="normal"> </mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="normal"> </mml:mi><mml:mn>1</mml:mn></mml:math>

Explicitly, when integrating the PDF across the entire real number line, the result should equal 1. This signifies that the area beneath the PDF curve must equate to 1, or P(S) = 1, which remains true for the uniform distribution. The PDF signifies the density of probability; thus, it must be non-negative and can exceed 1.

Consider a continuous random variable, X, with PDF represented as <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:mfenced></mml:math>. The ensuing properties are applicable:

<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><msub><mi>f</mi><mi>X</mi></msub><mfenced open="(" close=")"><mi>x</mi></mfenced><mspace width="0.25em" /><mo>≥</mo><mn>0</mn></mrow></mrow></math>, for all real x

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:mrow><mml:munderover><mml:mo stretchy="false">∫</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">∞</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mfenced separators="|"><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mfenced><mml:mi>d</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:mrow><mml:mi mathvariant="normal"> </mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="normal"> </mml:mi><mml:mn>1</mml:mn></mml:math>

Next, we’ll move on to cover maximum likelihood.

Maximum likelihood estimation

Maximum likelihood is a statistical approach, that is used to estimate the parameters of a probability distribution. The objective is to identify the parameter values that maximize the likelihood of observing the data, essentially determining the parameters most likely to have generated the data.

Suppose we have a random sample, <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mrow><mi>X</mi><mo>=</mo><mo>{</mo><msub><mi>X</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>X</mi><mi>n</mi></msub><mo>}</mo></mrow></mrow></mrow></math>, from a population with a probability distribution <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>f</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>|</mml:mo><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mo>)</mml:mo></mml:math>, where θ is a vector of parameters. The likelihood of observing the sample, X, given the parameters, θ, is defined as the product of the individual probabilities of observing each data point:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mrow><mi>L</mi><mo>(</mo><mi mathvariant="bold-italic">θ</mi><mo>|</mo><mi>X</mi><mo>)</mo><mo>=</mo><mi>f</mi><mo>(</mo><mi>X</mi><mo>|</mo><mi mathvariant="bold-italic">θ</mi><mo>)</mo></mrow></mrow></mrow></math>

In case of having independent and identically distributed observations, the likelihood function can be expressed as the product of the univariate density functions, each evaluated at the corresponding observation:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mi>L</mi><mfenced open="(" close=")"><mrow><mi mathvariant="bold-italic">θ</mi><mo>|</mo><mi>X</mi></mrow></mfenced><mo>=</mo><mi>f</mi><mfenced open="(" close=")"><mrow><msub><mi>X</mi><mn>1</mn></msub><mo>|</mo><mi mathvariant="bold-italic">θ</mi></mrow></mfenced><mi>f</mi><mfenced open="(" close=")"><mrow><msub><mi>X</mi><mn>2</mn></msub><mo>|</mo><mi mathvariant="bold-italic">θ</mi></mrow></mfenced><mo>…</mo><mi>f</mi><mfenced open="(" close=")"><mrow><msub><mi>X</mi><mi>n</mi></msub><mo>|</mo><mi mathvariant="bold-italic">θ</mi></mrow></mfenced></mrow></mrow></math>

The maximum likelihood estimate (MLE) is the parameter vector value that offers the maximum value for the likelihood function across the parameter space.

In many cases, it’s more convenient to employ the natural logarithm of the likelihood function, referred to as the log-likelihood. The peak of the log-likelihood happens at the identical parameter vector value as the likelihood function’s maximum, and the conditions required for a maximum (or minimum) are acquired by equating the log-likelihood derivatives with respect to each parameter to 0. If the log-likelihood is differentiable with respect to the parameters, these conditions result in a set of equations that can be solved numerically to derive the MLE. One common use case or scenario where MLE significantly impacts ML model performance is in linear regression. When building a linear regression model, MLE is often used to estimate the coefficients that define the relationship between input features and the target variable. MLE helps find the values for the coefficients that maximize the likelihood of observing the given data under the assumed linear regression model, improving the accuracy of the predictions.

The MLEs of the parameters, θ, are the values that maximize the likelihood function. In other words, the MLEs are the values of θ that make the observed data, X, most probable.

To find the MLEs, we typically take the natural logarithm of the likelihood function, as it is often easier to work with the logarithm of a product than with the product itself:

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:mrow><mml:mrow><mml:mi mathvariant="normal">ln</mml:mi></mml:mrow><mml:mo>⁡</mml:mo><mml:mrow><mml:mi>L</mml:mi><mml:mfenced separators="|"><mml:mrow><mml:mi mathvariant="bold-italic">θ</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:mfenced></mml:mrow></mml:mrow><mml:mi> </mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">ln</mml:mi></mml:mrow><mml:mo>⁡</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">θ</mml:mi></mml:mrow></mml:mfenced><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">ln</mml:mi></mml:mrow><mml:mo>⁡</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">θ</mml:mi></mml:mrow></mml:mfenced><mml:mo>+</mml:mo><mml:mo>…</mml:mo><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">ln</mml:mi></mml:mrow><mml:mo>⁡</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">θ</mml:mi></mml:mrow></mml:mfenced></mml:math>

The MLEs are determined by equating the partial derivatives of the log-likelihood function with respect to each parameter to 0 and then solving these equations for the parameters:

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:mo>∂</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">ln</mml:mi></mml:mrow><mml:mo>⁡</mml:mo><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:mi mathvariant="bold-italic">θ</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:mfenced><mml:mo>/</mml:mo><mml:mo>∂</mml:mo><mml:msub><mml:mrow><mml:mi>θ</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math>

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:mo>∂</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">ln</mml:mi></mml:mrow><mml:mo>⁡</mml:mo><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:mi mathvariant="bold-italic">θ</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:mfenced><mml:mo>/</mml:mo><mml:mo>∂</mml:mo><mml:msub><mml:mrow><mml:mi>θ</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math>

...

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:mo>∂</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">ln</mml:mi></mml:mrow><mml:mo>⁡</mml:mo><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:mi mathvariant="bold-italic">θ</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:mfenced><mml:mo>/</mml:mo><mml:mo>∂</mml:mo><mml:msub><mml:mrow><mml:mi>θ</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math>

where k is the number of parameters in θ. The goal of a maximum likelihood estimator is to find θ such that

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mrow><mi mathvariant="bold-italic">θ</mi><mo>(</mo><mi>x</mi><mo>)</mo><mo>=</mo><mi>a</mi><mi>r</mi><mi>g</mi><munder><mi>max</mi><mi mathvariant="bold-italic">θ</mi></munder><mrow><mrow><mi>L</mi><mo>(</mo><mi mathvariant="bold-italic">θ</mi><mo>|</mo><mi>x</mi><mo>)</mo></mrow></mrow></mrow></mrow></mrow></math>

Once the MLEs have been found, they can be used to make predictions about the population based on the sample data. Maximum likelihood is widely used in many fields, including psychology, economics, engineering, and biology. It serves as a potent tool for comprehending the connections among variables and for predicting outcomes based on observed data. For example, building a word predictor using maximum likelihood estimation.

Next, we introduce the problem of word autocompletion, also known as word prediction, which is a feature in where an application predicts the next word a user is typing. The aim of word prediction is to save time and make typing easier by predicting what the user is likely to type next based on their previous inputs and other contextual factors. Word prediction can be found in various forms in many applications, including search engines, text editors, and mobile device keyboards, and is designed to save time and increase the accuracy of inputs.

Given a group of words that the user typed, how would we suggest the next word?

If the words were The United States of, then it would be trivial to assume that the next word would be America. However, what about finding the next word for How are? One could suggest several next words.

There usually isn’t just one clear next word. Thus, we’d want to suggest the most likely word or perhaps even the most likely words. In that case, we would be interested in suggesting a probabilistic representation of the possible next words and picking the next word as the one that is most probable.

The maximum likelihood estimator provides us with that precise capability. It can tell us which word is most probable given the previous words that the user typed.

In order to calculate the MLE, we need to calculate the probability function of all word combinations. We can do that by processing large texts and counting how many times each combination of words exists.

Consider reviewing a large cohort of text that has the following occurrences:

you”

they”

those”

the”

Any other word

“how are …”

16

14

0

100

10

not “how are…”

200

100

300

1,000

30,000

Table 2.1 – Sample of n-grams occurrences in a document

For instance, there are 16 occurrences in the text where the sequence “how are you” appears. There are 140 sequences that have a length of three that start with the words “how are.” That is calculated as:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mn>16</mn><mo>+</mo><mn>14</mn><mo>+</mo><mn>0</mn><mo>+</mo><mn>100</mn><mo>+</mo><mn>10</mn><mo>=</mo><mn>140</mn></mrow></mrow></math>

There are 216 sequences that have a length of three and that end with the word “you”. That is calculated as:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mn>16</mn><mo>+</mo><mn>200</mn><mo>=</mo><mn>216</mn></mrow></mrow></math>

Now, let’s suggest a formula for the most likely next word.

Based on the common maximum likelihood estimation for the probablistic variable <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math>, the formula would be to find a value for <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math> which maximizes:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mrow><mi>P</mi><mfenced open="(" close="|"><msub><mi>W</mi><mn>3</mn></msub></mfenced><msub><mi>W</mi><mn>1</mn></msub><mo>,</mo><msub><mi>W</mi><mn>2</mn></msub><mo>)</mo></mrow></mrow></mrow></math>

However, this common formula has a few characteristics that wouldn’t be advantagous to our application.

Consider the next formula which has specific advantages that are necessary for our use case. It is the maximum likelihood formula for parametric estimation, meaning, estimating deterministic parameters. It suggests finding a value for <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math> which maximizes:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mrow><mi>P</mi><mfenced open="(" close="|"><mrow><msub><mi>W</mi><mn>1</mn></msub><mo>,</mo><msub><mi>W</mi><mn>2</mn></msub></mrow></mfenced><msub><mi>W</mi><mn>3</mn></msub><mo>)</mo></mrow></mrow></mrow></math>

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math> is by no means a deterministic parameter, however, this formula suits our use case as it reduces common word bias emphasizing contextual fit, and adjusts for word specificity, thus enhancing the relevance of our predictions. We will elaborate more on these traits in the conclusion of this exercise.

Let’s enhance this formula so to make it easier to calculate:

<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mrow><mi>P</mi><mfenced open="(" close="|"><mrow><msub><mi>W</mi><mn>1</mn></msub><mo>,</mo><msub><mi>W</mi><mn>2</mn></msub></mrow></mfenced><msub><mi>W</mi><mn>3</mn></msub><mo>)</mo></mrow></mrow></mrow></math><math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mrow><mo>=</mo><mfrac><mrow><mi>P</mi><mo>(</mo><msub><mi>W</mi><mn>1</mn></msub><mo>,</mo><msub><mi>W</mi><mn>2</mn></msub><mo>,</mo><msub><mi>W</mi><mn>3</mn></msub><mo>)</mo></mrow><mrow><mi>P</mi><mo>(</mo><msub><mi>W</mi><mn>3</mn></msub><mo>)</mo></mrow></mfrac></mrow></mrow></mrow></math>

In our case,<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math> is “how” and <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math> is “are.”

There are five candidates for the next word; let’s calculate the probability for each of them:

  • P(“how”, “are” | “you”) = 16 / (200 + 16) = 16/216 = 2/27
  • P(“how”, “are” | “they”) = 14 / (100 +14) = 14/114 = 7/57
  • P(“how”, “are” | “those”) = 0 / 300 = 0
  • P(“how”, “are” | “the”) = 100 / (1000 + 100) = 100/1100 = 1/11
  • P(“how”, “are” | any other word) = 10 / (30,000 + 10) = 10/30010 = 1/3001

Out of all the options, the highest value of probability is 7/57 and it is achieved when “they” is the next word.

Note that the intuition behind this maximum likelihood estimator is having the suggested next word make the words that the user typed most likely. One could wonder, why not take the word that is most probable given the first two words, meaning, the orginal maximum likelihood formula for probabilistic variables? From the table, we see that given the words “how are,” the most frequent third word is “the,” with a probability of 100/140. However, this approach wouldn’t take into account the fact that the word “the” is extremely prevalent altogether, as it is most frequently used in the text in general. Thus, its high frequency isn’t due to its relationship to the first two words; it is because it is simply a very common word in general. The maximum likelyhood formula we chose takes that into account.

Bayesian estimation

Bayesian estimation is a statistical approach that involves updating our beliefs or probabilities about a quantity of interest based on new data. The term “Bayesian” refers to Thomas Bayes, an 18th-century statistician who first developed the concept of Bayesian probability.

In Bayesian estimation, we start with prior beliefs about the quantity of interest, which are expressed as a probability distribution. These prior beliefs are updated as we collect new data. The updated beliefs are represented as a posterior distribution. The Bayesian framework provides a systematic way of updating prior beliefs with new data, taking into account the degree of uncertainty in both the prior beliefs and the new data.

The posterior distribution is calculated using Bayes’ theorem, which is the fundamental equation of Bayesian estimation. Bayes’ theorem states that

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mi>P</mi><mfenced open="(" close=")"><mrow><mi mathvariant="normal">Θ</mi><mo>|</mo><mi>X</mi></mrow></mfenced><mo>=</mo><mfrac><mrow><mi>P</mi><mfenced open="(" close=")"><mrow><mi>X</mi><mo>|</mo><mi mathvariant="normal">Θ</mi></mrow></mfenced><mi>P</mi><mfenced open="(" close=")"><mi mathvariant="normal">Θ</mi></mfenced></mrow><mrow><mi>P</mi><mfenced open="(" close=")"><mi>X</mi></mfenced></mrow></mfrac></mrow></mrow></math>

where Θ is the quantity of interest, X is the new data, P(Θ|X) is the posterior distribution, P(X|Θ) is the likelihood of the data given the parameter value, P(Θ) is the prior distribution, and P(X) is the marginal likelihood or evidence.

The marginal likelihood is calculated as follows:

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:mi>P</mml:mi><mml:mfenced separators="|"><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:mfenced><mml:mo>=</mml:mo><mml:mrow><mml:munderover><mml:mo stretchy="false">∫</mml:mo><mml:mrow><mml:mi> </mml:mi></mml:mrow><mml:mrow><mml:mi> </mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mi>P</mml:mi><mml:mfenced separators="|"><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">Θ</mml:mi></mml:mrow></mml:mfenced></mml:mrow></mml:mrow><mml:mo>∙</mml:mo><mml:mi>P</mml:mi><mml:mfenced separators="|"><mml:mrow><mml:mi mathvariant="normal">Θ</mml:mi></mml:mrow></mml:mfenced><mml:mi>d</mml:mi><mml:mi mathvariant="normal">Θ</mml:mi></mml:math>

where the integral is taken over the entire space of Θ. The marginal likelihood is often used as a normalizing constant, ensuring that the posterior distribution integrates to 1.

In Bayesian estimation, the choice of prior distribution is important, as it reflects our beliefs about the quantity of interest before collecting any data. The prior distribution can be chosen based on prior knowledge or previous studies. If no prior knowledge is available, a non-informative prior can be used, such as a uniform distribution.

Once the posterior distribution is calculated, it can be used to make predictions about the quantity of interest. As an example, the posterior distribution’s mean can serve as a point estimate, whereas the posterior distribution itself can be employed to establish credible intervals. These intervals represent the probable range within which the true value of the target quantity resides.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image