What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Learning Bayesian Models with R

Chapter 1. Introducing the Probability Theory

Bayesian inference is a method of learning about the relationship between variables from data, in the presence of uncertainty, in real-world problems. It is one of the frameworks of probability theory. Any reader interested in Bayesian inference should have a good knowledge of probability theory to understand and use Bayesian inference. This chapter covers an overview of probability theory, which will be sufficient to understand the rest of the chapters in this book.

It was Pierre-Simon Laplace who first proposed a formal definition of probability with mathematical rigor. This definition is called the Classical Definition and it states the following:

	The theory of chance consists in reducing all the events of the same kind to a certain number of cases equally possible, that is to say, to such as we may be equally undecided about in regard to their existence, and in determining the number of cases favorable to the event whose probability is sought. The ratio of this number to that of all the cases possible is the measure of this probability, which is thus simply a fraction whose numerator is the number of favorable cases and whose denominator is the number of all the cases possible.
	--Pierre-Simon Laplace, A Philosophical Essay on Probabilities

What this definition means is that, if a random experiment can result in Introducing the Probability Theory mutually exclusive and equally likely outcomes, the probability of the event is given by:

Here, Introducing the Probability Theory is the number of occurrences of the event .

To illustrate this concept, let us take a simple example of a rolling dice. If the dice is a fair dice, then all the faces will have an equal chance of showing up when the dice is rolled. Then, the probability of each face showing up is 1/6. However, when one rolls the dice 100 times, all the faces will not come in equal proportions of 1/6 due to random fluctuations. The estimate of probability of each face is the number of times the face shows up divided by the number of rolls. As the denominator is very large, this ratio will be close to 1/6.

In the long run, this classical definition treats the probability of an uncertain event as the relative frequency of its occurrence. This is also called a frequentist approach to probability. Although this approach is suitable for a large class of problems, there are cases where this type of approach cannot be used. As an example, consider the following question: Is Pataliputra the name of an ancient city or a king? In such cases, we have a degree of belief in various plausible answers, but it is not based on counts in the outcome of an experiment (in the Sanskrit language Putra means son, therefore some people may believe that Pataliputra is the name of an ancient king in India, but it is a city).

Another example is, What is the chance of the Democratic Party winning the election in 2016 in America? Some people may believe it is 1/2 and some people may believe it is 2/3. In this case, probability is defined as the degree of belief of a person in the outcome of an uncertain event. This is called the subjective definition of probability.

One of the limitations of the classical or frequentist definition of probability is that it cannot address subjective probabilities. As we will see later in this book, Bayesian inference is a natural framework for treating both frequentist and subjective interpretations of probability.

Probability distributions

In both classical and Bayesian approaches, a probability distribution function is the central quantity, which captures all of the information about the relationship between variables in the presence of uncertainty. A probability distribution assigns a probability value to each measurable subset of outcomes of a random experiment. The variable involved could be discrete or continuous, and univariate or multivariate. Although people use slightly different terminologies, the commonly used probability distributions for the different types of random variables are as follows:

Probability mass function (pmf) for discrete numerical random variables
Categorical distribution for categorical random variables
Probability density function (pdf) for continuous random variables

One of the well-known distribution functions is the normal or Gaussian distribution, which is named after Carl Friedrich Gauss, a famous German mathematician and physicist. It is also known by the name bell curve because of its shape. The mathematical form of this distribution is given by:

Here, Probability distributions is the mean or location parameter and is the standard deviation or scale parameter ( is called variance). The following graphs show what the distribution looks like for different values of location and scale parameters:

One can see that as the mean changes, the location of the peak of the distribution changes. Similarly, when the standard deviation changes, the width of the distribution also changes.

Many natural datasets follow normal distribution because, according to the central limit theorem, any random variable that can be composed as a mean of independent random variables will have a normal distribution. This is irrespective of the form of the distribution of this random variable, as long as they have finite mean and variance and all are drawn from the same original distribution. A normal distribution is also very popular among data scientists because in many statistical inferences, theoretical results can be derived if the underlying distribution is normal.

Now, let us look at the multidimensional version of normal distribution. If the random variable is an N-dimensional vector, x is denoted by:

Then, the corresponding normal distribution is given by:

Here, Probability distributions corresponds to the mean (also called location) and is an N x N covariance matrix (also called scale).

To get a better understanding of the multidimensional normal distribution, let us take the case of two dimensions. In this case, Probability distributions and the covariance matrix is given by:

Here, Probability distributions and are the variances along and directions, and is the correlation between and . A plot of two-dimensional normal distribution for , , and is shown in the following image:

If Probability distributions , then the two-dimensional normal distribution will be reduced to the product of two one-dimensional normal distributions, since would become diagonal in this case. The following 2D projections of normal distribution for the same values of and but with and illustrate this case:

The high correlation between x and y in the first case forces most of the data points along the 45 degree line and makes the distribution more anisotropic; whereas, in the second case, when the correlation is zero, the distribution is more isotropic.

We will briefly review some of the other well-known distributions used in Bayesian inference here.

Expectations and covariance

Having known the distribution of a set of random variables Expectations and covariance , what one would be typically interested in for real-life applications is to be able to estimate the average values of these random variables and the correlations between them. These are computed formally using the following expressions:

For example, in the case of two-dimensional normal distribution, if we are interested in finding the correlation between the variables Expectations and covariance and , it can be formally computed from the joint distribution using the following formula:

Binomial distribution

A binomial distribution is a discrete distribution that gives the probability of heads in n independent trials where each trial has one of two possible outcomes, heads or tails, with the probability of heads being p. Each of the trials is called a Bernoulli trial. The functional form of the binomial distribution is given by:

Here, Binomial distribution denotes the probability of having k heads in n trials. The mean of the binomial distribution is given by np and variance is given by np(1-p). Have a look at the following graphs:

The preceding graphs show the binomial distribution for two values of n; 100 and 1000 for p = 0.7. As you can see, when n becomes large, the Binomial distribution becomes sharply peaked. It can be shown that, in the large n limit, a binomial distribution can be approximated using a normal distribution with mean np and variance np(1-p). This is a characteristic shared by many discrete distributions that, in the large n limit, they can be approximated by some continuous distributions.

Beta distribution

The Beta distribution denoted by is a function of the power of , and its reflection is given by:

Here, Beta distribution are parameters that determine the shape of the distribution function and is the Beta function given by the ratio of Gamma functions: .

The Beta distribution is a very important distribution in Bayesian inference. It is the conjugate prior probability distribution (which will be defined more precisely in the next chapter) for binomial, Bernoulli, negative binomial, and geometric distributions. It is used for modeling the random behavior of percentages and proportions. For example, the Beta distribution has been used for modeling allele frequencies in population genetics, time allocation in project management, the proportion of minerals in rocks, and heterogeneity in the probability of HIV transmission.

Gamma distribution

The Gamma distribution denoted by is another common distribution used in Bayesian inference. It is used for modeling the waiting times such as survival rates. Special cases of the Gamma distribution are the well-known Exponential and Chi-Square distributions.

In Bayesian inference, the Gamma distribution is used as a conjugate prior for the inverse of variance of a one-dimensional normal distribution or parameters such as the rate () of an exponential or Poisson distribution.

The mathematical form of a Gamma distribution is given by:

Here, Gamma distribution and are the shape and rate parameters, respectively (both take values greater than zero). There is also a form in terms of the scale parameter , which is common in econometrics. Another related distribution is the Inverse-Gamma distribution that is the distribution of the reciprocal of a variable that is distributed according to the Gamma distribution. It's mainly used in Bayesian inference as the conjugate prior distribution for the variance of a one-dimensional normal distribution.

Dirichlet distribution

The Dirichlet distribution is a multivariate analogue of the Beta distribution. It is commonly used in Bayesian inference as the conjugate prior distribution for multinomial distribution and categorical distribution. The main reason for this is that it is easy to implement inference techniques, such as Gibbs sampling, on the Dirichlet-multinomial distribution.

The Dirichlet distribution of order is defined over an open dimensional simplex as follows:

Here, Dirichlet distribution , , and .

Wishart distribution

The Wishart distribution is a multivariate generalization of the Gamma distribution. It is defined over symmetric non-negative matrix-valued random variables. In Bayesian inference, it is used as the conjugate prior to estimate the distribution of inverse of the covariance matrix (or precision matrix) of the normal distribution. When we discussed Gamma distribution, we said it is used as a conjugate distribution for the inverse of the variance of the one-dimensional normal distribution.

The mathematical definition of the Wishart distribution is as follows:

Here, Wishart distribution denotes the determinant of the matrix of dimension and is the degrees of freedom.

A special case of the Wishart distribution is when corresponds to the well-known Chi-Square distribution function with degrees of freedom.

Wikipedia gives a list of more than 100 useful distributions that are commonly used by statisticians (reference 1 in the Reference section of this chapter). Interested readers should refer to this article.

Exercises

By using the definition of conditional probability, show that any multivariate joint distribution of N random variables has the following trivial factorization:
The bivariate normal distribution is given by:
Here:
By using the definition of conditional probability, show that the conditional distribution can be written as a normal distribution of the form where and .
By using explicit integration of the expression in exercise 2, show that the marginalization of bivariate normal distribution will result in univariate normal distribution.

In the following table, a dataset containing the measurements of petal and sepal sizes of 15 different Iris flowers are shown (taken from the Iris dataset, UCI machine learning dataset repository). All units are in cms:

Sepal Length	Sepal Width	Petal Length	Petal Width	Class of Flower
5.1	3.5	1.4	0.2	Iris-setosa
4.9	3	1.4	0.2	Iris-setosa
4.7	3.2	1.3	0.2	Iris-setosa
4.6	3.1	1.5	0.2	Iris-setosa
5	3.6	1.4	0.2	Iris-setosa
7	3.2	4.7	1.4	Iris-versicolor
6.4	3.2	4.5	1.5	Iris-versicolor
6.9	3.1	4.9	1.5	Iris-versicolor
5.5	2.3	4	1.3	Iris-versicolor
6.5	2.8	4.6	1.5	Iris-versicolor
6.3	3.3	6	2.5	Iris-virginica
5.8	2.7	5.1	1.9	Iris-virginica
7.1	3	5.9	2.1	Iris-virginica
6.3	2.9	5.6	1.8	Iris-virginica
6.5	3	5.8	2.2	Iris-virginica

Answer the following questions:

What is the probability of finding flowers with a sepal length more than 5 cm and a sepal width less than 3 cm?
What is the probability of finding flowers with a petal length less than 1.5 cm; given that petal width is equal to 0.2 cm?
What is the probability of finding flowers with a sepal length less than 6 cm and a petal width less than 1.5 cm; given that the class of the flower is Iris-versicolor?

Description

Bayesian Inference provides a unified framework to deal with all sorts of uncertainties when learning patterns form data using machine learning models and use it for predicting future observations. However, learning and implementing Bayesian models is not easy for data science practitioners due to the level of mathematical treatment involved. Also, applying Bayesian methods to real-world problems requires high computational resources. With the recent advances in computation and several open sources packages available in R, Bayesian modeling has become more feasible to use for practical applications today. Therefore, it would be advantageous for all data scientists and engineers to understand Bayesian methods and apply them in their projects to achieve better results. Learning Bayesian Models with R starts by giving you a comprehensive coverage of the Bayesian Machine Learning models and the R packages that implement them. It begins with an introduction to the fundamentals of probability theory and R programming for those who are new to the subject. Then the book covers some of the important machine learning methods, both supervised and unsupervised learning, implemented using Bayesian Inference and R. Every chapter begins with a theoretical description of the method explained in a very simple manner. Then, relevant R packages are discussed and some illustrations using data sets from the UCI Machine Learning repository are given. Each chapter ends with some simple exercises for you to get hands-on experience of the concepts and R packages discussed in the chapter. The last chapters are devoted to the latest development in the field, specifically Deep Learning, which uses a class of Neural Network models that are currently at the frontier of Artificial Intelligence. The book concludes with the application of Bayesian methods on Big Data using the Hadoop and Spark frameworks.

Who is this book for?

This book is for statisticians, analysts, and data scientists who want to build a Bayes-based system with R and implement it in their day-to-day models and projects. It is mainly intended for Data Scientists and Software Engineers who are involved in the development of Advanced Analytics applications. To understand this book, it would be useful if you have basic knowledge of probability theory and analytics and some familiarity with the programming language R.

What you will learn

Set up the R environment

Create a classification model to predict and explore discrete variables

Get acquainted with Probability Theory to analyze random events

Build Linear Regression models

Use Bayesian networks to infer the probability distribution of decision variables in a problem

Model a problem using Bayesian Linear Regression approach with the R package BLR

Use Bayesian Logistic Regression model to classify numerical data

Perform Bayesian Inference on massively large data sets using the MapReduce programs in R and Cloud computing

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Frequently bought together

€41.99

€41.99

€29.99

Total € 113.97

Filter reviews by

All

Amazon verified reviews

Hugo Jan 11, 2016

This book is good, have a clear reading, structured information in good steps, exercises and references, these last two are very useful when you want more detailed information. Statistics aren't easy at least for me, but I could learn the advantages of Bayesian inference.I think that the first chapters are essential introduction to the subject and the tools to work, but, the real what you really want comes in modules, first you have an understatement of the use and capabilities of principles of Bayesian inference, after that you have notion of Bayesian and R, than you start to use both in machine learning. Machine Learning have many uses, so I think that the applicability of the book tend to infinity, I really liked that the author gives base of Bayesian neural networks in chapter 8, talking about deep belief networks the advantages and like in all the other subjects he gives good references to go deep and learn for sure. You will understand wow structured is the book when you achieve the last chapter and see how much you've learned and that the complexity of your projects achieve, all the chapters are like a stair degree.My experience reading this was good because I feel that I've learned and the exercises make me work with, Bayesian inference is different from classic statistics, you can you this to solve yor project needs, I certainly recommend this book, is hard to find such information well explained like in this book.

Amazon Verified review

Duncan W. Robinson Nov 06, 2015

Learning Bayesian Models with R is a great book for those who want a hands-on approach to Bayesian data analysis and modeling using R software. It’s really not easy to find books that provide a good introduction to Bayesian theory and methodology, tell you how this information can be used (what you can do with knowledge of this methodology), and give useful examples with R code. My favorite sections were chapters 5 through 8; these detail Bayesian regression models, Bayesian classification models, and, my favorite, Bayesian neural networks (I mean, how cool is that?). For those who program in Python, I also recommend Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference.

Perry Nally Jan 02, 2016

A great book about machine learning and big data processing using Bayesian statistical algorithms. I have to say that this book is not for the faint of heart. But if you want to learn something very useful in the decades to come, then this is for you, faint-hearted included. It is an intensely mathematical read, but there's no other way to portray such elegance.This book presents the equations needed without leaving anything out. There are study section at the end of each chapter to help you verify your understanding, which is a nice addition. I do recommend thoroughly understanding the first two chapters before moving on to the rest of the book as they contain critical statistical logic that is needed to understand the mathematical models used in the rest of the book.It's a great book and can be used as a resource for artificial intelligence and big data. It's also well organized with short clear details.

Dimitri Shvorob Mar 05, 2016

Let me be frank: I don't like Packt, a publisher that saves money on editing and graphic design, and just keeps churning out un-edited, ugly books by authors who could not, or did not, go with a proper publisher. They give free e-books to reviewers, and many feel obliged to return the favor by posting super-positive, but detail-free "reviews", which don't mention any alternatives, and sometimes name-check a book's key terms, but in an odd way that suggests that they don't really know what those mean. Just look at the reviews, and ratings, of this book's fan Dipanjan Sarkar, for example. I have seen many Packt books, and many Dipanjans, and I am annoyed.Anyway, this is not a five-star book. It is not a typical Packt book, in that Packt publishes IT books, and this is a formula-ridden statistics book that would be more at home in the catalog of an academic publisher like Springer or CRC-Hall. It starts with a concise if dry survey of Bayesian basics, and then surveys several Bayesian methods, implemented by specific R packages - "arm", "BayesLogit", "lda", "brnn", "bgmm", "darch". I see good things about the first part; the second one, on the other hand, came across as not very clearly written and too superficial to be useful: for example, I simply failed to understand the author's explanation of the Bayesian logit, and that should not have been complicated. The decision to go with specific R packages is understandable, but the failure to even mention BUGS, JAGS or STAN - the popular, general tools - is not.I don't understand who the book is for. The people who can handle the integrals, so to speak, will find the relevant R packages on their own. The less technical readers, on the other hand, will be put off by the academic style. My suggestion to the latter is "Doing Bayesian Data Analysis" by Kruschke. There is also a good, accessible book which uses Python rather than R, by Davidson-Pilon.UPD. With the benefit of a little more life experience, I would say: don't spend your time on *any* R book. Python is the way to go.

Vincent Jun 06, 2016

The book provides a quick review of all the main things you need to know when running Bayesian analyses in R. It will not make you a Bayesian wizard, but it could serve as a quick introduction to Bayesian analyses in R.A point of critic is that at some places I felt that the author could have provided a bit more guidance on interpretation. For example, chapter 5 explains how to fit a Bayesian model and how to simulate the posterior distribution, but does not devote a single line to explain how a user should interpret and use that simulation compared with the model coefficients. Further, chapter 5 states that smaller confidence intervals in the Bayesian model is a major benefit, but when I compare the actual predictions with reference values the Bayesian model actually performs marginally worse than ordinary least square regression. The author should really do more effort to explain why smaller confidence intervals are worth the reduction in actual model quality.Code examples are simply a log of the command line entries the author, including odd repetitions. Further, the code has some poor programming habits, e.g. using the attach(data) function is not meaningful if you already pass the data argument to the model.

Learning Bayesian Models with R: Become an expert in Bayesian Machine Learning methods using R and apply them to solve real-world big data problems

What do you get with a Packt Subscription?

Learning Bayesian Models with R

Chapter 1. Introducing the Probability Theory

Probability distributions

Conditional probability

Bayesian theorem

Marginal distribution

Expectations and covariance

Binomial distribution

Beta distribution

Gamma distribution

Dirichlet distribution

Wishart distribution

Exercises

References

Summary

Page 1 of 9

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs

Learning Bayesian Models with R: Become an expert in Bayesian Machine Learning methods using R and apply them to solve real-world big data problems

What do you get with a Packt Subscription?

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs