Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Bayesian Network Fundamentals

Save for later
  • 25 min read
  • 10 Aug 2015

article-image

In this article by Ankur Ankan and Abinash Panda, the authors of Mastering Probabilistic Graphical Models Using Python, we'll cover the basics of random variables, probability theory, and graph theory. We'll also see the Bayesian models and the independencies in Bayesian models.

A graphical model is essentially a way of representing joint probability distribution over a set of random variables in a compact and intuitive form. There are two main types of graphical models, namely directed and undirected. We generally use a directed model, also known as a Bayesian network, when we mostly have a causal relationship between the random variables. Graphical models also give us tools to operate on these models to find conditional and marginal probabilities of variables, while keeping the computational complexity under control.

(For more resources related to this topic, see here.)

Probability theory

To understand the concepts of probability theory, let's start with a real-life situation. Let's assume we want to go for an outing on a weekend. There are a lot of things to consider before going: the weather conditions, the traffic, and many other factors. If the weather is windy or cloudy, then it is probably not a good idea to go out. However, even if we have information about the weather, we cannot be completely sure whether to go or not; hence we have used the words probably or maybe. Similarly, if it is windy in the morning (or at the time we took our observations), we cannot be completely certain that it will be windy throughout the day. The same holds for cloudy weather; it might turn out to be a very pleasant day. Further, we are not completely certain of our observations. There are always some limitations in our ability to observe; sometimes, these observations could even be noisy. In short, uncertainty or randomness is the innate nature of the world. The probability theory provides us the necessary tools to study this uncertainty. It helps us look into options that are unlikely yet probable.

Random variable

Probability deals with the study of events. From our intuition, we can say that some events are more likely than others, but to quantify the likeliness of a particular event, we require the probability theory. It helps us predict the future by assessing how likely the outcomes are.

Before going deeper into the probability theory, let's first get acquainted with the basic terminologies and definitions of the probability theory. A random variable is a way of representing an attribute of the outcome. Formally, a random variable X is a function that maps a possible set of outcomes ? to some set E, which is represented as follows:

X : ? ? E

As an example, let us consider the outing example again. To decide whether to go or not, we may consider the skycover (to check whether it is cloudy or not). Skycover is an attribute of the day. Mathematically, the random variable skycover (X) is interpreted as a function, which maps the day (?) to its skycover values (E). So when we say the event X = 40.1, it represents the set of all the days {?} such that bayesian-network-fundamentals-img-0 , where bayesian-network-fundamentals-img-1 is the mapping function. Formally speaking, bayesian-network-fundamentals-img-2.

Random variables can either be discrete or continuous. A discrete random variable can only take a finite number of values. For example, the random variable representing the outcome of a coin toss can take only two values, heads or tails; and hence, it is discrete. Whereas, a continuous random variable can take infinite number of values. For example, a variable representing the speed of a car can take any number values.

For any event whose outcome is represented by some random variable (X), we can assign some value to each of the possible outcomes of X, which represents how probable it is. This is known as the probability distribution of the random variable and is denoted by P(X).

For example, consider a set of restaurants. Let X be a random variable representing the quality of food in a restaurant. It can take up a set of values, such as {good, bad, average}. P(X), represents the probability distribution of X, that is, if P(X = good) = 0.3, P(X = average) = 0.5, and P(X = bad) = 0.2. This means there is 30 percent chance of a restaurant serving good food, 50 percent chance of it serving average food, and 20 percent chance of it serving bad food.

Independence and conditional independence

In most of the situations, we are rather more interested in looking at multiple attributes at the same time. For example, to choose a restaurant, we won't only be looking just at the quality of food; we might also want to look at other attributes, such as the cost, location, size, and so on. We can have a probability distribution over a combination of these attributes as well. This type of distribution is known as joint probability distribution. Going back to our restaurant example, let the random variable for the quality of food be represented by Q, and the cost of food be represented by C. Q can have three categorical values, namely {good, average, bad}, and C can have the values {high, low}. So, the joint distribution for P(Q, C) would have probability values for all the combinations of states of Q and C. P(Q = good, C = high) will represent the probability of a pricey restaurant with good quality food, while P(Q = bad, C = low) will represent the probability of a restaurant that is less expensive with bad quality food.

Let us consider another random variable representing an attribute of a restaurant, its location L. The cost of food in a restaurant is not only affected by the quality of food but also the location (generally, a restaurant located in a very good location would be more costly as compared to a restaurant present in a not-very-good location). From our intuition, we can say that the probability of a costly restaurant located at a very good location in a city would be different (generally, more) than simply the probability of a costly restaurant, or the probability of a cheap restaurant located at a prime location of city is different (generally less) than simply probability of a cheap restaurant. Formally speaking, P(C = high | L = good) will be different from P(C = high) and P(C = low | L = good) will be different from P(C = low). This indicates that the random variables C and L are not independent of each other.

These attributes or random variables need not always be dependent on each other. For example, the quality of food doesn't depend upon the location of restaurant. So, P(Q = good | L = good) or P(Q = good | L = bad)would be the same as P(Q = good), that is, our estimate of the quality of food of the restaurant will not change even if we have knowledge of its location. Hence, these random variables are independent of each other.

In general, random variables bayesian-network-fundamentals-img-3 can be considered as independent of each other, if:

bayesian-network-fundamentals-img-4

They may also be considered independent if:

bayesian-network-fundamentals-img-5

We can easily derive this conclusion. We know the following from the chain rule of probability:

P(X, Y) = P(X) P(Y | X)

If Y is independent of X, that is, if X | Y, then P(Y | X) = P(Y). Then:

P(X, Y) = P(X) P(Y)

Extending this result on multiple variables, we can easily get to the conclusion that a set of random variables are independent of each other, if their joint probability distribution is equal to the product of probabilities of each individual random variable.

Sometimes, the variables might not be independent of each other. To make this clearer, let's add another random variable, that is, the number of people visiting the restaurant N. Let's assume that, from our experience we know the number of people visiting only depends on the cost of food at the restaurant and its location (generally, lesser number of people visit costly restaurants). Does the quality of food Q affect the number of people visiting the restaurant? To answer this question, let's look into the random variable affecting N, cost C, and location L. As C is directly affected by Q, we can conclude that Q affects N. However, let's consider a situation when we know that the restaurant is costly, that is, C = high and let's ask the same question, "does the quality of food affect the number of people coming to the restaurant?". The answer is no. The number of people coming only depends on the price and location, so if we know that the cost is high, then we can easily conclude that fewer people will visit, irrespective of the quality of food. Hence, bayesian-network-fundamentals-img-6 .

This type of independence is called conditional independence.

Installing tools

Let's now see some coding examples using pgmpy, to represent joint distributions and independencies. Here, we will mostly work with IPython and pgmpy (and a few other libraries) for coding examples. So, before moving ahead, let's get a basic introduction to these.

IPython

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, which offers enhanced introspection, rich media, additional shell syntax, tab completion, and a rich history. IPython provides the following features:

  • Powerful interactive shells (terminal and Qt-based)
  • A browser-based notebook with support for code, text, mathematical expressions, inline plots, and other rich media
  • Support for interactive data visualization and use of GUI toolkits
  • Flexible and embeddable interpreters to load into one's own projects
  • Easy-to-use and high performance tools for parallel computing

You can install IPython using the following command:

>>> pip3 install ipython

To start the IPython command shell, you can simply type ipython3 in the terminal. For more installation instructions, you can visit http://ipython.org/install.html.

pgmpy

pgmpy is a Python library to work with Probabilistic Graphical models. As it's currently not on PyPi, we will need to build it manually. You can get the source code from the Git repository using the following command:

>>> git clone https://github.com/pgmpy/pgmpy

Now cd into the cloned directory switch branch for version used and build it with the following code:

>>> cd pgmpy
>>> git checkout book/v0.1
>>> sudo python3 setup.py install

For more installation instructions, you can visit http://pgmpy.org/install.html.

With both IPython and pgmpy installed, you should now be able to run the examples.

Representing independencies using pgmpy

To represent independencies, pgmpy has two classes, namely IndependenceAssertion and Independencies. The IndependenceAssertion class is used to represent individual assertions of the form of bayesian-network-fundamentals-img-7 or bayesian-network-fundamentals-img-8 . Let's see some code to represent assertions:

# Firstly we need to import IndependenceAssertion
In [1]: from pgmpy.independencies import IndependenceAssertion
# Each assertion is in the form of [X, Y, Z] meaning X is
# independent of Y given Z.
In [2]: assertion1 = IndependenceAssertion('X', 'Y')
In [3]: assertion1
Out[3]: (X _|_ Y)

Here, assertion1 represents that the variable X is independent of the variable Y. To represent conditional assertions, we just need to add a third argument to IndependenceAssertion:

In [4]: assertion2 = IndependenceAssertion('X', 'Y', 'Z')
In [5]: assertion2
Out [5]: (X _|_ Y | Z)

In the preceding example, assertion2 represents bayesian-network-fundamentals-img-9.

IndependenceAssertion also allows us to represent assertions in the form of bayesian-network-fundamentals-img-10 . To do this, we just need to pass a list of random variables as arguments:

In [4]: assertion2 = IndependenceAssertion('X', 'Y', 'Z')
In [5]: assertion2
Out[5]: (X _|_ Y | Z)

Moving on to the Independencies class, an Independencies object is used to represent a set of assertions. Often, in the case of Bayesian or Markov networks, we have more than one assertion corresponding to a given model, and to represent these independence assertions for the models, we generally use the Independencies object. Let's take a few examples:

In [8]: from pgmpy.independencies import Independencies
# There are multiple ways to create an Independencies object, we
# could either initialize an empty object or initialize with some
# assertions.
 
In [9]: independencies = Independencies() # Empty object
In [10]: independencies.get_assertions()
Out[10]: []
 
In [11]: independencies.add_assertions(assertion1, assertion2)
In [12]: independencies.get_assertions()
Out[12]: [(X _|_ Y), (X _|_ Y | Z)]

We can also directly initialize Independencies in these two ways:

In [13]: independencies = Independencies(assertion1, assertion2)
In [14]: independencies = Independencies(['X', 'Y'],
                                         ['A', 'B', 'C'])
In [15]: independencies.get_assertions()
Out[15]: [(X _|_ Y), (A _|_ B | C)]

Representing joint probability distributions using pgmpy

We can also represent joint probability distributions using pgmpy's JointProbabilityDistribution class. Let's say we want to represent the joint distribution over the outcomes of tossing two fair coins. So, in this case, the probability of all the possible outcomes would be 0.25, which is shown as follows:

In [16]: from pgmpy.factors import JointProbabilityDistribution as         Joint
In [17]: distribution = Joint(['coin1', 'coin2'],
                             [2, 2],
                             [0.25, 0.25, 0.25, 0.25])

Here, the first argument includes names of random variable. The second argument is a list of the number of states of each random variable. The third argument is a list of probability values, assuming that the first variable changes its states the slowest. So, the preceding distribution represents the following:

In [18]: print(distribution)
+--------------------------------------+
¦ coin1   ¦ coin2   ¦   P(coin1,coin2) ¦
¦---------+---------+------------------¦
¦ coin1_0 ¦ coin2_0 ¦   0.2500         ¦
+---------+---------+------------------¦
¦ coin1_0 ¦ coin2_1 ¦   0.2500         ¦
+---------+---------+------------------¦
¦ coin1_1 ¦ coin2_0 ¦   0.2500         ¦
+---------+---------+------------------¦
¦ coin1_1 ¦ coin2_1 ¦   0.2500         ¦
+--------------------------------------+

We can also conduct independence queries over these distributions in pgmpy:

In [19]: distribution.check_independence('coin1', 'coin2')
Out[20]: True

Conditional probability distribution

Let's take an example to understand conditional probability better. Let's say we have a bag containing three apples and five oranges, and we want to randomly take out fruits from the bag one at a time without replacing them. Also, the random variables bayesian-network-fundamentals-img-11 and bayesian-network-fundamentals-img-12 represent the outcomes in the first try and second try respectively. So, as there are three apples and five oranges in the bag initially, bayesian-network-fundamentals-img-13 and bayesian-network-fundamentals-img-14 . Now, let's say that in our first attempt we got an orange. Now, we cannot simply represent the probability of getting an apple or orange in our second attempt. The probabilities in the second attempt will depend on the outcome of our first attempt and therefore, we use conditional probability to represent such cases. Now, in the second attempt, we will have the following probabilities that depend on the outcome of our first try: bayesian-network-fundamentals-img-15bayesian-network-fundamentals-img-16bayesian-network-fundamentals-img-17 , and bayesian-network-fundamentals-img-18 .

The Conditional Probability Distribution (CPD) of two variables bayesian-network-fundamentals-img-19 and bayesian-network-fundamentals-img-20 can be represented as bayesian-network-fundamentals-img-21 , representing the probability of bayesian-network-fundamentals-img-22 given bayesian-network-fundamentals-img-23 that is the probability of bayesian-network-fundamentals-img-24 after the event bayesian-network-fundamentals-img-25 has occurred and we know it's outcome. Similarly, we can have bayesian-network-fundamentals-img-26 representing the probability of bayesian-network-fundamentals-img-27 after having an observation for bayesian-network-fundamentals-img-28.

The simplest representation of CPD is tabular CPD. In a tabular CPD, we construct a table containing all the possible combinations of different states of the random variables and the probabilities corresponding to these states. Let's consider the earlier restaurant example.

Let's begin by representing the marginal distribution of the quality of food with Q. As we mentioned earlier, it can be categorized into three values {good, bad, average}. For example, P(Q) can be represented in the tabular form as follows:

Quality

P(Q)

Good

0.3

Normal

0.5

Bad

0.2

Similarly, let's say P(L) is the probability distribution of the location of the restaurant. Its CPD can be represented as follows:

Location

P(L)

Good

0.6

Bad

0.4

As the cost of restaurant C depends on both the quality of food Q and its location L, we will be considering P(C | Q, L), which is the conditional distribution of C, given Q and L:

Location

Good

Bad

Quality

Good

Normal

Bad

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at AU $24.99/month. Cancel anytime

Good

Normal

Bad

Cost

 

 

 

 

 

 

High

0.8

0.6

0.1

0.6

0.6

0.05

Low

0.2

0.4

0.9

0.4

0.4

0.95

Representing CPDs using pgmpy

Let's first see how to represent the tabular CPD using pgmpy for variables that have no conditional variables:

In [1]: from pgmpy.factors import TabularCPD
 
# For creating a TabularCPD object we need to pass three
# arguments: the variable name, its cardinality that is the number
# of states of the random variable and the probability value
# corresponding each state.
In [2]: quality = TabularCPD(variable='Quality',
                             variable_card=3,
                               values=[[0.3], [0.5], [0.2]])
In [3]: print(quality)
+----------------------+
¦ ['Quality', 0] ¦ 0.3 ¦
+----------------+-----¦
¦ ['Quality', 1] ¦ 0.5 ¦
+----------------+-----¦
¦ ['Quality', 2] ¦ 0.2 ¦
+----------------------+
In [4]: quality.variables
Out[4]: OrderedDict([('Quality', [State(var='Quality', state=0),
                                 State(var='Quality', state=1),
                                 State(var='Quality', state=2)])])
 
In [5]: quality.cardinality
Out[5]: array([3])
 
In [6]: quality.values
Out[6]: array([0.3, 0.5, 0.2])

You can see here that the values of the CPD are a 1D array instead of a 2D array, which you passed as an argument. Actually, pgmpy internally stores the values of the TabularCPD as a flattened numpy array.

In [7]: location = TabularCPD(variable='Location',
                              variable_card=2,
                             values=[[0.6], [0.4]])
In [8]: print(location)
+-----------------------+
¦ ['Location', 0] ¦ 0.6 ¦
+-----------------+-----¦
¦ ['Location', 1] ¦ 0.4 ¦
+-----------------------+

However, when we have conditional variables, we also need to specify them and the cardinality of those variables. Let's define the TabularCPD for the cost variable:

In [9]: cost = TabularCPD(
                     variable='Cost',
                     variable_card=2,
                     values=[[0.8, 0.6, 0.1, 0.6, 0.6, 0.05],
                             [0.2, 0.4, 0.9, 0.4, 0.4, 0.95]],
                     evidence=['Q', 'L'],
                     evidence_card=[3, 2])

Graph theory

The second major framework for the study of probabilistic graphical models is graph theory. Graphs are the skeleton of PGMs, and are used to compactly encode the independence conditions of a probability distribution.

Nodes and edges

The foundation of graph theory was laid by Leonhard Euler when he solved the famous Seven Bridges of Konigsberg problem. The city of Konigsberg was set on both sides by the Pregel river and included two islands that were connected and maintained by seven bridges. The problem was to find a walk to exactly cross all the bridges once in a single walk.

To visualize the problem, let's think of the graph in Fig 1.1:

bayesian-network-fundamentals-img-29

Fig 1.1: The Seven Bridges of Konigsberg graph

Here, the nodes a, b, c, and d represent the land, and are known as vertices of the graph. The line segments ab, bc, cd, da, ab, and bc connecting the land parts are the bridges and are known as the edges of the graph. So, we can think of the problem of crossing all the bridges once in a single walk as tracing along all the edges of the graph without lifting our pencils.

Formally, a graph G = (V, E) is an ordered pair of finite sets. The elements of the set V are known as the nodes or the vertices of the graph, and the elements of bayesian-network-fundamentals-img-30 are the edges or the arcs of the graph. The number of nodes or cardinality of G, denoted by |V|, are known as the order of the graph. Similarly, the number of edges denoted by |E| are known as the size of the graph. Here, we can see that the Konigsberg city graph shown in Fig 1.1 is of order 4 and size 7.

In a graph, we say that two vertices, u, v ? V are adjacent if u, v ? E. In the City graph, all the four vertices are adjacent to each other because there is an edge for every possible combination of two vertices in the graph. Also, for a vertex v ? V, we define the neighbors set of v as bayesian-network-fundamentals-img-31 . In the City graph, we can see that b and d are neighbors of c. Similarly, a, b, and c are neighbors of d.

We define an edge to be a self loop if the start vertex and the end vertex of the edge are the same. We can put it more formally as, any edge of the form (u, u), where u ? V is a self loop.

Until now, we have been talking only about graphs whose edges don't have a direction associated with them, which means that the edge (u, v) is same as the edge (v, u). These types of graphs are known as undirected graphs. Similarly, we can think of a graph whose edges have a sense of direction associated with it. For these graphs, the edge set E would be a set of ordered pair of vertices. These types of graphs are known as directed graphs. In the case of a directed graph, we also define the indegree and outdegree for a vertex. For a vertex v ? V, we define its outdegree as the number of edges originating from the vertex v, that is, bayesian-network-fundamentals-img-32 . Similarly, the indegree is defined as the number of edges that end at the vertex v, that is, bayesian-network-fundamentals-img-33 .

Walk, paths, and trails

For a graph G = (V, E) and u,v ? V, we define a u - v walk as an alternating sequence of vertices and edges, starting with u and ending with v. In the City graph of Fig 1.1, we can have an example of a - d walk as bayesian-network-fundamentals-img-34.

If there aren't multiple edges between the same vertices, then we simply represent a walk by a sequence of vertices. As in the case of the Butterfly graph shown in Fig 1.2, we can have a walk W : a, c, d, c, e:

bayesian-network-fundamentals-img-35

Fig 1.2: Butterfly graph—a undirected graph

A walk with no repeated edges is known as a trail. For example, the walk bayesian-network-fundamentals-img-36 in the City graph is a trail. Also, a walk with no repeated vertices, except possibly the first and the last, is known as a path. For example, the walk bayesian-network-fundamentals-img-37 in the City graph is a path.

Also, a graph is known as cyclic if there are one or more paths that start and end at the same node. Such paths are known as cycles. Similarly, if there are no cycles in a graph, it is known as an acyclic graph.

Bayesian models

In most of the real-life cases when we would be representing or modeling some event, we would be dealing with a lot of random variables. Even if we would consider all the random variables to be discrete, there would still be exponentially large number of values in the joint probability distribution. Dealing with such huge amount of data would be computationally expensive (and in some cases, even intractable), and would also require huge amount of memory to store the probability of each combination of states of these random variables.

However, in most of the cases, many of these variables are marginally or conditionally independent of each other. By exploiting these independencies, we can reduce the number of values we need to store to represent the joint probability distribution.

For instance, in the previous restaurant example, the joint probability distribution across the four random variables that we discussed (that is, quality of food Q, location of restaurant L, cost of food C, and the number of people visiting N) would require us to store 23 independent values. By the chain rule of probability, we know the following:

P(Q, L, C, N) = P(Q) P(L|Q) P(C|L, Q) P(N|C, Q, L)

Now, let us try to exploit the marginal and conditional independence between the variables, to make the representation more compact. Let's start by considering the independency between the location of the restaurant and quality of food over there. As both of these attributes are independent of each other, P(L|Q) would be the same as P(L). Therefore, we need to store only one parameter to represent it. From the conditional independence that we have seen earlier, we know that bayesian-network-fundamentals-img-38 . Thus, P(N|C, Q, L) would be the same as P(N|C, L); thus needing only four parameters. Therefore, we now need only (2 + 1 + 6 + 4 = 13) parameters to represent the whole distribution.

We can conclude that exploiting independencies helps in the compact representation of joint probability distribution. This forms the basis for the Bayesian network.

Representation

A Bayesian network is represented by a Directed Acyclic Graph (DAG) and a set of Conditional Probability Distributions (CPD) in which:

  • The nodes represent random variables
  • The edges represent dependencies
  • For each of the nodes, we have a CPD

In our previous restaurant example, the nodes would be as follows:

  • Quality of food (Q)
  • Location (L)
  • Cost of food (C)
  • Number of people (N)

As the cost of food was dependent on the quality of food (Q) and the location of the restaurant (L), there will be an edge each from Q ? C and L ? C. Similarly, as the number of people visiting the restaurant depends on the price of food and its location, there would be an edge each from L ? N and C ? N. The resulting structure of our Bayesian network is shown in Fig 1.3:

bayesian-network-fundamentals-img-39

Fig 1.3: Bayesian network for the restaurant example

Factorization of a distribution over a network

Each node in our Bayesian network for restaurants has a CPD associated to it. For example, the CPD for the cost of food in the restaurant is P(C|Q, L), as it only depends on the quality of food and location. For the number of people, it would be P(N|C, L) . So, we can generalize that the CPD associated with each node would be P(node|Par(node)) where Par(node) denotes the parents of the node in the graph. Assuming some probability values, we will finally get a network as shown in Fig 1.4:

bayesian-network-fundamentals-img-40

Fig 1.4: Bayesian network of restaurant along with CPDs

Let us go back to the joint probability distribution of all these attributes of the restaurant again. Considering the independencies among variables, we concluded as follows:

P(Q,C,L,N) = P(Q)P(L)P(C|Q, L)P(N|C, L)

So now, looking into the Bayesian network (BN) for the restaurant, we can say that for any Bayesian network, the joint probability distribution bayesian-network-fundamentals-img-41 over all its random variables {X1,X2,...,Xn} can be represented as follows:

bayesian-network-fundamentals-img-42

This is known as the chain rule for Bayesian networks.

Also, we say that a distribution P factorizes over a graph G, if P can be encoded as follows:

bayesian-network-fundamentals-img-43

Here, ParG(X) is the parent of X in the graph G.

Summary

In this article, we saw how we can represent a complex joint probability distribution using a directed graph and a conditional probability distribution associated with each node, which is collectively known as a Bayesian network.

Resources for Article:

 


Further resources on this subject: