NIPS 2017 Special: A deep dive into Deep Bayesian and Bayesian Deep Learning with Yee Whye Teh

Yee Whye Teh is a professor at the department of Statistics of the University of Oxford and also a research scientist at DeepMind. He works on statistical machine learning, focussing on Bayesian nonparametrics, probabilistic learning, and deep learning.

The motive of this article aims to bring our readers to Yee’s keynote speech at the NIPS 2017. Yee’s keynote ponders deeply on the interface between two perspectives on machine learning: Bayesian learning and Deep learning by exploring questions like: How can probabilistic thinking help us understand deep learning methods or lead us to interesting new methods? Conversely, how can deep learning technologies help us develop advanced probabilistic methods? For a more comprehensive and in-depth understanding of this novel approach, be sure to watch the complete keynote address by Yee Whye Teh on NIPS facebook page. All images in this article come from Yee’s presentation slides and do not belong to us.

The history of machine learning has shown a growth in both model complexity and in model flexibility. The theory led models have started to lose their shine. This is because machine learning is at the forefront of a revolution that could be called as data led models or the data revolution. As opposed to theory led models, data-led models try not to impose too many assumptions on the processes that have to be modeled and are rather superflexible non-parametric models that can capture the complexities but they require large amount of data to operate.

On the model flexibility side, we have various approaches that have been explored over the years. We have kernel methods, Gaussian processes, Bayesian nonparametrics and now we have deep learning as well. The community has also developed evermore complex frameworks both graphical and programmatic to compose large complex models from simpler building blocks.

In the 90’s we had graphical models, later we had probabilistic programming systems, followed by deep learning systems like TensorFlow, Theano, and Torch. A recent addition is probabilistic Torch, which brings together ideas from both the probabilistic Bayesian learning and deep learning.

On one hand we have Bayesian learning, which deals with learning as inference in some probabilistic models. On the other hand we have deep learning models, which view learning as optimization functions parametrized by neural networks. In recent years there has been an explosion of exciting research at this interface of these two popular approaches resulting in increasingly complex and exciting models.

What is Bayesian theory of learning

Bayesian learning describes an ideal learner as one who interacts with the world in order to know its state, which is given by θ. He/she makes some observations about the world by deducing a model in Bayesian context. This model is a joint distribution of both the unknown state of the world θ and the observation about the world x.

nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-0

The model consists of prior distribution and marginal distribution, combining which gives a reverse conditional distribution also known as posterior, which describes the totality of the agent's knowledge about the world after he/she sees x. This posterior can also be used for predicting future observations and act accordingly.

nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-1

Issues associated with Bayesian learning

Rigidity

Learning can be wrong if model is wrong
Not all prior knowledge can be encoded as joint distribution
Simple analytic forms are limiting for conditional distributions

2. Scalability:

Intractable to compute this posterior and approximations have to be made, which then introduces trade offs between efficiency and accuracy. As a result, it is often assumed that Bayesian techniques are not scalable.

To address these issues, the speaker highlights some of his recent projects which showcase scenarios where deep learning ideas are applied to Bayesian models (Deep Bayesian learning) or in the reverse applying Bayesian ideas to Neural Networks ( i.e. Bayesian Deep learning)

Deep Bayesian learning: Deep learning assists Bayesian learning

Deep learning can improve Bayesian learning in the following ways:

Improve the modeling flexibility by using neural networks in the construction of Bayesian models
Improve the inference and scalability of these methods by parameterizing the posterior way of using neural networks
Empathizing inference over multiple runs

These can be seen in the following projects showcased by Yee:

Concrete VAEs(Variational Autoencoders)

FIVO: Filtered Variational Objectives

Concrete VAEs

What are VAEs?

All the qualities mentioned above, i.e. improving modeling flexibility, improving inference and scalability, and empathizing inference over multiple runs by using neural networks can be seen in a class of deep generative models known as VAE (Variational Autoencoders).

nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-2 Fig: Variational Autoencoders

VAEs include latent variables that describe the contents of a scene i.e objects, pose. The relationship between these latent variables and the pixels have to be highly complex and nonlinear. So, in short, VAEs are used to parameterize generative and variable posterior distribution that allows for greater scope flexible modeling.

The key that makes VAEs work is the reparameterization trick nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-3

nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-4 Fig: Adding reparameterization to VAEs

The reparameterization trick is crucial to the continuous latent variables in the VAEs. But many models naturally include discrete latent variables. Yee suggests application of the reparameterization on the discrete latent variables as a work around.

This brings us to the concept of Concrete VAEs..

CONtinuous relaxation of disCRETE distributions. nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-5 Also, the density can be further calculated:

This concrete distribution is the reparameterization trick for discrete variables which helps in calculating the KL divergence that is needed for variational inference.

FIVO: Filtered Variational Objectives

FIVO extends VAEs towards models for sequential and time series data. It is built upon another extension of VAEs known as Importance Weighted Autoencoder, a generative model with a similar as that of the VAE, but which uses a strictly tighter log-likelihood lower bound.

Variational lower bound: nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-7

Rederivation from importance sampling: nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-8

Better to use multiple samples: nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-9

Using Importance Weighted Autoencoders we can use multiple sampling, with which we can get a tighter lower bound and optimizing this lower bound should lead to better learning. Let’s have a look at the FIVO objectives:

We can use any unbiased estimator p(X) of marginal probability nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-10 Tightness of bound related to variance of estimatorFor sequential models, we can use particle filters which produce unbiased estimator of marginal probability. They can also have much lower variance than importance samplers.

Bayesian Deep learning:

Bayesian approach for deep learning gives us counterintuitive and surprising ways to make deep learning scalable. In order to explore the potential of Bayesian learning with deep neural networks, Yee introduced a project named, The posterior server.

The Posterior server

The posterior server is a distributed server for deep learning. It makes use of the Bayesian approach in order to make neural networks highly scalable.

This project focuses on Distributed learning, where both the data and the computations can be spread across the network. nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-12

The figure above shows that there are a bunch of workers and each communicates with the parameter server, which effectively maintains the authoritative copy of the parameters of the network.

At each iteration, each worker obtains the latest copy of the parameter from the server, computes the gradient update based on its data and sends it back to the server which then updates it to the authoritative copy. So, communications on the network tend to be slower than the computations that can be done on the network. Hence, one might consider multiple gradient steps on each iteration before it sends the accumulated update back to the parameter server.

The problem is that the parameter and the worker quickly get out of sync with the authoritative copy on the parameter server. As a result, this leads to stale updates which allow noise into the system and we often need frequent synchronizations across the network for the algorithm to learn in a stable fashion. The main idea here in Bayesian context is that we don't just want a single parameter, we want a whole distribution over them.

This will then relax the need for frequent synchronizations across the network and hopefully lead to algorithms that are robust to last frequent communication.

nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-13

Each worker is simply going to construct its own tractable approximation to his own likelihood function and send this information to the posterior server which then combines these approximations together to form the full posterior or an approximation of it. Further, the approximations that are constructed would be based on the statistics of some sampling algorithms that happens locally on that worker.

The actual algorithm includes a combination of the variational algorithms, Stochastic Gradient EP and the Markov chain Monte Carlo on the workers themselves. So the variational part in the algorithm handles the communication part in the network whereas the MCMC part handles the sampling part that is posterior to construct the statistics that the variational part needs. For scalability, a stochastic gradient Langevin algorithm which is a simple generalization of the SGT, which includes additional injected noise, to sample from posterior noise. nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh-img-14

To experiment with this server, it was trained densely connected neural networks with 500 reLU units on MNIST dataset. You can have a detailed understanding of these examples in the keynote video.

This interface between Bayesian learning and deep learning is a very exciting frontier. Researchers have brought management of uncertainties within deep learning. Also, flexibility and scalability in Bayesian modeling. Yee concludes with two questions for the audience to think about.

Does being Bayesian in the space of functions makes more sense than being Bayesian in the sense of parameters?
How to deal with uncertainties under model misspecification?