The motive of this article aims to bring our readers to Yee’s keynote speech at the NIPS 2017. Yee’s keynote ponders deeply on the interface between two perspectives on machine learning: Bayesian learning and Deep learning by exploring questions like: How can probabilistic thinking help us understand deep learning methods or lead us to interesting new methods? Conversely, how can deep learning technologies help us develop advanced probabilistic methods? For a more comprehensive and in-depth understanding of this novel approach, be sure to watch the complete keynote address by Yee Whye Teh on NIPS facebook page. All images in this article come from Yee’s presentation slides and do not belong to us.
The history of machine learning has shown a growth in both model complexity and in model flexibility. The theory led models have started to lose their shine. This is because machine learning is at the forefront of a revolution that could be called as data led models or the data revolution. As opposed to theory led models, data-led models try not to impose too many assumptions on the processes that have to be modeled and are rather superflexible non-parametric models that can capture the complexities but they require large amount of data to operate.
On the model flexibility side, we have various approaches that have been explored over the years. We have kernel methods, Gaussian processes, Bayesian nonparametrics and now we have deep learning as well. The community has also developed evermore complex frameworks both graphical and programmatic to compose large complex models from simpler building blocks.
In the 90’s we had graphical models, later we had probabilistic programming systems, followed by deep learning systems like TensorFlow, Theano, and Torch. A recent addition is probabilistic Torch, which brings together ideas from both the probabilistic Bayesian learning and deep learning.
On one hand we have Bayesian learning, which deals with learning as inference in some probabilistic models. On the other hand we have deep learning models, which view learning as optimization functions parametrized by neural networks. In recent years there has been an explosion of exciting research at this interface of these two popular approaches resulting in increasingly complex and exciting models.
Bayesian learning describes an ideal learner as one who interacts with the world in order to know its state, which is given by θ. He/she makes some observations about the world by deducing a model in Bayesian context. This model is a joint distribution of both the unknown state of the world θ and the observation about the world x.
The model consists of prior distribution and marginal distribution, combining which gives a reverse conditional distribution also known as posterior, which describes the totality of the agent's knowledge about the world after he/she sees x. This posterior can also be used for predicting future observations and act accordingly.
2. Scalability:
Intractable to compute this posterior and approximations have to be made, which then introduces trade offs between efficiency and accuracy. As a result, it is often assumed that Bayesian techniques are not scalable.
To address these issues, the speaker highlights some of his recent projects which showcase scenarios where deep learning ideas are applied to Bayesian models (Deep Bayesian learning) or in the reverse applying Bayesian ideas to Neural Networks ( i.e. Bayesian Deep learning)
Deep learning can improve Bayesian learning in the following ways:
These can be seen in the following projects showcased by Yee:
All the qualities mentioned above, i.e. improving modeling flexibility, improving inference and scalability, and empathizing inference over multiple runs by using neural networks can be seen in a class of deep generative models known as VAE (Variational Autoencoders).
Fig: Variational Autoencoders
VAEs include latent variables that describe the contents of a scene i.e objects, pose. The relationship between these latent variables and the pixels have to be highly complex and nonlinear. So, in short, VAEs are used to parameterize generative and variable posterior distribution that allows for greater scope flexible modeling.
The key that makes VAEs work is the reparameterization trick
Fig: Adding reparameterization to VAEs
The reparameterization trick is crucial to the continuous latent variables in the VAEs. But many models naturally include discrete latent variables. Yee suggests application of the reparameterization on the discrete latent variables as a work around.
CONtinuous relaxation of disCRETE distributions.Also, the density can be further calculated:
This concrete distribution is the reparameterization trick for discrete variables which helps in calculating the KL divergence that is needed for variational inference.
FIVO extends VAEs towards models for sequential and time series data. It is built upon another extension of VAEs known as Importance Weighted Autoencoder, a generative model with a similar as that of the VAE, but which uses a strictly tighter log-likelihood lower bound.
Variational lower bound:
Rederivation from importance sampling:
Better to use multiple samples:
Using Importance Weighted Autoencoders we can use multiple sampling, with which we can get a tighter lower bound and optimizing this lower bound should lead to better learning. Let’s have a look at the FIVO objectives:
We can use any unbiased estimator p(X) of marginal probabilityTightness of bound related to variance of estimatorFor sequential models, we can use particle filters which produce unbiased estimator of marginal probability. They can also have much lower variance than importance samplers.
Bayesian approach for deep learning gives us counterintuitive and surprising ways to make deep learning scalable. In order to explore the potential of Bayesian learning with deep neural networks, Yee introduced a project named, The posterior server.
The posterior server is a distributed server for deep learning. It makes use of the Bayesian approach in order to make neural networks highly scalable.
This project focuses on Distributed learning, where both the data and the computations can be spread across the network.
The figure above shows that there are a bunch of workers and each communicates with the parameter server, which effectively maintains the authoritative copy of the parameters of the network.
At each iteration, each worker obtains the latest copy of the parameter from the server, computes the gradient update based on its data and sends it back to the server which then updates it to the authoritative copy. So, communications on the network tend to be slower than the computations that can be done on the network. Hence, one might consider multiple gradient steps on each iteration before it sends the accumulated update back to the parameter server.
The problem is that the parameter and the worker quickly get out of sync with the authoritative copy on the parameter server. As a result, this leads to stale updates which allow noise into the system and we often need frequent synchronizations across the network for the algorithm to learn in a stable fashion. The main idea here in Bayesian context is that we don't just want a single parameter, we want a whole distribution over them.
This will then relax the need for frequent synchronizations across the network and hopefully lead to algorithms that are robust to last frequent communication.
Each worker is simply going to construct its own tractable approximation to his own likelihood function and send this information to the posterior server which then combines these approximations together to form the full posterior or an approximation of it. Further, the approximations that are constructed would be based on the statistics of some sampling algorithms that happens locally on that worker.
The actual algorithm includes a combination of the variational algorithms, Stochastic Gradient EP and the Markov chain Monte Carlo on the workers themselves. So the variational part in the algorithm handles the communication part in the network whereas the MCMC part handles the sampling part that is posterior to construct the statistics that the variational part needs. For scalability, a stochastic gradient Langevin algorithm which is a simple generalization of the SGT, which includes additional injected noise, to sample from posterior noise.
To experiment with this server, it was trained densely connected neural networks with 500 reLU units on MNIST dataset. You can have a detailed understanding of these examples in the keynote video.
This interface between Bayesian learning and deep learning is a very exciting frontier. Researchers have brought management of uncertainties within deep learning. Also, flexibility and scalability in Bayesian modeling. Yee concludes with two questions for the audience to think about.