Yoshua Bengio et al on Twin Networks

The paper “Twin Networks: Matching the Future for Sequence Generation”, is written by Yoshua Bengio in collaboration with Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sordoni, Adam Trischler, and Chris Pal. This paper proposes a simple technique for encouraging generative RNNs to plan ahead. To achieve this, the authors have presented a simple RNN model, which further has two separate networks--a forward and a backward--that run in opposite directions during training. The main motive to train them in opposite direction is the hypothesis that the states of the forward model should be able to predict the entire future sequence.

Yoshua Bengio is a Canadian computer scientist.He is known for his work on artificial neural networks and deep learning. His main research ambition is to understand principles of learning that yield intelligence. Yoshua has been co-organizing the Learning Workshop with Yann Le Cun, with whom he has also created the International Conference on Representation Learning (ICLR), since the year 1999. Yoshua has also organized or co-organized numerous other events, principally the deep learning workshops and symposia at NIPS and ICML since 2007.

The article talks about TwinNet i.e. Twin Networks, a method for training RNN architectures to better model the future in its internal state, which are supervised by another RNN modelling the future in reverse order.

Twin Networks: Matching the Future for Sequence Generation

What problem is the paper attempting to solve?

Recurrent Neural Networks (RNNs) are the basis of state-of-art models for generating sequential data such as text and speech and are usually trained by teacher forcing. This corresponds to optimizing one-step ahead prediction. At present, there is no explicit bias toward planning in the training objective, the model may prefer to focus on the most recent tokens instead of capturing subtle long-term dependencies, which could contribute to global coherence. Local correlations are usually stronger than long-term dependencies and thus end up dominating the learning signal. This results obtained from this are, samples from RNNs tend to exhibit local coherence but lack meaningful global structure.

Recent efforts to address this problem have involved augmenting RNNs with external memory, with unitary or hierarchical architectures, or with explicit planning mechanisms. Parallel efforts aim to prevent overfitting on strong local correlations by regularizing the states of the network, by applying dropout or penalizing various statistics. To solve this, the paper proposes TwinNet, a simple method for regularizing a recurrent neural network that encourages modeling those aspects of the past that are predictive of the long-term future.

Paper summary

This paper presents a simple technique which enables generative recurrent neural networks to plan ahead. A backward RNN is trained to generate a given sequence in a reverse order. The states of the forward model are implicitly forced to predict cotemporal states of the backward model. The paper empirically shows that this approach achieves 9% relative improvement for a speech recognition task, and also achieves significant improvement on a COCO caption generation task.

Overall, the model is driven by the intuition:

(a) The backward hidden states contain a summary of the future of the sequence

(b) To predict the future more accurately, the model will have to form a better representation of the past.

The paper also demonstrates the success of the TwinNet approach experimentally, through several conditional and unconditional generation tasks that include speech recognition, image captioning, language modelling, and sequential image generation.

Key Takeaways

The paper introduces a simple method for training generative recurrent networks that regularizes the hidden states of the network to anticipate future states
The paper also provides an extensive evaluation of the proposed model on multiple tasks and concludes that it helps training and regularization for conditioned generation (speech recognition, image captioning) and for the unconditioned case (sequential MNIST, language modelling)
As a deeper analysis, the paper includes a visualization, which includes the introduced cost and observes that it negatively correlates with the word frequency.

Reviewer feedback summary

Overall Score: 21/30

Average Score: 7/10

The reviewers stated that the paper presents a novel approach to regularize RNNs and give results on different datasets indicating wide range of application. However, based on our results, they said that further experimentation and extensive hyperparameter search is needed. Overall, the paper is detailed, simple to implement and positive empirical results support the described approach.

The reviewers have also pointed out a few limitations which include:

Major downside of the approach is the cost in terms of resources. The twin model requires large memory and takes longer to train (~ 2-4 times) while providing little improvement over the baseline.
During evaluation we found that the attention twin model gives results like “a woman at table a with cake a”, where it forces the model to look like a sentence from the back side too. This might be the reason for low metric values observed in soft attention twin net model.

The effect of twin net as a regularizer can be examined against other regularization strategies for comparison purposes.