4 ways to enable Continual learning into Neural Networks

Of late, Deep Learning has been one of the working forces behind most technological breakthroughs happening around the globe. Whether it is easy machine translation, automatic recognition and sorting of images, smartphone interaction, automated medicine and healthcare, deep learning is the power source for all.

Neural networks, the building blocks of deep learning models, are now set on the path to achieve complete human brain imitation. But, to achieve this, it faces a roadblock, the ability to perform sequential task learning without forgetting. This particular shortcoming is known as catastrophic forgetting. Humans too have a tendency of forgetting old information at a gradual rate. However, with neural networks this phenomenon occurs at a catastrophic rate and hence the name. In order to enable continual learning in neural networks, there are several powerful architectures and algorithms.

Few of them are discussed in the article below:

Long Short-Term Memory Networks

Long Short-Term Memory network is a type of Recurrent Neural Network, generally used to solve the problem of vanishing gradient. It consists of an explicit memory unit called a cell, embedded into the network. As the name implies, LSTMs can remember information for longer duration.

LSTM follows RNN architecture but unlike RNN they have 4 neural network layers. The cell runs straight down the entire architecture to store values. These stored values remain untouched as further learning happens. It can add new information to the cell state or eliminate old ones, regulated by three gates. These gates work on 1s(pass everything) and 0s(pass nothing). Further, the gates are responsible for protection and control of the cell state.

If this sounds complex, here’s a simple connotation—The gates are the decision-makers in LSTM. They decide what information to eliminate and what to store. Based on the gate filter of the cell state, LSTM generates the output.

LSTM is being used as a fundamental component by top multinational firms (Google, Amazon, Microsoft) for applications such as speech recognition, smart assistant, or for feature enhancement.

Elastic Weight Consolidation Algorithm

Synaptic consolidation is the human brain’s approach for long term learning. Elastic Weight consolidation algorithm has taken inspiration from this mechanism to solve the issue of catastrophic interference. The neural network, like the brain, is made up of several connections among the neurons. The EWC evaluates how important a task is to a connection. By evaluation we mean, assigning weights to a connection. These weights are decided based on the importance of the older tasks.

In an EWC, the weight attached to each connection in a new task is linked to the old value by an elastic spring. The stiffness of the spring is in relation to the connection’s importance, hence the name, Elastic Weight Consolidation. In the play of weights and connections, EWC algorithm helps in making a neural network learn new tasks without overriding information of the prior task, reducing significant amount of computational cost.

The EWC algorithm was used in Atari games to learn multiple games sequentially. Using an EWC, the game agent was able to learn to play one game and then transfer what it had learnt to play a new game. It was also able to play multiple games successively.

Differentiable Neural Computer

DeepMind’s Differentiable neural computer (DNC) is a memory augmented neural network (MANN) which is a combination of neural networks and memory system. DNCs can essentially store complex data as computers do, all the while learning from examples like neural networks. They are not only used to parse complex data structures such as trees and graphs but also learn to form their own data structure. When a DNC was shown a graph data structure for example, the map of the London Underground, it learnt to write a description of the graph and answered questions on the graph. Surprisingly, a DNC can also answer questions about your family tree!

The DNC has a controller, one may think of it as a computer processor. But, the controller is responsible for three simple tasks:

taking an input
reading to and fro memory
producing an interpretable output

Memory here is referred to places where a vector of information is stored. A controller can fidget with read/write operations on the memory. With every new information it can either:

choose to write to a completely new, unused location
write to a used location based on the information the controller is searching for
not perform the write operation at all

It can also decide to free locations no longer needed. As far as reading is concerned, the controller can read from multiple memory locations. Memory can also be searched basis multiple parameters such as the content or the temporal links. The information read, can be further produced in the form of answers in context to the questions asked.

Simply put, memory enables the DNCs to make decisions about how they allocate, store, and retrieve memory to produce relevant and interpretable answers.

Progressive Neural Networks

The ability to transfer knowledge across domains has limited applicability in case of neural networks. Progressive neural networks act as training wheels towards developing continual learning systems. It functions at each layer of the network to incorporate prior knowledge and to decide whether to reuse old computations or learn new ones, making itself immune to catastrophic forgetting.

Progressive networks essentially operate in the form of an adapter to make connections between columns. A column here is a group of layers i.e. the training given to a neural network for a particular task. When a neural network has to learn a new task, an extra column is added and the weights of the first column are frozen, eliminating catastrophic forgetting. Output of the layers of the original column becomes additional input to layer in the new column. As more tasks are added, simultaneously the columns increase in number. The adapter then has to deal with the dimensionality explosion that may happen due to increasing number of columns.

A progressively enhanced neural network was successful in playing the Labyrinth 3D maze game. The neural network progressively learnt new mazes by using information it received from previous mazes.

Conclusion

The memory augmented neural networks have wider application in the field of robotic process automation, self-driving cars, natural language understanding, chatbots, next word predictions etc. Neural networks are also being utilized for time series prediction essentially for AR and VR technologies, video analytics and to study financial markets.

With the advancements happening in the field of Continual learning, a deep learning neural network that emulates the human brain entirely is nigh.