Active Learning : An approach to training machine learning models efficiently

Training a machine learning model to give accurate results requires crunching huge amounts of labelled data in it. Data being naturally unlabelled, need ‘experts’ who can scan through the data and tag them with correct labels. To perform topic-specific data labelling, for example, classifying diseases based on their type, would definitely require a doctor or someone with a medical background to label the data. Getting such topic-specific experts to label data can get difficult and quite expensive. Also, doing this for many machine learning projects is impractical. Active learning can help here.

What is Active Learning

Active learning is a type of semi-supervised machine learning, which aids in reducing the amount of labeled data required to train a model. In active learning, the model focuses only on data that the model is confused about and requests the experts to label them. The model later trains a bit more on the small amount of labeled data, and repeats the same for such confusing data labeling.

Active learning, in short, prioritizes confusing samples that need labeling. This enables models to learn faster, and allows experts to skip labeling data that is not a priority, and to provide the model with the most useful information on the confused samples.

This in turn can fetch great machine learning models, as active learning can reduce the number of labels required to collect from experts.

Types of Active learning

An active learning environment includes a learner (the model being trained), huge amount of raw and unlabelled data, and the expert (the person/system labelling the data). The role of the learner is to choose which instances or examples should be labelled. The learner’s goal is to reduce the number of labeled examples needed for an ML model to learn. On the other hand, the expert on receiving the data to be labelled, analyzes the data to determine appropriate labels for it.

There are three types of Active learning scenarios.

Query Synthesis - In such a scenario, the learner constructs examples, which are further sent to the expert for labeling.

Stream-based active learning - Here, from the stream of unlabelled data, the learner decides the instances to be labelled or choose to discard them.

Pool-based active learning - This is the most common scenario in active learning. Here, the learner chooses only the most informative or best instances and forwards them to the expert for labelling.

Some Real-life applications of Active learning

Natural Language Processing (NLP): Most of the NLP applications require a lot of labelled data such as POS (Parts-of-speech) tagging, NER (Named Entity Recognition), and so on. Also, there is a huge cost incurred in labelling this data. Thus, using active learning can reduce the amount of labelled data required to label.

Scene understanding in self-driving cars: Active learning can also be used in detecting objects, such as pedestrians from a video camera mounted on a moving car,a key area to ensure safety in autonomous vehicles. This can result in high levels of detection accuracy in complex and variable backgrounds.

Drug designing: Drugs are biological or chemical compounds that interact with specific ‘targets’ in the body (usually proteins, RNA or DNA) with an aim to modify their activity. The goal of drug designing is to find which compounds bind to a particular target. The data comes from large collections of compounds, vendor catalogs, corporate collections, and combinatorial chemistry. With active learning, the learner can find out the compounds that are active (binds to target) or inactive.

Active learning is still being researched using different deep learning algorithms such as CNNs and LSTMs, which act as learners in order to improve their efficiency. Also, GANs (Generative Adversarial Networks) are being implemented in the active learning framework. There are also some research papers that try to learn active learning strategies using meta-learning.