Evolution of NLP approaches
Let us discuss how NLP has evolved over the past two decades as significant advancements have occurred in the field. Recently, the evolution has been primarily characterized by the transformative Transformer architecture. This architecture did not emerge out of thin air but, rather, evolved from various neural-based NLP approaches into an attention-based encoder-decoder architecture, which continues to evolve. In the past decade, the Transformer architecture and its variants have gained popularity due to the following developments:
- Contextual word embeddings thanks to self-attention
- Attention mechanisms, which overcome the problem of forcing input sentences to encode all information into one context vector
- Better subword tokenization algorithms for handling unseen words or rare words
- Injecting additional memory tokens into sentences, such as the paragraph ID in Doc2vec or a Classification ([CLS]) token in Bidirectional Encoder Representations from Transformers (BERT)
- Parallelizable architectures that make for faster training and fine-tuning
- Model compression (distillation, quantization, and so on)
- TL capabilities: DL models can be easily adapted to new tasks or languages
- Cross-lingual, multilingual, and multitasking learning capabilities
- Multimodal training
For several years, traditional NLP approaches such as n-gram language models, TF-IDF-based models, BM25 information retrieval models, and one-hot encoded document-term matrices have been utilized to solve a range of NLP tasks, including sequence classification, language generation, and machine translation. However, these traditional approaches have limitations, such as difficulty in handling sparsity, rare words, and long-term dependencies. To address these challenges, we have developed DL-based approaches such as RNN, CNN, and FFNN, as well as several variants.
A document vector created by the TF-IDF model would have a size of more than 30,000 features. In 2013, word2vec, a two-layer FFNN word-encoder model, sorted out the dimensionality curse by generating compact and dense representations of words called word embeddings. This early model was able to create fast and efficient static word embeddings by converting unsupervised textual data into supervised data (self-supervised learning) through the prediction of target words using nearby neighboring words. Meanwhile, GloVe, another widely used and popular model, argued that count-based models can be better than neural models. It leverages both global and local statistics of a corpus to learn embeddings based on word-word co-occurrence statistics.
Word embeddings have proven effective for some syntactic and semantic tasks, as demonstrated in the following figure, which illustrates how the offsets between terms in the embeddings allow for vector-oriented reasoning. For example, we can discern the generalization of gender relations, a semantic relation, from the offset between the terms Man and Woman (Man -> Woman). Then, we can arithmetically estimate the vector of Actress by adding the vector of the term Actor and the offset calculated before. Likewise, we can learn the syntactic relationships, such as word plural forms. For instance, if the vectors of Actor, Actors, and Actress are given, we can estimate the vector of Actresses:
Figure 1.2 – Word embeddings offset for relationship extraction
The representativeness of word embeddings has become a fundamental component for DL models such as RNNs or CNNs. The recurrent and convolutional architectures, respectively, started to be used as encoders and decoders in sequence-to-sequence (seq2seq) problems where each token is represented with an embedding. The main challenge with these early models was polysemous words (words with more than one meaning). The senses of the words are ignored since a single fixed representation is assigned to each word, which is an especially severe problem for sentence semantics.
Pioneering neural network models such as Universal Language Model Fine-tuning (ULMFiT) and Embeddings from Language Models (ELMo) were able to encode sentence-level information and alleviate polysemy issues, unlike static word embeddings. In this way, we have a new concept called contextual word embeddings.
The ULMFiT and ELMo approaches were based on LSTM (Long Short-Term Memory) networks, a variant of RNN. They also exploited the concept of pre-training and fine-tuning, which allows for TL by using pre-trained models trained on a general task with general textual datasets and fine-tuning them on a target task with supervision. This was a significant development because TL, which had previously been successful in image processing, was being applied for the first time in the field of NLP.
In the meantime, the idea of an attention mechanism (Neural Machine Translation by Jointly Learning to Align and Translate. Bahdanau et al., 2015) made a strong impression in the NLP field and achieved significant success, especially in seq2seq problems. Earlier methods would pass the last state (known as a context vector or thought vector) obtained from the entire input sequence to the output sequence without linking or elimination. Thanks to the attention mechanism, certain parts of the input can be associated with certain parts of the output.
In 2017, the Transformer-based encoder-decoder model was introduced and found to be successful due to its innovative use of the attention mechanism. The design is based on an FFNN by discarding RNN recurrency and using only attention mechanisms (Vaswani et al., Attention is All You Need, 2017). It overcame many difficulties that other approaches faced and has become a new paradigm. Throughout this book, you will be exploring and understanding how Transformer-based models work.