Leveraging DL
For decades, we have witnessed successful architectures, especially in word and sentence representation. Neural network-based language models have been effective in addressing feature representation and language modeling problems because they allow for the training of more advanced neural architectures on large datasets, which enables the learning of compact, high-quality representations of language.
In 2013, the word2vec model introduced a simple and effective architecture for learning continuous word representations that outperformed other models on a variety of syntactic and semantic language tasks, such as sentiment analysis, paraphrase detection, and relation extraction. The low computational complexity of word2vec has also contributed to its popularity. Due to the success of embedding methods, word embedding representation has gained attraction and is now widely used in modern NLP models.
Word2vec and similar models learn word embeddings through the use of a prediction-based neural architecture based on nearby word predictions. This approach differs from traditional BoW methods that rely on count-based techniques for capturing distributional semantics. The authors of the GloVe model have addressed the question of whether count-based or prediction-based methods are better for distributional word representations and claimed that the two approaches are not significantly different. FastText is another widely used model that incorporates subword information by representing each word as a bag of character n-grams, with each n-gram represented by a constant vector. Words are then represented as the sum of their sub-vectors, an idea first introduced by H. Schütze in 1993. This allows FastText to compute word representations even for unseen words (or rare words) and learn the internal structure of words, such as suffixes and affixes, which is particularly useful for morphologically rich languages. Likewise, modern Transformer architectures exploit incorporating subword information and use various subword tokenization methods, such as WordPiece, SentencePiece, or Byte-Pair Encoding (BPE).
Let us quickly discuss popular RNN models.
Considering the word order with RNN models
Traditional BoW models do not consider word order because they treat all words as individual units and place them in a basket. RNN models, on the other hand, learn the representation of each token (word) by cumulatively incorporating the information of previous tokens, and the final token ultimately provides the representation of the entire sentence.
Like other neural network models, RNN models process tokens produced by a tokenization algorithm that breaks down raw text into atomic units, known as tokens. These tokens are then associated with numeric vectors, called token embeddings, which are learned during training. Alternatively, we can use well-known word-embedding algorithms, such as word2vec or FastText, to generate these token embeddings in advance.
Here is a simple illustration of an RNN architecture for the sentence The cat is sad.
, where x0
is the vector embeddings of the
, x1
is the vector embeddings of cat
, and so forth. Figure 1.3 illustrates an RNN being unfolded into a full deep neural network (DNN). Unfolding means that we associate a layer to each word. For the The cat is sad.
sequence, we take care of a sequence of five tokens. The hidden state in each layer acts as the memory of the network and is shared between the steps. It encodes information about what happened in all previous timesteps and in the current timestep. This is represented in the following diagram:
Figure 1.3 – An RNN architecture
The following are some advantages of an RNN architecture:
- Variable-length input: The capacity to work on variable-length input, no matter the size of the sentence being input. We can feed the network with sentences of 3 or 300 words without changing the parameter.
- Caring about word order: It processes the sequence word by word in order, caring about the word position.
- Suitable for working in various modes: We can train a machine translation model or sentiment analysis using the same recurrency paradigm. Both architectures would be based on an RNN:
- One-to-many mode: RNN can be redesigned in a one-to-many model for language generation or music generation. (e.g., a word -> its definition in a sentence)
- Many-to-one mode: It can be used for text classification or sentiment analysis (e.g., a sentence as a list of words -> sentiment score)
- Many-to-many mode: The use of many-to-many models is to solve encoder-decoder problems such as machine translation, question answering, and text summarization or Named Entity Recognition (NER)-like sequence labeling problems
The disadvantages of an RNN architecture are listed here:
- Long-term dependency problem: When we process a very long document and try to link the terms that are far from each other, we need to care about and encode all irrelevant other terms between these terms.
- Prone to exploding or vanishing gradient problems: When working on long documents, updating the weights of the very first words is a big deal, which makes a model untrainable due to a vanishing gradient problem.
- Hard to apply parallelizable training: Parallelization breaks the main problem down into a smaller problem and executes the solutions at the same time, but RNN follows a classic sequential approach. Each layer strongly depends on previous layers, which makes parallelization impossible.
- The computation is slow as the sequence is long: An RNN could be very efficient for short text problems. It processes longer documents very slowly, besides the long-term dependency problem.
Although an RNN can theoretically attend to the information many timesteps before, in the real world, problems such as long documents and long-term dependencies are impossible to discover. Long sequences are represented within many deep layers. These problems have been addressed by many studies, some of which are outlined here:
- Hochreiter and Schmidhuber. Long Short-term Memory. 1997.
- Bengio et al. Learning long-term dependencies with gradient descent is difficult. 1993.
- K. Cho et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. 2014.
LSTMs and gated recurrent units
LSTM networks (Schmidhuber, 1997) and gated recurrent units (GRUs) (Cho, 2014) are types of RNNs that are designed to address the problem of long-term dependencies. One key feature of LSTMs is the use of a cell state, which is a horizontal sequence line located above the LSTM unit and controlled by specialized gates that handle forget, insert, or update operations. The complex structure of an LSTM is shown in the following diagram:
Figure 1.4 – An LSTM unit
The design is able to decide the following:
- What kind of information we will store in the cell state
- Which information will be forgotten or deleted
In the original RNN, in order to learn the state of any token, it recurrently processes the entire state of previous tokens. Carrying entire information from earlier timesteps leads to vanishing gradient problems, which makes the model untrainable. You can think of it this way: the difficulty of carrying information increases exponentially with each step. For example, let’s say that carrying information with the previous token is 2 units of difficulty; with the 2 previous tokens, it would be 4 units, and with 10 previous tokens, it becomes 1,024 units of difficulty.
The gate mechanism in LSTM allows the architecture to skip some unrelated tokens at a certain timestep or remember long-range states to learn the current token state. The GRU is similar to an LSTM in many ways, the main difference being that a GRU does not use the cell state. Rather, the architecture is simplified by transferring the functionality of the cell state to the hidden state, and it only includes two gates: an update gate and a reset gate. The update gate determines how much information from the previous and current timesteps will be pushed forward. This feature helps the model keep relevant information from the past, which minimizes the risk of a vanishing gradient problem as well. The reset gate detects the irrelevant data and makes the model forget it.
In a seq2seq pipeline with traditional RNNs, it was necessary to compress all the information into a single representation in the input data. However, this can make it difficult for the output tokens to focus on specific parts of the input tokens. Bahdanau’s attention mechanism was first applied to RNNs and has shown to be effective in addressing this issue and has become a key concept in the Transformer architecture. We will examine how the Transformer architecture uses attention in more detail later in the book.
Contextual word embeddings and TL
Traditional word embeddings such as word2vec were important components of many neural language understanding models. However, using these embeddings in a language modeling task would result in the same representation for the word bank regardless of the context in which it appears. For example, the word bank would have the same embedding whether it refers to a financial institution or side of a river with traditional vectors. At this point, context is important for humans as well as machines to understand the meaning of a word.
In this line, ELMo and ULMFiT were the pioneering models that achieved contextual word embeddings and efficient TL in NLP. They got good results on a variety of language understanding tasks, such as question answering, NER, relation extraction, and so forth. These contextual embeddings capture both the meaning of a word and the context in which it appears. Unlike traditional word embeddings, which use a static representation for each word, they use bi-directional LSTM to encode a word by considering the entire sentence in which it appears.
They demonstrated that pre-trained word embeddings can be reused for other NLP tasks once the pre-training has been completed. ULMFiT, in particular, was successful in applying TL to NLP tasks. At that time, TL was already commonly used in computer vision, but existing NLP approaches still required task-specific modifications and training from scratch. ULMFiT introduced an effective TL method that can be applied to any NLP task and also demonstrated techniques for fine-tuning a language model. The ULMFiT process consists of three stages. The first stage is the pre-training of a general domain language model on a large corpus to capture general language features at different layers. The second stage involves fine-tuning the pre-trained language model on a target task dataset using discriminative fine-tuning to learn task-specific features. The final stage involves fine-tuning the classifier on the target task with gradual unfreezing. This approach maintains low-level representations while adapting to high-level ones.
Now, we finally come to our main topic—Transformers!