Recalling traditional NLP approaches
Although traditional models will soon become obsolete, they can always shed light on innovative designs. The most important of these is the distributional method, which is still used. Distributional semantics is a theory that explains the meaning of a word by analyzing its distributional evidence rather than relying on predefined dictionary definitions or other static resources. This approach suggests that words that frequently occur in similar contexts tend to have similar meanings. For example, words such as dog and cat often occur in similar contexts, suggesting that they may have related meanings. The idea was first proposed by Zellig S. Harris in his work, Distributional Structure of Words, in 1954. One of the benefits of using a distributional approach is that it allows researchers to track the semantic evolution of words over time or across different domains or senses of words, which is a task that is not possible using dictionary definitions alone.
For many years, traditional approaches to NLP have relied on Bag-of-Words (BoW) and n-gram language models to understand words and sentences. BoW approaches, also known as vector space models (VSMs), represent words and documents using one-hot encoding, a sparse representation method. These one-hot encoding techniques have been used to solve a variety of NLP tasks, such as text classification, word similarity, semantic relation extraction, and word-sense disambiguation. On the other hand, n-gram language models assign probabilities to sequences of words, which can be used to calculate the likelihood that a given sequence belongs to a corpus or to generate random sequences based on a given corpus.
Along with these models, in order to assign importance to a term, the Term Frequency (TF) and the Inverse Document Frequency (IDF) metrics are often used. IDF helps to reduce the weight of high-frequency words, such as stop words and functional words, which have little discriminatory power in understanding the content of a document. The discriminatory power of a term also depends on the domain – for example, a list of articles about DL is likely to have the word network in almost every document. It would not be a surprise to see the term network in the domain data. The Document Frequency (DF) of a word is calculated by counting the number of documents in which it appears, and this can be used to scale down the weights of all terms. The TF is simply the raw count of a term in a document.
Some of the advantages and disadvantages of a TF-IDF-based BoW model are listed as follows:
Advantages |
Disadvantages |
|
|
Table 1.1 – Advantages and disadvantages of a TF-IDF BoW model
Using a BoW approach to represent a small sentence can be impractical because it involves representing each word in the dictionary as a separate dimension in a vector, regardless of whether it appears in the sentence or not. This can result in a high-dimensional vector with many zero cells, making it difficult to work with and requiring a large amount of memory to store.
Latent semantic analysis (LSA) has been widely used to overcome the dimensionality problem of the BoW model. It is a linear method that captures pairwise correlations between terms. LSA-based probabilistic methods can still be considered as a single layer of hidden topic variables. However, current DL models include multiple hidden layers, with billions of parameters. In addition to that, Transformer-based models showed that they can discover latent representations much better than such traditional models.
The traditional pipeline for NLP tasks begins with a series of preparation steps, such as tokenization, stemming, noun phrase detection, chunking, and stop-word elimination. After these steps are completed, a document-term matrix is constructed using a weighting schema, with TF-IDF being the most popular choice. This matrix can then be used as input for various machine learning (ML) pipelines, including sentiment analysis, document similarity, document clustering, and measuring the relevancy score between a query and a document. Similarly, terms can be represented as a matrix and used as input for token classification tasks, such as named entity recognition and semantic relation extraction. The classification phase typically involves the application of supervised ML algorithms, such as support vector machines (SVMs), random forests, logistic regression, naive Bayes, and multiple learners (boosting or bagging).
Language modeling and generation
Traditional approaches to language generation tasks often rely on n-gram language models, also known as Markov processes. These are stochastic models that estimate the probability of a word (event) based on a subset of previous words. There are three main types of n-gram models: unigram, bigram, and n-gram (generalized). Let us look at these in more detail:
- Unigram models assume that all words are independent and do not form a chain. The probability of a word in a vocabulary is simply calculated by its frequency in the total word count.
- Bigram models, also known as first-order Markov processes, estimate the probability of a word based on the previous word. This probability is calculated by the ratio of the joint probability of two consecutive words to the probability of the first word.
- N-gram models, also known as N-order Markov processes, estimate the probability of a word based on the previous n-1 words.
We have already discussed the paradigms underlying traditional NLP models in the Recalling traditional NLP approaches subsection and provided a brief introduction. We will now move on to discuss how neural language models have impacted the field of NLP and how they have addressed the limitations of traditional models.