A word-based language model defines a probability distribution over sequences of words. Given a sequence of words of length m (for example, a sentence), it assigns a probability P(w1, ... , wm) to the full sequence of words. We can use these probabilities as follows:
- To estimate the likelihood of different phrases in NLP applications.
- As a generative model to create new text. A word-based language model can compute the likelihood of a given word following a sequence of words.
The inference of the probability of a long sequence, say w1, ..., wm, is typically infeasible. We can calculate the joint probability of P(w1, ... , wm) with the chain rule of joint probability (Chapter 1, The Nuts and Bolts of Neural Networks):
The probability of the later words given the earlier words would be especially difficult to estimate from the data. That's why this...