Dealing with sequential data
To produce good next-token predictions, a language model needs to be able to consider a sizeable context, reaching back many words or even sentences.
To demonstrate this, consider the following text:
A solitary tiger stealthily stalks its prey in the dense jungle. The underbrush whispers as it attacks, concealing its advance toward an unsuspecting fawn.
The second sentence in this example contains two pronouns, it and its (shown in bold above), both referring to the tiger from the previous sentence, many words apart. But without seeing the first sentence, you’d likely assume that it refers to the underbrush instead, which would have led to a very different sentence ending, such as this one:
The underbrush whispers as it sways gently in the soft breeze.
This shows long-range context matters for language modeling and next-token prediction. You can construct examples of arbitrary length where the pronoun resolution relies on the context...