This section of this chapter will discuss the various data preparation and text preprocessing steps involved before feeding it into the model as input. The specific way we prepare the data really depends on how we intend to model it, which in turn depends on how we intend to use it.
Preparing and cleansing data
Getting ready
The language model will be based on statistics and predict the probability of each word given an input sequence of text. The predicted word will be fed in as input to the model, to, in turn, generate the next word.
A key decision is how long the input sequences should be. They need to be long enough to allow the model to learn the context for the words to predict. This input length will also define the...