Key steps in NLP preprocessing
NLP has accumulated much knowledge in preprocessing texts. The key steps in NLP preprocessing are tokenization, lowercase conversion, stop word removal, punctuation removal, stemming, and lemmatization. These steps help to ensure the text quality for modeling and further analyses. Let’s learn about them in detail.
Tokenization
While we see a sentence consisting of individual words, computers see a sentence as an inseparable string. Tokenization is the process of splitting a string into a list of tokens. For example, one line of the song “Theme from New York, New York” that we used in Chapter 2, Text Representation, is: “I want to be a part of it, New York, New York.” After tokenization, it becomes a list:
['I', 'want', 'to', 'be', 'a', 'part', 'of', 'it', ',', 'New', 'York', ',', 'New...