Essentials of NLP
Language has been a part of human evolution. The development of language allowed better communication between people and tribes. The evolution of written language, initially as cave paintings and later as characters, allowed information to be distilled, stored, and passed on from generation to generation. Some would even say that the hockey stick curve of advancement is because of the ever-accumulating cache of stored information. As this stored information trove becomes larger and larger, the need for computational methods to process and distill the data becomes more acute. In the past decade, a lot of advances were made in the areas of image and speech recognition. Advances in Natural Language Processing (NLP) are more recent, though computational methods for NLP have been an area of research for decades. Processing textual data requires many different building blocks upon which advanced models can be built. Some of these building blocks themselves can be quite challenging and advanced. This chapter and the next focus on these building blocks and the problems that can be solved with them through simple models.
In this chapter, we will focus on the basics of pre-processing text and build a simple spam detector. Specifically, we will learn about the following:
- The typical text processing workflow
- Data collection and labeling
- Text normalization, including case normalization, text tokenization, stemming, and lemmatization
- Modeling datasets that have been text normalized
- Vectorizing text
- Modeling datasets with vectorized text
Let's start by getting to grips with the text processing workflow most NLP models use.