Summary
Text preprocessing is a critical step in NLP modeling. With proper text preprocessing, the quality of the NLP model’s outcome will be satisfactory. This is a fundamental step that experienced NLP data scientists pay attention to.
This chapter presents the three popular Python libraries for text preprocessing: spaCy, Gensim, and NLTK. In each library, we demonstrated how to perform tokenization, stop-word removal, punctuation removal, stemming, and lemmatization. We also saw the strengths of each library and learned how to use them.
In the next chapter, we will start to learn about the concept of latent semantics and latent semantic analysis (LSA).