Summary
In this chapter, you learned the fundamental concepts of NLP as an important subfield in machine learning, including tokenization, stemming and lemmatization, and PoS tagging. We also explored three powerful NLP packages and worked on some common tasks using NLTK and spaCy. Then we continued with the main project, exploring the 20 newsgroups data. We began by extracting features with tokenization techniques and went through text preprocessing, stop word removal, and lemmatization. We then performed dimensionality reduction and visualization with t-SNE and proved that count vectorization is a good representation of text data. We proceeded with a more modern representation technique, word embedding, and illustrated how to utilize a pre-trained embedding model.
We had some fun mining the 20 newsgroups data using dimensionality reduction as an unsupervised approach. Moving forward, in the next chapter, we’ll be continuing our unsupervised learning journey, specifically...