Summary
In this chapter, you learned how to preprocess textual and categorical nominal and ordinal data using state-of-the-art NLP techniques.
You can now build a classical NLP pipeline with stop-word removal, lemmatization and stemming, n-grams, and count term occurrences using a bag-of-words model. We used SVD to reduce the dimensionality of the resulting feature vector and to generate lower- dimensional topic encoding. One important tweak to the count-based bag-of-words model is to compare the relative term frequencies of a document. You learned about the tf-idf
function and can use it to compute the importance of a word in a document compared to the corpus.
In the following section, we looked at Word2Vec and GloVe, pre-trained dictionaries of numeric word embeddings. You can now easily reuse a pre-trained word embedding for commercial NLP applications with great improvements and with accuracy due to the semantic embedding of words.
Finally, we finished the chapter by looking...