Summary
In this chapter, we learned about the basic forms of text representation, including BoW, Bag-of-N-grams, and TF-IDF methods, to represent raw text. The advantage of BoW is its simplicity. The Bag-of-N-grams method enhances BoW because it captures phrases. TF-IDF can enhance BoW by measuring the importance of a word in a document relative to the entire corpus. Words that are rare in a document will have a high score in the TF-IDF vector. The common disadvantage of BoW, Bag-of-N-grams, and TF-IDF is they create a very sparse matrix. Also, they do not take into consideration the order of words in an article. In this chapter, we also learned how to perform BoW and TF-IDF in Gensim, scikit-learn
, and NLTK.
As we become more hands-on with texts, we'll need to deal with words in uppercase or lowercase, or documents with punctuation, numbers, and special characters. We'll also need to distinguish meaningful words from common words and annotate them with grammatical notations...