Text Representation
A computer operates on zeros and ones, and algorithms operate on numerical values. A computer does not understand beautiful texts such as the plays by William Shakespeare or the books by Leo Tolstoy. So, raw texts need to be converted to numerical values for a computer to process. The first step in NLP is converting texts to numerical values.
In this chapter, we will learn about the basic text representation – Bag-of-Words, Bag-of-N-grams, and TF-IDF. This chapter is for absolute NLP beginners. In this chapter, we will learn how to code with Gensim, scikit-learn, and NLTK. We will cover the following topics:
- What text representation is
- The transition from one-hot encoding to Bag-of-Words to Bag-of-N-grams
- What TF-IDF is
- How to perform Bag-of-Words (BoW) and TF-IDF encoding in Gensim
- The real-world applications of BoW and TF-IDF
By the end of this chapter, you will be able to describe the BoW, Bag-of-N-grams, and TF-IDF methods...