Building a simple bag-of-words model
In this section, we will look at a surprisingly simple concept to tackle the shortcomings of label encoding for textual data using a technique called bag-of-words, which will build a foundation for a simple NLP pipeline. Don't worry if these techniques look too simple when you read through them; we will gradually build on top of them with tweaks, optimizations, and improvements to build a modern NLP pipeline.
A naïve bag-of-words model using counting
In this section, the main concept that we will build is the bag-of-words model. It is a very simple concept; that is, it involves modeling any document as a collection of words that appear in a given document with the frequency of each word. Hence, we throw away sentence structure, word order, punctuation marks, and more and reduce the documents to a raw count of words. Following this, we can vectorize this word count into a numeric vector representation, which can then be used for ML...