In this chapter, we took baby steps in understanding the math involved in the representation of text data using numbers based on some heuristics. We made an attempt to understand the BoW model and build it using the CountVectorizer API provided by the sklearn module. After looking into limitations associated with CountVectorizer, we tried mitigating those using TfIdfVectorizer, which scales the weights of the less frequently occurring terms. We understood that these methods are purely based on lexical analysis and have limitations in terms of not taking into account features such as semantics associated with words, the co-occurrence of words together, and the position of words in a document, among others.
The study of the vectorization methods was followed up by making use of these vectors to find similarity or dissimilarity between documents using cosine similarity as the measure that provides the angle between two vectors in n-dimensional space. Finally, we looked into one...