In NLP, a very common pipeline can be subdivided into the following steps:
- Collecting a document into a corpus
- Tokenizing, stopword (articles, prepositions, and so on) removal, and stemming (reduction to radix-form)
- Building a common vocabulary
- Vectorizing the documents
- Classifying or clustering the documents
The pipeline is called Bag-of-Words and will be discussed in this chapter. A fundamental assumption is that the order of every single word in a sentence is not important. In fact, when defining a feature vector, as we're going to see, the measures taken into account are always related to frequencies, and therefore they are insensitive to the local positioning of all elements. From some viewpoints, this is a limitation because in a natural language the internal order of a sentence is necessary to preserve the meaning; however, there are many...