Extracting useful information for text-based information is no easy task. For a basic application, such as document classification, the common way of feature extraction is called bag of words (BoW), in which the frequency of the occurrence of each word is used as a feature for training the classifier. We will briefly talk about BoW in the following section, as well as the tf-idf approach, which is intended to reflect how important a word is to a document in a collection or corpus.
Traditional NLP
Bag of words
BoW is mainly for categorizing documents. It is also used in computer vision. The idea is to represent the document as a bag or a set of words, disregarding the grammar and the order of the word sequences.
After the preprocessing...