Preprocessing – similarity measured as similar number of common words
As we have seen previously, the bag-of-word approach is both fast and robust. However, it is not without challenges. Let's dive directly into them.
Converting raw text into a bag-of-words
We do not have to write a custom code for counting words and representing those counts as a vector. Scikit's CountVectorizer
does the job very efficiently. It also has a very convenient interface. Scikit's functions and classes are imported via the sklearn
package as follows:
>>> from sklearn.feature_extraction.text import CountVectorizer >>> vectorizer = CountVectorizer(min_df=1)
The parameter min_df
determines how CountVectorizer
treats words that are not used frequently (minimum document frequency). If it is set to an integer, all words occurring less than that value will be dropped. If it is a fraction, all words that occur less than that fraction of the overall dataset will be dropped. The parameter max_df
works in...