As we have seen earlier, the bag of words approach is both fast and robust. It is, though, not without challenges. Let's dive directly into them.
Preprocessing – similarity measured as a similar number of common words
Converting raw text into a bag of words
We do not have to write custom code for counting words and representing those counts as a vector. Scikit's CountVectorizer method, does the job efficiently but also has a very convenient interface:
>>> from sklearn.feature_extraction.text import CountVectorizer >>> vectorizer = CountVectorizer(min_df=1)
The min_df parameter determines how CountVectorizer treats seldom words (minimum document frequency). If it is set to an integer, all...