Building stories
Simhash should be used to detect near-duplicate articles only. Extending our search to a 3-bit or 4-bit difference becomes terribly inefficient (3-bit difference requires 5,488 distinct queries to Cassandra while 41,448 queries will be needed to detect up to 4-bit differences) and seems to bring much more noise than related articles. Should the user want to build larger stories, a typical clustering technique must be applied then.
Building term frequency vectors
We will start grouping events into stories using a KMeans algorithm, taking the articles' word frequencies as input vectors. TF-IDF is simple, efficient, and a proven technique to build vectors out of text content. The basic idea is to compute a word frequency that we normalize using the inverse document frequency across the dataset, hence decreasing the weight on common words (such as stop words) while increasing the weight of words specific to the definition of a document. Its implementation is part of the...