Jaccard similarity for large sets with MinHash
The Bloom filter is a probabilistic data structure to determine whether an item is a member of a set. While comparing user or item similarities, what we are usually interested in is the intersection between sets, as opposed to their precise contents. MinHash is a technique that enables a large set to be compressed in such a way that we can still perform the Jaccard similarity on the compressed representations.
Let's see how it works with a reference to two of the most prolific raters in the MovieLens dataset. Users 405 and 655 have rated 727 and 675 movies respectively. In the following code, we extract their ratings and convert them into sets before passing to Incanter's jaccard-index
function. Recall that this returns the ratio of movies they've both rated out of all the movies they've rated:
(defn rated-items [user-ratings id] (->> (get user-ratings id) (map :item))) (defn ex-7-22 [] (let [ratings ...