Dimensionality reduction
What algorithms such as MinHash and LSH aim to do is reduce the quantity of data that must be stored without compromising on the essence of the original. They're a form of compression and they define helpful representations that preserve our ability to do useful work. In particular, MinHash and LSH are designed to work with data that can be represented as a set.
In fact, there is a whole class of dimensionality-reducing algorithms that will work with data that is not so easily represented as a set. We saw, in the previous chapter with k-means clustering, how certain data could be most usefully represented as a weighted vector. Common approaches to reduce the dimensions of data represented as vectors are principle component analysis and singular-value decomposition. To demonstrate these, we'll return to Incanter and make use of one of its included datasets: the Iris dataset:
(defn ex-7-27 [] (i/view (d/get-dataset :iris)))
The previous code should return...