Analytic Signatures
In Chapter 4, Dimension Reduction, we discussed dimension reduction – methods that enable us to express data succinctly in ways that give us insights into the data. A hash function, discussed previously, is yet another way to accomplish dimension reduction. Hash functions are effective for many purposes, including the file verification use case we discussed. In that scenario, we were interested in determining whether two scripts were exactly the same or not. Even a slight difference in data, such as changing the word "take" to "make," had the potential to completely corrupt the intended message, so exactness was required.
In other cases, we may want to make meaningful comparisons between different datasets without requiring the exact identity of the two datasets being compared. Consider the case of detecting copyright violations. Suppose that a website hosts images from its users. It wants to ensure that users are not submitting images that are protected by copyright....