Now that we've mastered how to generate cryptographic hashes, let's work on generating fuzzy hashes. We'll discuss a few techniques we could employ for similarity analysis, and walk through a basic example of how ssdeep and spamsum employ rolling hashing to help generate more resilient signatures.
It may go without saying that our most accurate approach to similarity analysis is to compare the byte content of two files, side by side, and look for differences. While we may be able to accomplish this using command-line tools or a difference analysis tool (such as kdiff3), this only really works at a small scale. Once we move from comparing two small files to comparing many small files, or a few medium-sized files, we need a more efficient approach. This is where signature generation comes into play.
To generate a signature, we must have a few...