From a machine learning point of view, raw text is useless. If we manage to transform it into meaningful numbers, we can then feed it into our machine learning algorithms, such as clustering. This is also true for more mundane operations on text, such as similarity measurement.
Measuring the relatedness of posts
How not to do it
One text similarity measure is the Levenshtein distance, which also goes by the name edit distance. Let's say we have two words, machine and mchiene. The similarity between them can be expressed as the minimum set of edits that are necessary to turn one word into the other. In this case, the edit distance will be two, as we have to add an a after the m and delete the first e. This algorithm is...