In this recipe, we want to find a way to decide whether two strings are similar given a representation of those two strings. We'll try to improve the way strings are represented in order to make more meaningful comparisons between strings. But first, we'll get a baseline using more traditional string comparison algorithms.
We'll do the following: given a dataset of paired string matches, we'll try out different functions for measuring string similarity, then a bag-of-characters representation, and finally a Siamese neural network (also called a twin neural network) dimensionality reduction of the string representation. We'll set up a twin network approach for learning a latent similarity space of strings based on character n-gram frequencies.