Unsupervised learning with co-localization
The first layers of the digit classifier trained in Chapter 2, Classifying Handwritten Digits with a Feedforward Network as an encoding function to represent the image in an embedding space, as for words:
It is possible to train unsurprisingly the localization network of the spatial transformer network by minimizing the hinge loss objective function on random sets of two images supposed to contain the same digit:
Minimizing this sum leads to modifying the weights in the localization network, so that two localized digits become closer than two random crops.
Here are the results: