A localization network
In Spatial Transformer Networks (STN), instead of applying the network directly to the input image signal, the idea is to add a module to preprocess the image and crop it, rotate it, and scale it to fit the object, to assist in classification:

Spatial Transformer Networks
For that purpose, STNs use a localization network to predict the affine transformation parameters and process the input:

Spatial transformer networks
In Theano, differentiation through the affine transformation is automatic, we simply have to connect the localization net with the input of the classification net through the affine transformation.
First, we create a localization network not very far from the MNIST CNN model, to predict six parameters of the affine transformation:
l_in = lasagne.layers.InputLayer((None, dim, dim)) l_dim = lasagne.layers.DimshuffleLayer(l_in, (0, 'x', 1, 2)) l_pool0_loc = lasagne.layers.MaxPool2DLayer(l_dim, pool_size=(2, 2)) l_dense_loc = mnist_cnn.model(l_pool0_loc, input_dim...