Relation networks consist of two important functions: an embedding function, denoted by and the relation function, denoted by . The embedding function is used for extracting the features from the input. If our input is an image, then we can use a convolutional network as our embedding function, which will give us the feature vectors/embeddings of an image. If our input is text, then we can use LSTM networks to get the embeddings of the text. Let us say, we have a support set containing three classes, {lion, elephant, dog} as shown below:
And let's say we have a query image , as shown in the following diagram, and we want to predict the class of this query image:
First, we take each image, , from the support set and pass it to the embedding function for extract the features. Since our support set has images, we can use a convolutional network as our embedding...