Relation networks consist of two important functions: an embedding function, denoted by and the relation function, denoted by
. The embedding function is used for extracting the features from the input. If our input is an image, then we can use a convolutional network as our embedding function, which will give us the feature vectors/embeddings of an image. If our input is text, then we can use LSTM networks to get the embeddings of the text. Let us say, we have a support set containing three classes, {lion, elephant, dog} as shown below:
data:image/s3,"s3://crabby-images/9ed70/9ed70d618b73b64ee0326da3e83fc3cf367f1a75" alt=""
And let's say we have a query image , as shown in the following diagram, and we want to predict the class of this query image:
data:image/s3,"s3://crabby-images/86bb3/86bb3d70e378b225fa2d01663a3db2cfa5fe32b5" alt=""
First, we take each image, , from the support set and pass it to the embedding function
for extract the features. Since our support set has images, we can use a convolutional network as our embedding...