Understanding capsules
In traditional CNNs, we define different filters that run over the entire image. The 2D matrices produced by each filter are stacked on top of one another to constitute the output of a convolutional layer. Subsequently, we perform the max pooling operation to find the invariance in activities. Invariance here implies that the output is robust to small changes in the input as the max pooling operation always picks up the max activity. As mentioned previously, max pooling results in the valuable loss of information and is unable to represent the relative orientation of different objects to others in the image.
Capsules, on the other hand, encode all of the information of the objects they are detecting in a vector form as opposed to a scalar output by a neuron. These vectors have the following properties:
- The length of the vector indicates the probability of an object in the image.
- Different elements of the vector encode different properties of the object. These properties...