The second stage of Faster R-CNN accepts the feature maps from the first stage, as well as the list of RoIs. For each RoI, convolutional layers are applied to obtain class predictions and bounding box refinement information. The operations are represented here:
Figure 5.12: Architecture summary of Faster R-CNN
Step by step, the process is as follows:
- Accept the feature maps and the RoIs from the RPN step. The RoIs generated in the original image coordinate system are converted into the feature map coordinate system. In our example, the stride of the CNN is 16. Therefore, their coordinates are divided by 16.
- Resize each RoI to make it fit the input of the fully connected layers.
- Apply the fully connected layer. It is very similar to the final layers of any convolutional network. We obtain a feature vector.
- Apply two different convolutional layers. One handles the classification (called cls) and the other handles the refinement of the RoI (called rgs).