The levels of the feature maps are enriched by the depth of the layers in the deep convolutional neural networks and many visual tasks have also made major improvements.
However, in the case of multi-person pose estimation, there are still limitations in the trade-off between the low-level and high-level feature maps. Here, the channel information with different characteristics can complement and reinforce with each other. So, the researchers decided to propose the Channel Shuffle Module (CSM) to further calculate the interdependencies between the low-level and high-level feature maps.
Image Source: Multi-Person Pose Estimation with Enhanced Channel-wise and Spatial Information
Let’s assume that the pyramid features extracted from the ResNet backbone are denoted as Conv2∼5 (as shown in the figure). In this case, Conv-3∼5 are first upsampled to the same resolution as the Conv2, and then these feature maps are concatenated together.
Then the channel shuffle operation is performed on the concatenated features in order to fuse the complementary channel information among different levels. The shuffled features then are split and downsampled to the original resolution separately which are denoted as C-Conv-2∼5. C-Conv-2∼5.
Next, the researchers perform 1×1 convolution to further fuse C-Conv-2∼5, and obtain the shuffled features that are denoted as SConv-2∼5. And they concatenate the shuffled feature maps S-Conv-2∼5 with the original pyramid feature maps Conv2∼5 for achieving the final enhanced pyramid feature representations.
These enhanced pyramid feature maps contain the information from the original pyramid features and fused cross-channel information from the shuffled pyramid feature maps.
The researchers introduced Attention Residual Bottleneck based on the enhanced pyramid feature representations mentioned above. With the help of Attention Residual Bottleneck, they enhanced the feature responses both in the spatial and channel-wise context.
Image Source: Multi-Person Pose Estimation with Enhanced Channel-wise and Spatial Information
In the figure, the schema of the original Residual Bottleneck and the Spatial, Channel-wise Attention Residual Bottleneck is composed of the spatial attention and channel-wise attention. The dashed links in the figure, indicate the identity mapping. The ARB learns the spatial attention weights β and the channel-wise attention weights α respectively.
By applying the whole feature maps, the project leads to sub-optimal results due to the irrelevant regions. Whereas, the spatial attention mechanism attempts to highlight the task-related regions in the feature maps.
The team evaluates the models on the challenging COCO keypoint benchmark and train them on the COCO dataset that includes 57K images and 150K person instances with no extra data involved. The ablation studies are then validated on the COCO minival dataset and the final results are reported on the COCO test-dev dataset compared with the public state-of-the-art results. The team uses the official evaluation metric that reports the OKS-based AP (average precision) in the experiments. Here the OKS (object keypoints similarity) defines the similarity between the ground truth pose and predicted pose.
In the Channel-wise Attention Residual Bottleneck (SCARB) experiment, the team explores the effects of different implementation orders of the spatial attention and the channelwise attention in the Attention Residual Bottleneck, i.e., SCARB and CSARB.
To know more about this news, check out the paper, Multi-Person Pose Estimation with Enhanced Channel-wise and Spatial Information.
AI can now help speak your mind: UC researchers introduce a neural decoder that translates brain signals to natural-sounding speech
OpenAI researchers have developed Sparse Transformers, a neural network which can predict what comes next in a sequence
Researchers propose a reinforcement learning method that can hack Google reCAPTCHA v3