Recent years have witnessed a surge in demand for applications of visual object recognition, for instance, in self-driving cars and content-based image search. This demand is because of the astonishing progress of convolutional networks (CNNs) where state-of-the-art models may have even surpassed human-level performance. However, most are complex models which have high computational demands at inference time. In real-world applications, computation is never free; it directly translates into power consumption, which should be minimized for environmental and economic reasons.
Ideally, all systems should automatically use small networks when test images are easy or computational resources are limited and use big networks when test images are hard or computation is abundant.
In order to develop resource-efficient image recognition, the authors aim to develop CNNs that slice the computation and process these slices one-by-one, stopping the evaluation once the CPU time is depleted or the classification sufficiently certain. Unfortunately, CNNs learn the data representation and the classifier jointly, which leads to two problems
The authors propose a novel network architecture that addresses both problems through careful design changes, allowing for resource-efficient image classification.
The model is based on a multi-scale convolutional neural network similar to the neural fabric, but with dense connections and with a classifier at each layer. This novel network architecture, called Multi-Scale DenseNet (MSDNet), address both of the problems described above (of classifiers altering the internal representation and the lack of coarse-scale features in early layers) for resource-efficient image classification.
The network uses a cascade of intermediate classifiers throughout the network. The first problem is addressed through the introduction of dense connectivity. By connecting all layers to all classifiers, features are no longer dominated by the most imminent early exit and the trade-off between early or later classification can be performed elegantly as part of the loss function.
The second problem is addressed by adopting a multi-scale network structure. At each layer, features of all scales (fine-to-coarse) are produced, which facilitates good classification early on but also extracts low-level features that only become useful after several more layers of processing.
Overall Score: 25/30
Average Score: 8.33
The reviewers found the approach to be natural and effective with good results. They found the presentation to be clear and easy to follow. The structure of the network was clearly justified. The reviewers found the use of dense connectivity to avoid the loss of performance of using early-exit classifier interesting. They appreciated the results and found them to be quite promising, with 5x speed-ups and same or better accuracy than previous models. However, some reviewers pointed out that the results about the more efficient densenet* could be shown in the main paper.