The architecture we will use for classification is named MobileNet. It is a convolutional model designed to run on mobile. Introduced in 2017, in the paper MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, by Andrew G Howard et al., it uses a special kind of convolution to reduce the number of parameters as well as the computations necessary to generate predictions.
MobileNet uses depthwise separable convolutions. In practice, this means that the architecture is composed of an alternation of two types of convolutions:
- Pointwise convolutions: These are just like regular convolutions, but with a 1 × 1 kernel. The purpose of pointwise convolutions is to combine the different channels of the input. Applied to an RGB image, they will compute a weighted sum of all channels.
- Depthwise convolutions: These are like regular convolutions, but do not combine channels. The role of depthwise convolutions is to filter the content of the input...