Introduced by Min Lin and others in their influential Network in Network (NIN) paper in 2013, the idea of having a CNN composed of sub-network modules was adapted and fully exploited by the Google team. As previously mentioned and shown in Figure 4.4, the basic inception modules they developed are composed of four parallel layers—three convolutions with filters of size 1 × 1, 3 × 3, and 5 × 5, respectively, and one max-pooling layer with stride 1. The advantages of this parallel processing, with the results concatenated together after, are numerous.
As explained in the Motivation sub-section, this architecture allows for the multiscale processing of the data. The results of each inception module combine features of different scales, capturing a wider range of information. We do not have to choose which kernel size may be the best (such a choice would require several iterations of training...