Now that we've got that sorted, it's time for us to dive into the really fun stuff. How do we train these fantastic architectures? Do we need a completely new algorithm to facilitate our training and optimization? No! We can still use backpropagation and gradient descent to calculate the error, differentiate it with respect to the previous layers, and update the weights to get us as close to the global optima as possible.
But before we go further, let's go through how backpropagation works in CNNs, particularly with kernels. Let's revisit the example we used earlier on in this chapter, where we convolved a 3 × 3 input with a 2 × 2 kernel, which looked as follows:
We expressed each element in the output matrix as follows:
We should remember from Chapter 7, Feedforward Networks, where we introduced backpropagation, that we...