Way back in Chapter 7, Training a CNN From Scratch, we used convolutions to slide a window over regions of an image to learn complex visual features. This allowed us to learn important local visual features, regardless of where in the picture those features might have been, and then hierarchically learn more and more complex features as our network got deeper. We typically used a 3 x 3 or 5 x 5 filter on a 2D or 3D image. You may want to review Chapter 7, Training a CNN From Scratch, if you are feeling rusty on your understanding of convolution layers and how they work.
It turns out that we can use the same strategy on a sequence of words. Here, our 2D matrix is the output from an embedding layer. Each row represents a word, and all the elements in that row are its word vector. Continuing with the preceding example, we would have a 10 x...