Before we detail how the network learns the weights of Winput, Wrec, and V, let's try to get a broad understanding of how a basic RNN works. The general idea is that Winput will influence the results if some of the features from the input make it into the hidden state, and Wrec will influence the results if some features stay in the hidden state.
Let's use specific examples—classifying a violent video and a dance video.
As a gunshot can be quite sudden, it would represent only a few frames among all the frames of the video. Ideally, the network will learn Winput, so that when x<t> contains the information of a gunshot, the concept of violent video would be added to the state. Moreover, Wrec (defined in the previous equation) must be learned in a way that prevents the concept of violent from disappearing from the state. This way, even if the gunshot appears only in the first few frames, the video would still be classified as violent...