To introduce RNNs, we will use the example of video recognition. A video is composed of N frames. The naive method to classify a video would be to apply a CNN to each frame, and then take the average of the outputs.
While this would provide decent results, it does not reflect the fact that some parts of the video are more important than others. Moreover, the important parts do not always take more frames than the meaningless ones. The risk of averaging the output would be to lose important information.
To circumvent this problem, an RNN is applied to all the frames of the video, one after the other, from the first one to the last one. The main attribute of RNNs is adequately combining features from all the frames in order to generate meaningful results.