Introduction
We encounter different kinds of data in our day-to-day lives, and some of this data has temporal dependencies (dependencies over time) while some does not. For example, an image by itself contains the information it wants to convey. However, data forms such as audio and video have dependencies over time. They cannot convey information if a fixed point in time is taken into consideration. Based on the problem statement, the input that's needed in order to solve the problem can differ. If we have a model to detect a particular person in a frame, a single image can be used as input. However, if we need to detect their actions, we need a stream of images, contiguous in time, as the input. We can understand the person's actions by analyzing these images together, but not independently.
While watching a movie, a particular scene makes sense because its context is known, and we remember all the information gathered before in the movie to understand the current scene. This...