Consider a scenario where we are transcribing the image of a handwritten text. In this case, we would be dealing with image data and also sequential data (as the content in the image needs to be transcribed sequentially).
In traditional analysis, we would have hand-crafted the solution—for example: we might have slid a window across the image (where the window is of the average size of a character) so that the window would detect each character, and then output characters that it detects, with high confidence.
However, in this scenario, the size of the window or the number of windows we shall slide is hand crafted by us—which becomes a feature-engineering (feature generation) problem.
A more end-to-end approach shall be extracting the features obtained by passing the image through a CNN and then passing these features as inputs to various time steps...