- If both CNNs and LSTMs can model spatially correlated data, what makes LSTMs particularly better?
Nothing in general, other than the fact that LSTMs have memory. But in certain applications, such as NLP, where a sentence is discovered sequentially as you go forward and backward, there are references to certain words at the beginning, middle, and end, and multiples at a time. It is easier for BiLSTMs to model that behavior faster than a CNN. A CNN may learn to do that, but it may take longer to do so in comparison.
- Does adding more recurrent layers make the network better?
No. It can make things worse. It is recommended to keep it simple to no more than three layers, unless you are a scientist and are experimenting with something new. Otherwise, there should be no more than three recurrent layers in a row in an encoder model.
- What other applications are there for LSTMs?
Audio processing and classification; image denoising; image super-resolution; text summarization...