In the previous sections, we saw how RNNs can be used to learn patterns of many different time sequences. In this section, we will look at how these models can be used for the problem of recognizing and understanding speech. We will give a brief overview of the speech recognition pipeline and provide a high-level view of how we can use neural networks in each part of the pipeline.
Speech recognition
Speech recognition pipeline
Speech recognition tries to find a transcription of the most probable word sequence considering the acoustic observations provided:
transcription = argmax(P(words | audio features))
This probability function is typically modeled in different parts (note that the normalizing term P (audio features) is...