Recognizing and understanding spoken language is a challenging problem due to the complexity and variety of speech data. There have been several different technologies deployed to recognize spoken words in the past. Most of those approaches were very limited in their scope, as they were unable to recognize a wide variety of words, accents, and tones, and aspects of spoken language, such as a pause between spoken words. Some of the prevalent modeling technique for speech recognition include Hidden Markov Models (HMM), Dynamic Time Warping (DTW), Long Short-Term Memory Networks (LSTM), and Connectionist Temporal Classification (CTC).
In this chapter, we shall learn about various options for speech to text and the prebuilt model from Google's TensorFlow team, using the Speech Commands Dataset. We shall cover the following...