Using speech-to-text
Speech-to-text, also known as speech recognition, is a forefront technology that allows the accurate conversion of speech into text in real-time or batch mode. The recent advances in machine learning have led to state-of-the-art systems that can understand natural speech in many languages. Deep neural networks have proven to be very efficient for speech recognition, and current systems have an error rate of between 3%-5%, depending on the task. As a point of reference, humans achieve similar error rates when asked to transcribe recorded audio. Deep neural networks have worked so well for the task because of the data’s compositional nature; waveforms can be cut into phonemes, which are the building blocks of words. Then, words can be combined to create sentences. We have seen a similar concept during the discussion in the Understanding CNN section of Chapter 8, Detecting Hateful and Offensive Language. Processing an image using a convolutional neural network...