Neural network architecture
The network used for this example has three modules:
- A feature extraction module that processes the audio clips into feature vectors
- A deep neural network module that produces softmax probabilities for each word in the input frame of feature vectors
- A posterior handling module that combines the frame-level posterior scores into a single score for each keyword
Feature extraction module
In order to make the computation easy, the incoming audio signal is run through a voice-activity detection system and the signal is divided into speech and non-speech parts of the signals. The voice activity detector uses a 30-component diagonal covariance GMM model. The input to this model is 13-dimensional PLP features, their deltas, and double deltas. The output of GMM is passed to a State Machine that does temporal smoothing.
The output of this GMM-SM module is speech and non-speech parts of the signal.
The speech parts of the signal are further processed to generate the features....