DS2 architecture is composed of many layers of recurrent connections, convolutional filters, and non-linearities, as well as the impact of a specific instance of batch normalization, applied to RNNs, as shown here:
To learn from datasets with a large amount of data, DS2 model's capacity is increased by adding more depth. The architectures are made up to 11 layers of many bidirectional recurrent layers and convolutional layers. To optimize these models successfully, batch normalization for RNNs and a novel optimization curriculum called SortaGrad were used.
The training data is a combination of input sequence x(i) and the transcript y(i), whereas the goal of the RNN layers is to learn the features between x(i) and y(i):
training set X = {(x(1), y(1)), (x(2), y(2)), . . .}
utterance = x(i)
label = y(i)
The spectrogram of power normalized...