Dividing the task into such systems have been working successfully and have powered many commercial speech-to-speech translation products, including Google Translate.
In 2016, most of the engineers and researchers realized the need for end-to-end models on speech translation when researchers demonstrated the feasibility of using a single sequence-to-sequence model for speech-to-text translation.
In 2017, the Google AI team demonstrated that such end-to-end models can outperform cascade models. Recently, many approaches for improving end-to-end speech-to-text translation models have been proposed.
Translatotron demonstrates that a single sequence-to-sequence model can directly translate speech from one language into another. Also, it doesn’t rely on an intermediate text representation in either language, as required in cascaded systems. It is based on a sequence-to-sequence network that takes source spectrograms as input and then generates spectrograms of the translated content in the target language.
Translatotron also makes use of two separately trained components: a neural vocoder that converts output spectrograms to time-domain waveforms and a speaker encoder, which is used to maintain the source speaker’s voice in the synthesized translated speech.
The sequence-to-sequence model uses a multitask objective for predicting source and target transcripts and generates target spectrograms during training. But during the inference, no no transcripts or other intermediate text representations are used.
The engineers at Google AI validated Translatotron’s translation quality by measuring the BLEU (bilingual evaluation understudy) score, computed with text transcribed by a speech recognition system.
The results do lag behind a conventional cascade system but the engineers have managed to demonstrate the feasibility of the end-to-end direct speech-to-speech translation.
Translatotron can retain the original speaker’s vocal characteristics in the translated speech by incorporating a speaker encoder network. This makes the translated speech sound natural and less jarring. According to the Google AI team, the Translatotron gives more accurate translation than the baseline cascade model, while retaining the original speaker’s vocal characteristics.
The engineers concluded that Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language and can retain the source speaker’s voice in the translated speech.
To know more about this news, check out the blog post by Google AI.
Google News Initiative partners with Google AI to help ‘deep fake’ audio detection research
Google AI releases Cirq and Open Fermion-Cirq to boost Quantum computation
Google’s Cloud Healthcare API is now available in beta