Microsoft’s new neural text-to-speech service lets machines speak like people

Microsoft has come out with a production system that performs text-to-speech (TTS) synthesis using deep neural networks. This new production system makes it hard for you to distinguish the voice of computers from human voice recordings.

The Neural text-to-speech synthesis has significantly reduced the ‘listening fatigue’ when talking about interaction with AI systems. It enables the system with human-like, natural sounding voice, that makes the interaction with chatbots and virtual assistants more engaging. This neural-network powered text-to-speech system was demonstrated by the Microsoft team at the Microsoft Ignite conference in Orlando, Florida, this week.

Additionally, Neural text-to-speech converts digital texts such as e-books into audiobooks. It also enhances in-car navigation systems. Deep Neural networks are great at overcoming the limits of traditional text-to-speech systems. Neural networks are very accurate in matching the patterns of stress and intonation in spoken language, called prosody. They’re also quite effective in synthesizing the units of speech into a computer voice.

microsofts-new-neural-text-to-speech-service-lets-machines-speak-like-people-img-0

Neural TTS

Traditional text-to-speech systems generally break down the prosody into separate linguistic analysis and acoustic prediction steps that get governed by independent models. This usually results in muffled, buzzy voice synthesis. Whereas, neural networks perform prosody prediction and voice synthesis simultaneously. This results in a more fluid and natural-sounding voice.

Microsoft makes use of the computational power of Azure to offer real-time streaming. This makes it useful for situations such as interacting with a chatbot or virtual assistant. This TTS capability is served in the Azure Kubernetes Service to ensure high scalability and availability.

Only the preview of the text-to-speech service is available currently. The preview comes with two pre-built neural text-to-speech voices in English – Jessa, and Guy. Microsoft will be making more languages available soon. It will also be offering customization services in 49 languages for customers wanting to build branded voices optimized for their specific needs.

For more information, check out the official Microsoft Blog post.