Barriers fall as Microsoft’s speech and language technologies exit the lab

Will Lewis and Amanda Song face each other in front of dual monitors showing real-time translation as the speak

Microsoft has incorporated world-class systems for translating between Chinese, German and English, which are based on groundbreaking research, into its publicly available translation technologies, the company announced on Tuesday.

The new translation technology is one of several advances that have recently moved out of Microsoft’s research labs and into the hands of consumers.

These technologies “are making this world a better place,” said Xuedong Huang, a technical fellow in Microsoft Cloud and AI who leads the Speech and Language group.

The new translation systems for Chinese, German and English, for example, are based on pioneering research in machine translation that used advanced deep neural networks to achieve human parity in translating news articles from Chinese to English. Microsoft’s computer engineers took that research system and adapted it to the company’s suite of translation technologies available via Azure Cognitive Services, including the Microsoft Translator app and the Presentation Translator plug-in for PowerPoint.

The team plans to apply the technology to additional languages supported by Microsoft Translator over the coming months.

Huang’s team also recently updated a speech recognition system for English that is available via Speech Services from Azure Cognitive Services. The capability is adapted from a research system that achieved human parity on transcriptions of recorded human telephone conversations, which benchmark testing has shown is “second to none,” noted Huang.

To give voice to these words and languages, Huang’s team has developed and made available via preview in Azure Cognitive Services a neural text-to-speech synthesis system that generates digital voices from text that are nearly indistinguishable from recordings of people. The technology can be used to make interactions with chatbots and virtual assistants more natural and engaging, convert digital texts such as e-books into audiobooks and enhance in-car navigation, for example.

The speech and language group also recently previewed a new audio-visual prototype device that leverages a breakthrough in so-called vision-enhanced far-field speech recognition to produce accurate transcriptions even when people are not speaking directly into a microphone. The far-field technology is available to developers via the Speech Devices SDK.

Although this is still a research project, Huang said Microsoft’s prototype device that leverages this technology could enable more digitization of meetings. Real-time translations allow people who speak different languages to converse naturally in real time without the need to hold a device close to their mouth, for example. The system also generates real-time transcripts with each speaker automatically identified. These transcripts are searchable, allowing people who were unable to attend the meeting to find out who said what.

“This is going to enhance productivity and efficiency,” said Huang.


John Roach writes about Microsoft research and innovation. Follow him on Twitter.