Microsoft Cognitive Services push gains momentum

The machine-learned smarts that enable Microsoft’s Skype Translator, Bing and Cortana to accomplish tasks such as translating conversations, compiling knowledge and understanding the intent of spoken words are increasingly finding their way into third-party applications that people use every day.

These advances in the democratization of artificial intelligence are coming in part from Microsoft Cognitive Services, a collection of 25 tools that allow developers to add features such as emotion and sentiment detection, vision and speech recognition, and language understanding to their applications with zero expertise in machine learning.

“Cognitive Services is about taking all of the machine learning and AI smarts that we have in this company and exposing them to developers through easy-to-use APIs, so that they don’t have to invent the technology themselves,” said Mike Seltzer, a principal researcher in the Speech and Dialog Research Group at Microsoft’s research lab in Redmond, Washington.

“In most cases, it takes a ton of time, a ton of data, a ton of expertise, and a ton of compute to build a state-of-the-art machine-learned model,” he explained.

Take one of the tools that deals with speech recognition, for example. Seltzer and his colleagues have spent more than a decade developing algorithms that enable Microsoft’s speech recognition technology to perform robustly in noisy environments as well as with the jargons, dialects and accents of specific user groups and settings.

The same flexible technology is now available to developers of third-party applications via the Custom Speech Service, a Cognitive Service that Microsoft released to public preview on Tuesday.

Two other Cognitive Services, the Content Moderator and Bing Speech API, will be moving to general availability next month, the company noted. The Content Moderator allows users to quarantine and review data such as images, text or videos to filter out unwanted material, such as potentially offensive language or pictures. The Bing Speech API converts audio into text, understands intent and converts text back to speech.

Andrew Shuman, corporate vice president, Microsoft AI and Research.

Andrew Shuman, corporate vice president, Microsoft AI and Research.

Cognitive Services that enable developers to apply intelligence to visual data such as pictures and video are being used by customers to enhance their services. For example, business intelligence company Prism Skylabs used the Computer Vision API in its Prism Vision application, which helps organizations search through closed-circuit and security camera footage for specific events, items and people.

The entire collection of Cognitive Services stems from a drive within Microsoft to make its artificial intelligence and machine learning expertise widely accessible to the development community to create delightful and empowering experiences for end users, said Andrew Shuman, corporate vice president of products for Microsoft’s AI and Research organization.

“Being able to have software now that observes people, listens, reacts and is knowledgeable about the physical world around them provides an excellent breakthrough in terms of making interfaces more human, more natural, more easy to understand and thus far more impactful in lots of different scenarios,” he said

“This era that we are coming into is really an era of enhancing and bringing about more computer capabilities for more people in more interesting ways.”

Storytelling experience
Take Alexander Mejia, for example. Growing up, he always rushed to try the latest games with the latest graphics and technological innovations, chasing the buzz that comes with better sounds and resolution and new ways to convert bodily twitches into action on the screen.

In recent years, while working as a creative director in the gaming industry, the buzz from new experiences fizzled – doubling of computing power failed to result in a doubling of gaming excitement. “What is the next thing?” he asked. “What is the technology leap that is going to allow for new experiences that will wow the gamers?”

The questioning led to a demonstration of the latest generation of virtual reality technology. He strapped on the headgear and was taken for a wild ride on a roller coaster. The adrenaline rush returned. The experience, he said, was visceral.

“You believe that things are real when in the virtual world,” he said. “What would happen if we put a person in front you? Would you try to talk to him?”

The idea blossomed to a business plan. Mejia founded his own company, Human Interact, to develop virtual reality storytelling experiences. The company’s premier title, Starship Commander, provides players control over the narrative as they zip around space at faster-than-light speed and speak with virtual characters at every turn.

To achieve realistic, fast-paced action, Mejia and his colleagues required accurate and responsive speech recognition.

“You’ve got to make it so that anytime anybody says anything, [the speech recognition engine] is going to understand them and run them down the right path in the script,” he explained. “And that,” he added, “is the magic of Microsoft Cognitive Services.”

Creating a custom speech model
Modern speech recognition technology hinges on machine-learned statistical models that harness the power of cloud computing and massive troves of data to convert snippets of sound into text that is an accurate transcription of the spoken words.

An acoustic model, for example, is a classifier that labels short snippets of audio as one of a number phonemes, or sound units, in a given language. The labels are combined with those from the neighboring snippets to predict what word in the target language is being spoken, Seltzer explained. That prediction is guided by a dictionary that contains every word in the target language broken down to its phonemes.

Meanwhile, a language model further refines the prediction by weighing how common every predicted word is in the target language. When the recognizer grapples with similar sounding words, the higher probability goes to the more common word. These models also consider context to make more robust predictions. “If the previous words are ‘The player caught the,’” explained Seltzer, “then ‘ball’ is going to be more likely than ‘fall.’”

The acoustic model that powers Microsoft’s state-of-the-art speech recognition engine is a deep neural network, a classifier inspired by theories about how pattern recognition occurs in the human brain. The model is trained on thousands of hours of audio using advanced algorithms that run in the cloud.

Microsoft’s speech recognition system recently hit a milestone when it recognized words in a conversation as well as a person. The milestone was achieved on a standardized test, or benchmark, that has been used by researchers in academia and industry for more than 20 years.

“Now, if you take that same system and put it in a noisy factory and it had never seen noisy factory speech, it would not do a good job,” Seltzer said. “That is where the Custom Speech Service comes in.”

The service allows a developer to customize the acoustic and language models to the sounds of the noisy factory floor and the jargon of factory workers. The acoustic model, for example, can be trained to recognize speech amid the din of hydraulics and drills and the language model updated to give priority weighting to jargon specific to the factory, such as nuts, bolts and car parts.

Beneath the hood, the Custom Speech Service leverages an algorithm that shifts Microsoft’s existing speech recognizer to the developer-supplied data. By starting from models that have been trained on massive troves of data, the amount of application-specific data required is greatly reduced. In cases where the developer’s data is insufficient, the recognizer falls back on the existing models.

“The basic idea is that the more focused the systems can be, the better they will perform,” Seltzer said. “The job of the Custom Speech Service is to let you focus the system on the data that you care about.”

Customized for virtual reality
Starship Commander, the premier title from Human Interact, takes place in a science fiction world that contains invented words and place names. When Mejia trained the Custom Speech Service on these keywords and phrases, he found the system made half as many errors as the open-source speech-to-text software he used to build an early prototype of the virtual reality experience.

Mejia then turned to Microsoft’s Language Understanding Service to address another concern – understanding the intent of what the gamers say.

“There are a lot of different ways to say ‘let’s go,’” he explained. “There is, ‘let’s go; autopilot; get me out of here; let’s go faster than light; engage the hyper-drive.’ These are all different things people say to get going in our game, especially in the heat of the moment, because sometimes you don’t have very much time before something bad happens.”

The Language Understanding Service, which is currently in public preview, allows developers to train a classifier in a machine-learned model to understand the intent of natural language by uploading a subset of the type of things users might utter and tagging those utterances to an intention.

On the backend, the service harnesses more than a decade of research on how to train classifiers with a limited dataset, explained Hussein Salama, director of Microsoft’s Advanced Technology Lab in Cairo, Egypt, which is leading the development of the service.

“Usually one needs an expert on machine learning to select the right technology and provide the right sets of data and to train the classifiers and then to evaluate them,” he said. “With the Language Understanding Service, we have simplified this. Provide a few utterances, a few examples of phrases with the intent, and then the Language Understanding Service can start training a model with good accuracy for that intent.”

For Starship Commander, the customization worked seamlessly, learning from the examples how to infer intent from natural language commands that were not part of the training data. “It’s shockingly scary how good it understands things that you never even trained it for,” Mejia said. “It is an AI.”

Related information:

John Roach writes about Microsoft research and innovation. Follow him on Twitter.