Skip to main content
The AI Blog

With reinforcement learning, Microsoft brings a new class of AI solutions to customers

Someone looking to book a vacation online today might have very different preferences than they did before the COVID-19 pandemic.

Instead of flying to an exotic beach, they might feel more comfortable driving locally. With limited options for dining out, having a full kitchen might be essential. Motel rooms or cabins might be more appealing than hotels with shared lobbies.

Countless companies use online recommendation engines to show customers products and experiences that match their interests. And yet, traditional machine learning models that predict what people might prefer are often based on data from past experience. That means they aren’t necessarily able to pick up on quickly changing consumer preferences unless they are retrained with new data.

Personalizer, which is part of Azure Cognitive Services within the Azure AI platform, uses a more cutting-edge approach to machine learning called reinforcement learning, in which AI agents can interact and learn from their environment in real time.

The technique used to be primarily used in research labs. But now, it’s making its way into more Microsoft products and services — from Azure Cognitive Services that developers can plug into apps and websites to autonomous systems that engineers can use to refine manufacturing processes. Azure Machine Learning is also previewing cloud-based reinforcement learning offerings for data scientists and machine learning professionals.

“We’ve come a long way in the last two years when we had a lot of proof of concept projects within Microsoft and deployments with a couple of customers,” said Rafah Hosn, senior director at Microsoft Research’s New York lab. “Now we are really progressing nicely into things that can be packaged and shrink wrapped and pointed to a particular set of problems.”

Rafah Hosn standing outside
Rafah Hosn, senior director at Microsoft Research Lab – New York City. Photo courtesy of Microsoft.

Z-Tech, the technology hub of Anheuser-Busch InBev, is using Personalizer to deliver tailored recommendations in an online marketplace to better serve small grocery stores across Mexico. Other Microsoft customers and partners are employing reinforcement learning to detect production anomalies and develop robots that can adjust to unpredictable real-world conditions — with models that can learn from environmental cues, expert feedback or customer behavior in real time.

Once Microsoft began using Personalizer on its homepage to contextually personalize the products displayed to each visitor, the company saw a 19-fold increase in engagement with the products that Personalizer chose. The company has also used Personalizer internally to select the right offers, products and content across Windows, Edge browser and Xbox. These scenarios are giving up to a 60% lift in engagement across billions of personalizations each month.

Teams has also used reinforcement learning to find the optimal jitter buffer for a video meeting, which trades off millisecond-scale information delays to provide better connection continuity, while Azure is exploring reinforcement learning-based optimization to help determine when to reboot or remediate virtual machines.

Because reinforcement learning models learn from instantaneous feedback, they can quickly adapt to changing or unpredictable circumstances. Once the COVID-19 pandemic hit, some companies had no idea what to expect as people’s purchasing and travel behaviors changed overnight, said Jeff Mendenhall, a Microsoft principal program manager for Personalizer.

“All of their historic modeling and expert knowledge went out the window,” Mendenhall said. “But with reinforcement learning, Personalizer can update the model every minute if needed to learn and respond to what actual user behaviors are right now.”

In reinforcement learning, an AI agent learns largely by trial and error. It tests out different actions in either a real or simulated world and gets a reward when the actions achieve a desired result — whether that’s a customer hitting the button to book a vacation reservation or a robot successfully unloading an unwieldy bag of coins.

Training an AI agent through reinforcement learning is similar to teaching a puppy to do a trick, Hosn said. It gets a treat when it makes decisions that yield a desired result and learns to repeat the actions that get the most treats. But in complicated real-world scenarios, exploring the vast universe of potential actions and finding an optimal sequence of decisions can be far more complicated.

At the 34th Conference on Neural Information Processing Systems (NeurIPS 2020) this week, Microsoft researchers presented 17 research papers that mark significant progress in addressing some of the field’s biggest challenges. By investing in reinforcement learning teams across its network of Microsoft Research labs, the company says it is developing a portfolio of approaches to tackle different problems and exploring multiple paths to potential breakthroughs.

John Langford sits in an office
John Langford, partner research manager at Microsoft Research Lab – New York City. Photo by John Brecher.

Those teams have focused on developing a robust understanding of reinforcement learning’s foundational elements and creating practical solutions for customers — not just novelty demonstrations, researchers say.

They’ve spent a lot of time figuring out which scenarios reinforcement learning is well-suited to solve, as well as probing the technical underpinnings to understand why something works and how to repeat it, said John Langford, a partner research manager at Microsoft Research Lab – New York.

“Right now there’s a big gap between one-off applications where you can get PhDs to grind really hard and figure out a way to make it work as opposed to developing a routinely useful system that can be used over and over again,” Langford said.

“All of our reinforcement learning research at Microsoft really falls into two big buckets — how can we solve challenges that customers are bringing to us and what are the foundations we can use to build replicable, reliable solutions?” he said.

A different approach to machine learning

Reinforcement learning uses a fundamentally different approach than supervised learning, a more common machine learning technique in which models learn to make predictions from training examples they’ve been fed.

If a person is trying to learn French, exposing themselves to French text, grammar rules and vocabulary is closer to a supervised learning approach, said Raluca Georgescu, a research software engineer working on Project Paidia in the Microsoft Research Cambridge UK lab.

With a reinforcement learning approach, they would go to France and learn by talking to people. They’d be penalized with puzzled looks if they say the wrong thing and they’d get rewarded with a croissant if they order it correctly, she said.

A reinforcement learning agent learns from interacting with its environment, either in the real world or in a simulated environment that allows it to safely explore different options. It takes an action and waits to see if it results in a positive or negative outcome, based on a reward system that’s been established.  Once that feedback is received, the model learns whether that decision was good or bad and updates itself accordingly.

It’s a really simple form of learning that’s endemic in the natural world, said Langford.

“Even worms can do reinforcement learning — they can learn to go towards things and avoid things based on some feedback,” Langford said. “That ability to learn at a very basic level from your environment is something that is super natural for us but in machine learning it’s a bit more tricky and delicate and requires more thought than supervised learning.”

The new papers presented at NeurIPS this week offer significant contributions in three key research areas: batch reinforcement learning, strategic exploration given rich observations and representation learning. Taken together, researchers say, these breakthroughs aim to boost the efficiency of models and expand the scope of problems that reinforcement learning can solve.

YouTube Video

From research labs to real-world products

Personalizer, the first Azure Cognitive Service to be built on reinforcement learning, grew out of a close collaboration between Microsoft researchers and Azure product experts. They wanted to help developers easily serve the right content to the right users at the right time without requiring a deep knowledge of machine learning.

Metrics Advisor, a new Azure Cognitive Service now available in public preview, also uses reinforcement learning to incorporate feedback and make models more adaptive to a customer’s dataset, which helps detect more subtle anomalies in sensors, production processes or business metrics.

Personalizer automatically selects what to show someone looking at a website or what question a chatbot should ask next to drive a desired business or experience outcome. That could be getting a person to commit to healthier eating habits or to try a new gaming experience. The agent learns through trial and error which content is most helpful or persuasive to different kinds of users.

In trying to make a video recommendation, for instance, what someone prefers to watch might be driven by what time of day it is, whether they’re sitting at home or moving around, or how much battery their device has left. Personalizer learns from the choices or actions that customers with similar characteristics have made.

A woman behind a store counter holds a smart phone
Z-Tech, the technology hub of Anheuser-Busch InBev, has deployed Azure Personalizer, which uses reinforcement learning, to deliver tailored recommendations to small grocery stores across Mexico. Photo courtesy of Z-Tech.

Z-Tech, the technology hub of the multinational drink and brewing company AB InBev, started using Personalizer this fall to deliver tailored recommendations to mom and pop grocery stores in Mexico placing orders through the MiMercado online marketplace. It has seen a nearly 100% uplift in click-through rates for the personalized products and a 67% increase in converting customer interest into orders.

“As we were learning about the capabilities of the Azure platform, Personalizer came up as something that was very leading edge and very innovative and solved a need for us,” said Luiz Gondim, global chief technology officer for Z-Tech, which aims to bring data-driven solutions to small- and medium-sized businesses.

In the past, the featured products on MiMercado were the same for all customers. Z-Tech was interested in using AI to make personalized and more useful recommendations for an individual corner store selling everything from beer and baking supplies to potato chips and pet food.

Personalizer had two differentiating benefits, said Richard Sheng, global director of data science and analytics for Z-Tech.

“Reinforcement learning models by their very nature generally require less data because they use the current context to generate recommendations and learn through user feedback,” he said. “And having the models already developed and wrapped into an API that we can use in this plug-and-play way was very helpful.”

Microsoft autonomous systems researchers collaborated with scientists and engineers from Sber, a global financial services and technology company that operates SberBank, the largest bank in Russia, Central and Eastern Europe and one of the leading financial institutions worldwide, to use reinforcement learning to develop robotic technologies for unloading heavy collection bags of coins from mobile carts so they can be counted and repackaged.

In a recently published paper describing those results, the researchers detail how manipulating unstable coin bags with a constantly shifting center of gravity is a harder robotics problem than gripping solid objects. It’s the kind of scenario that’s commonplace in the physical world but that robots relying on traditional control systems or neural networks struggle to master, said Albert Efimov, vice-president for research and innovation at SberBank.

“We saw an opportunity to really advance the science and use reinforcement learning to teach a machine to perform a very difficult process,” said Efimov. “The bag has an unpredictable and amorphous shape, and even humans have to think for a minute about how to handle it. For a robot to do this is a big deal.”

The Sber and Microsoft team used deep reinforcement learning and machine teaching techniques to first train the AI agent in a simulated environment, where it could explore different strategies and learn what worked best. Once deployed in real-world working conditions, the robotic system was able to successfully unload the coin bags on the first try 95% of the time.

A screenshot of the Ninja Theory game
Researchers from Microsoft’s Cambridge-UK lab and Ninja Theory are exploring how reinforcement learning could help develop AI agents that can collaborate with human players in video games. Note: Image is not representative of final game gameplay or visuals.

In Project Paidia, researchers in Microsoft Research’s UK-Cambridge lab are collaborating with Ninja Theory, an Xbox game studio. The goal is to drive state-of-the-art research in reinforcement learning that can enable new applications in modern video games and develop AI agents that can learn to collaborate with human players.

Agents that use reinforcement learning have the potential to better anticipate behaviors and react to nuances to enable effective collaboration with human players who are creative and unpredictable and have different styles of play, said Katja Hofmann, a principal researcher who leads a team that focuses on deep reinforcement learning in gaming and other application areas at Microsoft Research’s Cambridge-UK lab. Bots developed with current technologies struggle to navigate those complexities and simply don’t react in quite the same way that people do.

A woman stands in front of a brown wall
Principal researcher Katja Hofmann leads a team that focuses on deep reinforcement learning in gaming and other applications at Microsoft’s Cambridge – UK research lab. Photo courtesy of Microsoft.

Video games like Ninja Theory’s Bleeding Edge, which requires characters with different personalities and superpowers to team up to score points and defeat opponents, offer a helpful test bed for developing AI agents that can use reinforcement learning to coordinate actions and react appropriately to new situations through a series of rewards.

“Having a bot that can genuinely collaborate with human players is considered impossible with traditional game AI technology, so that creates a really nice space for us,” Hofmann said. “If we can demonstrate how to do this in gaming, it’s a first step towards demonstrating how we can create agents outside gaming that can work collaboratively with humans in other ways.”

The Project Paidia research team and others across Microsoft helped Azure Machine Learning understand what heavy users of reinforcement learning actually need in terms of infrastructure and compute power.

They’ve developed tools allowing people to experiment with the technology, including a demo allowing people to play a simple game with a reinforcement learning agent to see how it reacts as well as Azure Machine Learning sample notebooks to create an agent that can navigate a lava maze in Minecraft.

Large companies in the industrial, manufacturing and financial services fields that employ data scientists with reinforcement learning expertise are now using Azure Machine Learning’s reinforcement learning offerings introduced earlier this year to efficiently spin up and manage training processes in the cloud, said Keiji Kanazawa, Microsoft principal program manager.

“For customers who are doing large-scale trial and error, the value of the cloud is that they can do it massively,” he said. “Our tools allow customers to focus on what they’re trying to do with reinforcement learning and their goals and the structure of the rewards and all the compute just happens on the back end.”

Top image: Researchers from Microsoft and Sber used reinforcement learning to develop a robotic technology that can unload unwieldy bags of coins from mobile carts. Photo courtesy of Sber.

Related:

Jennifer Langston writes about Microsoft research and innovation. Follow her on Twitter.