Are you talking to me? Azure AI brings iconic characters to life with Custom Neural Voice

Have you ever wished you could leap into your favorite cartoon and interact with characters like Bugs Bunny who entertain you onscreen?

Welcome to the AT&T Experience Store in Dallas, where a life-size, high-definition Bugs Bunny greets you by name and tells you he needs your help to find several golden carrots hidden throughout the store. Thanks to 5G, augmented reality, artificial intelligence and a Custom Neural Voice created with Microsoft Azure AI technology, Bugs follows your directions to navigate the store in search of carrots, chatting with you in real time.

The technology that makes such a conversation flow naturally is the neural text-to-speech capability within Speech, an Azure Cognitive Service, and it is now generally available.

“One of the things we hear from our customers is they like the idea of communicating with their customers through speech,” said Eric Boyd, corporate vice president for Azure AI Platform at Microsoft. “Speech has been very robotic over the years. Neural voice is a big leap forward to make it sound really natural.”

For AT&T, the immersive Bugs Bunny experience was an opportunity to delight customers while demonstrating the capabilities of their 5G cellular network. The network makes it possible for Bugs to appear in HD quickly and move around the room seamlessly.

Speech has been very robotic over the years. Neural voice is a big leap forward to make it sound really natural.

“We’re trying to prove to consumers that there is something to 5G that makes it different and better than a 4G network,” said Jay Cary, vice president of 5G product and mobility innovation for AT&T. “It has massive computing power, higher speeds and lower latency. This felt like a really amazing way to bring the potential of the network and the technology to life.”

Bugs Bunny is the first animated character AT&T has brought to life with Custom Neural Voice, but it likely won’t be the last. Cary becomes quite animated himself as he talks about the possibilities: characters coming to life from the cereal box, reading you stories, watching cartoons alongside you or showing you around the neighborhood.

“We love that idea of blending the physical environment and the virtual environment,” he said.

To create the custom voice, an approved Bugs voice actor came into the studio to record about 2,000 phrases and lines, with guidance from the Microsoft team, Cary said.

The Warner Bros. team – “the Bugs Bunny experts,” Cary calls them – then worked with the Microsoft team to iterate on the voice, making sure it accurately reflects Bugs Bunny’s personality and all his inflections.

“We wanted to make sure it really represented what Bugs felt like in the real world,” Cary said. “It feels like a natural speed, real-life conversation you might have with a friend. It feels very real.”

Unreal transparency

A conversation with Bugs Bunny might feel real, but everyone knows that it isn’t – because Bugs is a fictional character. That’s an important distinction, and one that Microsoft is careful to protect in every application of the technology. That’s a key reason Custom Neural Voice is limited access, meaning interested customers must apply and be approved by Microsoft to use the technology. In this case, general availability means it is ready for production and available in more Azure cloud regions, not that it is available to the general public.

While many uses for Custom Neural Voice involve a fictional character, sometimes a customer wants the voice to be a real person, such as an author reading their own book. Even in those cases, it is important that people know the voice is synthetic, which is why Microsoft includes a disclosure requirement in its contract.

“We require customers to make very clear it’s a synthetic voice or, when it’s not immediately obvious in context, that they explicitly disclose it’s synthetic in a way that’s perceivable by users and not buried in terms,” said Sarah Bird, Responsible AI lead for Cognitive Services within Azure AI.

Another fictional voice that neural text-to-speech is bringing to life is Flo, the longtime brand icon for Progressive Insurance.

Progressive brand icon Flo
To bring voice conversation capabilities to its Flo chatbot, Progressive Insurance created a synthetic voice using Custom Neural Voice. Image courtesy of Progressive Insurance.

A few years ago, the company launched a Flo chatbot in Facebook messenger, complete with the sunny personality and quirky witticisms that customers have come to expect from the salesperson character played by Stephanie Courtney in TV ads since 2008. When the company started to explore the potential of using a voice conversation to interact with customers, Flo was the natural choice.

“One of Progressive’s core interest areas is we want to make our brand and products available wherever and whenever people want,” said Matt White, technology and innovation manager in Progressive’s acquisition experience group. “That’s why we put Flo in Facebook Messenger, and that’s why we started to explore what’s possible with voice and smart speakers.”

Progressive was already using Azure AI technology to power the chatbot, and it made sense to layer the neural text-to-speech service on top, White said.

The general availability of Custom Neural Voice includes technical controls to help prevent misuse of the service. As part of the voice recording script a customer submits to create the custom voice, the voice actor makes a statement acknowledging that they understand the technology and are aware that the customer is having a Custom Neural Voice made. That recording is compared with the training data using speaker verification technology to make sure the voices match before a customer can begin training the voice. Microsoft also contractually requires customers to get consent from voice talent.

“We did a number of studies and had interactions with the voice acting industry and ethicists in the field to come up with sets of guidelines and ways we want to make sure this technology is used,” Boyd said.

A commitment to responsibility

Contractual terms, limiting access to approved customers and performing speaker verification on audio files are three ways Microsoft is safeguarding against misuse of the technology. Bird’s role within Microsoft is to help develop protocols and support teams in responsibly developing features and products within Azure Cognitive Services, as well as empowering customers to use them responsibly.

“We really want to demonstrate how we can create these technologies that have this positive impact while making sure that we’re not causing harm in the world,” Bird said.

Microsoft conducts impact assessments to determine potential risks. Once risks have been identified, features and processes are created to address them. In the case of Custom Neural Voice, such safeguards include the review process for each potential use case, a code of conduct and the verification comparing voice talent acknowledgement files against training audio files.

Bird said the team is also working on a way to embed a digital watermark within a synthetic voice to indicate that the content was created with an Azure Custom Neural Voice.

Such technical and policy features are in line with Microsoft’s commitment to responsible AI. That commitment includes Transparency Notes, which communicate the purposes, capabilities and limitations of an AI system.

“As creators of this technology, we have an obligation to make sure it’s used responsibly,” Boyd said. “We take responsible AI very seriously; it’s one of our core tenets. And we’re careful with the partners we work with in making sure they follow the guidelines.”

Building a custom voice

So how do a bunch of recorded phrases become a natural-sounding voice that can say anything?

Recordings are used to create a font of sounds, or phonemes. It’s somewhat similar to a font on a computer containing letters and characters that you combine to make words and sentences.

But neural text-to-speech goes way beyond piecing together sounds to form words.

“The real technology breakthrough is the efficient use of deep learning to process the text to make sure the prosody and pronunciation is accurate,” said Xuedong Huang, a Microsoft technical fellow and the chief technology officer of Azure AI Cognitive Services. “The prosody is what the tone and duration of each phoneme should be. We combine those in a seamless way so they can reproduce the voice that sounds like the original person.”

The real technology breakthrough is the efficient use of deep learning to process the text to make sure the prosody and pronunciation is accurate. The prosody is what the tone and duration of each phoneme should be. We combine those in a seamless way so they can reproduce the voice that sounds like the original person.

Xuedong Huang

Listen to a demonstration of a Custom Neural Voice, created with Huang and his team at Microsoft. Image courtesy of Scott Eklund/Red Box Pictures.

Deep learning is a subset of machine learning, in which machines are taught to learn and analyze data in a similar way to humans. “Deep” refers to the depth of the layers of neural networks, which are inspired by our understanding of how the brain works. These layers upon layers of neural networks work together to perform complex tasks quickly, mapping sequences of data together and learning from each task. More layers within a neural network create better results.

In neural text-to-speech, one neural network converts input text into an acoustic sequence, encoding and decoding and predicting prosody, while another neural network converts that acoustic sequence into speech. Between the two, there are about 50 layers.

Because the two neural networks can simultaneously predict the right prosody and synthesize the voice, it results in a more natural-sounding voice.

Of course, not everyone needs a custom voice built just for them. Microsoft also has more than 120 prebuilt neural voices in more than 50 languages for customers who want to quickly add read-aloud functionality or give a voice to a chatbot.    

‘Unlocking people’s creative potential’

At its core, Custom Neural Voice is a creative technology, Bird said. She’s most excited about its possibilities in education, such as in reading books or teaching a new language.

Microsoft worked with a nonprofit organization in Beijing, China, using Custom Neural Voice and a team of volunteers to generate AI audio content to be donated to the Beijing Hongdandan Visually Impaired Service Center, which provides resources for people who are blind or have low vision.

Duolingo, a language learning company, is using Custom Neural Voice as part of its effort to personalize language learning by introducing a cast of characters within the learning platform. The diverse group of nine includes Lily, a deadpan, moody teen, and Junior, precocious youngster who’s too smart for his own good. 

The company went through hundreds of iterations of characters, aiming for them to reflect the user base of cultures around the world while aligning visually with Duo, the app’s longstanding main character.  

“Duolingo is used around the world, and we want people to feel connected to and engaged with the app,” said Duolingo CTO Severin Hacker. 

Duolingo created a cast of nine characters
Duolingo used Custom Neural Voice to help bring nine new characters to life within the language learning platform. Image courtesy of Duolingo.

The shape and other design aspects of each character informed its personality, and they all share some elements with Duo: a unique body shape, detached feet, big eyes and a simple construction. Giving voice to the characters was the final touch in an extensive character creation process. 

“Voice is very important when learning a language,” Hacker said. “It was particularly important for us as a language learning app that we expose our learners to authentic voices and accents, and we’re able to do that with this technology.”  

The company has been working with voice actors to create custom voice fonts for each character. Last year, Duolingo introduced Lily’s voice in English and Spanish, and Junior’s in English. Eventually, all nine characters will be featured in English, Spanish, French, German and Japanese. Language learners can expect to hear from new characters including Bea, a type-A world traveler, and Vikram, a devoted husband and pastry chef, later this year.  

Custom Neural Voice can also be used to create a custom voice font that doesn’t directly mimic an existing person or character. 

“We have the ability to create composite voices and experiment with creating voices that would never really exist by bringing the best of different backgrounds together,” Bird said. “This is technology that is unlocking people’s creative potential.” 

Bird and Boyd believe that Custom Neural Voice technology will open doors for deeper engagement, whether that’s through entertainment, information or education.  

“One of the really exciting things about AI is that we’re constantly surprised by the ways that you can use it that are way beyond what we originally envisioned,” Boyd said. “It’s just really exciting to see what people can do with it.”  

Learn more: 

John Roach contributed to this post.  

Top image: Visitors to the AT&T Experience Store in Dallas can interact with Bugs Bunny and other characters in augmented reality. Bugs speaks to customers using a synthetic voice created using Custom Neural Voice, a capability within Azure Cognitive Services. LOONEY TUNES and all related characters and elements© & ™ Warner Bros. Entertainment Inc. (s21).