Talking with your hands: How Microsoft researchers are moving beyond keyboard and mouse

Kfir Karmon imagines a world in which a person putting together a presentation can add a quote or move an image with a flick of the wrist instead of a click of a mouse.

Jamie Shotton envisions a future in which we can easily interact in virtual reality much like we do in actual reality, using our hands for small, sophisticated movements like picking up a tool, pushing a button or squeezing a soft object in front of us.

And Hrvoje Benko sees a way in which those types of advances could be combined with simple physical objects, such as a few buttons on a piece of wood, to recreate complex, immersive simulators – replacing expensive hardware that people use today for those purposes.

Microsoft researchers are looking at a number of ways in which technology can start to recognize detailed hand motion — and engineers can put those breakthroughs to use in a wide variety of fields.

The ultimate goal: Allowing us to interact with technology in more natural ways than ever before.

“How do we interact with things in the real world? Well, we pick them up, we touch them with our fingers, we manipulate them,” said Shotton, a principal researcher in computer vision at Microsoft’s Cambridge, UK, research lab. “We should be able to do exactly the same thing with virtual objects. We should be able to reach out and touch them.”

This kind of technology is still evolving. But the computer scientists and engineers who are working on these projects say they believe they are on the cusp of making hand and gesture recognition tools practical enough for mainstream use, much like many people now use speech recognition to dictate texts or computer vision to recognize faces in photos.

That’s a key step in Microsoft’s broader goal to provide more personal computing experiences by creating technology that can adapt to how people move, speak and see, rather than asking people to adapt to how computers work.

“If we can make vision work reliably, speech work reliably and gesture work reliably, then people designing things like TVs, coffee machines or any of the Internet of Things gadgets will have a range of interaction possibilities,” said Andrew Fitzgibbon, a principal researcher with the computer vision group at the UK lab.

That will be especially important as computing becomes more ubiquitous and increasingly anticipates our needs, as opposed to responding to our commands. To make these kinds of ambient computing systems truly work well, experts say, they must be able to combine all our senses, allowing us to easily communicate with gadgets using speech, vision and body language together – just like we do when communicating with each other.

The team working on hand track in Microsoft's UK lab includes Tom Cashman (top left, standing), Andrew Fitzgibbon, Lucas Bordeaux, John Bronskill, (bottom row) David Sweeney, Jamie Shotton, Federica Bogo. Photo by Jonathan Banks.

The team working on hand tracking in Microsoft’s UK lab includes Tom Cashman (top left, standing), Andrew Fitzgibbon, Lucas Bordeaux, John Bronskill, (bottom row) David Sweeney, Jamie Shotton, Federica Bogo. Photo by Jonathan Banks.

Smooth, accurate and easy

In order to accomplish a component of that vision, Fitzgibbon and other researchers believe the technology must track hand motion precisely and accurately, using as little computing power as possible. That will allow people to use their hands naturally and with ease, and for consumer gadgets to respond accordingly.

It’s easier said than done, in large part because the hand itself is so complex. Hands can rotate completely around, and they can do things like ball up into a fist, which means the fingers disappear and the tool needs to make its best guess as to where they’ve gone and what they are doing. Also, a hand is obviously smaller than an entire body, so there’s more detailed motion to track.

The computer vision team’s latest advances in detailed hand tracking, which are being unveiled at two prestigious academic research conferences this summer, combine new breakthroughs in methods for tracking hand movement with an algorithm dating back to the 1940s – when computing power was less available and a lot more expensive.  Together, they create a system that can track hands smoothly, quickly and accurately – in real time – but can run on a regular consumer gadget.

“We’re getting to the point that the accuracy is such that the user can start to feel like the avatar hand is their real hand,” Shotton said.

The system, still a research project for now, can track detailed hand motion with a virtual reality headset or without it, allowing the user to poke a soft, stuffed bunny, turn a knob or move a dial.

What’s more, the system lets you see what your hands are doing, fixing a common and befuddling disconnect that happens when people are interacting with virtual reality but can’t see their own hands.

From dolphins to detailed hand motion

The project, called Handpose, relies on a wealth of basic computer vision research. For example, a research project that Fitzgibbon and his colleague Tom Cashman worked on years earlier, looking at how to make 2D images of dolphins into 3D virtual objects, proved useful in developing the Handpose technology.

The researchers say that’s an example of how a long-term commitment to this kind of research can pay off in unexpected ways.

Although hand movement recognition isn’t being used broadly by consumers yet, Shotton said that he thinks the technology is now getting good enough that people will start to integrate it into mainstream experiences.

“This has been a research topic for many, many years, but I think now is the time where we’re going to see real, usable, deployable solutions for this,” Shotton said.

A virtual sense of touch

The researchers behind Handpose say they have been surprised to find that a lack of haptics – or the sense of actually touching something – isn’t as big of a barrier as they thought when people test systems like theirs, which let people manipulate virtual objects with their hands.

That’s partly because of how they are designing the virtual world. For example, the researchers created virtual controls that are thin enough that you can touch your fingers together to get an experience of touching something hard. They also developed sensory experiences that allow people to push against something soft and pliant rather than hard and unforgiving, which appears to feel more authentic.

The researchers say they also notice that other senses, such as sight and sound, can convince people they are touching something real when they are not – especially once the systems are good enough to work in real time.

Andy Wilson, left, and Hrvoje Benko are among the researchers working on haptic retargeting.

Andy Wilson, left, and Hrvoje Benko are among the researchers working on haptic retargeting. Photo by Jeremy Mashburn.

Still, Benko, a senior researcher in the natural interaction group at Microsoft’s Redmond, Washington, lab, noted that as virtual reality gets more sophisticated, it may become harder to trick the body into immersing itself in the experience without having anything at all to touch.

Benko said he and his lab colleagues have been working on ways to use limited real-world objects to make immersive virtual reality experiences seem more like what humans expect from the real world.

“There’s some value in haptics and so we’re trying to understand what that is,” said Andy Wilson, a principal researcher who directs Microsoft Research’s natural interaction group.

But that doesn’t mean the entire virtual world needs to be recreated. Eyal Ofek, a senior researcher in the natural interaction group, said people can be fooled into believing things about a virtual world if that world is presented with enough cues to mimic reality.

For example, let’s say you want to build a structure using toy blocks in a virtual environment. Using the haptic retargeting research project the Microsoft team created, one building block could be used over and over again, with the virtual environment shifting to give the impression you are stacking those blocks higher and higher even as, in reality, you are placing the same one on the same plane.

The same logic could be applied to a more complex simulator, using just a couple of simple knobs and buttons to recreate a complex system for practicing landing an airplane or other complex maneuvers.

“A single physical object can now simulate multiple instances in the virtual world,” Ofek said.

The language of gesture

Let’s say you’re talking to a colleague over Skype and you’re ready to end the call. What if, instead of using your mouse or keyboard to click a button, you could simply make the movement of hanging up the phone?

Need to lock your computer screen quickly? What if, instead of scrambling to close windows and hit keyboard shortcuts, you simply reach out and mimic the gesture of turning a key in a lock?

Researchers and engineers in Microsoft’s Advanced Technologies Lab in Israel are investigating ways in which developers could create tools that would allow people to communicate with their computer utilizing the same kind of hand gestures they use in everyday life.

The goal of the research project, called Project Prague, would be to provide developers with basic hand gestures, such as the one that switches a computer off. And it also makes it easy for developers to create customized gestures for their own apps or other products, with very little additional programming or expertise.

The system, which utilizes machine learning to train systems to recognize motions, runs using a retail 3D camera.

“It’s a super easy experience for the developers and for the end user,” said Karmon, a principal engineering manager who is the project’s lead.

To build the system, the researchers recorded millions of hand images and then used that data set to train the technology to recognize every possible hand pose and motion.

Eyal Krupka, a principal applied researcher and head of the lab’s computer vision and machine learning research, said the technology then uses hundreds of micro artificial intelligence units, each analyzing a single aspect of the user’s hand, to accurately interpret each gesture.

The end result is a system that doesn’t just recognize a person’s hand, but also understands that person’s intent.

Adi Diamant, who directs the Advanced Technologies Lab, said that when people think about hand and gesture recognition, they often think about ways it can be used for gaming or entertainment. But he also sees great potential for using gesture for everyday work tasks, like designing and giving presentations, flipping through spreadsheets, editing e-mails and browsing the web.

People also could use them for more creative tasks, like creating art or making music.

Diamant said these types of experiences are only possible because of advances in fields including machine learning and computer vision, which have allowed his team to create a system that gives people a more natural way of interacting with technology.

“We chose a project that we knew was a tough challenge because we knew there was a huge demand for hand gesture,” he said.

Related:

Allison Linn is a senior writer at Microsoft. Follow her on Twitter.