Microsoft researchers win ImageNet computer vision challenge

Microsoft researchers on Thursday announced a major advance in technology designed to identify the objects in a photograph or video, showcasing a system whose accuracy meets and sometimes exceeds human-level performance.

Microsoft’s new approach to recognizing images also took first place in several major categories of image recognition challenges Thursday, beating out many other competitors from academic, corporate and research institutions in the ImageNet and Microsoft Common Objects in Context challenges.

Like many other researchers in this field, Microsoft relied on a method called deep neural networks to train computers to recognize the images. Their system was more effective because it allowed them to use extremely deep neural nets, which are as much as five times deeper than any previously used.

The researchers say even they weren’t sure this new approach was going to be successful – until it was.

Kaiming He. Photo by Microsoft.

“We even didn’t believe this single idea could be so significant,” said Jian Sun, a principal research manager at Microsoft Research who led the image understanding project along with teammates Kaiming He, Xiangyu Zhang and Shaoqing Ren in Microsoft’s Beijing research lab.

The major leap in accuracy surprised others as well. Peter Lee, a corporate vice president in charge of Microsoft Research’s NExT labs, said he was shocked to see such a major breakthrough.

“It sort of destroys some of the assumptions I had been making about how the deep neural networks work,” he said.

The contests, organized by researchers from top universities and corporations, have in the past few years become a leading barometer of success in this exploding field.

In the ImageNet challenge, the Microsoft team won first place in all three categories it entered: classification, localization and detection. Its system was better than the other entrants by a large margin.

In the Microsoft Common Objects in Context challenge, also known as MS COCO, the Microsoft team won first place for image detection and segmentation . The MS COCO project was originally funded by Microsoft and started as a collaboration between Microsoft and a few universities, but it is now run by academics outside of Microsoft.

A long research path and a recent breakthrough

Computer scientists have for decades been trying to train computer systems to do things like recognize images and comprehend speech, but until recently those systems were plagued with inaccuracies.

Then, about five years ago, researchers hit upon the idea of using a technology called neural networks, which are inspired by the biological processes of the brain. The neural networks themselves weren’t new, but the method of using them was – and it resulted in big leaps in accuracy in image recognition.

The system also proved very successful for recognizing speech, and it’s been the basis for the real-time translation capability in Skype Translator.

Neural networks are built in a series of layers. Theoretically, more layers should lead to better results, but in practice one big challenge has been that the signals vanish as they pass through each layer, eventually leading to difficulties in training the whole system.

Sun said researchers were excited when they could successfully train a “deep neural network” system with eight layers three years ago, and thrilled when a “very deep neural network” with 20 to 30 layers delivered results last year.

But he and his team thought they could go even deeper. For months, they toyed with various ways to add more layers and still get accurate results.

After a lot of trial and error, the researchers hit on a system they dubbed “deep residual networks.”

The deep residual net system they used for the ImageNet contest has 152 layers – fives time more than any past system – and it uses a new “residual learning” principle to guide the network architecture designs.

Residual learning reformulates the learning procedure and redirects the information flow in deep neural networks. That helped the researchers solve the accuracy problem that has traditionally dogged attempts to build extremely deep neural networks.

Transfer of knowledge

One key advantage of neural networks is that they get better at one task when they are given another. For example, with Skype Translator, a neural network that is designed to translate from English to German gets better at translating German once it has been trained for the additional task of translating Chinese.

Sun said his team saw similar results when they tested their residual neural networks in advance of the two competitions. After researchers used the system for the classification tasks in the ImageNet challenge, they found that it was significantly better at the three other metrics: detection, localization and segmentation.

“What we learned from our extremely deep networks is so powerful and generic that it can substantially improve many other vision tasks,” Sun said.

The researchers believe they would see a similar effect if they used the same principle for other problems, such as speech recognition.

They are already using these new advances to help improve the tools in Microsoft Project Oxford, which help developers build more intelligent apps for things like speech and image recognition. They also are working tightly with Microsoft’s product groups to include the best image understanding in existing or future Microsoft products and services.

None of this means that computers are getting smarter than humans, in a general way. The researchers say what it shows is that computers are getting very good at very narrow tasks, like identifying images in a database.

Still, that has big implications for how computers could eventually help people in any number of ways, like recognizing the difference between a tree and a car in your side view mirror or the frustrating task of sorting through photos for specific things, like a great picture of your dog.

“We don’t believe we’re anywhere close to the limit of the ultimate improvement in data classification accuracy for any of these tasks,” Lee said.

Related:

Allison Linn is a senior writer at Microsoft Research. Follow her on Twitter.