Microsoft researchers release graph that helps machines conceptualize


To most computers, that word printed on an otherwise blank screen is simply a string of characters.

It’s different for people. You see a word associated with a big cat, a large mammal. Given the context of valet parking, it might also bring to mind a luxury brand that is similar to Mercedes and BMW.

Put another way, you have a collection of ideas, or concepts, of what “Jaguar” means and the mental agility to use context to infer which concept the writer of the word intended to convey.

On Tuesday, a team of scientists from Microsoft Research Asia, Microsoft’s research lab in Beijing, China, announced the public release of technology designed to help computers conceptualize in a humanlike fashion.

From left, Lei Ji, Jun Yan and Dawei Zhang of Microsoft Research Asia were key players in the development of Microsoft Concept Graph. (Photo credit: Microsoft.)

The Microsoft Concept Graph, as it is known, is a massive graph of concepts – more than 5.4 million and growing – that machine-learning algorithms are culling from billions of web pages and years’ worth of anonymized search queries.

“We want to provide machines some commonsense, high-level concepts” so that they can better understand, and process, human communication, says Jun Yan, a senior research manager at Microsoft Research Asia, who is working on the project.

Knowledge graphs such as this one are a major component of ongoing efforts in industry and academia to computationally simulate human thinking, which computer scientists argue is a hallmark of true artificial intelligence.

“The limitation of computers is that they do not have commonsense knowledge or semantics. They can only understand the characters of words,” Yan explains. “But with humans it is different. Humans have a lot of background knowledge to understand things.”

Conceptual computing

The research behind the Microsoft Concept Graph has been ongoing for six years. The technology has potential applications that range from keyword advertising and search enhancement to the development of human-like chatbots.

For example, in traditional search advertising, a luxury car company buys a list of keywords related to products it wants to sell, such as various models of sport utility vehicles, or SUVs, Yan explains. When those models are queried, the engine surfaces an ad for the car company.

Using data from the Microsoft Concept Graph, the keyword sales team can also suggest that the car company buy related keywords, such as “upmarket SUV,” “top crossover” and potentially hundreds more.


“This is an opportunity to earn more revenue from the advertiser, and for the advertiser to reach a larger audience,” Yan says.

Daxin Jiang, a China-based principal development manager with Microsoft’s search engine Bing, has collaborated with the Concept Graph team for three years to incorporate conceptualization techniques to improve the ranking and relevance of search results.

For example, the graph recognizes certain phrases as single entities. When “Microsoft Research Asia” is queried, Bing ranks documents with the phrase “Microsoft Research Asia” higher than documents where “Microsoft,” “Research” and “Asia” are separated by additional words or punctuation.

His group is also leveraging the Concept Graph for question answering. For example, the graph can answer the question “What are the Asian developing countries?”

“The Concept Graph scans through web pages and extracts instances that belong to concepts,” Jiang explains. “’Asian developing countries’ is a concept and China, India, etc., are all instances for this concept.”

Learning conceptualization

To create the Microsoft Concept Graph, Yan and colleagues trained a machine-learning algorithm to search through the database of indexed web pages and search queries for word associations linked together by basic, common speech patterns including the phrases “such as” and “is a.”

For example, if a web page contains the text “an animal, such as a dog,” the algorithm selects “animal” as a candidate concept for the instance “dog,” Yan explains. The text “Microsoft is a technology company” results in the instance “Microsoft” paired with the concept “technology company.”

The algorithm also performs a statistical analysis to weed out rare or incorrect instance-concept pairs that arise from semantic ambiguity.

For example, on the first pass, the sentence “domestic animals other than dogs such as cats” produces two results: “cat is a dog” and “cat is a domestic animal,” which are both derived from the pattern “such as.”

As the algorithm processes more and more pages of text, it learns that “cat is a domestic animal” is more frequent than “cat is a dog.” When the frequency difference between the two ambiguous meanings crosses a defined threshold, the algorithm weeds out “cat is a dog.”


“We only keep the frequently mentioned things by different people on different webpages,” Yan says. “That way we have confidence in the instance and concept pair.”

Humans, too, are recruited to look over segments of the data for erroneous pairs, which helps improve the quality of the graph.

The result is millions of concepts, ranging from the common “cities” and “musicians” to the rare “wedding dress designers” and “acid blocking heartburn drugs.”

Each concept is linked to a set of instances and described by attributes such as person, thing and object as well as relationships such as located in, friend of and president of.

Tagging model

Along with the Microsoft Concept Graph, the researchers released a related technology called the Microsoft Concept Tagging Model, which automatically maps instances to concepts with a probability score, enabling machines humanlike conceptualization.

The model is based on a machine-learning algorithm that weights, or scores, matches for a given instance-concept pair. In this way, the most computationally useful concept, a so-called basic-level concept, is ranked highest.

For example, the instance “Microsoft” automatically maps to the concepts “company,” “software company” and “largest OS vendor.” Both “company” and “largest OS vendor” are highly related to Microsoft, but “software company” is the most useful, and thus highest ranked, concept.


Microsoft is certainly a “company,” but so too are ExxonMobil and McDonalds, which have little else in common with Microsoft. Whereas “largest OS vendor” applies only to Microsoft. “Software company” is a concept that relates to Microsoft, as well as similar companies such as IBM, Adobe and Oracle.

In other words, “software company” is specific without being too specific; it is general enough to be related to several other instances, which makes it useful for semantic computation such as performing searches or answering questions.

The accuracy of the model increases as it incorporates the context of surrounding words.

For example, for the sentence, “I want to eat an apple,” the tagging model gives the “fruit” concept more weight, as a person is unlikely to eat the well-known technology company. The weighting is reversed for “I want to visit Apple” since “visit” is more likely associated with “technology company.”

“Based on the context of previous terms, we can distinguish the detail of the concept to further filter out irrelevant concepts,” Yan explains. “When you see ‘eat apple’ we know the high probability thing is the fruit.”

Model release

The public release of the Microsoft Concept Graph and Microsoft Concept Tagging Model are intended to support research on natural language understanding for technologies such as search engines, chatbots and other artificial intelligence systems, according to Yan.

“We want to encourage more people to utilize our fundamental service,” he says.

Yanghua Xiao, an associate professor of computer science at Fudan University in Shanghai, China, for example, is using the graph in his research on enabling machines to understand human language, including natural language questions.

Take, for example, the question: “How many people are there in New York?” which is about the population of a city.

“Whatever the city is, say Shanghai or London, they share the same semantic template,” he notes. “The Concept Graph, which contains facts like ‘New York is a city’ can help us build the template so that the machine can understand the question with the template and answer the question with exact answers.”

The Microsoft Concept Graph and Microsoft Concept Tagging Model are available to download for research purposes. The current release includes the core version of concept data in English mined from billions of web pages and search queries.

Future releases will include conceptualization with context for understanding short and long texts as well as support for Chinese.


John Roach writes about Microsoft research and innovation. Follow him on Twitter.