Microsoft researchers detect lung-cancer risks in web search logs

Eric Horvitz portrait

Smoking cigarettes is the leading cause of lung cancer, the most common cause of cancer death in the world. But nearly 20 percent of lung-cancer diagnoses are made in people who are non-smokers. That means in addition to smoking, geographic, demographic and genetic factors play a role in the devastating disease.

A project from Microsoft’s research labs is exploring the feasibility of using anonymized web search data to learn more about lung-cancer risk factors and provide early warning to people who are candidates for disease screening.

The findings, published Thursday in JAMA Oncology, extend research that team members published last June on the feasibility of using the text of questions people ask search engines to predict diagnoses of pancreatic cancer. The machine-learning method builds on patterns found in the search queries.

“Here, we are not just looking at the text of the queries; we also consider the locations that people are in when they issue these queries and we tie that back to contextual risk factors linked to those locations,” says study co-author Ryen White, chief technology officer for health intelligence at Microsoft Health in Redmond, Washington.

For example, the model developed by the researchers determines the ZIP code where the search was issued and correlates the location data with maps from the U.S. Geological Survey to determine environmental levels of radon gas, a known lung-cancer risk. Census data reveal the average age of homes in each region, which is relevant as older homes are poorly ventilated and thus can trap radon.

Knowing ZIP codes also helps the researchers infer users’ socioeconomic status and race, providing additional clues on cancer risk. According to the Centers for Disease Control and Prevention, people living below the poverty level have higher rates of smoking than the general population, and death rates for people with cancer are highest among black Americans.

In addition, the model uses algorithms to determine searchers’ likely gender and age from patterns of queries. Searches from the same mobile device within hours of each other from ZIP codes separated by hundreds, or thousands, of miles could indicate air travel.

Taken together, these data “allow us to discover new risk factors, things that might not have been thought of in the past that might actually be important,” White says. “We looked at air travel, for example, as one of the factors that might be tied to a higher likelihood.”

Ryen White
Ryen White, chief technology officer for Microsoft Health and an information retrieval expert. (Photography by Scott Eklund/Red Box Pictures)

The findings are associations, not evidence of a cause, emphasizes study co-author Eric Horvitz, technical fellow and managing director of Microsoft’s research lab in Redmond. But, he adds, they can suggest directions for future clinical studies on lung cancer.

Take plane travel, for example. Horvitz says that although it was useful in their predictive models, the researchers have yet to confirm a causal connection between plane travel and lung cancer. “However, the result frames a hypothesis that can be pursued and studied. Same with how radon gas and older homes link up,” he says.

To develop the model, Horvitz and White identified so-called experiential queries such as “I have just been diagnosed with lung cancer,” which are then followed up with behaviors that provide evidence of a recent diagnosis, such as multiple queries on treatment options and side effects.

The model then looks back in time at the anonymized logs for searches that might signal a pending diagnosis. These include searches about symptoms such as hoarseness and others that provide evidence of known and potential risk factors such as cigarette use, locations linked to elevated radon levels and frequent long-distance travel.

The researchers ran the model on the anonymized logs of nearly 5 million searchers and found that it can identify 1.5 percent to nearly 40 percent of searchers a year in advance of when they will input queries consistent with a lung cancer diagnosis. The percentages vary as the sensitivity of the model is shifted to limit false positive rates from 1 in 100,000 to 1 in 1,000. The approach performs more effectively for searchers identified as high risk, such as living in a ZIP code with elevated radon levels.

The research, explains Horvitz, who holds both a Ph.D. and MD from Stanford University, is part of a broader and ongoing effort to use the vast aggregations of data compiled from human interactions with the web to help advance clinical medicine.

“People tend to whisper their health concerns into search engines on a regular basis,” he says. “This kind of data can serve as a complement to more formal clinical information.”

The research, he adds, “shows promise for identifying new clinically relevant findings in multiple areas of healthcare.”

The researchers are still discussing how the research might eventually be used. For example, White says, at some point in the future people might consent to having relevant information and inferences from web search logs and other data streams shared with their doctor.

At this point, Horvitz notes, this is purely research. But with publication of the findings in the medical literature, the work could stimulate interest by clinical researchers and inform the development of future screening systems that can catch cancers earlier in their progression.

“The first step,” he says, “is to see if these kinds of things are feasible.”


John Roach writes about Microsoft research and innovation. Follow him on Twitter.