Aiming advances in AI at biomedical search
Tomorrow marks one year since COVID-19 was declared a pandemic by the World Health Organization. At the time, the virus had already spread to 114 countries and had killed more than 4,000 people worldwide. Washington was the first state in the United States facing an outbreak—one that raged a couple of miles from my home in Kirkland. The world was engulfed in fear.
In confronting the pandemic, researchers in biology, medicine and epidemiology raced to better understand Sars-CoV2 and to develop approaches to treat COVID-19. Their insights and findings generated an information explosion: Tens of thousands of research articles were published in the first several months. To date, nearly a half-million articles have been published. And this recent work stands on a massive foundation of hundreds of thousands of relevant publications going back decades, including findings on other viruses in the coronavirus family, investigation of immune responses to respiratory infections and advances with RNA-based vaccines.
Recognizing that tools for navigating information would play a critical role in the fight against the virus, teams at Microsoft were inspired to think more deeply about search and retrieval for biomedical information, spurring new projects. Today, we’re announcing the availability of Microsoft Biomedical Search, a research prototype that enables searchers to query biomedical literature with natural language queries rather than keywords.
Microsoft Biomedical Search builds on several threads of work. Early in the pandemic, we collaborated with colleagues at the National Library of Medicine (NLM), the Allen Institute for AI, the White House Office of Science and Technology (OSTP)and other organizations to build the COVID-19 Open Research Dataset (CORD-19). This freely available resource of machine-readable content about the coronavirus group of viruses has stimulated numerous research projects worldwide on biomedical search and scientific visualization.
In another effort, we pursued a better understanding of rising questions and informational needs by reaching out to scientists and clinicians at the front lines of the pandemic. As part of this work, we engaged with colleagues at the Cleveland Clinic on challenges with information seeking. The Cleveland Clinic team brought together a diverse group of biologists and clinicians to create a list of questions that were representative of their rising information needs—for their pursuit of scientific research at the frontiers of understanding and for helping with their care of COVID-19 patients.
With Microsoft Biomedical Search, we’ve brought together large biomedical literature datasets, representative query workloads and advances in large-scale neural language models to enhance biomedical search. In particular, we’ve pursued capabilities that allow searchers to better specify what they mean with the precision of natural language.
Microsoft Biomedical Search surfaces the most relevant results from more than 20 million documents from CORD-19, Microsoft Academic, PubMed and PubMed Central and relies on three interrelated AI efforts: PubMedBERT, MetaAdaptRank and SaliencyMeasure models.
PubMedBERT is a large-scale language model that was pre-trained on biomedical text rather than on a mix of general-domain language and domain-specific language. The model was pre-trained from scratch with 3 billion words specific to biomedicine. We have found that the model outperforms all prior language models on biomedical natural language processing applications.
MetaAdaptRank helps to accurately determine relevance by alleviating common problems associated with the ranking of research search results. Information retrieval systems often fail to identify all relevant information because queries and documents use different terms to describe the same concept. For example, this mismatch can happen when a searcher is unfamiliar with new terminology. MetaAdaptRank can learn the semantics of specialized domains to more accurately rank results even for topics or keywords for which information is scarce.
SaliencyMeasure uses reinforcement learning to predict the likely future importance of a scientific paper, which helps with balancing the ranking of older and recent publications or authors rather than relying solely on citations.
We’re excited to put Microsoft Biomedical Search in the hands of biomedical research scientists, clinicians, epidemiologists and public health experts. There’s certainly much more to do and more to learn to help scientists navigate the explosion of information in biomedicine. We welcome feedback from the biomedical community to help us make further progress.