In order for scientists to make breakthroughs that could help lead to cures for pediatric cancers, researchers around the world need to be able to easily share and collaborate on genomic data. That’s why, in 2010, computational biologist Jinghui Zhang and her team at St. Jude Children’s Research Hospital in Memphis started uploading anonymized genomes of their patients’ healthy and cancerous cells to public data repositories.
“We realized that it was very hard for people to download the data and use the data for their research because of the sheer size and volume of the data,” said Zhang. “So, St. Jude started to seriously explore other ways to facilitate data sharing with the global research community.”
That led to a collaboration with members of the genomics group in Microsoft’s research organization. At the time, Microsoft was beginning work on a cloud-based computational pipeline to align billion-piece puzzles of raw genomic data with reference genomes, and then identify where the aligned and reference genomes differ, an analytical technique known as alignment and variant calling.
On Wednesday, Microsoft announced the general availability of the Microsoft Genomics service, the publicly available offering that is the result of Microsoft’s initial work in this area.
Variants are what make individuals unique. They are markers for traits ranging from physical attributes to disease susceptibility. Variants are also the fuel for so-called genome-wide association studies that allow researchers to home in on what the variants mean. The more genomic data researchers can access and analyze, the more precisely they can tease apart the complexity of biology and make progress toward curing diseases such as cancer.
Zhang’s team worked with the Microsoft researchers on the development of the genome alignment and variant calling pipeline in partnership with DNAnexus, a secure, cloud-based platform for managing genomic data, which runs on Microsoft Azure.
To date, the collaborators have processed a half petabyte of genomic data and stored it on Azure for analysis. For perspective, a half petabyte of data would fill 750,000 standard CD-ROM discs.
The St. Jude genomic data analyzed through the pipeline and stored in the cloud is the foundation for a data-sharing platform that the research hospital is building with DNAnexus and Microsoft. The goal is to enable researchers from around the world to collaborate in the search for cures to pediatric cancers, which are diagnosed in about 175,000 children age 14 and under around the world each year.
“For us to be able to test on real-world data like that and be able to work side-by-side with these teams was an amazing opportunity,” said Geralyn Miller, who directs the genomics group in Microsoft’s research organization.
Good data made easy
The Microsoft Genomics service is part of Healthcare NExT, a Microsoft initiative that aims to accelerate healthcare innovation through artificial intelligence (AI) and cloud computing.
In genomics, the road to realizing these goals begins with clean and accurate data.
“We know we need to have good data and if we can make it very, very easy for people to have good data, then we can bring the biological information to analytical tools in the cloud and, hopefully, make people much more productive and improve their discovery rate,” said Bob Davidson, a principal software architect in Microsoft’s genomics group.
The Microsoft Genomics service, he explained, is an essential element in next-generation AI machinery that will drive breakthroughs in understanding and treating diseases such as cancers with precision medicine. For example, by analyzing genomic data from a patient’s healthy and tumor tissues, a physician will be able to select the treatment that will be most effective based on comparison with data from other cancer patients, including treatments and outcomes.
A common pipeline for processing genomic data helps reduce artifacts and noise that can cloud the data, and that results in a stronger signal for the AI-powered elements of precision medicine, noted Miller.
“We are commoditizing this step,” she said. “We are going to make it simple for people to do it and what is going to come out the other end is data that is consistent.”
‘A perfect cloud workload’
The opportunity to commoditize the alignment and variant calling phase of genome sequencing, called secondary analysis, arose as the cost for sequencing a single human genome plummeted from $100 million in 2001 to less than $1,000 today, which is in the range of other routine medical tests. Industry experts expect the sub-$1,000 genome will spur a sequencing rush – by 2025, they predict more than 100 million human genomes will be sequenced.
And that presents another problem that Microsoft and DNAnexus are well-positioned to solve.
A single human genome takes up 100 gigabytes of storage space. As more and more genomes are sequenced, storage needs will grow from gigabytes to petabytes to exabytes. By 2025, an estimated 40 exabytes of storage capacity will be required for human genomic data. An exabyte is approximately 1,000 petabytes, or enough data to fill 1.5 billion standard CD ROM discs.
“Genomic data is really big data,” noted Miller. “And it is really compute intensive.” Processing a single human genome requires several hundred core hours. Computer processing units, or CPUs, on modern laptop computers typically have four cores. Data centers, by contrast, have hundreds of thousands of cores, which “make genomics a perfect cloud workload.”
What’s more, handling of genomic data involves a laundry list of legal and ethical requirements to maintain data privacy and security. Microsoft has Azure datacenters spread around the world, and Microsoft Genomics is currently offered in the U.S., Western Europe and Southeast Asia. The Microsoft Genomics service is ISO-certified, meaning it meets certain international standards for security, privacy and quality. It’s also covered by Microsoft under the HIPAA Business Associate Agreement, which assures that companies manage personal health information responsibly. It follows the security and privacy principles outlined in the Microsoft Trust Center.
Ecosystem of partners
DNAnexus, deployed on Azure, is the genomic data management company working with St. Jude Children’s Research Hospital on its data-sharing platform. DNAnexus will integrate the Microsoft Genomics service as well as other genomic analytical and visualization tools, providing researchers with an interface to access tools and diverse datasets in a collaborative and secure ecosystem.
“We are most successful when our scientists engage with the customer’s scientists to understand their scientific problem, and then work on porting their workflow onto the platform. They run some trials, and then we are off to the races,” said Richard Daly, chief executive officer of DNAnexus. “In this case, our team worked closely with St. Jude and Microsoft to determine the specific requirements and translate that into tailored solutions.”
Miller, Davidson and their colleagues in the genomics research group at Microsoft view the Microsoft Genomics service as the first of many tools to come that are ready for integration by a growing ecosystem of partners on Azure including DNAnexus. For example, one ongoing discussion, noted Miller, centers on another challenge St. Jude faces: how to share and collaborate on different types of data produced from different organizations with different tools?
“The thing that is different about Microsoft Genomics is the tie to research,” said Miller. “We have the domain expertise to be able to go off and do experiments and to bring these ideas out of the lab and into the world.”
Related:
- Learn more about the Microsoft Genomics service
- Learn about pediatric cancer research at St. Jude Children’s Research Hospital
- Find out more about DNAnexus
- Jude Children’s Research Hospital, Microsoft and DNAnexus join forces to fuel scientific discoveries
John Roach writes about Microsoft research and innovation. Follow him on Twitter.