A collaboration between computer scientists and biologists from research institutions across the United States is yielding a set of computational tools that increase efficiency and accuracy when deploying CRISPR, a gene-editing technology that is transforming industries from healthcare to agriculture.
CRISPR is a nano-sized sewing kit that can be designed to cut and alter DNA at a specific point in a specific gene.
The technology, for example, may lead to breakthrough applications such as modifying cells to combat cancer or produce high-yielding drought-tolerant crops such as wheat and corn.
Elevation, the newest tool released by the team, uses a branch of artificial intelligence known as machine learning to predict so-called off-target effects when editing genes with the CRISPR system.
Although CRISPR shows great promise in a number of fields, one challenge is that lots of genomic regions are similar, which means the nano-sized sewing kit can accidentally go to work on the wrong gene and cause unintended consequences – the so-called off-target effects.
“Off-target effects are something that you really want to avoid,” said Nicolo Fusi, a researcher at Microsoft’s research lab in Cambridge, Massachusetts. “You want to make sure that your experiment doesn’t mess up something else.”
Fusi and former Microsoft colleague Jennifer Listgarten, together with collaborators at the Broad Institute of MIT and Harvard, University of California Los Angeles, Massachusetts General Hospital and Harvard Medical School, describe Elevation in a paper published Jan. 10 in the journal Nature Biomedical Engineering.
Elevation and a complementary tool for predicting on-target effects called Azimuth are publicly available for free as a cloud-based end-to-end guide-design service running on Microsoft Azure as well as via open-source code.
Using the computational tools, researchers can input the name of the gene they want to modify and the cloud-based search engine will return a list of guides that researchers can sort by predicted on-target or off-target effects.
Nature as engineer
The CRISPR gene-editing system is adapted from a natural virus-fighting mechanism. Scientists discovered it in the DNA of bacteria in the late 1980s and figured out how it works over the course of the next several decades.
“The CRISPR system was not designed, it evolved,” said John Doench, an associate director at the Broad Institute who leads the biological portions of the research collaboration with Microsoft.
CRISPR stands for “clustered regularly interspaced short palindromic repeats,” which describes a pattern of repeating DNA sequences in the genomes of bacteria separated by short, non-repeating spacer DNA sequences.
The non-repeating spacers are copies of DNA from invading viruses, which molecular messengers known as RNA use as a template to recognize subsequent viral invasions. When an invader is detected, the RNA guides the CRISPR complex to the virus and dispatches CRISPR-associated (Cas) proteins to snip and disable the viral gene.
In 2012, molecular biologists figured out how to adapt the bacterial virus-fighting system to edit genes in organisms ranging from plants to mice and humans. The result is the CRISPR-Cas9 gene editing technique.
The basic system works like this: Scientists design synthetic guide RNA to match a DNA sequence in the gene they want to cut or edit and set it loose in a cell with the CRISPR-associated protein scissors, Cas9.
Today, the technique is widely used as an efficient and precise way to understand the role of individual genes in everything from people to poplar trees as well as how to change genes to do everything from fight diseases to grow more food.
“If you want to understand how gene dysfunction leads to disease, for example, you need to know how the gene normally functions,” said Doench. “CRISPR has been a complete game changer for that.”
An overarching challenge for researchers is to decide what guide RNA to choose for a given experiment. Each guide is roughly 20 nucleotides; hundreds of potential guides exist for each target gene in a knockout experiment.
In general, each guide has a different on-target efficiency and a different degree of off-target activity.
The collaboration between the computer scientists and biologists is focused on building tools that help researchers search through the guide choices and find the best one for their experiments.
Several research teams have designed rules to determine where off-targets are for any given gene-editing experiment and how to avoid them. “The rules are very hand-made and very hand-tailored,” said Fusi. “We decided to tackle this problem with machine learning.”
To tackle the problem, Fusi and Listgarten trained a so-called first-layer machine-learning model on data generated by Doench and colleagues. These data reported on the activity for all possible target regions with just one nucleotide mismatch with the guide.
Then, using publicly available data that was previously generated by the team’s Harvard Medical School and Massachusetts General Hospital collaborators, the machine-learning experts trained a second-layer model that refines and generalizes the first-layer model to cases where there is more than one mismatched nucleotide.
The second-layer model is important because off-target activity can occur with far more than just one mismatch between guide and target, noted Listgarten, who joined the faculty at the University of California at Berkeley on Jan. 1.
Finally, the team validated their two-layer model on several other publicly available datasets as well as a new dataset generated by collaborators affiliated with Harvard Medical School and Massachusetts General Hospital.
Some model features are intuitive, such as a mismatch between the guide and nucleotide sequence, noted Listgarten. Others reflect unknown properties encoded in DNA that are discovered through machine learning.
“Part of the beauty of machine learning is if you give it enough things it can latch onto, it can tease these things out,” she said.
Off target scores
Elevation provides researchers with two kinds of off-target scores for every guide: individual scores for one target region and a single overall summary score for that guide.
Target scores are machine-learning based probabilities provided for every single region on the genome that something bad could happen. For every guide, Elevation returns hundreds to thousands of these off-target scores.
For researchers trying to determine which of potentially hundreds of guides to use for a given experiment, these individual off-target scores alone can be cumbersome, noted Listgarten.
The summary score is a single number that lumps the off-target scores together to provide an overview of how likely the guide is to disrupt the cell over all its potential off-targets.
“Instead of a probability for each point in the genome, it is what’s the probability I am going to mess up this cell because of all of the off-target activities of the guide?” said Listgarten.
End-to-end guide design
Writing in Nature Biomedical Engineering, the collaborators describe how Elevation works in concert with a tool they released in 2016 called Azimuth that predicts on-target effects.
The complementary tools provide researchers with an end-to-end system for designing experiments with the CRISPR-Cas9 system – helping researchers select a guide that achieves the intended effect – disabling a gene, for example – and reduce mistakes such as cutting the wrong gene.
“Our job,” said Fusi, “is to get people who work in molecular biology the best tools that we can.”
In addition to Listgarten, Fusi and Doench, project collaborators include Michael Weinstein from the University of California Los Angeles, Benjamin Kleinstiver, Keith Joung and Alexander A. Sousa from Harvard Medical School and Massachusetts General Hospital, and Melih Elibol, Luong Hoang, Jake Crawford and Kevin Gao from Microsoft Research.
- Read about Elevation in Nature Biomedical Engineering.
- Read about Azimuth in Nature Biotechnology.
- Check out CRISPR.ML, the end-to-end guide design service.
- Learn how machine learning and CRISPR are being used to understand relationships between genes in Nature Biotechnology.
- Read: Molecular biology meets computer science tools in new system for CRISPR
- Follow Nicolo Fusi, Jennifer Listgarten and John Doench on Twitter.
John Roach writes about Microsoft research and innovation. Follow him on Twitter.