Sifting for gold sounds thrilling, right? But finding big, valuable nuggets can be a long and tedious process. The same might be true of teasing through reams of DNA data to reveal changes that contribute to cancer and other diseases, as well as to find the best treatments and advance cures.
Scientists at St. Jude Children’s Research Hospital work behind the scenes to speed up this genomic sifting in ways that could spell a windfall for children at the hospital and around the world. Developing new computerized tools that essentially act like sieves, researchers can quickly and cheaply analyze huge amounts of DNA data and share the information with others across the globe.
In recent years, St. Jude scientists have developed high-tech tools with snazzy names, such as ProteinPaint, a genomic visualization engine; PeCanPIE, which sifts through millions of genetic variations to find those involved in inherited cancers; CREST, which uses next-generation sequencing data to detect genomic structural variations at base-pair resolution; CONSERTING, which finds DNA duplications and deletions; and other data analysis and visualization tools.
When children arrive at St. Jude with cancer diagnoses, scientists may genetically sequence samples of their tumors. This testing accomplishes several key goals, precisely identifying the cancer type and helping clinicians make better treatment choices. Crucially, however, the data is combined with information from other patients to provoke breakthroughs that might help far more than one child.
“If you think about the cancer as the endpoint of the disease, we already get a huge amount of information on what is happening at the endpoint,” explains Xiaotu Ma, PhD, of St. Jude Computational Biology, who recently created a new research tool. “We immediately provide every patient with DNA analysis from Day One because it’s needed for their tumor diagnosis. But after you start a treatment, you want to know how the cancer is behaving. For that you really need ultra-deep sequencing.”
Mathematical tool unearths machine errors
These research tools seem complex on the surface; yet, the ideas behind them are deceptively simple. The often-catchy names of these tools are also at odds with the serious nature of what they reveal.
Ma’s new mathematical tool, called SequencErr, is a prime example. He and his St. Jude colleagues devised the first method to identify and measure errors caused by ultra-deep sequencing machines, which can root out cancer cells hiding among millions of normal cells in patients’ tumor samples.
But these high-tech machines aren’t perfect, and sometimes they actually introduce errors into the decoding process. Since each bit of double-stranded DNA fits together like puzzle pieces on a string, Ma’s tool knows a mistake has occurred if differences show up in the strands, which are theoretically identical.
“Whenever there is a mismatch within this forward and reverse read, we know it must be from the sequencer,” explains Ma, who published a report on SequencErr in the January 2021 issue of Genome Biology.
Discovering these machine errors could lead to big payoffs. Ma hopes that SequencErr — offered for free to researchers worldwide — will help doctors find cancer cells that might otherwise escape treatment in patients who have already undergone therapy.
“The tool will help us measure remaining cancer cells and determine if more therapy is needed to prevent relapse,” he says.
M2A mimics brain process
Another new St. Jude research tool, dubbed M2A, cuts to the deepest secrets of genomics. Created by Xiang Chen, PhD, of St. Jude Computational Biology, M2A uses a machine-learning approach called deep learning to improve the study of how people’s behavior and environment change the way genes work, a field known as epigenetic research.
Putting a new twist on a long-used technique, M2A boosts the value of computers in science — adding to scientists’ cancer research toolkit — by simulating the way human brains explore information.
Recently published in the journal Genome Biology, M2A provides a framework to use information on DNA methylation — a process that can lead to conditions such as cancer or heart disease — to understand gene promoters, pieces of DNA that turn a gene on or off.
DNA methylation data are typically used to provide biomarkers that can boost tumor detection. Biomarkers — substances measurable in blood, urine or tissue that may signal disease — can also further whittle down tumor traits that point toward targeted treatments. Scientists worldwide can use the fast, free and reliable M2A tool through St. Jude Cloud, an online data-sharing and collaboration platform.
The researchers have rigorously tested their approach on a variety of adult and pediatric cancers, including leukemias and solid tumors. It takes only about 15 minutes to upload DNA methylation data and receive results tailored to specific DNA proteins called histones.
“What we have done with M2A is create a method for integrating DNA methylation information around promoters to make it more readily interpretable,” Chen explains.
Conga tangoes with virus data
Like the dance it brings to mind, CoNGA is a new tool that can pull together two quite different sets of information in one analysis.
Developed by Stefan Schattgen, PhD, and Paul Thomas, PhD, of St. Jude Immunology; and Phil Bradley, PhD, of Fred Hutchinson Cancer Research Center, the algorithm may help scientists find new ways to boost people’s immune response to sometimes-devastating viruses. The term algorithm refers to instructions that tell computers how to transform a set of facts into useful information.
CoNGA combines data from the immune system’s nearly limitless T-cell receptors — which zero in on invaders such as viruses and tumors — with data from cells with similar gene expression. Thomas compares CoNGA to a mapping process. He says the tool decodes if groups of cells in both T-cell receptor and gene-expression “spaces” are in the same neighborhoods, “meaning they’re functionally sort of super-related.”
“We can assign little neighborhoods of cells based on those distances in the gene-expression space and do the same thing in the T-cell receptor sequence space,” Thomas explains. “And this gets kind of cool because you form these neighborhoods where every cell has the same group of neighbors.”
Thomas, whose team has developed many techniques for understanding huge sets of data, tested CoNGA on cells from individuals with diverse infection histories, including those who had been infected with Epstein-Barr virus (EBV). This virus, which causes mononucleosis, “is much more complicated than flu or coronavirus, with many different stages to its lifecycle,” Thomas explains.
“What CoNGA was able to show us was that different T-cell populations had distinct gene-expression profiles specific to each EBV lifecycle stage,” he says. A report on this work recently appeared in the journal Nature Biotechnology.
Defined populations of cells common to the flu or EBV, for example, can also be entered into CoNGA to automatically generate patterns. These linkages may lead to discoveries that tease out how the immune system battles these viruses, which can lead to new ways to improve the immune response.
“I think CoNGA will become a standard method of analyzing these datasets,” Thomas says. “Labs everywhere can do it, and we’ve made CoNGA totally open-source and free.”
From Promise, Summer 2021