Word to the wise: Don’t let the genomic data fool you

image of Brian Abraham, PhD

Brian Abraham, PhD, stresses skeptical analysis as a means to find the true answers in gigabytes of genomic data.

“The first principle is that you must not fool yourself, and you are the easiest person to fool.” Richard Feynman, PhD

The late Richard Feynman, PhD, was a theoretical physicist, but computational biologist Brian J. Abraham, PhD, is a disciple.

“Feynman was right that you need to be on guard to prevent fooling yourself in research. It seems to be especially true with anything involving statistics, computation or high-throughput experiments,” Abraham said. “You can always find evidence supporting your favorite model somewhere in hundreds of gigs of data, so I’ve found the need to tether analyses to reality first with really unexciting-sounding questions.”

That means taking a hard look at the quality of the data, the study methodology or how results of initial analyses square with existing scientific literature before leaping forward into the unknown.

So, when collaborators sent him 350 gigabytes of genome sequencing data, Abraham kept Feynman’s advice and his own list of questions in mind as he dove in.

Mining data for gold

The hard drive came from the laboratory of Leonard Zon, MD, of Boston Children’s Hospital and the Dana-Farber/Boston Children’s Cancer and Blood Disorder Center in the hands of his trainee, Avik Choudhury, PhD. The data were from a series of high-throughput sequencing experiments that tracked gene expression changes in blood cells as they matured from bone marrow.

The project built on a longstanding collaboration between Zon and Richard Young, PhD, of the Whitehead Institute for Biomedical Research and the Massachusetts Institute of Technology. Young is a powerhouse in gene expression research and Abraham’s postdoctoral mentor. The Zon and Young laboratories had recently identified specific proteins as key players in blood cell development and gene expression.

“I started by asking very basic questions about where proteins sit on DNA and what they do to regulate the genes driving developing blood cells,” Abraham explains. He built databases of landing sites for protein gene-regulators and how those proteins correlated with blood gene expression levels — something he describes as “real basic stuff.”

Work pivoted to expanding characterizations of the maturation-driving transcription factors; this would soon highlight something unexpected. Transcription factors regulate gene expression by binding small pieces of DNA called enhancers where transcription factors serve as on-off switches to activate or suppress gene activity.

In studying these proteins, the research team eventually turned to genome-wide association studies (GWASes) to compare their genome regions of interest with those implicated in genetic traits. These studies scan the genomes of large numbers of people to find variations in DNA associated with traits or diseases. GWAS studies typically focus on single-nucleotide polymorphisms (SNPs), which are DNA variations that involve a single chemical base.

The team focused on genetic variations associated with seven red blood cell traits, including cell size, number and hemoglobin levels. Such variations in red blood cell traits are linked to different mortality rates. The team reasoned that some SNPs associated with diseases might affect transcription factor-to-DNA binding, or that they could prioritize regions for further study based on their trait-relevance.

“Initially all I was able to show in the analysis is that SMAD1, a signaling transcription factor, collaborates with GATA1 and GATA2, known master transcription factors, in driving red blood cell development,” Abraham said. “But, to advance, we needed some other lens to view these data through, and genetic diseases and traits have proven useful for focusing similar research.”

Small DNA variations play an outsized role

A few years earlier, researchers on the West Coast reported that SNPs often occurred in non-protein-coding regions of the genome, specifically in transcriptional enhancers. With that in mind, Abraham took another look at the data. “That’s when we realized there was something unusual about the enhancers in this study,” he said.

SNPs in this study frequently occurred in a subset of enhancers that bind two types of transcription factors: master transcription factors and signaling transcription factors. Master transcription factors regulate the fundamental identities of cells. Signaling transcription factors control how cells sense and respond to their environment.

The team realized that a subset of enhancers bound both master and signaling transcription factors. The research called those enhancers “transcriptional signaling centers.”  To the researchers’ surprise, SNPs altered the binding of signaling transcription factors far more often than the binding of master transcription factors. The variations prevented signaling transcription factors from optimally binding certain enhancers. That meant genes regulated by the transcription factors were not expressed, or switched on, in response to external cues that drive red blood cell maturation.

“Entire dissertations have been written on the biological effects of individual trait-associated variants,” Abraham said. “So observing this pattern — signaling transcription factors’ DNA landing sites being especially impacted by variants – required that we integrate many datasets and data types from a variety of biological systems.”

Master vs. signaling transcription factor

The finding seemed counterintuitive, but every skepticism-driven statistical analysis held up.

“Master transcription factors establish cell identity, and the signaling transcription factors have been seen as simply tweaking how a cell responds to its environment,” Abraham said. “You might think that a change that leads to disease or changes the fate of cells probably involves weighty, master transcription factors.”

But cells must sense and respond to their cellular environment through signaling transcription factors that deliver information to the genome. “This research suggests the source is just a small genetic variant that, over the course of a lifetime, varies the response to external signals just enough to cause problems,” Abraham said. That also raises hope that researchers will be able to develop compounds to tweak that activity to avoid illness or enhance drug response.

“This conclusion was not something we could have come to without systematic, incremental, skeptical analysis,” Abraham explained. “We found examples of this and examples of that, but it was really distilling the patterns from these cases and distrusting the most interesting observations the most that made the difference in this research.”

About the author

Mary Powers is a former member of the Strategic Communications, Education and Outreach Department at St. Jude Children’s Research Hospital.

More Articles From Mary Powers

Related Posts

Blazing a TRAIL in acute myeloid leukemia treatment

Study reveals the neurochemical gatekeeper to learning and the key to unlocking it

Drug which blocks stress granule formation offers insight into biomolecular condensates

Stay ahead of the curve