Vast sequencing project begins to unlock human genome’s secrets — by deciphering other mammals’ DNA

Despite decades of advancements in genomics, we still don’t know what most of our DNA does. But an ambitious international research collaboration is providing new answers about how genetics shapes human health and disease, with help from an unlikely source — a menagerie of mammals.

The findings, reported in a set of 11 studies published on Thursday in the journal Science, come out of the Zoonomia Project, which compared the genomes of 240 mammalian species. The list of sequenced creatures reads a bit like the passenger manifest of Noah’s Ark: Amazon river dolphins, greater mouse-eared bats, fat-tailed dwarf lemurs, horses, humans, and more.

Researchers found stretches of DNA common to these animals that remained largely unchanged across 100 million years of evolution — a telling indicator that these sequences have an important function. The scientists estimated that a minimum of 10.7% of the human genome is functional, on the higher end of estimates of 3% to 12% from previous studies.


Most of this so-called constrained DNA does not code for the production of proteins — the building blocks and machinery of cells — and roughly half of it is in regions of the genome that researchers don’t understand at all. But the studies offer early hints that mutations in these evolutionarily conserved regions could play a key role in disease, such as certain brain cancers.

The authors say the findings underscore the power of comparative genomics, a field focused on examining the genomes of many species to understand everything from human health to how species evolved and which ones are at risk of extinction. Kerstin Lindblad-Toh, who started the Zoonomia Project in 2015, said at a press briefing that flagging constrained regions in existing, human-centric databases could help scientists better understand whether a mutation is likely to be important, which could help doctors diagnose disease.


“If we can insert evolutionary constraint as a metric in all of these ways that scientists are already trying to decipher [genetics], that’s very important,” said Lindblad-Toh, who is director of vertebrate genomics at the Broad Institute of MIT and Harvard.

The new findings come almost exactly 20 years after the end of the Human Genome Project, which took 13 years to complete and cost $2.7 billion. Since then, advancements in sequencing technology have allowed researchers to decode DNA more quickly, accurately, and cheaply than ever before. Researchers are shattering record times for sequencing, and are close to being able to read a whole genome for just $100. Just last week, scientists and doctors from around the globe gathered in San Diego to talk about how sequencing could be used to routinely screen newborn babies for genetic disease — and to make sure infants quickly get the right treatments.

There’s just one problem: We still don’t know what most of the genome does. Only about 1% to 2% of our DNA codes for proteins. That at first led scientists to believe the rest was essentially junk, though this is an increasingly outdated view as researchers have learned more about how non-coding regions can control levels of gene activation. But there’s still a lot of genetic variation that we don’t understand.

That’s where African elephants, Père David’s deer, and thirteen-lined ground squirrels come in handy. By looking across the tree of life, researchers can see which DNA regions have changed and which ones haven’t. If a region of the genome has stayed the same, that’s a good sign it plays an important role, since natural selection wouldn’t stop the accumulation of mutations in regions that don’t have a function.

Roughly half of the samples used in the Zoonomia Project come from San Diego Zoo Wildlife Alliance, the organization that runs the San Diego Zoo and owns a repository of 10,000 cell lines from more than 1,100 species and subspecies.

“It turned out we were a gold mine for them,” said Oliver Ryder, the organization’s director of genetics.

Researchers used software to compare the genetic sequences of all 240 species by aligning matching regions. Unlike in some past efforts, scientists avoided using the human genome as a reference for comparison. This less anthropocentric approach allowed them to include genetic data from regions missing in people, broadening the set of sequences analyzed.

Roughly 80% of constrained sequences they identified are in regions that don’t code for proteins, and half aren’t included in public research databases that catalog sequences with known functions.

And yet these regions nonetheless seem to be important. Scientists found that they play a big role in the predictive power of polygenic risk scores, which calculate a person’s chances of having a trait or disease based on the combined effects of numerous genetic variants. Researchers found that constrained regions made outsized contributions to the accuracy of risk scores for everything from blood pressure to immune cell counts to bone density and body mass index.

“It does illustrate … that at least some non-coding regions are incredibly important,” said Shawn Baker, a genomics consultant with more than 20 years of experience in the field and who was not involved in the study. “This is an area that is under-explored.”

Researchers also used constrained regions to identify genes that may drive the growth of two brain cancers that mostly affect children, pilocytic astrocytoma and medulloblastoma. One such example is BMP4, which controls the growth of neural stem cells. The level of activation of these genes was associated with how long patients survived with these tumors.

There were plenty of other new findings across the 11 papers. Researchers identified regions of the genome that might explain why certain animals can sniff out the faintest of scents or hunker down and hibernate during the winter. In one study, they even analyzed DNA from the taxidermied remains of Balto to understand what genetic factors might have allowed the famed Siberian Husky to lead a sled dog team across Alaska’s treacherous tundra in 1925 to bring vials of diphtheria antitoxin to a remote village in dire need of the medicine.

These findings and others were all powered by DNA sequencers sold by Illumina, a genomics juggernaut that controls about 80% of the market. The company’s machines read tiny bits of DNA and then used software to stitch that information together, an approach known as short-read sequencing.

“We didn’t need perfection; we needed a large number of species to compare,” Lindblad-Toh said.

But she notes that the sequencing world has changed a lot since researchers first generated this data more than five years ago. Companies such as Oxford Nanopore and Pacific Biosciences have developed increasingly accurate and affordable sequencers based on long-read sequencing, which reads much larger chunks of the genome and can allow researchers to make sense of complicated regions where certain sequences are flipped around or duplicated. Even Illumina is jumping into this space.

Long reads have already allowed researchers to fill in bits of the human genome missed by the Human Genome Project, and the authors of the current studies say they’re eager to see how the technology could add to their recent findings.

Source: STAT