Opinion: Needed for national security and competitiveness: a federal biodata infrastructure

An executive order from the Biden administration aims to build a robust bioeconomy “in a manner that benefits all Americans and the global community and maintains United States technological leadership and economic competitiveness.” The executive order acknowledged the importance of biodata to growing the U.S. bioeconomy and calls for “a biological data initiative.”

That’s easier said than done. I see significant challenges that must be addressed ahead of May 10, the deadline by which several U.S. government agencies are required to report to the White House the biological data sources they deem critical to U.S. national interests. Agencies must also share by then their plans to fill data gaps, reduce security risks to biological data repositories, and outline each agency’s authorities, resources, and actions needed to support the data initiative.

Efforts to explore and map the genomes and molecular processes that govern biological organisms are the modern data equivalent of the crude maps used by 15th century seafarers exploring uncharted waters. Just as better maps increased economic and military power centuries ago, accurate information about how biological processes operate at the molecular, individual, population and ecosystem scales will empower U.S. leadership in biomanufacturing and synthetic biology.


It will also enable the country to effectively compete with China’s aggressive biotech strategy.

Specifically, the federal government should establish a federated and distributed model that identifies and connects as many public and private or proprietary genomic and biological databases as possible. This should be accompanied by an update of standards and practices, as well as a call to the international community to establish new standards for biodata use and collection.


Biodata matter to U.S. national and economic security

The U.S. and Chinese governments both recognize that biotechnology is critical to national security and economic competitiveness. The U.S. bioeconomy already accounts for about 5% of the country’s gross domestic product ($960 billion) and is rapidly expanding. Biomanufacturing is becoming a major mode of production for a wide range of industries, from pharmaceuticals and industrial chemicals to food and fuels. Biotechnologies can also play important roles in ensuring supply chain resiliency, mitigating climate change, and restoring damaged ecosystems.

National security concerns about biology have historically focused on bioweapons developed by individuals, terrorists, and state actors. Today, the biggest biotech threat may be the loss of U.S. economic competitiveness, stemming from a failure to transition the country’s enormous advantages in biological research into the infrastructure needed to grow the bioeconomy.

The world is in the middle of a technology transformation. Biology — understanding how living things operate — is converging with the digital world. Biology is written in code, but instead of ones and zeroes it is written in As, Ts, Gs, and Cs, the nucleic acid sequences of DNA. Reading, writing, and editing this code will eventually have an even larger impact than the digital revolution.

Biological data fuels growth of the bioeconomy

Data is the essential fuel of the bioeconomy.

The quantity and variety of biodata are driven by new methods for observing and measuring biological processes, and new biotechnology applications, including synthetic biology and biomanufacturing. Genomics, which involves deciphering the sequence of base pairs in the DNA of an individual, generates especially large amounts of data. A single human genome consists of more than 3 billion base pairs, the equivalent of 200 gigabytes of data.

Since 1982, the U.S. has maintained a national repository of DNA sequences called GenBank, managed by the National Center for Biology Information within the National Institutes of Health. Japan and Europe manage similar gene banks, and actively collaborate with the U.S. by sharing data submissions and releases. The Chinese government established a national gene bank in 2016 which is operated by the BGI Group. The amount of genetic data collected by these entities is incredibly large. As of 2019, in fact, GenBank’s library now contains 19.6 trillion base pairs from more than 2.9 billion nucleotide sequences for more than 500,000 formally described species.

Governments are not the only entities holding significant collections of biological data. Universities, hospitals, publicly held companies, and private ones are leveraging biodata for both research and commercial purposes. For example, BGI Group is assembling a collection of blood samples, genetic data, and other medical information on millions of women from across the globe as part of its prenatal genetic screening product offerings. Other companies are assembling proprietary genomic and biological databases for synthetic biology projects in manufacturing and agriculture.

Collecting, curating, standardizing, and maintaining such large datasets is a huge task. Genomics is essentially a comparative exercise: the larger and more varied the available library, the more useful it is — but only if the datasets are genuinely comparable: the submissions are accurate, consistently entered, and maintained according to established standards.

Genomic studies and other biotech efforts are now benefiting from advanced analytics based on machine learning and artificial intelligence. In addition to needing large datasets, these approaches also need comprehensive taxonomies and high-quality data labels. Even then, errors will inevitably appear as these datasets scale, are copied, and shared. This makes error correction another key design requirement for new, updated, or federated collections of biological data.

The security of biological databases is a major concern. Whenever data from individuals are collected and stored, their privacy must be protected. And even as human biodata and associated applications are generating new ethical concerns and legal requirements, the data need to remain accessible to researchers and innovators. A coordinated U.S. effort to marshal disparate biological data sources is needed to strike the right balance between utility and security.

Through much of the 20th century, the failure of commercial software and internet companies to build in security and privacy protections sufficient to prevent cybercrimes has been costly. An opportunity now exists to build in from the beginning adaptive security and resilience into the bioeconomy data systems and to leverage U.S. expertise in artificial intelligence to create biodata systems that not only detect but anticipate security risks.

The federal government’s role

I see three roles for the federal government in unlocking the potential of public and private biological data to maintain and extend U.S. leadership in the global bioeconomy.

Constructing a comprehensive biological data base for all biological applications is unrealistic. The first things the U.S. should do is establish a federated and distributed model that identifies and connects as many biological data collections as possible. This should be accompanied by an update of standards and practices for entering, accessing, and storing new data in major collections such as GenBank. The U.S. should also convene an international effort to establish biodata standards and establish best practices in the design and use of biodata collections.

Second, the executive branch needs to create an advisory board to help shape the initial design and operational principles of the country’s biodata infrastructure. While this group should represent a range of institutions and industries that the national biodata infrastructure must serve, it should be small enough to assemble and provide guidance quickly, and to help inform plans offered by federal agencies, which are due in June.

Third, the administration should establish a more long-lived committee, I’ll call it the Biological Data Infrastructure and Security Consortium, to consider the priorities and design of an adaptive biological data infrastructure that creates feedback loops between the public and private sectors and is responsive to the dynamic needs of the bioeconomy.

The Biden administration’s bioeconomy executive order accurately identifies biological data as a critical component of this emerging economy. Establishing an effective biodata infrastructure that enables rapid advances in both the life sciences and biotechnologies could give the U.S. and its allies a significant competitive advantage. It is important to recognize both the magnitude of this challenge and the essential role that the private sector must play in the design and operation of such a biological data infrastructure.

To be insufficiently ambitious in this undertaking risks U.S. biotech innovation becoming lost at sea, with significant consequences for the bioeconomy.

Tara O’Toole is a senior fellow and former executive vice president at In-Q-Tel, a nonprofit strategic investor that serves and powers the national security interests and capabilities of the U.S. intelligence community and its allies. She formerly served as Under Secretary of Science and Technology at the U.S. Department of Homeland Security as the principal advisor to the Secretary on matters related to science and technology.

Source: STAT