Alphabet firm DeepMind releases massive database of 3D protein structures

With the advent of cheap genetic sequencing, the world of biology has been flooded with 2D data. Now, artificial intelligence is pushing the field into three dimensions.

On Thursday, Alphabet-owned AI outfit DeepMind announced it has used its highly accurate deep learning model AlphaFold2 to predict the 3D structure of 350,000 proteins — including nearly every protein expressed in the human body — from their amino acid sequences. Those predictions, reported in Nature and released to the public in the AlphaFold Protein Structure Database, are a powerful tool to unravel the molecular mechanisms of the human body and deploy them in medical innovations.

“This resource we’re making available, starting at about twice as many predictions as there are structures in the Protein Data Bank, is just the beginning,” said John Jumper, lead AlphaFold researcher at DeepMind, in a press call. The company intends to continue adding predicted structures to the database.


“When we reach the scale of 100 million predictions that cover all proteins, we’re really starting to talk about transformative uses,” he said.

One of those transformations may come in the database’s application to drug discovery. In an uncommon move, DeepMind has chosen to make the database — released in partnership with the European Molecular Biology Laboratory — completely open source for any use.


“So we hope, actually, that drug discovery and pharma will use it,” DeepMind CEO Demis Hassabis said during the call.

DeepMind’s predictions could be of interest to AI-driven drug companies looking to hone their models, biotech startups hoping to expand their list of target proteins, and even companies engineering new designer enzymes.

“Whenever there’s a breakthrough, I think rising tides lift all boats. And this opens up a super exciting era in structure-driven drug design,” said Abraham Heifets, CEO of AI-driven drug discovery company Atomwise, which uses its own library of computationally inferred protein structures to find molecules that selectively bind with proteins involved in disease. “Having better information on the shape of a protein is how you design a molecule that fits into that protein really well, to shut down or arrest that disease process.”

DeepMind had committed to opening up its work in November, after AlphaFold2 took home the top prize in the protein-folding prediction contest CASP, in what was hailed as a solution to the long-standing protein folding problem. But in the seven months since then, structural biologists got antsy waiting for the groundbreaking work to go public. As STAT reported last week, DeepMind raced to publish its open source code and methods in Nature, just as a group at the University of Washington published their own attempt at replicating AlphaFold’s approach in Science.

With the database adding so many new structural predictions, researchers from drug developers and basic scientists will have a lot of new material to work with. “We’ll look through it very quickly to see if there are proteins we’re interested in that are suddenly enabled by this new dataset,” said Heifets.

Jumper thinks the new tool will remove a difficult choice that plagues some biologists: If a protein structure isn’t available, they could spend lots of time and money on physical experiments to figure it out (which still might not pan out), or they could simply go without and focus on functional studies. “Suddenly, the access to structures is going to increase dramatically,” he told STAT. “I think that’s really going to change how scientists approach these biological questions.”

Still, these aren’t plug-and-play structures: They’re predictions, and they come with caveats that scientists will have to consider.

“Me as a biochemist, I’d like to understand is this a good model or not? What about this algorithm is confident or not?” said Frank von Delft, who leads protein crystallography at the University of Oxford’s Centre for Medicines Discovery. “I think that will be the key. Can you tell me, ‘Yeah, I kind of nailed it, and this one I’m struggling to nail, but this one is easy to get right’?”

To answer that question, DeepMind built measures into its predictions to help researchers determine whether to rely on the structures for their work. “Preparing the predictions has actually only been a small part of this work,” DeepMind’s Kathryn Tunyasuvunakool, lead author on the paper, said in the call. “Perhaps even more effort has gone into providing both local and global confidence metrics.”

Across the board, AlphaFold2 predicted 58% of amino acids in the human proteome — all the proteins expressed by the human body — with confidence, and 35.7% with a very high degree of confidence. At that level, the model could nail not just the backbone of the protein, but the orientation of its side chains. The degree of confidence required will depend on how scientists are using the prediction. “If you were looking at, say, the active site of an enzyme, you would want the residues involved to be in that highest confidence bracket,” said Tunyasuvunakool, “but actually there’s an awful lot of utility even in the next-highest confidence bracket.”

“It is kind of overwhelming what they can do,” said Arne Elofsson, a bioinformatician at Stockholm University.

The AlphaFold database doesn’t spell doom for experimental biologists, those who painstakingly determine the physical structure of proteins using methods like X-ray crystallography, cryo-electron microscopy, and nuclear magnetic resonance spectroscopy. For many applications, there will be a need to validate the structures proposed by these models, said Elofsson.

But as predicted structures become more accepted, the AlphaFold database could change the way structural biology prioritizes its work — and even what it considers its gold standard.

“Normally in CASP we assume the experiment is the gold standard, and if you disagree you’re wrong,” said John Moult, a computational biologist at the University of Maryland who founded the contest. “And with DeepMind some of the time that’s true, but quite a lot of the time not true.” In other words, there’s room for error in the physical experiments used to determine protein structure — and with a highly accurate prediction model, a computer could in some cases do the job better. “So I think that there’s a lot to sort out there: When is a detail actually computationally better than the corresponding experimental result?”

That will be a philosophical question for the field to confront over time, especially as AlphaFold’s approach continues to develop. DeepMind made massive gains between its first entry in 2018’s CASP competition, with AlphaFold1, and AlphaFold2 in 2020. “This is sort of v2.1 in a way, and we expect there will be more improvements over time,” said Hassabis, adding that DeepMind may update the database as more experimental protein structures are solved or as the computational model continues to be developed.

As the database expands, so too could the set of structures that could be applied to drug discovery. “A thing that people don’t really know or think about is that there’s 20,000 human genes, but only 4% of those have ever had a drug approved by the FDA,” said Heifets. “So we have many more protein targets that we could go after than we’ve ever had medicine brought to bear against.” DeepMind has established a partnership with the Drugs for Neglected Diseases Initiative to develop approaches for Chagas disease and leishmaniasis.

But there are also uses for the database that are as yet unseen. “AlphaFold is a paradigm change in the level of accuracy which biologists can now expect, which will unlock other applications,” Pushmeet Kohli, DeepMind’s head of AI for science, told STAT. “Which is why we wanted to make AlphaFold broadly accessible: So the community would not just leverage it for existing applications, like in drug discovery, but other applications they might not even have been thinking about until now.”

Source: STAT