Artificial intelligence (AI) has solved one of biology’s greatest challenges: predicting how proteins curl up from a linear chain of amino acids into 3D shapes that allow them to perform the tasks of life. Today, leading structural biologists and organizers of a biennial protein folding competition announced the achievement of researchers at DeepMind, a UK-based AI company. They say the DeepMind method will have far-reaching effects, including a dramatic acceleration in the creation of new drugs.
“What the DeepMind team has accomplished is fantastic and will change the future of structural biology and protein research,” said Janet Thornton, director emeritus of the European Bioinformatics Institute. “This is a 50 year old problem,” adds John Moult, a structural biologist at the University of Maryland, Shady Grove, and co-founder of the Critical Assessment of Protein Structure Prediction (CASP) competition. “I never thought I would see this in my life.”
The human body uses tens of thousands of different proteins, each of which ranges from tens to many hundreds of amino acids. The sequence of those amino acids determines how the myriad pushes and pulls between them cause the complex 3D shapes of proteins, which in turn determine how they function. Knowing those forms will help researchers come up with drugs that can lodge in the pockets and crevices of proteins. And synthesizing proteins with a desired structure could accelerate the development of enzymes that make biofuels and break down waste plastic.
For decades, researchers have deciphered the 3D structures of proteins using experimental techniques such as X-ray crystallography or cryo-electron microscopy (cryo-EM). But such methods can take months or years and don’t always work. Structures have been resolved for only about 170,000 of the more than 200 million proteins discovered in life forms.
In the 1960s, researchers realized that if they could work out all of the individual interactions within a protein’s sequence, they could predict its 3D shape. However, with hundreds of amino acids per protein and countless ways in which each pair of amino acids can interact, the number of possible structures per sequence was astronomical. Computational scientists jumped on the problem, but progress has been slow.
In 1994, Moult and colleagues launched CASP, which takes place every two years. Participants are given amino acid sequences for about 100 proteins whose structures are unknown. Some groups calculate a structure for each series, while other groups determine it experimentally. The organizers then compare the computational predictions with the lab results and give the predictions a global distance test (GDT) score. Scores above 90 on the zero to 100 scale are considered comparable to experimental methods, Moult says.
Even in 1994, predicted structures for small, simple proteins could match experimental results. But for larger, challenging proteins, the GDT scores of the calculations were about 20, “a complete catastrophe,” said Andrei Lupas, a CASP judge and evolutionary biologist at the Max Planck Institute for Developmental Biology. By 2016, competing groups had achieved scores of about 40 for the most difficult proteins, mostly by drawing insights from known structures of proteins closely related to the CASP targets.
When DeepMind first entered in 2018, the algorithm, called AlphaFold, relied on this comparative strategy. But AlphaFold also included a computational approach called deep learning, where the software is trained on huge data sets – in this case the sequences, structures and known proteins – and learns to recognize patterns. DeepMind handily won, beating the competition by an average of 15% on each structure, and winning GDT scores of up to about 60 for the most difficult targets.
But the predictions were still too crude to be useful, said John Jumper, AlphaFold’s head of development at DeepMind. “We knew how far we were from biological relevance.” To do it better, Jumper and his colleagues combined deep learning with a ‘tension algorithm’ that mimics the way someone puts together a jigsaw puzzle: first connecting pieces into small clumps – in this case clusters of amino acids – and then looking for ways merge the bunches into a bigger whole. They worked on a modest computer network with 128 processors and trained the algorithm on all approximately 170,000 known protein structures.
And it worked. AlphaFold achieved a median GDT score of 92.4 across the target proteins in this year’s CASP. For the most challenging proteins, AlphaFold scored a median of 87.25 points above the next best predictions. It even excelled at resolving structures of proteins stuck in cell membranes, which are central to many human diseases but notoriously difficult to resolve with X-ray crystallography. Venki Ramakrishnan, a structural biologist at the Molecular Biology Laboratory of the Medical Research Council, calls the result “an astonishing advance in the problem of protein folding.”
All groups in this year’s competition have improved, Moult said. But with AlphaFold, Lupas says, “The game has changed.” In fact, the organizers were concerned that DeepMind was cheating in some way. That is why Lupa’s posed a special challenge: a membrane protein from a kind of archaea, an ancient group of microbes. For 10 years, his research team tried every trick in the book to get an X-ray crystal structure from the protein. “We couldn’t solve it.”
But AlphaFold had no problems. It provided a detailed image of a three-part protein with two long spiral arms in the center. The model enabled Lupas and his colleagues to understand their X-ray data; within half an hour they had adapted their experimental results to the predicted structure of AlphaFold. “It’s almost perfect,” says Lupas. “They couldn’t possibly have cheated this. I don’t know how they do it. “
As a condition of entering CASP, DeepMind – like all groups – agreed to reveal enough details about its method so that other groups could recreate it. That will be a boon to experimentalists, who can use accurate structure predictions to understand opaque X-ray and cryo-EM data. It could also enable drug designers to quickly figure out the structure of any protein in new and dangerous pathogens like SARS-CoV-2, an important step in the search for molecules to block them, Moult says.
Yet AlphaFold is not doing everything right yet. In the match, it noticeably faltered on one protein, an amalgam of 52 small repeating segments, disrupting each other’s position as they converged. Jumper says the team now wants to train AlphaFold to resolve such structures, as well as those of complexes of proteins that work together to perform key functions in the cell.
While one major challenge has arisen, others will no doubt emerge. “This is not the end of anything,” says Thornton. “It is the start of many new things.”