Every two years there’s a big challenge competition in predicting protein folding. That is. . .well, a hard problem. Protein chains have (in theory) an incomprehensibly large number of possible folded states, but many actual proteins just manage to arrange themselves properly either alone or with a few judicious bumps from chaperones. It’s been clear for many decades that there are many energetic factors in play that allow them to accomplish these magic tricks, which are a bit like watching piles of hinged lumber spontaneously restack themselves into functional boats, wagons, and treehouses. But knowing that amide bond angles, pi-stacking interactions, hydrogen bonding, hydrophobic surfaces, steric clashes, and all the rest are all important, while a good start, is a long way from being able to calculate them and assess their relative importance for any given case.
The CASP (Critical Assessment of protein Structure Prediction) contests have been run since 1994. I wrote about the 2018 one here, with particular attention to the Google-backed AlphaFold effort. Now the 2020 CASP results are in, and AlphaFold seems to have improved its standing even more. There are several divisions to the competition: “regular targets”, where the teams are given the plain amino acid sequence of proteins whose structures have been determined (but not publicly released), multimeric targets (for protein complexes), refinement targets (where teams try to refine an existing structural model to make it fit the experimental data better) and contact predictions. AlphaFold made their push this year in what is always the largest and most contested of these, the regular targets group.
This year’s press release is rather different from the others. It announces, basically, that an AI-based solution has been found, and that’s the latest AlphaFold version. Out of a list of 100 or so proteins in the free-modeling challenge, it predicted the structures of two-thirds of them to a level of accuracy that would be within the range of experimental error. Again, these are single proteins (not the multimeric complexes or the other categories, where AlphaFold did not participate), but that is really a substantial achievement. Their 2018 results were good (and better than anyone had achieved in previous CASP rounds), but these are much better still. Here are the results in that regular targets category, and you can see that the AlphaFold team largely blew everyone else out (that tall bar on the far left).
I’m impressed. We’re not up to “guaranteed protein structure for whatever you put it”, but getting that level of structural accuracy on that many varied proteins is something that has just never been done before. I will be very interested to hear from the AlphaFold people about what improvements they feel were most important. As it is, such computations tend to use a variety of techniques: straight-out calculation of those energetic factors mentioned above (when necessary) along with searching for similarities to known protein sequences and structures to get a leg up. Improved methods to run such “prior art” searches reliably are a big area as well; they are nontrivial.
So some of the improvement is due to the ever-increasing number of protein structures that we have solved experimentally, and the improved application of that data to new protein sequences. Some of it is due to better ways to search through and apply the lessons from those previous structures (and better ways to be sure that you’re picking the right lessons to learn!) And some of it is due to the sheer increases in computational power that we have at our disposal, of course, although it has to be noted that you cannot just compute your way out of problems like this one if you don’t have some solid ideas about where you’re going and how you’re going to find a path forward.
It’s not that we have completely achieved a fundamental understanding of all the energetic processes and tradeoffs in folding any given protein. While we’re closer to that than ever before, we also have shortcuts that allow us to table those fundamental problems and arrive at a solution by analogy to things we already know that proteins do (whatever their reasons might be for doing it!) And that means that the accuracy of such calculations is only going to improve as we continue to solve more protein structures (and to improve the tools for using them). Decades ago, people probably expected eventual progress in the protein folding problem to come more from the fundamental-understanding side, but AI programs can be extraordinarily good at the “Hey, you know what, I’ve seen something kind of like that before” approach, and the results speak for themselves.
X-ray and NMR protein structures are continuing to flow into the databases, of course. And I would expect the recent improvements in cryo-electron microscopy to add plenty of material for such efforts. Cryo-EM will also add a lot of multimeric protein complexes to that particular data pile as well. That will be the next big challenge, one with huge relevance to the way that protein tend to perform their functions inside living cells. Onward!
Second update: for those wondering about what this means for drug discovery, let me send you here.