Skip to main content

In Silico

Protein Folding, 2020

Every two years there’s a big challenge competition in predicting protein folding. That is. . .well, a hard problem. Protein chains have (in theory) an incomprehensibly large number of possible folded states, but many actual proteins just manage to arrange themselves properly either alone or with a few judicious bumps from chaperones. It’s been clear for many decades that there are many energetic factors in play that allow them to accomplish these magic tricks, which are a bit like watching piles of hinged lumber spontaneously restack themselves into functional boats, wagons, and treehouses. But knowing that amide bond angles, pi-stacking interactions, hydrogen bonding, hydrophobic surfaces, steric clashes, and all the rest are all important, while a good start, is a long way from being able to calculate them and assess their relative importance for any given case.

The CASP (Critical Assessment of protein Structure Prediction) contests have been run since 1994. I wrote about the 2018 one here, with particular attention to the Google-backed AlphaFold effort. Now the 2020 CASP results are in, and AlphaFold seems to have improved its standing even more. There are several divisions to the competition: “regular targets”, where the teams are given the plain amino acid sequence of proteins whose structures have been determined (but not publicly released), multimeric targets (for protein complexes), refinement targets (where teams try to refine an existing structural model to make it fit the experimental data better) and contact predictions. AlphaFold made their push this year in what is always the largest and most contested of these, the regular targets group.

This year’s press release is rather different from the others. It announces, basically, that an AI-based solution has been found, and that’s the latest AlphaFold version. Out of a list of 100 or so proteins in the free-modeling challenge, it predicted the structures of two-thirds of them to a level of accuracy that would be within the range of experimental error. Again, these are single proteins (not the multimeric complexes or the other categories, where AlphaFold did not participate), but that is really a substantial achievement. Their 2018 results were good (and better than anyone had achieved in previous CASP rounds), but these are much better still. Here are the results in that regular targets category, and you can see that the AlphaFold team largely blew everyone else out (that tall bar on the far left).

I’m impressed. We’re not up to “guaranteed protein structure for whatever you put it”, but getting that level of structural accuracy on that many varied proteins is something that has just never been done before. I will be very interested to hear from the AlphaFold people about what improvements they feel were most important. As it is, such computations tend to use a variety of techniques: straight-out calculation of those energetic factors mentioned above (when necessary) along with searching for similarities to known protein sequences and structures to get a leg up. Improved methods to run such “prior art” searches reliably are a big area as well; they are nontrivial.

So some of the improvement is due to the ever-increasing number of protein structures that we have solved experimentally, and the improved application of that data to new protein sequences. Some of it is due to better ways to search through and apply the lessons from those previous structures (and better ways to be sure that you’re picking the right lessons to learn!) And some of it is due to the sheer increases in computational power that we have at our disposal, of course, although it has to be noted that you cannot just compute your way out of problems like this one if you don’t have some solid ideas about where you’re going and how you’re going to find a path forward.

It’s not that we have completely achieved a fundamental understanding of all the energetic processes and tradeoffs in folding any given protein. While we’re closer to that than ever before, we also have shortcuts that allow us to table those fundamental problems and arrive at a solution by analogy to things we already know that proteins do (whatever their reasons might be for doing it!) And that means that the accuracy of such calculations is only going to improve as we continue to solve more protein structures (and to improve the tools for using them). Decades ago, people probably expected eventual progress in the protein folding problem to come more from the fundamental-understanding side, but AI programs can be extraordinarily good at the “Hey, you know what, I’ve seen something kind of like that before” approach, and the results speak for themselves.

X-ray and NMR protein structures are continuing to flow into the databases, of course. And I would expect the recent improvements in cryo-electron microscopy to add plenty of material for such efforts. Cryo-EM will also add a lot of multimeric protein complexes to that particular data pile as well. That will be the next big challenge, one with huge relevance to the way that protein tend to perform their functions inside living cells. Onward!

Update: here’s Nature on this, here’s Science, and here’s the New York Times. Lots of coverage out there!

Second update: for those wondering about what this means for drug discovery, let me send you here.

40 comments on “Protein Folding, 2020”

  1. Marko says:

    Trebek : ” The answer is : Multi-dimensional jig-saw puzzles. ”

    Derek : ” What computers do for fun and relaxation. “

    1. Crocodile Chuck says:

      ‘Answer the problem in the form of a question”

      1. Marko says:

        Sorry , you’re right. Should’ve been ” What DO computers do for fun and relaxation? “

  2. Konrad Koehler says:

    It is more than data and raw cpu power. It is also “deep learning” algorithms.

    1. comper says:

      Other teams used deep learning and the same data but didn’t do nearly so well.

      I’m waiting to see the full analysis and to see it tested in the hands of others before getting too excited. This field has been dogged by overhyped claims (plus some scientific fraud) for years.

  3. KW says:

    The bitter lesson, by Rich Sutton

    “We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach. […] We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.”

    1. Kaleberg says:

      That’s not 100% true. As Rodney Brooks pointed out, machine vision performs a convolution that relies on our knowledge that object appearance is invariant with regard to position and scale. There’s probably a lot more assumption going in to this model than it seems, and not all of it is obvious.

      1. Ernie says:

        Transformers networks are learning to do what CNNs do without being explicitly structured to do so. AIAYN.

  4. theasdgamer says:

    In the news…

    Major hit on Corman-Drosten:

    1. c says:

      “The use of lockdowns and belief in unproven NPIs coupled with the appearance of the ‘casedemic’ phenomena enabled by PCR testing has encouraged governments worldwide to to intimidate their populations into compliance with increasingly bizarre and illogical restrictions.”

      I think the cart might be pulling the horse on this report.

      1. Marko says:

        ” I think the cart might be pulling the horse on this report. ”

        Yes , but I don’t think it’s a horse. More like a jackass.

  5. luysii says:

    Let’s see how the algorithms work on an unstructured protein (without telling the participants)

  6. AQR says:

    Impressive results! I have a couple of questions:
    Are the AlphaFold calculations performed without human intervention or do the researchers “guide” the computer along the way?
    If the calculations are done without human intervention, is it possible to determine, after the fact, how AlphaFold derived the final structure?

  7. c says:

    Partly (mostly?) funded by advertisements on cute cat videos!

  8. c says:

    Fun comment from a blog post written after the 2018 debut of AlphaFold (here:

    “What is worse than academic groups getting scooped by DeepMind? The fact that the collective powers of Novartis, Pfizer, etc, with their hundreds of thousands (~million?) of employees, let an industrial lab that is a complete outsider to the field, with virtually no prior molecular sciences experience, come in and thoroughly beat them on a problem that is, quite frankly, of far greater importance to pharmaceuticals than it is to Alphabet. It is an indictment of the laughable “basic research” groups of these companies, which pay lip service to fundamental science but focus myopically on target-driven research that they managed to so badly embarrass themselves in this episode.”

    I expect more embarrassment in the future.

    1. Nat says:

      I’ve been reading comments along the lines of “silly scientists, why don’t they just use computers to solve cancer” for two decades and I’m sure it’s been going on for much longer than that. The more general form of this complaint is “if they’d used this technology that didn’t exist ten years ago they’d be done by now”. It requires zero effort or awareness to fling simplistic claims like this about, but somehow when deployed at the correct time they get you a reputation as a hard-nosed prognosticator and speaker-of-truth-to-power.

      1. c says:

        Okay? That doesn’t really address the reality that apparently a small team of computer scientists from an advertising company can do what thousands of very well payed biochemists cannot.

        It’s not like Machine Learning didn’t exist 10 years ago (it did) and it’s not like this model required a large amount of compute to generate (it didn’t, certainly not on the scale of pharma revenues).

        Seems it’s much more like a failed “hard-nosed prognosticator” to waive this away as if there isn’t some underlying embarrassing problem going on here.

        1. Nat says:

          That small team of exceptionally-well-paid computer scientists has not actually discovered any drugs yet. (Neither have I, but it’s not either of our jobs.) Let’s check back in a few years when they’ve had a chance to go through Phase III trials.

          Derek, isn’t there an early 1980s magazine cover relevant to this? Maybe Byte?

          1. c says:

            My own bias here is that (at least in my experience) research wings of pharma orgs (and academic groups) are packed with “scientists” (actually just poorly trained engineers with inflated egos) whose main skill is the ability to rattle off lists of illustrious names associated with cross coupling and total synthesis.

            It’s pretty clear to me that we shouldn’t be training and promoting the young brains that have spent 5-10 years mixing liquids, walking down to the NMR queue, and waiting for crystals to form in some vial in the back of a fume hood.

            These big orgs are relying on people that are only capable of doing research at a pace and quality that could have been done 50 years ago. The career scientists entrenched in these places should be ashamed when some advertising company sweeps away one of their core problems as a side project. Why hasn’t a percentage of the group been retrained on modern science and put to work with 10x the budget (that would have been swallowed by some cryo NMR or crystallography department anyways)?

          2. sPh says:

            Talk about a blast from the past: I remember playing around with that system in the WashU Engineering Computing Lab (ECL) back in the… a long time ago. It was capable of generating some pretty good graphics for an affordable computer (affordable compared to a multimillion dollar Evans & Sutherland graphics system) and it appeared in many popular science and trade press articles but I don’t recall any significant publications driven by it (other than the computer engineering dept that built it). Eventually it was part of the tours for prospective undergraduates but turned off otherwise.

        2. Jerry says:

          The problem is that the problem itself is way more than figuring out how proteins fold. There’s a long distance between doing well in a competition that focuses on very low level processes and developing safe real world applications and therapies.

          This is like concluding that, since Boston Dynamics has working cross terrain robots, we need to immediately shutdown all all terrain vehicle production. And Boston Dynamics is much further along comparatively to where this result is.

        3. Kaleberg says:

          Google’s second largest business after advertising is selling web hosting and computation. Everyone knew that making any serious progress towards solving protein folding was going to take more advanced computational techniques and a lot more computer power. Google had to build an extremely powerful system in order to run AlphaFold. They had to design specialized processors and develop computer languages for programming them. No pharma company was going to be able to assemble the team needed to do this. Look at the team Apple had to build for its latest Macbook M1 processor. Expecting a pharma company to do this is like expecting the Weather Channel to build a supercomputer and write a program to produce reliable 60 day forecasts.

          In ten years we’ll laugh at the early BetaFold bloopers, maybe even here at In The Pipeline. People who care about protein structure will run the KappaFold app on their smart phone and wonder what all the fuss was about.

    2. Derek Lowe says:

      I remember this one. But it greatly overestimates the importance of predicting protein structure to drug discovery, for one thing. I may need to do a follow-up post on that issue. . .

      1. John Wayne says:


        Do you mean the the cures for dementia, cancer and pain are not going to come out of the other end?

    3. burt says:

      It’s no shame to be beaten in algo design by Alphabet. These guys are simply the best computer scientists in the world. A young friend of mine graduated from a top CS program. Google came by and made offers to the top of the class. Everyone accepted.

      1. John Wayne says:

        Google has enough money to do whatever it cares to do, and treat it’s people well. All of those of us in pharma and biotech research have probably never had such a blank slate, clear attainable goal and huge assets at their disposal. Plus, we all get reorganized into new companies every few years. It’s amazing we get anything done.

    4. JIA says:

      I’m a bit puzzled by the original comment dissing pharma, AND the responses. Yes, Google computer scientists who specialize in algorithm development and machine learning have handily beaten computational employees at pharma companies in this specific task of in silico prediction of protein structure from sequence.

      But — Why is this a surprise? And also, who cares?? Why get defensive, pharma peeps?

      As pointed out by another commenter, the “best and brightest” in comp sci aren’t going to pharma — they can’t pay nearly enough to compete with Google et al! And a computational biologist in pharma has a much broader job description than just protein folding algorithms, so no surprise they aren’t experts.

      But the idea that this means Google has beaten pharma “at their own game” — namely, drug discovery and development — is laughable. Predicting protein structures isn’t even step 1 in drug discovery, it’s more like step 0.01. Or even step-minus-one. Plenty of antibody drugs have been discovered without knowing the folded structure of the protein target! You don’t need a structure to do an immunization and get usable antibodies.

      So — Google wins the algorithmic computational challenge, pharma loses. The field is advancing! Great. Now hopefully Google will license those algorithms to pharma so they can get a bit of help with one small aspect of the job of drug discovery and development.

    5. Hugo says:

      The theoretical groups in Big Pharma have been just a decoration in their labs for most of their existence. They never represent more than few out of 100 PhDs in a drug design lab. Nobody risks billions on a computer generated drug solution. So far.

  9. anon the II says:

    Somebody needs to sit c down and explain it to him. It’s the powder that counts. If you ain’t got the powder, you ain’t got doodly squat. I don’t care for his attitude.

    I think the problem here is that we all thought the if you solved the folding problem, then you can solve the folding problem with a small molecule stuck inside using the same methods. We forgot to tell Google about that part. Unfortunately, there’s no database for those pieces of the puzzle.

    1. Some idiot says:

      I had a thought along similar lines…

      Yes, this is a fantastic result. I’ll be honest and say that I thought it would take another 10 years or so to get to this point.

      But (and not belittling this result whatsoever), the really fascinating one is (as you say) predicting how the protein structure will change when you put something in it (or on the surface), and where the “thing” wants to end up. Because this is what really understanding interactions is all about.

      And the thing that would _really_ impress me is when they can do this, and then also at the same time deliver _understanding_ at the same time… To me, this is the intellectual goal. It may be that at some level this is “unobtainable” or “not understandable” by our minds. But I hope not…

      Ok, enough pocket philosophy from me for the day…!

  10. DKinDK says:

    Are papers and more test cases coming? I’ve recently started a PhD that involves the homology modeling of a troublesome protein in order to start structure based VS campaigns and I would prefer to just… have the protein structure without beating my head into the computer any longer.

  11. Kevin says:

    Hi Derek,

    How close are we to fine tuning protein structure of enzymes and other important macro molecules? That seems like the next frontier for innovation in biochemical and biotech markets. Does this AI put us closer?

  12. Barry says:

    Anfinsen’s dogma in its simple form is wrong. Chaperones exist (and are vital) because not all proteins will fold correctly if just expressed into the cytosol. Heck, if the dogma were right, a clever annealing sequence could un-cook eggs.

  13. Ken says:

    Does anyone know if these algorithms work incrementally? As I understand it, the protein starts folding as it is emitted from the ribosome, so there might be benefit in handling the problem that way; how do the first ten residues fold, then add the next few to that structure, and so on.

    1. Barry says:

      But of course many proteins undergo post-translational modification. The first ten AAs expressed my be clipped off in the final version. Yes, maybe their role was exactly to initiate the folding in some cases.

  14. just sayin' says:

    if these computers are so smart, why don’t they figure out a cure for covid?

  15. Jim Hartley says:

    The documentary on AlphaGo, constructed by DeepMind (since acquired by Google) is worth watching.

  16. anon says:

    I guess it makes sense that a bunch of chemists and pharma folks would think of the importance of this work in drug discovery and thus discount it, but I don’t think that’s why it’s important. In my experience protein structures have been great to explain what happened, but pretty useless for designing the next drug. What’s important here is gaining understanding of a fundamental biological process (folding) and the understanding it will help unlock about enzyme mechanisms and catalysis. Further, one will be able to model large numbers of structural variants with improved accuracy to decide which ones to make to explore the enzymology. I don’t know whether amazing new applications will follow this technological advance, but if it is like most step changes in technology there will be applications and I won’t successfully predict what they will be. Being able to predict structure with confidence will be, I’m pretty sure, a Big Deal for something, it may or may not for drug discovery, there’s no evidence that structure determination os or ever has ever been the rate-limiting step in drug discovery.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.