Here’s a new preprint that goes a long way to telling us what we need to know about coronavirus antibodies and Spike protein mutations. It’s from Jesse Bloom’s group at the Fred Hutchinson center in Seattle, and it’s another one of those experiments that you could only do with modern molecular biology (and modern bioinformatics).
This is what I mean by that: as we have all had to learn, the Spike protein is what recognizes and binds to the human ACE2 protein to start off the process of viral infection with this disease. And a particular region of the Spike, the receptor-binding domain (RBD), is the business end of it all. That’s a stretch of about 200 amino acids (depending on where you draw the line), and it’s the exact piece that binds to ACE2. We have a good idea of how it does that (lots of structural biology by this point), but looking at those structures and figuring out what any changes to the amino acid sequence might do, well. . .that’s not so easy. You would be a lot better off with empirical data, which is what this new paper provides.
The authors look at a 201-amino acid range in the RBD. We already know what the canonical form is like, so that leaves us with 19 other standard amino acids that might be substituted in those positions (obviously, some of them are a lot more likely to form than others, because of the triplet code from DNA/RNA to amino acids, but in theory you have 19 variations). That gives you 3,819 possible single-point mutant proteins, so they made them all (see appendix below). Well, nearly all – they were able to express 3,804 of them on the surface of yeast cells and determine their binding to ACE2. That’s done via incubation with fluorescently-labeled ACE2 protein, cell sorting, and deep sequencing.
The results pass the sanity check (which is good!) For example, mutations that produce to a “stop codon” cannot really produce well-folded protein, and so it proved – none of those bound to ACE2. Meanwhile, the mutations with very closely related amino acids (leucine/isoleucine/valine, that sort of thing) clustered together. Most changes are about the same or slightly for the worse, which fits with one’s protein experiences as well, but overall the mutational tolerance is rather high: 46% of the mutants bind to ACE2 at least as well as the wild-type protein. And there are indeed some that bind even more tightly to ACE2.
The data have been aggregated into some useful heat maps and visualizations, and what you can see is that there are some areas that seem very tolerant indeed to mutation, while others are severely constrained (just what you’d expect). On top of that, the group looked at how well these various proteins expressed, and that’s a whole different set of constraints, because some of these mutants have trouble folding well or producing stable proteins once they’re made (see at right, figure from the paper).
Here’s a key point, though: none of those tighter-binding mutants have so far appeared in the wild. Everything that has shown up in sequencing from patients falls into the “same or a bit worse” category, so we can say, thus far, that there does not appear to be selection showing up for tighter-binding (and presumably more infectious) mutant forms of the coronavirus. There’s one (V367F) that appears to be better for protein expression, but it doesn’t seem to be spreading around more because of that. That doesn’t mean it can’t do such a thing, of course – the constantly replicating virus could stumble into a variation that it hasn’t explored yet that provides some kind of real fitness advantage, and that could happen any time. But it doesn’t seem to have happened yet.
And keep in mind, “fitness” is a complicated word. There are a lot of factors at work – the paper specifically mentions the way that these proteins are glycosylated as an example. That’s surely an important process, and one that we don’t understand very well across the mutational landscape. Also remember, there’s more to viral entry than just tight binding to ACE2 – the next step, where the viral envelope and the cell membrane start to fuse, has to happen, too. You could imagine some kinds of tight binding that actually keep that from happening; the mechanism gets stuck at a too-tight stage in the wrong position to move on. So this is a valuable but rough look at the list of mutations, and there will be more as we gain more understanding. On the positive side, the paper took a selection of these RBD proteins and expressed them in a lentivirus model, and the ability of those mutants to infect cells via ACE2 correlated well with the data the primary assay produced. So we’ve at least gotten a look at one important part of the story.
The RBD is of course the target of many neutralizing antibodies against the virus – both the ones that patients are raising themselves from their own immune responses, and the ones that are being selected out as potential monoclonal therapies. The paper notes that the part of the RBD that directly contacts ACE2 is in fact more constrained than most of the rest of the sequence, so that’s good (fewer possibilities for mutational escape), and that none of the antibodies that have been characterized so far have epitopes (binding regions) that are as constrained as the key RBD area, either. That suggests that there’s room for “epitope focusing” to make them even better and less prone to be evaded.
And the RBD region is what almost all of the vaccines are targeting as an antigen to raise an immune response, so this paper has direct bearing on them as well. Perhaps you would want such a vaccine to produce one of the variations that maintain ACE2 binding but have better expression (which likely means better protein stability), for example. Another good idea might be to focus on the RBD regions that are tightly constrained, since immunity raised to those would seem to have the best chance to deal with whatever comes down the chute in the future. This is really valuable work, and part of a huge worldwide effort that has produced an extraordinary amount of information in a very short time. This is how we’re going to beat this thing: sheer knowledge.
Appendix: As mentioned, these are the single-point mutations, one by one. Note that if you want to talk about every possible variant at every position of such a protein, all at the same time, then you’re asking “How many different 201-amino-acid proteins are there using the 20 canonical residues?” That, unfortunately, is 20 to the 201st power, and even with that number – which is Rather Large – we have totally ignored the possibility of multiple three-dimensional folded structures and the related question of varying internal cysteine bridges. At any rate, a hypothetical complete library of merely all the 9-amino-acid proteins (20 to the ninth power) comes to a mere 512 billion compounds (update: I hosed up this number earlier, fixed now). So the “all the 201-AA proteins” library is right out. Where would you put it?