The coronavirus outbreak has been accompanied by a huge amount of sequencing data, as well it should be. Nextstrain.org is a great place to see this in action: region by region, the spread of the infection can be tracked, often with enough detail to say where the virus must have come in from and how many different starting points it’s had. That all depends on how many different strains are detected in the first place, of course, and in regions (like the US!) where we’re not even doing enough basic RT-PCR swab tests to know the prevalence of the virus as a whole, we’re surely missing a lot of information about deeper things like viral sequence, the number of different mutations, and how they’re distributed. GISAID is another large repository of such data, and it’s growing day by day.
As you can see from these sites, there are a lot of random single-nucleotide changes that have popped up. It’s important to realize that overall, the mutation rate of this virus is not particularly high (in line with other coronaviruses, actually). But they do accumulate; there’s a fearful amount of viral replication going on out there, and not all of it goes perfectly. A single patient may have several mutational strains going at the same time as the virus replicates, and there’s been a report of a person who turned out to be infected simultaneously by two strains with different geographic origins, which is bad luck. Take a look at the Nextstrain diversity panel here to get an idea of what’s been found across the sequence (reproduced below). These are total events, the number of times mutations have been seen, and remember, because of the triplet genetic code, some of these single-nucleotide variations are still going to lead to the same amino acid in the resulting proteins:
Here’s a general look at how that genome is organized. ORF (open reading frame) 1a/b the big orange bar and green bar in the graphic above) encodes nonstructural enzymes that are involved in replication (and in protein processing to enable that replication – this is the polyprotein mentioned this post). Then you get into some structural pieces: you have the S region which codes for the notorious Spike protein (the one that gives the virus its distinct appearance and that interacts with human ACE2 for cellular entry), ORF3a, a structural protein that is near the spike and likely modulates the human inflammation response (as it seems to in the earlier SARS coronavirus), the gene for the envelope protein (E), the membrane protein (M), the nucleocapsid protein (N), an RNA polymerase enzyme, and a few others. You can see above that there are known mutations scattered through the whole sequence, with some spots showing more than others. It’s worth comparing these in terms of entropy: how divergent are these mutations? Here’s that plot:
Note that the baseline goes down a lot; most of the mutations in the first count are trivial ones that don’t even change what amino acid gets coded for, or produce something quite similar. That second-biggest peak, for example, right at the end of ORF1a at the ORF1b border, doesn’t look so impressive when you see the entropy number and take into account what these mutations are coding for. But you can see that there are certainly some entropic spikes, like that big one right in the middle of the S protein. There are several ways to think about this: that’s a position in the RNA sequence that’s more likely to lead to something new, and it may also be that a relatively larger number of different amino acid residues are accommodated in the resulting protein. And since we’re looking at the sequences of viruses that have successfully infected humans, that tells you that most all of the ones we’re seeing must still be functional. So that big diversity line in the middle of the Spike protein may not be coding for a crucial residue for human infection – but if you’re developing a vaccine that sets off antibodies to the Spike protein, you would want to be sure that this diversity doesn’t throw things off, and that those antibodies will still recognize the variations. Overall, the relative lack of diverse mutations in the regions shown (like pretty much all the rest of the Spike) could be a sign that weirdo changes there just aren’t tolerated, and generally don’t produce a viable and/or infectious virus.
The historical example that this inevitably calls to mind is from World War II: mathematician Abraham Wald was given the job of analyzing the patterns of damage seen in returning combat planes, with an eye to where armoring could be improved. The initial idea was that the areas with the most holes were perhaps getting hit often and should be shored up – but Wald pointed out the survivorship bias problem: these places actually indicated where a plane could take damage and still be able to return. He believed that the distribution of projectiles was probably fairly even, meaning that regions on the aircraft where no shell holes were ever detected were probably the crucial ones to armor! (Note: the accounts of this have been embellished over time, but the fundamental story is accurate – see this post at the American Mathematical Society about the math behind Wald’s work, and note especially the postscript). In those plots above, we are seeing the places that you can shoot through the coronavirus genome and still return a working pathogen.
We’ve were just talking about mutations that still produce a functional virus – what about the ones that produce something worse? Worse for us, I mean? That’s where the WWII airplane analogy breaks down a bit; no gain-of-function was probably produced by shooting off parts of the wing. But then combat aircraft are under selection pressure only through human mediation, while viruses are on their own.
We already know that the receptor-binding domain of the Spike protein is one of the more variable regions – that’s the peak that you see in those plots above in the S gene. And that’s because it’s so important for viral entry into cells – without which there is no replication at all. There’s some evidence that this might be undergoing some positive selection; there are several subtle signs in the way these mutations are coming on that make this a possibility (see that last preprint link for more details). And positive selection means what you think it means: a change that is an actual advantage for the virus and gives the new form an edge on the existing strains. In general, we could be uncomfortable with its products. They could tend to the “easier to catch, faster to multiply” end of things, although you could also imagine a “causes less trouble and fewer symptoms” variety also getting selected for, which is what tends to happen in the long run with pathogens (see the discussion of attenuated viruses here). Unfortunately, that latter phenotype can develop through selection both on the pathogen and on the hosts, that is to say us, and we’d rather keep the viral thumb off that scale. But it happens: a good part of the European population descends from people who didn’t die during the Black Death, and you can still see it in their genes.
Tracking viral mutations as an epidemic spreads has its odd features. For one thing, “founder effects” and population bottlenecks are well recognized in all species as having very noticeable evolutionary consequences. If a population grows out from a small cohort or if a once-larger population goes through a contraction to a small number of individuals, the loss of genetic diversity leaves a lasting mark. And that’s pretty much what happens every time a virus jumps from person to another!As mentioned above, the study of viral genomes is all about survivorship bias. It doesn’t take many viral particle to infect someone under favorable transmission routes (such as, in this case, inhaling a small floating droplet coughed out by someone else). So a virus spreading through a population may be going through a long series of consecutive founder events/bottlenecks, punctuated by bursts of replication once a new host is infected, each time with new possibilities for random-error mutations and selection pressure from the host’s defenses. Factor in the extreme speed with which the viruses can replicate when they get the chance, and you have a literal example of the phrase “evolution in action”, right in front of your eyes. All that bottlenecking can be more of a factor than selection in the host, because the number of variants that show up in a single infected person will surely not all make the leap to the next host.
Now let’s discuss a new preprint in light of all this. This paper looks in detail at 11 mutated forms of the Cov-2 virus and goes further to functionally characterize them a bit, such as how easily they infect cells in culture. That’s crucial information, and we’re starting to accumulate enough of it to draw some conclusions. The 11 mutations are shown below – these are laid out just like the Nextstrain graphs above, and now you know your way around the coronavirus genome a bit and can see some things that are going on. (Those who really want to dive into the architecture should go here!) These mutations were determined by deep sequencing with a huge number of reads, which is possible partly because the total viral sequence just isn’t that big (although the coronaviruses have the biggest genomes of all the RNA viruses in general):
These are from patients in Hangzhou fairly early on the epidemic, which is good. To be honest, we still don’t have as good a profile as we would want of the early mutational diversity of the virus as it got going in Wuhan; people were generally too busy to do a lot of deep sequencing. The 11 mutants shown above were the ones that got further study; the paper identified 33 mutations in all, and it’s notable that 19 of these were still novel as of a March 24th check by the authors in the GISAID database. Among the interesting things in these 11 sequences is that two of the mutants described appear to be foundational to some of the strains now spread across the rest of the world. And take a look at the ZJU-11 genome – it has four mutations just in the ORF7b gene, three of which are consecutive (!) That codes for an accessory (non-structural) protein that ends up stuck in the Golgi apparatus of the cell, which for the non-biologists is not pronounced as it is in the Phish song, and I’m not sure if anyone knows quite what the viral protein is doing down there.
This team infected Vero-E6 cells (a standard cell model) with all 11 of the strains above and watched for differences in viral load by RT-PCR, along with electron microscopy of the cells themselves. There were few if any differences at the 1, 2, and 4-hour marks after introduction of the viruses. But at 8 hours, the ZJU-6, ZJU-7, ZJU-9, ZJU-10, and ZJU-11 strains all had a higher viral load as compared to the others. At the 24 hour mark, all of them had a noticeably higher viral load still, except for ZJU-2 and ZJU-7, which had not kept up. The ZJU-10 and ZJU-11 had taken off more drastically than the others by this point. (Update: fixed this paragraph and the one following, since I had mixed the paper’s RT-PCR cycle threshold values liberally with the viral load numbers last night).
ZJU-1, whose sequence fits more with a cluster of mutations found mostly in Europe, had 19 times the viral load of ZJU-2 and ZJU-8, which are more in the Seattle/Washington state clade – these differences were already becoming apparent at 24 hours and were statistically significant (reproducibly so) at 48. And when you compare the top and bottom-performing strains, ZJU-10 had 270 times the viral load of ZJU-2 at the 24-hour mark! So there are noticeable differences in the cell assay, but the question is how these might translate to human infections. The authors note that ZJU-11, the other top performer with one of the highest viral load numbers at 24 and at 48 hours, turned out to be very bad news for the patient that it was isolated from, who tested positive for 45 days (!) and took the longest of all of the patients in this study to be discharged from the hospital. Recall that this one had a heavy mutational signature in the ORF7b gene; trying to see if this sort of thing correlates with slow recovery in humans (and if so, how) would be very worthwhile.
So there appear to be real differences in cell infection assay data even when you just look at 11 variants of the coronavirus. One of the references above, posted on Biorxiv back on March 17, had already been looking at this at the zoomed-in level of binding of the Spike protein’s receptor-binding domain to human ACE2 protein. That could be one measure of infectivity, but I would argue that cell infection data are closer to reality (albeit a mixture of different effects). There’s always the question of which cell lines you choose to infect, though, and I suspect we’ll see more investigations along those lines to make sure that we’re not getting fooled that way. You can bet that more attempts to correlate viral sequences with such cell assays (and with patient outcomes) are underway as we speak. We need to know if there are nastier varieties out there and how such things might be spreading, of course, and these data are going to have to inform the research groups working on vaccines and monoclonal antibody treatments. Molecular biology, structural biology, cell biology – these disciplines and more are going to reveal the coronavirus’ secrets and tell us how best to fight back.