Here’s another for the “things we just didn’t realize” file. This article is a nice look at “miniproteins” (also known as micropeptides), small but extremely important species that we’ve mostly missed out on due to both our equipment and our own biases in looking at the data. Other recent overviews are here, here, and here. I should note that the literature on this topic is rather shaggy – it’s been developing for years under a number of different names, some of which are also claimed by other fields of research, and those reviews represent people trying to get a view of the whole landscape.
We’re talking proteins with fewer than 100 amino acids (and all the way down to single digits?), and these were excluded, as genomes began to be sequenced and annotated, from the standard definition of what an open reading frame was. That brings up another distinction for these species: they’re not carved off of some larger known protein, but rather are truly coded for at these short lengths. It seems likely that these things have shown up evolutionarily through mutations that formed stop or start codons somewhere in the genome, with the resulting proteins finding a use and then being conserved. They’re all over the place, genomically, with some of them show up in regions that didn’t look as though they coded for anything at all. We’re going to have to rethink some of our ideas about what genes look like and just how many of them there are.
But there’s a lot to sort through before we get to that point. Those ORF cutoffs were put in because it was thought that there must be mostly just noise down in those small lengths, and there still is plenty of that. For example, the article mentions that the yeast genome has about 6000 ORFs for proteins of at least 100 residues, but if you open up the criteria to everything below 100, you have 260,000 more (!) It doesn’t seem likely (or even possible) that most of that list is real or functional, but the great majority of those can be illusory and still leave you with a lot of new proteins to look into. Finding and validating these things is not always straightforward: you can do RNA-seq experiments and find a lot of short mRNAs, but not all of those are being turned into proteins. Ribo-seq, where you gum up translation in the cell and look for RNA sequences that are in the act of feeding into the ribosome, are probably stronger evidence, but you can’t count on seeing a particular sequence when you do that, either. Combining such data with LC/MS validation that the proteins really exist, along with looking across species to see if similar things are found around other genomes, gives you more confidence.
The article goes into a number of examples over the last few years where such proteins have been found as regulatory species, inhibitors of other protein activity, venom components, and more. Their size can make them peculiarly suitable for such functions – just large enough to bind to some structural cleft or surface in larger proteins. Below about 50 amino acids or so you start to lose the ability to form more complex protein structures, so many of these will probably fall into the “disordered protein” category, which is already pretty large and important.
For a while, it looked like these small proteins were mostly a prokaryotic thing, but by now it seems clear that they’re all over the place and that we just haven’t been paying proper attention to them. You’d figure that there must be miniproteins that are important for binding to RNA species, with implications for cotranslational expression mechanisms, noncoding RNA function, etc. And it also wouldn’t surprise me if miniproteins turn out to be involved in intracellular condensate formation and behavior, either, both by such RNA-binding mechanisms and others. (In fact, I think we’re already seeing that, since the P-bodies discussed here as affected by the “NoBody” microprotein are now recognized to be such condensates). These seem, in fact, pretty likely to blend into the whole RNA world for these reasons (and in fact, a number of miniproteins were at first identified as long noncoding RNAs until they turned out to be not so noncoding).
The whole story illustrates how we don’t find what we’re not looking for. (The recent glyco-RNA discovery is another example of this). And that means that we have to constant check our assumptions, particularly the assumption that we know what’s important enough to look for!