Just how are things organized in a living cell? What’s next to what, in three dimensions? That is, of course, a really hard question to answer, but we’re going to have to be able to answer it in a lot of contexts (and at high resolution) if we’re ever going to understand what’s going on down there. There’s a new paper out that has a completely different way of dealing with the problem when it comes to nucleic acid sequences, and it’s very thought-provoking.
Right off the bat, I have to mention something remarkable about it. If you go to the end, where many journals have a section about what contributions the various co-authors made to a manuscript, you will find that Joshua Weinstein of the Broad Institute had the original idea for the research, and then carried out every single experiment in the paper. Now that’s rare! Keep that in mind as we go into just what the work involves; it’s a tour de force of chemical biology.
The technique is named “DNA microscopy”, but you’re probably going to have to expand your definition of microscopy if you’re going to use that phrase. Here’s how it works, illustrated by the first example in the paper. It involves two different types of cells, mixed together in culture: MDA-MB-231 cells that are expressing green fluorescent protein (GFP), and BT-549 cells that are expressing red fluorescent protein (RFP). Now, if you wanted to see these, you’d stick them under a fluorescent microscope, and there they would be: red cells and green ones, clearly distinct. But what if your transcripts or proteins don’t glow? That’s why this GFP/RFP example is the demo; you can use fluorescence as a reality check, but the technique described doesn’t depend on optical methods whatsoever. (Like I said, you’re going to have to use a more inclusive definition of “microscopy”). The hope is that it will distinguish the two fluorescent proteins versus two controls (GADPH and ACTB), which are naturally expressed in both cell lines.
Here goes. You fix and permeabilize the cells, and then introduce complementary DNAs (cDNAs) for each of the four proteins, GFP, RFP, GADPH, and ACTB. Each of these has a Unique Molecular Identifiers with primers on them as well – these UMIs are randomized DNA 29-mers, and that’s long enough so that you can be mathematically sure that if one of them binds to something in the cell that it’s a unique event. And there are two kinds of cDNA, “beacons” and “targets”. The ACTB cDNAs are the beacons (universally expressed), and the others are the targets (the difference between the two is found in artificial sequence-adapters assigned to the primers that are annealing to each one). The reason for this will – I think – become clear in a minute.
The next step is overlap-extension PCR, which with the right primers on the right ends will end up splicing (concatenating) two DNA sequences with the insertion of a new stretch of DNA between them. The beacons and targets are designed with those primers so that they will concatenate with each other, and not with themselves: everything the OEPCR gets its teeth into is a beacon-target interaction. And the middle of each of the overlap-extension primers, that new added sequence, has 10 randomized nucleotides in it, so that each new concatenation event gives you a completely new 20-mer sequence in there. That serves as a marker, a Unique Event Identifier (UEI), addition of which gets rid of a lot of potential sources of error and mixed data. When you sequence the concatenated DNA after this OEPCR step, you get sequences that have the unique identifier from the beacon, the unique identifier from the target, and that unique event identifier in between them.
What does that give you? Well, here’s the key to the whole idea: the OEPCR reaction is spitting out copies of concatenated DNA, but the chances of it finding anything to splice together (a beacon and a target) depend on the spatial proximity of the two. The number of those UEIs that form depend on co-localization of the beacons and targets, and give you a readout on the distance between the actual physical points where each of those UMIs began to get amplified. The copied cDNA diffuses out in a three-dimensional cloud, and these clouds overlap with other clouds (beacons and targets) to greater or lesser degrees depending on how close they are to start with. So when you quantify all the UEI sequences that come out of the experiment, you’re getting them in order of how close the original cDNA molecules were before things started getting amplified.
It’s a bit like working out distances and positions from cell-phone towers, or by distributing RFID sensors inside some building and getting a physical location for some labeled person or object by reading off all the distance data. Only you’re doing it at biomolecular scale. I salute Josh Weinstein for thinking of this; it’s not an obvious idea by any means from the front end, only after you hear it explained and think about it for a while. There are certainly proximity-based methods out there in chemical biology (photoaffinity labeling of small molecules, ChIP-Seq for chromatin structure, mass spec crosslinking for protein interactions, etc.), but this is a much finer-grained way of dealing with such information.
The working-out-spatial-positions part is sheer calculation, once you get all that sequence data. And it’s a major part of the paper, and a major amount of work, but I’m not going to dive into all the details of it, since it exceeds my pay grade as a data scientist. What I can say, though, is that the data are first divided into smaller subsets to check how well the experiment worked locally – that is, how well did the information about those UMIs actually get reflected in the UEI sequence data? That analysis then got built on and extended to larger scales, giving you a nonlinear probability function of getting those UEI hits given the three-dimensional arrangements that were present. This is a crucial part of the experiment, of course, and that’s why it was done with GFP and RFP to start with, because you have a completely orthogonal way (fluorescence) to check how that model’s optimization is going. And what emerged pretty quickly was reassuring: that the dimensionality of the data on the local scale was low – that is, simple distances and events really did seem to have been encoded into the whole data set, rather than providing some sort of massive gemisch. Their eventual technique, spectral maximum likelihood estimation (sMLE), is based on modeling the chances of each UEI formation as a Gaussian probability based on proximity and reaction rates, with a host of ingenious refinements that I am only marginally qualified to comment on (so I won’t!)
The ACTB and GDPH signals were distributed throughout the data, as they should be, while the GFP and RFP signals were mutually exclusive (as they should be, since they were in totally different cells!) Shown at right is output from the data on a pretty large scale, and you can see that the RFP cells and the GFP cells are clearly distinguished (and note that the actual output is in three dimensions and can be rotated, zoomed in on, etc.) The paper goes on to apply the technique to 20 more transcripts that had been reported as naturally differing between the two cell types, and found that they could indeed recapitulate these (and they they correlated with the GFP and RFP data as well).
As I said earlier, this really is an ingenious idea, and it has both similarities to super-resolution fluorescence microscopy (in that both techniques are a reconstruction of stochastic events, that give you a picture in the end that exceeds the resolution limits that seemed in place before). But they get there by different means. Here’s the paper:
Optical super-resolution microscopy relies on the quantum mechanics of fluorescent energy decay. DNA microscopy, however, relies entirely on thermodynamic entropy. The moment we tag biomolecules with UMIs in the DNA microscopy protocol, the sample gains spatially stratified and chemically distinguishable DNA point sources. This tagging process thereby introduces a spatial chemical gradient across the sample that did not previously exist. Once these point sources begin to amplify by PCR and diffuse, this spatial gradient begins to disappear. This entropic homogenization of the sample is what enables different UMI diffusion clouds to interact and UEIs to form.
You could call it “diffusion entropy microscopy” too, I guess. But you’re not held back by the physics of light penetrating a sample. There are other advantages: for one, you can pick out different sequence variations (down to single nucleotides!) in the transcripts, via those long unique sequences in the starting cDNAs, giving you a direct imaging window into somatic variation. But the biggest advantage is that the whole thing just depends on techniques that everyone is already doing – PCR and sequencing – and it leverages the huge progress in the speed, efficiency, and cost of those processes. What you need is the software on the back end of the process, to handle all the data you generate, and that you can get right here. Watching this technology get applied to tumor samples, to highly differentiated and organized things like neuronal tissues, to watch the effects of environmental or pharmacological stresses, et very much cetera, is going to be fun to watch!