Skip to main content

Analytical Chemistry

A Completely New Way to Picture DNA in Cells

Just how are things organized in a living cell? What’s next to what, in three dimensions? That is, of course, a really hard question to answer, but we’re going to have to be able to answer it in a lot of contexts (and at high resolution) if we’re ever going to understand what’s going on down there. There’s a new paper out that has a completely different way of dealing with the problem when it comes to nucleic acid sequences, and it’s very thought-provoking.

Right off the bat, I have to mention something remarkable about it. If you go to the end, where many journals have a section about what contributions the various co-authors made to a manuscript, you will find that Joshua Weinstein of the Broad Institute had the original idea for the research, and then carried out every single experiment in the paper. Now that’s rare! Keep that in mind as we go into just what the work involves; it’s a tour de force of chemical biology.

The technique is named “DNA microscopy”, but you’re probably going to have to expand your definition of microscopy if you’re going to use that phrase. Here’s how it works, illustrated by the first example in the paper. It involves two different types of cells, mixed together in culture: MDA-MB-231 cells that are expressing green fluorescent protein (GFP),  and BT-549 cells that are expressing red fluorescent protein (RFP). Now, if you wanted to see these, you’d stick them under a fluorescent microscope, and there they would be: red cells and green ones, clearly distinct. But what if your transcripts or proteins don’t glow? That’s why this GFP/RFP example is the demo; you can use fluorescence as a reality check, but the technique described doesn’t depend on optical methods whatsoever. (Like I said, you’re going to have to use a more inclusive definition of “microscopy”). The hope is that it will distinguish the two fluorescent proteins versus two controls (GADPH and ACTB), which are naturally expressed in both cell lines.

Here goes. You fix and permeabilize the cells, and then introduce complementary DNAs (cDNAs) for each of the four proteins, GFP, RFP, GADPH, and ACTB. Each of these has a Unique Molecular Identifiers with primers on them as well – these UMIs are randomized DNA 29-mers, and that’s long enough so that you can be mathematically sure that if one of them binds to something in the cell that it’s a unique event. And there are two kinds of cDNA, “beacons” and “targets”. The ACTB cDNAs are the beacons (universally expressed), and the others are the targets (the difference between the two is found in artificial sequence-adapters assigned to the primers that are annealing to each one). The reason for this will – I think – become clear in a minute.

The next step is overlap-extension PCR, which with the right primers on the right ends will end up splicing (concatenating) two DNA sequences with the insertion of a new stretch of DNA between them. The beacons and targets are designed with those primers so that they will concatenate with each other, and not with themselves: everything the OEPCR gets its teeth into is a beacon-target interaction. And the middle of each of the overlap-extension primers, that new added sequence, has 10 randomized nucleotides in it, so that each new concatenation event gives you a completely new 20-mer sequence in there. That serves as  a marker, a Unique Event Identifier (UEI), addition of which gets rid of a lot of potential sources of error and mixed data. When you sequence the concatenated DNA after this OEPCR step, you get sequences that have the unique identifier from the beacon, the unique identifier from the target, and that unique event identifier in between them.

What does that give you? Well, here’s the key to the whole idea: the OEPCR reaction is spitting out copies of concatenated DNA, but the chances of it finding anything to splice together (a beacon and a target) depend on the spatial proximity of the two. The number of those UEIs that form depend on co-localization of the beacons and targets, and give you a readout on the distance between the actual physical points where each of those UMIs began to get amplified. The copied cDNA diffuses out in a three-dimensional cloud, and these clouds overlap with other clouds (beacons and targets) to greater or lesser degrees depending on how close they are to start with. So when you quantify all the UEI sequences that come out of the experiment, you’re getting them in order of how close the original cDNA molecules were before things started getting amplified.

It’s a bit like working out distances and positions from cell-phone towers, or by distributing RFID sensors inside some building and getting a physical location for some labeled person or object by reading off all the distance data. Only you’re doing it at biomolecular scale. I salute Josh Weinstein for thinking of this; it’s not an obvious idea by any means from the front end, only after you hear it explained and think about it for a while. There are certainly proximity-based methods out there in chemical biology (photoaffinity labeling of small molecules, ChIP-Seq for chromatin structure, mass spec crosslinking for protein interactions, etc.), but this is a much finer-grained way of dealing with such information.

The working-out-spatial-positions part is sheer calculation, once you get all that sequence data. And it’s a major part of the paper, and a major amount of work, but I’m not going to dive into all the details of it, since it exceeds my pay grade as a data scientist. What I can say, though, is that the data are first divided into smaller subsets to check how well the experiment worked locally – that is, how well did the information about those UMIs actually get reflected in the UEI sequence data? That analysis then got built on and extended to larger scales, giving you a nonlinear probability function of getting those UEI hits given the three-dimensional arrangements that were present. This is a crucial part of the experiment, of course, and that’s why it was done with GFP and RFP to start with, because you have a completely orthogonal way (fluorescence) to check how that model’s optimization is going. And what emerged pretty quickly was reassuring: that the dimensionality of the data on the local scale was low – that is, simple distances and events really did seem to have been encoded into the whole data set, rather than providing some sort of massive gemisch. Their eventual technique, spectral maximum likelihood estimation (sMLE), is based on modeling the chances of each UEI formation as a Gaussian probability based on proximity and reaction rates, with a host of ingenious refinements that I am only marginally qualified to comment on (so I won’t!)

The ACTB and GDPH signals were distributed throughout the data, as they should be, while the GFP and RFP signals were mutually exclusive (as they should be, since they were in totally different cells!) Shown at right is output from the data on a pretty large scale, and you can see that the RFP cells and the GFP cells are clearly distinguished (and note that the actual output is in three dimensions and can be rotated, zoomed in on, etc.) The paper goes on to apply the technique to 20 more transcripts that had been reported as naturally differing between the two cell types, and found that they could indeed recapitulate these (and they they correlated with the GFP and RFP data as well).

As I said earlier, this really is an ingenious idea, and it has both similarities to super-resolution fluorescence microscopy (in that both techniques are a reconstruction of stochastic events, that give you a picture in the end that exceeds the resolution limits that seemed in place before). But they get there by different means. Here’s the paper:

Optical super-resolution microscopy relies on the quantum mechanics of fluorescent energy decay. DNA microscopy, however, relies entirely on thermodynamic entropy. The moment we tag biomolecules with UMIs in the DNA microscopy protocol, the sample gains spatially stratified and chemically distinguishable DNA point sources. This tagging process thereby introduces a spatial chemical gradient across the sample that did not previously exist. Once these point sources begin to amplify by PCR and diffuse, this spatial gradient begins to disappear. This entropic homogenization of the sample is what enables different UMI diffusion clouds to interact and UEIs to form.

You could call it “diffusion entropy microscopy” too, I guess. But you’re not held back by the physics of light penetrating a sample. There are other advantages: for one, you can pick out different sequence variations (down to single nucleotides!) in the transcripts, via those long unique sequences in the starting cDNAs, giving you a direct imaging window into somatic variation. But the biggest advantage is that the whole thing just depends on techniques that everyone is already doing – PCR and sequencing – and it leverages the huge progress in the speed, efficiency, and cost of those processes. What you need is the software on the back end of the process, to handle all the data you generate, and that you can get right here. Watching this technology get applied to tumor samples, to highly differentiated and organized things like neuronal tissues, to watch the effects of environmental or pharmacological stresses, et very much cetera, is going to be fun to watch!


14 comments on “A Completely New Way to Picture DNA in Cells”

  1. luysii says:

    It is a remarkable achievement. But that still doesn’t get my chemist’s head around what’s going on in the nucleus with its 10 micron (100,000 Angstroms) diameter.

    It’s even more remarkable when you try to put the nucleus into terms physically relevant to you

    The nucleus is a very crowded place, filled with DNA, proteins packing up DNA, proteins patching up DNA, proteins opening up DNA to transcribe it etc. Statements like this produce no physical intuition of the sizes of the various players (to me at least). How do you go from the 1 Angstrom hydrogen atom, the 3.4 Angstrom thickness per nucleotide (base) of DNA, the roughly 20 Angstrom diameter of the DNA double helix, to any intuition of what it’s like inside a spherical nucleus with a diameter of 10 microns?

    How many bases are in the human genome? It depends on who you read — but 3 billion (3 * 10^9) is a lowball estimate — Wikipedia has 3.08, some sources have 3.4 billion. 3 billion is a nice round number. How physically long is the genome? Put the DNA into the form seen in most textbooks — e.g. the double helix. Well, an Angstrom is one ten billionth (10^-10) of a meter, and multiplying it out we get

    3 * 10^9 (bases/genome) * 3.4 * 10^-10 (meters/base) = 1 (meter).

    The diameter of a typical nucleus is 10 microns (10 one millionths of a meter == 10 * 10^-6 = 10^-5 meter. So we’ve got fit the textbook picture of our genome into something 1/100,000 smaller. We’ll definitely have to bend it like Beckham.

    As a chemist I think in Angstroms, as a biologist in microns and millimeters, but as an American I think in feet and inches. To make this stuff comprehensible, think of driving from New York City to Seattle. It’s 2840 miles or 14,995,200 feet (according to one source on the internet). Now we’re getting somewhere. I know what a foot is, and I’ve driven most of those miles at one time or other. Call it 15 million feet, and pack this length down by a factor of 100,000. It’s 150 feet, half the size of a (US) football field.

    Next, consider how thick DNA is relative to its length. 20 Angstroms is 20 * 10^-10 meters or 2 nanoMeters (2 * 10^-9 meters), so our DNA is 500 million times longer than it is thick. What is 1/500,000,000 of 15,000,000 feet? Well, it’s 3% of a foot which is .36 of an inch, very close to 3/8 of an inch. At least in my refrigerator that’s a pair of cooked linguini twisted around each other (the double helix in edible form). The twisting is pretty tight, a complete turn of the two strands every 35.36 angstroms, or about 1 complete turn every 1.5 thicknesses, more reminiscent of fusilli than linguini, but fusilli is too thick. Well, no analogy is perfect. If it were, it would be a description. One more thing before moving on.

    How thinly should the linguini be sliced to split it apart into the constituent bases? There are roughly 6 bases/thickness, and since the thickness is 3/8 of an inch, about 1/16 of an inch. So relative to driving from NYC to Seattle, just throw a base out the window every 1/16th of an inch, and you’ll be up to 3 billion before you know it.

    You’ve been so good following to this point that you get tickets for 50 yardline seats in the superdome. You’re sitting far enough back so that you’re 75 feet above the field, placing you right at the equator of our 150 foot sphere. The north and south poles of the sphere are over the 50 yard line. halfway between the two sides. You are about to the watch the grounds crew pump 15,000,000 feet of linguini into the sphere. Will it burst? We know it won’t (or we wouldn’t exist). But how much of the sphere will the linguini take up?

    The volume of any sphere is 4/3 * pi * radius^3. So the volume of our sphere of 10 microns diameter is 4/3 * 3.14 * 5 * 5 * 5 * = 523 cubic microns. There are 10^18 cubic microns in a meter. So our spherical nucleus has a volume of 523 * 10^-18 cubic meters. What is the volume of the DNA cylinder? Its radius is 10 Angstroms or 1 nanoMeter. So its volume is 1 meter (length of the stretched out DNA) * pi * 10^-9 * 10^-9 meters 3.14 * 10^-18 cubic meters (or 3.14 cubic microns == 3.14 * 10^-6 * 10^-6 * 10^-6

    Even though it’s 15,000,000 feet long, the volume of the linguini is only 3.14/523 of the sphere. Plenty of room for the grounds crew who begin reeling it in at 60 miles an hour. Since they have 2840 miles of the stuff to reel in, we’ll have to come back in a few days to watch the show. While we’re waiting, we might think of how anything can be accurately located in 2840 miles of linguini in a 150 foot sphere.

    1. Great reply. As a biologist/bioinformatician I would recommend reading the extant literature on the highly regulated 3D genome architecture, 4D Nucleome and the use of methods such as Hi-C (chromosome conformation capture). I question the stochastic nature of these interactions.

      1. luysii says:

        Gerald Higgins: Thanks ! It was fun writing it.

        The post is the first in a series of 6 on the subject dealing with the nucleus — later posts deal with nucleosomes, pol II, untwisting the linguini to transcribe it etc. etc.

        Here is a link to the second post in the series — they are all linked

        Each post has a link to the next

  2. Anon says:

    Author Contributions

    J.A.W. conceived the project. J.A.W. performed all experiments and analysis. J.A.W., A.R., and F.Z. helped guide the project and wrote the manuscript.
    Declaration of Interests

    The authors are co-inventors on patent applications filed by the Broad Institute related to this work…

    So one guy has the idea and Dora’s all the work and analysis, while these other guys who did next to no intellectual contribution share in the IPR as co-inventors? I would say the fact that they co-authored a paper which states that they did next to nothing proves that they *cannot* be counted as co-inventors!

    1. Scott says:

      It’s possible that the idea for how to do the process may have come from one of the co-authors, and the other one did a lot of translating from whatever JAW wrote into a more easily understandable thing for the paper. JAW just did the legwork on the experiments proper. Or, one of the coauthors did a lot of the math. Something like that.

      There’s nothing wrong with including someone as a co-inventor when you wouldn’t have had the idea without their input. In fact, I’d argue that they SHOULD be recognized as co-inventor if you wouldn’t have had the idea without their input!

      As it is, Derek’s description is something I can just barely wrap my brain around, up until that line about ‘finding position from cellphone towers’ and then it all clicks.

      1. Derek Lowe says:

        Well, that last line makes me happy!

    2. Anonymous says:

      The issue of inventorship is rarely raised by the USPTO and then only in an odd circumstance. Real issues of inventorship are most commonly settled in court.

  3. Jonathan says:

    Really cool idea! So, anything that can be tagged with a DNA/RNA transcript can be visualized? It will be interesting to watch this technology evolve and see where it really excels over immunofluorescence.

  4. Anonymous says:

    Just a lowly chemist trying to understand this in a very short time and incomplete read. The DNA 29-mers are being covalently labeled with fluorescent proteins to create DNA-protein chimeric DNA-GFP DNA-RFP … kinds of things? That labeling doesn’t persist beyond the first doubling (DNA is replicated, not the protein label).

    I better understand a stretch of DNA linked to the GENE for GFP, but that isn’t detectably green until the GFP protein is expressed. If expressed GFP is what I am seeing in the fluorescence microscope what is that telling me about the location of the DNA in the cell itself?

    If it’s supposed to be informing about the location of the DNA within the cell, which is bigger and more influential to the diffusion / dynamics? The DNA or the actual protein tag? A 29-mer DNA is around 29 x 660 g/mol = 19,140 Daltons. GFP is ~27,000 Daltons. Or are the beacons / targets fusing with much larger pieces of DNA?

    Are they really trying to spatially localize DNA within a single cell or just showing that they can label DNA in a mixed population of cells and move on from there? Perhaps to localizing specific DNAs in a population of very similar cells (cancerous and non-cancerous).

    Maybe I should stick to chemistry. 🙂

    1. Gareth says:

      No, the technique is all about localizing the transcribed RNA (not DNA in this case, despite the name) for the proteins, and using the concatenated sequences to infer their positions relative to nearby RNA molecules. The colors are just a way to verify the results from the sequencing data. It’s a way to ask “which individual RNA molecules are close to each other” on a large enough scale to build a map.

  5. charlesj says:

    It seems somewhat related to the idea of using DNA barcodes to map brain connectivity, which I think goes back to Zador et al 2012 But the idea of using the concentration gradients to measure distances via probability of ligation is new (at least to me) and really cool.

  6. T says:

    “then introduce complementary DNAs (cDNAs) for each of the four proteins”. This is confusing; what does cDNA have to do with the proteins? Shouldn’t that be “complementary DNAs (cDNAs) for each of the four genes”?

    1. SP123 says:

      Right, this is a big distinction- the technique localizes the transcripts not the proteins. AFAICT it ignores the biology of protein trafficking and localization. Wouldn’t you expect the mRNA to be mostly around the ER and ribosomes? It’s useful for distinguishing positions of cell types with different transcript signatures, but does it tell you much about the subcellular structure?

Comments are closed.