Skip to Content

Does that Gene Sequence Mean What We Think It Means?

As DNA sequencing becomes more and more important in a clinical setting, we’re going to have to make sure that we’re getting the sequences we think we are, and that we know what to do with them. Drug companies are looking to be more precise with their clinical trial populations and physicians are looking to do more personalized medicine, but the data need to be solid, and solidly annotated, for any of these things to work.

Here’s a new paper on this in Genome Medicine, from the (cleverly named) “Genome in a Bottle” (GIAB) initiative, which is trying to set standards for next-generation sequencing. There’s a lot of detail in it that will be of interest to gene sequencing specialists, but the take home can be summed up in one sentence: “We observe that the accuracy of a variant call depends on the genomic region, variant type, and read depth, and varies by analytical pipeline.”

That’s because getting really accurate sequence data is not exactly a push-the-button item, or at least not yet. The regions we’re most interested in sequencing – the ones that we’re sure code for proteins – are only one or two per cent of the genome. There’s a lot of repetitive stuff and scrambled-looking chunks of old genes, etc. in between those, and the genes themselves are laced with exons, deletions, insertions, and transpositions. And those latter things are just the ones we want to be able to have an exact fix on! Just to make things even more enjoyable, some of the genes that are of the most interest are members of large families of very closely related sequences, and there are (inevitably) errors in some (or all) of the published reference sequences. So being sure that you’re reading the right thing in the right way can be a challenge at this level of detail.

This paper shows both false negatives and false positives in the existing data sets. For example, the BRCA2 gene (famously associated with breast cancer) has a particular variant called rs80359760. It is described in several databases as pathogenic, but based on the GIAB’s consensus sequence, this is not accurate. It’s actually fine, but there may well be patients whose doctors don’t realize that. In general, it appears that a significant number of disease-relevant variations occur outside the really high-confidence parts of the genome, so there’s plenty of work to be done. . .


14 comments on “Does that Gene Sequence Mean What We Think It Means?”

  1. Don’t dream that the protein coding regions are universally easy either – there are nasty simple sequence repeats (such as the trinucleotide repeats in the Huntington’s Disease coding region), plus a number of genes have very similar orthologs that can throw off variant calling.

  2. SP says:

    Right, here’s another example of a beast (also affects a very small population so fewer samples available):
    Mutations causing medullary cystic kidney disease type 1 lie in a large VNTR in MUC1 missed by massively parallel sequencing.

  3. Anon says:

    This is interesting. Given that many important gene variants occur with very low frequency, just one mistake can make the difference between a correlation being deemed significant or not.

  4. PS says:

    These seem mostly technical issues that can be solved by being more careful carrying out and interpreting HTS. The annotation errors will eventually (hopefully soon) be flagged. The more serious problem is how we predict the phenotypic outcome of sequence variants and where we search for them. Are silent mutations harmless? How often missense mutations affect protein function? How about intronic variants? How many variants in regulatory sequences and unannotated exons are we missing by doing exome sequencing rather than whole genome sequencing?

  5. Luysii says:

    Then there’s always the problem of correlating a gene variant with a disease. Now that sequencing is easier to do on large numbers of people, correlation is somewhat easier, but 5 or so years ago, only people with a given disorder were sequenced (e.g. without adequate controls) and a lot of ‘causative mutations’ were found to be anything but that. This was particularly true for ion channel mutations and epilepsy.

    For details please see —

    1. HTSguy says:

      Apparently commonly recognized pathogenic gene variants are actually less penetrant (i.e. do not increase the person’s probability of disease) than is commonly assumed:

  6. chiz says:

    And even if we do have accurate sequence data there is still the problem of correctly annotating it. There is a growing body of evidence that we have several tens of thousands and, probably, several hundred thousand non-coding genes. Software doesn’t know how to find most of them. We are almost certainly missing genes for peptides , due to their shortness, and there are hints that the software packages could be missing a few full-blown coding genes too.

  7. Argon says:

    Timely post, coming after the ‘snake oil’ one of yesterday. This is clearly not the same but the promise of genome-based, individual medicine certainly has a long way to go in most cases. And it’s reminiscent of the problem with wildly different gene expression measurements that hit the fan some years ago.

  8. Jam says:

    This is precisely why the FDA prohibited 23andMe from returning “health reports” based on variants: the annotation and interpretation of genetic variants is “unstable” and changes over time.

    There is a female dutch researcher whose name I can’t recall who gives a periodically updated talk about her experience with 23andMe (she was an early scientific adviser and received the service on a complimentary basis). To paraphrase: one report would indicate that she was at risk of heart disease. Then in an update a few months later heart disease would be nowhere to be found but she was now purportedly at risk of osteoporosis. Then in the next update her “risks” would completely change yet again. The results were obviously unreliable and could not serve as a basis for customers and their doctors to take any action.

    This problem is widely recognized and there are many efforts to address it, both within institutions that want to use genetic testing as a routine part of diagnosis and treatment and in public annotation projects that want to enable the same.

    [Last fall 23andMe was approved to resume providing genetic results on a very restricted basis limited to a small number of inherited diseases for which prospective parents might be carriers. Such tests have been available for a long time elsewhere.]

    [One other qualifier: 23andMe doesn’t perform sequencing or variant calling, which are one of the subjects of this post – they do genotyping which is less open-ended. But interpretation of their genotyping results is subject to the same underlying data issues discussed in the post.]

    1. I think Argon and Jam have good points about the challenges in this area moving forward. I’m a precision medicine fan, but at the same time I think there’s been too much hype and an unfortunate lack of nuance in how the concept of using genetic data to guide behavior and treatment has been presented to the public. There’s been both an overemphasis on the “genes are destiny” meme, as well as a relative lack of attention to the other environmental and historical factors that lead to a person’s current health. And what genetic penetrance means. Beyond the technical challenges described in this GIAB paper, there are a whole bunch of biological unknowns as well.

      1. UudonRock says:

        Currently I am working in a clinical lab that is researching genetic sequencing on lymphoma patients. There most assuredly are barriers to overcome. There are identifiable sequence mutations that we have identified as regular occurrences in specific subsets. For us this is laying the groundwork for a new clinical diagnostic tool, but the intent isn’t to replace existing method, and certainly not to create a holistic approach to medicine. To the best of our ability we can find the genetic markers that are prerequisite to cancer in anyone, that doesn’t mean everyone will have it though. Currently histology and cytometry are the preferred diagnostic method for our patients. Neither is perfect by itself and both can miss important markers, or simply fail with Hodgkin’s and Large B-Cell sub types. The hope is this will augment existing methods to provide a more accurate diagnosis for clinicians and pathologists. We are a very long way away from replacing existing technology or morphologic examination.

    2. Drug Developer says:

      Out of curiosity I did the 23andMe profiling a few years ago, just before they embargoed the prediction part, and I haven’t seen the volatility of predictions over time that you say the Dutch researcher reports. I suppose I only really look at the 31 “4-star” risk findings out of 122 total, and discount all the lower-power findings, though. (For me: 3 Elevated risks, 12 Decreased risks, 16 Normal.)

    3. Andre says:

      The Dutch researcher’s name is Cecile Janssens. She is professor of Translational Epidemiology at the Department of Epidemiology at Emory University, Atlanta, USA. Her research concerns the translation of genomics research to applications in clinical and public health practice. Her work focuses on the prediction of multifactorial diseases (e.g. diabetes, cardiovascular disease, asthma) using genetic risk models and on the assessment of the predictive ability and utility of genetic testing Here is the link to her home page: , in case you want to read more about her.

      I had seen a talk of hers on the very same topic at the 2014 Re(act) Rare Disease Conference in Basel, Switzerland. It was an excellent talk highlighting how time consuming proper genetic counselling can become, if you take your job and responsibility towards the patient serious. Keep in mind that most patients typically do not have a degree in a biomedical discipline….

  9. D says:

    And, dare I mention, alternative splicing? There are an awful lot of folk out there who contact database sources who when they’re told about the various transcripts ask “yeah, but which is the real one”. Folk are really wedded to the one gene, one transcript, one translation model. And a lot of the interesting stuff is in things like regulatory motifs, regulatory RNAs, etc, quite apart from the genes themselves. What a mess we are!

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve the math problem. * Time limit is exhausted. Please reload CAPTCHA.