As DNA sequencing becomes more and more important in a clinical setting, we’re going to have to make sure that we’re getting the sequences we think we are, and that we know what to do with them. Drug companies are looking to be more precise with their clinical trial populations and physicians are looking to do more personalized medicine, but the data need to be solid, and solidly annotated, for any of these things to work.
Here’s a new paper on this in Genome Medicine, from the (cleverly named) “Genome in a Bottle” (GIAB) initiative, which is trying to set standards for next-generation sequencing. There’s a lot of detail in it that will be of interest to gene sequencing specialists, but the take home can be summed up in one sentence: “We observe that the accuracy of a variant call depends on the genomic region, variant type, and read depth, and varies by analytical pipeline.”
That’s because getting really accurate sequence data is not exactly a push-the-button item, or at least not yet. The regions we’re most interested in sequencing – the ones that we’re sure code for proteins – are only one or two per cent of the genome. There’s a lot of repetitive stuff and scrambled-looking chunks of old genes, etc. in between those, and the genes themselves are laced with exons, deletions, insertions, and transpositions. And those latter things are just the ones we want to be able to have an exact fix on! Just to make things even more enjoyable, some of the genes that are of the most interest are members of large families of very closely related sequences, and there are (inevitably) errors in some (or all) of the published reference sequences. So being sure that you’re reading the right thing in the right way can be a challenge at this level of detail.
This paper shows both false negatives and false positives in the existing data sets. For example, the BRCA2 gene (famously associated with breast cancer) has a particular variant called rs80359760. It is described in several databases as pathogenic, but based on the GIAB’s consensus sequence, this is not accurate. It’s actually fine, but there may well be patients whose doctors don’t realize that. In general, it appears that a significant number of disease-relevant variations occur outside the really high-confidence parts of the genome, so there’s plenty of work to be done. . .