Skip to Content

The Limits of Big Data

I fear that mentioning the phrase “Big Data” in the first sentence of a blog post will make half the potential readers suddenly remember that they have podiatrist appointments or something. But that’s the only way to approach this article at Wired. After all, the title is “The Cure For Cancer is Data – Mountains of Data”.

But this is a more realistic look than most of these articles. The problem is that when we use a term like “Big”, it’s a natural tendency to think, OK, really really large, got it, and sort of assume that once you get to something that has to be considered really large then you’ve clearly reached the goal and can start getting things to happen. “Really large”, though, is one of those concepts that just keep on going, and our brains are notoriously poor at picturing and manipulating concepts in that range. Here’s what happened along the way of this project:

In their search for these “resilient individuals,” (Eric) Schadt and his team amassed a pool of genetic data from 600,000 people, then the largest such genetic study ever conducted, with data assembled from a dozen sources (23andMe, the Beijing Genomics Institute, and the Broad Institute of MIT and Harvard, most notably). But in searching the 600,000 genomes, the researchers found potentially resilient individuals for only eight of the 170 diseases they were targeting. The study size was too small. By calculating the frequency of the disease-causing mutations in the population, Schadt and his team came to believe that the number of subjects they’d need to be useful wasn’t 600,000—it was more on the order of 10 million.

Schadt is now founding a company called Sema4 that will try to expand into this level of genomic information, figuring that the number of competitors will be small and that there may well be a business model once they’re up to those kinds of numbers (the data will be free to academic and nonprofit researchers). Handling information on that scale certainly is a problem, but as the article makes clear, the bigger problem is just getting information on that scale. How do you convince ten million people, from appropriately diverse genetic backgrounds, to have their genomes completely sequenced and give them to you?

“There are companies today that claim access to millions of patient records,” Schadt explains. “But from the standpoint of what we intend to do, the data is meaningless. It’s often inaccurate, incomplete, and not easily linked across systems. Plus, that data doesn’t typically include access to DNA or to the genomic data generated on their DNA.” To take the example of the Resilience Project, it wasn’t simply that the universe of data was too small—it was also that the 600,000 genomes were governed under a hash of various consenting arrangements. If something vital was discovered, hundreds of thousands of participants could not be recontacted or tracked, making the data useless from a practical research standpoint.

What the article doesn’t go on to lay out, though, is how all this is going to lead to any cures, for cancer or anything else. That’s actually the hard part; rounding up the ten million genomes will seem comparatively straightforward. One way to go about it is (as described above) to look for people who, from what we know, should have some sort of genomically-driven disease but don’t. What compensatory mutations do they have, and how are these protective? That won’t be easy, because everyone has their own collection of mutations, and there’s no guarantee that any of them will leap out as being biochemically plausible. There’s also a reasonable chance that no single mutation will turn out to be the answer by itself – it may be an ensemble, working together. And what if none of them are the answer? You could be looking at an environmental effect that’s not going to be in the DNA sequence at all, or present very subtly as a sort of bounce-shot mechanism. This is still very valuable work, and you can learn a great deal from “human genetic knockouts” that can’t really be learned any other way, but it’s far from straightforward.

You’re also unlikely to find cancer cures like this, at least, not directly. Cancer is a disease of cellular mutations, and it shows up after something, more likely several things, have gone wrong in a single cell. The main way that a person’s background DNA sequence will prove useful is if they have something going on with their DNA repair systems, cellular checkpoints, or the other mechanisms that actually guard against mutations and uncontrolled cell division, and those are almost certainly going to manifest themselves as greater susceptibility to tumor formation. (A number of these mutations are already known). I’m not aware of any mutations that go the other way and seem to confer a greater resistance to carcinogenesis – finding such things would be rather difficult. The efforts to sequence especially long-lived people are about the best idea I have in that line, and that’s not going to be very straightforward, either, for the reasons mentioned above.

But let’s say that you really do identify Protein X as a possible mechanism to cancel out or ameliorate Disease Y. Things have still just only begun. Now you have to see how possible it is, mechanistically, to target this protein as a therapeutic – how “druggable” it is. Your best hope is that it’s an enzyme or receptor whose lowered activity confers the beneficial effect, because we drug-discovery types are at our best when we’re throwing wrenches into the gears to stop some part of the machinery from working. That we can sometimes do. Making a specific protein work better, on the other hand, is extremely rare. There are a lot of disease-associated proteins that are considered more or less undruggable because they fail this step – or, more accurately, because we fail this step and can’t come up with a way to make anything work.

There’s an early scene in Brideshead Revisited where Charles Ryder, in the army during World War II, is looking at a much younger officer under him named Hooper, finding him a bit baffling and frightening. He tries substituting “Hooper” for the word “Youth” in various slogans and phrases to see if they still hold up – Hooper Hostels, the International Hooper Movement, etc., and finds it a pretty severe test. The equivalent, when you’re hearing about some new technique that could provide breakthroughs in human disease, is to wedge the word “Alzheimer’s” in there, and see if it still makes sense.

It’s a severe test as well. All sorts of genomic searches have been done with Alzheimer’s in mind, and (as far as I know) the main things that have been found are the various hideous mutations in amyloid processing that lead to early-onset disease (see the intro to this paper), and the connection with ApoE4 liproprotein. Neither of these explain the prevalence of Alzheimer’s in the general population; there is no genetic smoking gun for Alzheimer’s, because it would have been found by now. What you can get are some clues. The amyloid mutations are some of the strongest evidence for the whole amyloid hypothesis of the disease, but there’s still plenty of argument about how relevant these are to the regular form of it. Developing animal models of Alzheimer’s based on these mutations has been fraught with difficulty. And the ApoE4 correlation has led to a lot of hypotheses, some of which are difficult or impossible to put to the test, and others that remain unproven over twenty years after the initial discovery.

I’m sure that Eric Schadt and his people have a realistic picture of what they’re up to, but a lot of other people outside of biomedical research might read some of these Big Data articles and get the wrong idea. The point is that Big Data will only help you insofar as it leads to Big Understanding, and if you think the data collection and handling are a rate-limiting step, wait until you get to that one. I know it says that ye shall know the truth, and the truth shall make you free (a motto compelling enough that it’s in the lobby of the CIA’s headquarters), but in this kind of research, it’s more like ye shall sort of know parts of the truth, and they will confuse you thoroughly. It goes on that like for quite a while, usually. Big Data efforts will help, but they will not suddenly throw open the repair manual. There is no repair manual. It’s up to us to write it.

40 comments on “The Limits of Big Data”

  1. AmILloyd says:

    I remember a ‘Big Data’ session in a drug discovery Gordon Conference once. There was a token speaker from IBM who was involved in using supercomputers for crunching data. The first thing he said was, “You don’t have a Big Data problem.” That suddenly burst everyone’s bubble. His point was that airline data, weather data, traffic data, hospital emergency data; all of these are Big Data. Most drug discovery isn’t. Big Data is defined not just by volume but by speed and heterogeneity. For most cases in drug discovery, Big Data has just become a fancy buzzword to impress the investors and public.

    1. AndyD says:

      Very true! “For most cases in drug discovery, Big Data has just become a fancy buzzword to impress the investors and public.” To this I would add that it is also a good buzz term for empty suits and corporate IT gasbags to impress upper management. I know of one large British pharma company where the term “Big Data” has become synonymous with BS because it has been so liberally spouted by such types. I had the dubious pleasure once of listening to a director-level Big Data “expert” spewing about the “four Vs” of Big Data, just Google that term if you wish to know more about these Vs. It was painfully obvious that the same guy knew nothing about drug discovery or IT but was well versed in the required jargon and buzzwords. If “Big Data” is to be of any use in the pharmaceutical R&D setting, then teams will need good scientists, programmers and statisticians. What it certainly doesn’t need is the empty suit types who currently dominate in pharma IT.

      1. Derek Lowe says:

        Oy. IBM seems to be responsible for that “Four V” stuff. Sounds like a Mao-era slogan to me, but a lot of those things tend to hit me that way.

    1. Milkshake says:

      Another problem is GIGO – poor quality of data entering the data system will lead to worthless output. I have a friend who joined a prominent cancer research center that runs lots of clinical trials. Their plan is to create a massive database by entering every bit of patient clinical data that would be searchable against the genetic profiles of the individual patients – the utility of this database for finding the subset of patients having certain mutation that correlates with good treatment outcome is obvious. But one of the issues that came up was that the people taking the samples for RNA analysis may not have the full appreciation how finicky and unstable the material is – it is lots of work that has to be done right otherwise RNAs degrade and you won’t get useful results. The people who do this work may not be the best paid ones, they just are following a protocol that someone else had set up for them. If the protocol is flawed or not suitable for the particular tissue, the obtained data will be noisy if not meaningless, and would pollute the fancy database.

  2. watcher says:

    Another in vogue, next cure to everything. catch phrase. Yes, it might be useful for some applications or guidance for new NCEs but won’t cure all the ills for individuals, the public, or Pharma.

  3. Emjeff says:

    What will cure cancer is big thinking, not big data.

    1. Kelvin says:

      Make that quality thinking and quality data, but not big anything – apart from big change from what we’re doing now.

  4. Kelvin Stott says:

    Big Data = Big Noise, Small Signal.

    Expect a long and expensive wild goose chase following spurious correlations before people finally wake up.

    1. a nonny mouse says:

      We all seem to think that bigger the database of your data , better the understanding but nobody thinks of the flip side.More sources of data more confusing conclusions.

      1. Kelvin says:

        Indeed. Also, many more degrees of freedom (n) gives 2^n potential correlations (hypotheses), so a p-value of 0.05 would give 0.05 * 2^n spurious correlations by chance alone. Therefore you would need a p-value of 0.05 / 2^n to get 95% confidence in any one correlation. There just aren’t enough people on the planet to get that.

  5. john adams says:

    Having been the “victim” of an earlier incarnation of Eric’s fantasy that genetics will identify ALL disease targets/cure ALL diseases (i.e., by actually developing, to no good end, modulators of several such targets), all I can say is “good luck” !

  6. ___ says:

    Dan Sarewitz wrote over the summer: “If mouse models are like looking for your keys under the street lamp, big data is like looking all over the world for your keys because you can–even if you don’t know what they look like or where you might have dropped them or whether they actually fit your lock.”

  7. Barry says:

    I’m inclined to stop reading any article as soon as it points toward “the cure for cancer” in the singular. We don’t yet know how may diseases cancer is. The nightmare is that it will turn out to be large family of individually rare diseases, few of which are common enough to repay a Drug discovery/development program

    1. DanielT says:

      Barry, isn’t all the evidence to date that this is exactly what “cancer” is? About the best we can hope for is we find enough targets and treatments that we can mix-and-match on an individual basis.

  8. Charles U. Farley says:

    Is this the comments section where old med chemists gripe about those kids with their newfangled techniques and different ways of approaching traits, and how they just don’t get it?
    Yes, yes it is! 😉

    1. K. says:

      I suggest any youngsters here study statistics and learn about multiple hypothesis testing before they waste their careers and our budgets chasing big white noise.

      1. Charles U. Farley says:

        Some of us youngsters are biostatisticians. Please. If you don’t know about corrections for multiple tests, you’re not seriously in the big data / genomics / call it whatever you want business!
        Understanding and working with large genomic data sets involves a lot more than lecturing about Bonferroni, Holm, or Hochberg.
        I’m trolling you “old farts” somewhat, obviously. All these big tools are just after the same thing we’re all after- actionable drug targets. New century, new tools every year, same goal at the end of the day.

        1. Dana says:

          “the same thing we’re all after- actionable drug targets”. So lets start talking about the tools after we get the darn targets; it’s the constant hyping of the tools without actually getting anywhere that’s getting really hard to stomach.

    2. john adams says:

      We spent tens of millions of dollars doing mouse crosses to identify “causal” genes (gene products) for disease, not a SINGLE one of which proved to be causal when evaluated using (in most cases excellent) pharmaceutical tools ! Just sayin’ that I’m NOT convinced we (society) should try again using HUNDREDS of millions of dollars using a slight variation on the theme…

  9. Hap says:

    Getting lots of bad data doesn’t help – even if your methods give reliable results based on input, if much of the data is slapdash (“look at my CV!”) then the results are going to be worthless (or you won’t know what ones are worthless and what ones aren’t). If people were consistently collecting good data, this would be just hard, but it looks worse than that.

    1. Fool Me Once says:

      This is also true for “irrelevant data” or “inadequate data”.

      An infinite supply of answers to other peoples’ questions offers no guarantee that it contains the answer to your question. The appeal of big data is that, given enough random facts, the answer to every problem can be found. You just have to winnow your way to Revelation. But if the right preliminary questions and technology has yet to be asked or invented, then the answer to *your* question will not yet exist in any database, big data or no.

  10. steve says:

    This reminds me of the arguments against sequencing the human genome. No one could ever understand it! There are far too many genes for it to ever make sense! There is way too much junk DNA to make it worth sequencing the whole thing! Well, it didn’t cure cancer but it sure advanced the field. Junk DNA turned out not to be junk and there was a whole lot of information in non-coding sequences. Big Data may not cure cancer but I’m sure it will turn up some surprises that wouldn’t have been seen looking at Small Data (which I suppose requires at least a magnifying glass…). Let’s make a deal. We won’t assume that everyone that touts a new field is an idiot if they won’t claim that it will solve all problems. Generally these things follow the Gartner hype cycle and eventually reach a reasonable equilibrium.

    1. K. says:

      Fair enough, deal!

      When one pushes an extreme opinion (overhype or nay-say), I try to push the extreme opposite view, just to strike a balance. Usually the truth lies somewhere inbetween. 😉

    2. ADCchem says:

      With all the money, time, presentations, publications and general gyrations performed sequencing the DNA of cancer patients have we really learned anything actionable? And one could argue we never will because cancer by definition has hundreds of dependent mutations. I mean you could argue the IHC has had a bigger impact.

    3. Hap says:

      I don’t think I’d say “Don’t do it” so much as “Don’t promise what you can’t deliver.” The problem is that at the moment the deep trawl through the big cancer data pool would cost enough that the only way to free the money to do it is to promise the moon. At some point, though, you run out of honesty credits to spend in this way.

  11. Charles U. Farley says:

    Well said, steve

  12. Big data has the property that, more or less by definition, you can’t understand where the answer came from. You’ve turned some algorithms loose on what is, by definition, too much data to get your hands around.

    The great benefit, to empty suits, is that the algorithms can be tweaked and fixed until they produce the correct answer, and there’s no way to check their work. Big Data! It’s perfect for justifying whatever strategy you’ve already decided on, and there’s always something else to blame the failures on. Unpredictable market forces! Etc.

    Unfortunately, if you’re actually trying to cure disease there is a way to check the work. It’s just no fun for the patients.

    1. tangent says:

      Sounds like you’re conflating a couple of things, or at least I think the distinction is worth more of a look. #1 Whether you overfit the data and get a model that’s spurious. #2 Whether your model is valid within the universe of youe data, but it doesn’t translate to real results in the clinic.

      Thing #1 is a pitfall that’s extra risk for “big data” work because of what you mention, the model is allowed to be complex and not human-comprehensible. It’s harder than just saying “evaluate on data that you held out of the training, duh”… but it’s not that much harder. You can manage.

      Thing #2 is not really specific to “bigness”. This comes from problems like, what you’re studying isn’t actually a single thing and you don’t have a handle on it. It’s a mix of a thousand things and your patients are different than your sample. Big data, small data, any kind of data, it’s all useless unless you are measuring something real and repeatable. “GIGO” is a half century old.

      Okay, I will admit, problem #2 does crop up from Big Data because Big Data gives people ambitions. They think they can solve a problem that nobody actually understands well enough, and anyway they don’t need to talk to the domain experts.

    2. D Square says:

      It looks like we are confusing biology research with drug discovery.
      Of course, this is not such a surprise when many organizations have been letting go their more experienced drug hunters.

  13. Josh says:

    There are two different issues discussed in this post. First, in Eric’s case, he is starting with a large amount of data and looking for problems for which some subset of that large amount of data can provide some understanding. There are no experimental design tricks that can be used for an ill-posed problem. His work is exploratory in nature which isn’t a bad thing.

    This is very different from the second issue which is that when a target is known, is it druggable. There are currently machine learning approaches to efficiently yield answers to the second problem. There are active machine learning approaches which seek to direct experimentation to build very accurate models of experimental outcomes. This is done by iteratively selecting only the most informative experiments. The result is a highly accurate model that can be used to predict which (if any) compounds have desirable performance characteristics resulting from relatively few experiments being exectuted.

    That being said, to determine with confidence which targets are the best to hit for a specific disease is still a very difficult problem as you have the challenge of mapping experimental results in model systems to the clinical results (which all to often to not match up nicely).

    ***Shameless plug: Active learning for drug discovery is the specialty of my company.***

  14. Scott says:

    “Getting lots of bad data doesn’t help”
    Yeah, I remember one of the previous times someone attempted to apply Big Data to an array of 1000 ‘Known Druglike Chemicals’ that was discussed on this blog.

    The Big Data analysts failed the first rule of statistics: You only get usable data when you compare like with like.

    They didn’t compare similar doses between all the different chemicals. *facepalm* They also didn’t compare an array of doses of each individual chemical. *double facepalm*

    Possibly worst of all, they failed to ensure that what was in the bottle actually matched what was on the label of the bottle, but that’s a different discussion entirely.

  15. Smaller Luke says:

    Big Data has a few well-known limits.

    http://i.imgur.com/k6K1rVG.jpg

    Such as using contractions.

  16. Arne says:

    The problem with big data is that if the effect sizes were big enough to be important, they would be obvious without computers and statistics.

  17. insilicoconsulting says:

    I work with clinical and non clinical big data in my present role. Most data is from insurance claims and EHR. CPRD and the like are decent sources of such data. Indeed, they do not contain genomic data and are expensive to boot.

    There are indeed quality and data access issues but that does not mean that leveraging big data analytics techniques e.g. spreading data and computations across many nodes is not advantageous is many situations. After all bioinformaticians have been looking at parallel computing for a long time now. This is a similar use case for distributed analysis, provided there is something worth analyzing.

    In the case of SNP’s for example or just any other genetic variation, if a significant part of the population does not contain a SNP or haplotype then big data approaches can’t solve it for you. But exploring the chemical universe for example is a perfectly good big data scenario.

  18. insilicoconsulting says:

    Where big data helps e.g. speech recognition is in understanding Nuances e.g. Accents. Statistical models take you far e.g. markov chains but start failing in understanding accents. Similar might be the case of genomic/proteomic large scale biological data.

  19. Anon says:

    What is the point of fitting more and more variables to more and more data to test more snd more potential correlations, when half the raw data can’t be reproduced anyways?

  20. MoMo says:

    My many German mathematician and gene-jockey colleagues once summed up Big Data and even the Human Genome Project in these simple terms:

    Big Data- it is Scheisse!

  21. Walther White says:

    I’ve worked with Big Data before, and found that it was largely GIGO (garbage in, garbage out). Worse, the “garbage” is essentially noise that drowns out any useful data. IMHO, small well-curated and well-validated data sets provide better insights for drug discovery that mountains of, well, crap.

  22. Insofe says:

    Great Article. I think huge database of our data better to find out disadvantages but nobody thinks about it. Huge sources of data more confusing results. I think big data analysis is simple and Big Data efforts will help but not suddenly and requires huge statistical analysis. Yes, it might be useful for Pharmaceutical or public.

Comments are closed.