Skip to main content

The Dark Side

Too Much Wasted Time

Here’s a note of warning sounded in a lead editorial in Nature:

. . .I think that, in two decades, we will look back on the past 60 years — particularly in biomedical science — and marvel at how much time and money has been wasted on flawed research. . .

. . .many researchers persist in working in a way almost guaranteed not to deliver meaningful results. They ride with what I refer to as the four horsemen of the reproducibility apocalypse: publication bias, low statistical power, P-value hacking and HARKing (hypothesizing after results are known). My generation and the one before us have done little to rein these in.

The author (Dorothy Bishop) has, unfortunately, some very good points. The good news, though (as she mentions) is that reproducibility is finally being taken more seriously. The publication bias referred to is the one against negative results: too many such experiments find no place in the literature. Now, there are of course a lot of reasons for an experiment to fail, many of which might have no relationship to the hypothesis being investigated. But when a well-designed study of reasonable size fails to produce any results (or to confirm an expected result) that would be worth knowing about. The literature as it stands, though, has a pronounced tilt towards positive news, and you never know just how large the invisible halo of dark results might be. And the better and more impressive the journal, the higher that bias probably is.

Low statistical power is easy enough to understand – well, you’d think. But the literature is also full of underpowered experiments that not only can’t really support their own conclusions but probably can’t support any conclusions at all. As Bishop notes, “researchers have often treated statisticians who point this out as killjoys“. Valuable negative results are one thing, but studies that are too small from the start don’t even rise to that level. And it’s important to realize that “too small” is a sliding scale. A study done with six mice might have been far more acceptable if it were done with twenty or thirty, and at the other end, a genome-wide association study across one hundred thousand people could well be too tiny to trust. It’s all about effect size; it always has been. If your sample size is not appropriate for that effect size, you are wasting your time and everyone else’s.

P-hacking is another scourge. And sadly, it’s my impression that while some people realize that they’re doing it, others just think that they’re, y’know, doing science and that’s how it’s done. They think that they’re looking for the valuable part of their results, when they’re actually trying to turn an honest negative result into a deceptively positive one (at best) or just kicking through the trash looking for something shiny (at worst). I wasn’t aware of the example that Bishop cites of a paper that helped blow the whistle on this in psychology. Its authors showed that what were considered perfectly ordinary approaches to one’s data could be used to show that (among other things) listening to Beatles songs made the study participants younger. And I mean “statistically significantly younger”. As they drily termed it, “undisclosed flexibility” in handing the data lets you prove pretty much anything you want.

And the last category, hypothesizing ex post facto, is a related error that not everyone realizes can be error. To me, that’s a dancing-on-the-edge thing: your results may in fact be telling you something that you didn’t realize and suggesting your next experiment. But if you think that, then you had better set up that next experiment before you think about publishing. Deciding that your first round of results are worth broadcasting, now that you look at them from this nicer angle over here, is almost certainly a mistake. At best, you’re going to be putting true-but-suboptimal stuff into the literature, results that could have been presented in a much more convincing and useful form. And at worst, you’re just trying to turn garbage into gold again. Or not even gold.

The good news, as mentioned, is that people are actually more alert to these problems. The well-publicized efforts to reproduce key studies are a very visible sign, and the ability of sites like PubPeer and other social media outlets to critique papers more quickly and openly is another. The incentives for shoddy work are still there, though, in many cases. The key for longer-term change will be to find ways to reduce those – without that, it’s going to be uphill all the way.

40 comments on “Too Much Wasted Time”

  1. NJBiologist says:

    “A study done with six mice might have been far more acceptable if it were done with twenty or thirty…”

    There’s a transition point somewhere between a dozen and fifteen mice. Once you cross that transition, it’s time to have a thoughtful and serious conversation about your effect size *and your observed variability.* There will be studies that need to cross this threshold, but the wheat:chaff ratio starts to get out of hand at that group size. Some of the effect sizes will, on thoughtful consideration, reflect modest-to-meaningless effects; some of the observed variabilities will reflect experimental issues (uncontrolled variables, maybe?) or other reasons to rethink the design.

    1. Jim Hartley says:

      One of the early demonstrations that the transfer of immune cells could control cancer was this paper in 1971 (open access). From the abstract: “A group of 1125 mice was inoculated …Groups of 25 mice randomly selected from the original I 125 mice …”

      Deckers PJ, Edgerton BW, Thomas BS, Pilch YH. The adoptive transfer of
      concomitant immunity to murine tumor isografts with spleen cells from
      tumor-bearing animals. Cancer Res. 1971 Jun;31(6):734-42. PubMed PMID: 5088484.

      1. NJBiologist says:

        I’m not a cancer expert, but that study sounds overpowered, given that they appear to have been distinguishing between incidence rates of 87-100% (control) and 0-8% (manipulation).

        In fact, regulatory carcinogenicity studies typically have groups of 50 mice/rats. Based on limited reading, it sounds like the rationale is that enumeration by organ/tumor leads to many outliers, even with that group size. (I’ve read two of these studies in my career, and one had to examine the second control group due to the first control group having incidence of tumors in one organ below the lab historical range.)

        I may not have emphasized this enough in my original post: there are valid reasons to go to large group sizes; I’m just arguing for caution before going there.

    2. johnnyboy says:

      Sorry but there’s no such transition point between 12 and 15. 5 mice may be perfectly fine to show an effect, if the effect is large. 20 mice might not be enough. It all depends on effect variability.
      One should be aware that regulatory toxicity studies (the ones that allow to go into human trials) are often run on groups of 3 to 5, and this is considered perfectly acceptable by FDA et al.

      1. NJBiologist says:

        * effect size and variability, not just variability

        I’ve run some of those n=3 to 5 regulatory tox studies. I’ve watched as a n=3 maximum tolerated dose turned into a non-tolerated dose with one or more mortalities at n=5. (Try calculating binomial probabilities for one outcome with an underlying probability of 0.1 with 6 or 10 trials: the predicted outcome is no observations at 3m+3f, and at least one observation at 5m+5f. The range of probabilities over which this happens isn’t huge, but it’s big enough to periodically cause trouble.)

        Look at the raw data for those studies with small numbers, and look at the impact of outliers. Small n studies contribute to the reproducibility issue.

  2. Lambchops says:

    On the bright side there is now more of a drive (for clinical trials at least) to clamp down on publication bias by ensuring registered trials publish their results. On the less bright side, although pharma companies and some public institutes are upping their game in this regard there’s still a long way to go as the following articles highlight:

  3. Hap says:

    The hypothesizing after the fact sounds a lot like the “maybe this works in a subset” that sounds like the bane of pharmaceutical studies, and even in those, they bother to get separate data in that subset before they say they have an effect (or rather, mostly before they say it doesn’t work in that subset). It seems like if you have a new hypothesis, you need to test the new hypothesis before you actually claim it to be true. Shouldn’t editors/reviewers/authors already know this?

  4. dearieme says:

    Is it any wonder that public faith in science is in decline?

    I’ve just made that “fact” up; I have no idea whether it is true. But it ought to be; my faith has certainly declined a long way for, probably, three reasons.

    (i) Coming across scientific crooks in my own career, and being warned against other, terribly distinguished, workers as being crooks.

    (ii) Pseudo-science – I have in mind particularly String Theory, the hoped-for solution to the fact (a real one this time) that fundamental physics has been stuck going nowhere since the seventies. Alas, the predictions of String Theory can’t be tested by experiment or observation so it’s not really science at all. I take it that this intellectual adventure has been a waste of some very high grade talent.

    (iii) Fake science. I have two examples in mind. (a) The Goebbels warming scam. (b) The diet/CVD/statins nexus.

    On the other hand there’s still stuff that can take my breath away – recently, for instance, the ability to use ancient DNA to untangle some of human pre-history.

    The great thing to remember, I suppose, is that the lean years may yet be followed by fat years. May be.

    1. tim Rowledge says:

      Yeah, that Goebbels warming thing – bad idea. He’d smell something awful

    2. dearieme says:

      It’s just occurred to me: another cause for optimism. Alzheimer’s is caused by plaque – or not. At least the relevant people seem to be discussing the possibility that they’ve been on the wrong track. I have no idea who’s right but I am encouraged by the fact of investigation and discussion. By contrast, the cholesterol-is-killing-you mob seem determined not to wonder whether they might be wrong.

    3. debunker says:

      “coming across scientific crooks”
      I had the “opportunity” if you could call it that, to play quite a few small startups over the years in pursuit of a stock option payout.
      I can honestly say every one of them was exactly that..nothing more than scientific crooks looking to bilk unsuspecting investors. Almost all of them ended up with class action lawsuits from shareholders and they all had what were called “prominent” people behind them.
      What always astounded me was how stupid all the “esteemed” scientists were that actually bought into the fraud, either that or they were do desperate for a paycheck they’d play along with anything.

      1. Max says:

        I’ve been on the other side of that. I revamped several critical assays at a startup I had joined to ensure we had sufficient effect sizes to move drug candidates forward. That meant assay throughput took a real hit, but I refused to budge. Worse, I realized much too late that I wasn’t supposed to be honest with potential partners, clients etc about the throughput or limitations of the assays when doing the dog and pony. Needless to say, my career then took the hit.

    4. NMH says:

      I thought statins worked to lower hear attacks, doesn’t sound like fake science to me. Its just how they work is not clear.

      Global warming is fake? Last I heard the average temps are on the rise. Trying to shade it by branding it with a Nazi? How Donald of you.

      1. AM says:

        Climate data is extraordinarily p’hacked and raw data is typically ‘normalized’ before being released to the public. If you talk to a serious scientist in the field, they’ll tell you people on both ends are crazy and we can’t reliably say anything because the sample size is too small. Climate data (like diet science) is pretty much the poster child of papers making conclusions with data they really shouldn’t.

        1. Professor Electron says:

          Care to post some references to peer-reviewed journals to support your opinions?

  5. anon says:

    A long ago advisor told me “if you think you need statistics what you really need is a better experiment”. That advice has actually held up pretty well over the decades. Almost all of the time crappy looking results that are “statistically significant” are wrong.

    1. Anon3 says:

      Statistically significant has never meant meaningful.

      1. anon says:

        we know that now, it was gospel when I was a kid.

      2. MATHJ says:

        Statistically significant means it is not random.

    2. Diver Dude says:

      If you need statistics, you did the wrong experiment.
      Ernest Rutherford

      1. anon3 says:

        There must be some context for that quote? If you’re dealing with populations, you want some kinda measure for how different the two populations are (or aren’t). For instance, z’ scores for HTS. They can look different to the eye…but how do you quantify which assay condition is better without statistical analysis?

        1. loupgarous says:

          Rutherford’s high standards for proof in his own and his associates’ research may be the context for that quotation. Statistics are generally called for in physics when an hypothesized effect is difficult to distinguish from activity explainable by alternate explanations. His own hallmark observations such as alpha scattering were clear and didn’t require recourse to statistical analysis.

          Rutherford’s record for scientific prophesy was famously called into question by Leo Szilard, who doggedly recruited Britain, then America into a nuclear arms race with Germany based on Szilard’s disagreement with Rutherford that large, industrial-scale use of nuclear energy was “moonshine”.

          The sort of high-energy subatomic research which Rutherford discounted the need for has shown statistically convincing evidence for the existence of the Higgs boson – the nature of that evidence requires statistical analysis. if Rutherford actually said “”If your experiment needs statistics, you ought to have done a better experiment”? (there’s disagreement on that point), the consensus of subatomic physics begs to disagree.

          1. Barry says:

            maybe someone here remembers the precise quote to the effect that:
            If a Nobel laureate in physics opines something can be done, take him seriously. If he opines something can’t be done, do the experiment anyway?

          2. loupgarous says:

            @Barry That ‘s close to Arthur C. Clarke’s First Law:

            “When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.”

          3. Chris Phoenix says:

            Arthur C. Clarke: “If an elderly but distinguished scientist says that something is possible, he is almost certainly right; but if he says that it is impossible, he is very probably wrong.”

            (I can’t reply enough levels deep to answer the post I’m responding to.)

  6. Nick K says:

    These problems will be greatly ameliorated or even solved once we can overcome the “publish or perish” culture which reigns in our Universities.

    1. Hap says:

      Yes, but I’m not sure how to do that, since Deans can count much more easily than they can read. (as John Wayne quoted in an earlier thread). I know “lack of imagination” is not a logical counterargument, however.

      Funders can yank on the authors, but giving power to funders over both publishing and funding seems to be concentrating a lot of power in a few hands which probably won’t end well.

      1. loupgarous says:

        There are several fields of scientific endeavour in which the power of large institutional funders to support or deprecate hypotheses dovetails with “publish or perish” to distort and even retard science.

        It’s not a new thing – John Clark, in his history of chemical rocket propulsion <I<Ignition!, talks the chlorine pentaflouride being stalled by the initial refusal of an ARPA project manager to continue funding that work because “Lawton claimed he made ClF5 and we know that’s impossible.” Only what we’d now call “creative research financing” supported work that proved the existence of ClF5, after which

        “Then, about March 1962, Dr. Thompson scraped up some company R and D money, and told Lawton that he’d support two chemists for I hree months, doing anything that Lawton wanted them to do. Maya was put back on the job, and with Dave Sheehan’s help, managed to make enough “A” to get an approximate molecular weight. It was 127 — as compared with the calculated value of 130.5.

        Armed with this information, Lawton went back to ARPA and pleaded with Dick Holtzman, Mock’s lieutenant. Holtzman threw him out of the office. By this time it was the middle of 1962.

        At this time Lawton had an Air Force research program, and he decided, in desperation, to use their program —and money —to try to solve the problem. The catch was that the AF program didn’t allow for work on interhalogens, but apparently he figured that if he succeeded all would be forgiven. (In the old Royal Spanish Army there was a decoration awarded to a general who won a battle fought against orders. Of course, if he lost it, he was shot.) Pilipovitch was Lawton’s Responsible Scientist by that time, and he put Dick Wilson on the job.

        The next problem was to explain all this to the Air Force. It wasn’t easy. When Rocketdyne’s report got to Edwards Air Force Base in January 1963 the (bleep) hit the fan. Don McGregor, who had been monitoring Lawton’s program, was utterly infuriated, and wanted to kill him —slowly. Forrest “Woody” Forbes wanted to give him a medal. There was a fabulous brouhaha, people were shifted around from one job to another, and it took weeks for things to settle down. Lawton was forgiven, Dick Holtzman apologized handsomely for ARPA and gave Lawton a new contract, and relative peace descended upon the propellant business.”

        I don’t think Dr. Lawton would fare nearly so well these days. BIg funders just don’t have the sense of proportion and humor that ARPA’s Dr. Holtzman had back then.

        1. loupgarous says:

          FIrst paragraph ought to have read:

          “It’s not a new thing – John Clark, in his history of chemical rocket propulsion Ignition!, talks about the discovery and reproduction of the synthesis of chlorine pentaflouride being stalled by the initial refusal of an ARPA project manager to continue funding that work because “Lawton claimed he made ClF5 and we know that’s impossible.””

  7. JB says:

    I remember a scientist from NIST who visited our lab once and gave a talk on the incredibly complex task of trying to identify sources of variability in an experiment and how we can write protocols to reduce it. He showed data where a simple MTT cell viability assay was used to test nanoparticle toxicity. The same assay, reagents and cells were sent to labs across the country. The results were shockingly inconsistent. NIST spent a lot of time turning what was once a simple 1 page protocol for a MTT assay into a 7 page protocol. It worked very well at getting more reproducibility for the assay. But if professional scientists can’t even get simple MTT assays to run consistently, then why do we think the vast majority of other biology that is several orders of magnitude more complex will ever be repeatable? A MTT assay should be work for a 8th grader, yet papers these days use cells that might take over a year to engineer and require a volume’s worth of steps to accomplish.

    1. yf says:

      actually MTT is simply a reagent for a read-out. The real variables of the viablity assay are cell density, time, culture plate etc….

    2. NJBiologist says:

      “But if professional scientists can’t even get simple MTT assays to run consistently, then why do we think the vast majority of other biology that is several orders of magnitude more complex will ever be repeatable?”

      That’s an excellent question. I don’t have a solid answer, but I’ve wondered if the more complicated systems have more/better/more effective homeostatic mechanisms than cells. As yf notes, there are a lot of details of things like cell passaging that will change results. But as long as you don’t let the colony room get too far off the standard temperatures, the rats will stay pretty close to 37degC, to pick one example.

      Needless to say, this argument doesn’t do anything for you if your line of work involves one of those homeostatic mechanisms….

  8. One issue is that universities (and funders) still do not classify p-hacking, HARKing and outcome switching as research misconduct, even though it harms patients and public health (apart from being scientifically unsound):

    As long as the only response to these practices is tut-tutting, I fear little will change.

    On the positive side, the UK is now planning to monitor whether all trials run in the country are registered and reported:

    Hopefully, other countries will take note.

    1. aairfccha says:

      Without HARK you can very well go on a wild goose chase like Galen’s theory of juices or Aristotle’s laws of motion.

    2. Hap says:

      I think people assume misconduct is akin to actual fraud, rather than bad or self-serving practices, so framing those as scientific misconduct wouldn’t really be accurate. Someone doesn’t need to be evil to not be good at their job. It’s just that the definition of the job changes with viewpoint.

      On the other hand, if someone makes lots of questionable or unreliable papers, there isn’t a particularly good reason to fund them, and if they can’t get funded, they’ll either learn to do better (or to hack better) or go away. I worry about how much power that puts in the hands of funders, but it’s their money, and you get what you pay for and encourage.

  9. debunker says:

    So long at as there are gobs of money to be made flipping IPO’s with junk science it will continue
    Unfortunately the above pretty much describes the majority of the industry.

  10. Emjeff says:

    “And the last category, hypothesizing ex post facto, is a related error that not everyone realizes can be error. To me, that’s a dancing-on-the-edge thing: your results may in fact be telling you something that you didn’t realize and suggesting your next experiment. But if you think that, then you had better set up that next experiment before you think about publishing. ”

    This is exactly where the public health people go off the rails. Few realize that all of the big public health wins (fluoride decreases dental caries, smoking causes cancer, DDT weakens egg shells) were supported by reams of experimental data. Back then, it was well-known that causality is a very tricky thing, I suppose. Now, there are thousands of M.P.H’s poring through electronic datasets looking for “associations” that they can publish, with the always amusing “We don’t know why x causes y, it needs more study” mantra making everything ok. It’s not ok, because these 1) those “more studies” are never done, and 2) this is the reason that no one believes epi studies anymore.

    1. Hap says:

      I think I’m less inclined to believe dietary stuff (other than I’m eating too much and not enough vegetables and too much starch and sugar) because nutrition advice is “sometimes wrong, but always certain”. Too much certainty on stuff which isn’t powerful enough to justify the certainty means that I can’t think you know what should be taken as certain and what shouldn’t (other than “what you’re saying” which doesn’t help).

  11. freezyundead says:

    One thing that gets neglected in all discussions about reproducibility IHMO is the fact that the incentive structure (in academia at least) rewards premature publication, while sound, reproducible work isn’t rewarded at best, or punished at worst.

    Let me explain. The ultimate result of academic science is the publication. We may kid ourselves into believing that it’s actually about understanding nature or promoting knowledge, but if you take a cold hard look at it, it’s just the paper. Publications inform everything, from funding over tenure track over student influx, and indirectly even the odds of future publications. Thus, scientists are incentivised to publish as much as possible, as soon as possible. Anything they can bullsh*t past a journal editor is by definition good enough. Some academics have higher moral standards than others, but ultimately, publish or perish, amirite?

    Once a manuscript can feasibly be negotiated past an editor, any further work devoted, for example, towards ensuring reproducibility has no obvious benefit. At the VERY best, it might open avenues into higher impact journals, but that’s a bit of a stretch.

    On the other hand, what are the odds that someone really replicates the study and calls you out on it? Sure, that happens, but as long as your results LOOK somewhat convincing and don’t draw criticism straight away, there’s a decent chance it’ll be accepted as gospel and the world moves on.

    Now, let’s look at efforts to actually go ahead and replicate someone else’s work to see if it holds water. Unless you are a very rich and bored academic, tell me truly, why would you do that? At best, you can’t reproduce their findings, write up a paper about it, and try and get it published. Not only will such a paper most likely go into a lower impact journal, and won’t give you much visibility; it can be pretty hard to get it published in the first place, as this delightful series of Retraction Watch posts explains:

    At worst you spend money and resources on the replication attempt, learn that yes, indeed, the result checks out. And then what? What are the chances to publish that and translate it into further funding etc?

    I hope you see now how academia (industry may be different to a degree) does nothing to promote reproducibility while doing everything to incentivise hasty, premature, or straight out fraudulent science?

    What can be done about that? Honestly, I don’t know. I think to a certain degree, this is just how it works. We can of course change how research output and quality are measured, move away from taking the peer-reviewed article as the be-all and end-all of scientific progress. We can improve review mechanisms. We can change the way replication studies are handled and valued in the scientific community. Do I have all the solutions? No. Will any of that happen any time soon? Maybe. But until we recognise that a lack of reproducibility is not a bug but a feature of our current system, I don’t think we will make big strides.

  12. Margaret says:

    Academics’ livelihoods are often directly dependent on achieving said publication. At the end of a grueling PhD on the barest excuse for a stipend, a publication in a good peer-reviewed journal can be the difference between a job and unemployment. Funding fewer academics, but paying them better and giving them more secure jobs, may help.

Comments are closed.