Here’s a note of warning sounded in a lead editorial in Nature:
. . .I think that, in two decades, we will look back on the past 60 years — particularly in biomedical science — and marvel at how much time and money has been wasted on flawed research. . .
. . .many researchers persist in working in a way almost guaranteed not to deliver meaningful results. They ride with what I refer to as the four horsemen of the reproducibility apocalypse: publication bias, low statistical power, P-value hacking and HARKing (hypothesizing after results are known). My generation and the one before us have done little to rein these in.
The author (Dorothy Bishop) has, unfortunately, some very good points. The good news, though (as she mentions) is that reproducibility is finally being taken more seriously. The publication bias referred to is the one against negative results: too many such experiments find no place in the literature. Now, there are of course a lot of reasons for an experiment to fail, many of which might have no relationship to the hypothesis being investigated. But when a well-designed study of reasonable size fails to produce any results (or to confirm an expected result) that would be worth knowing about. The literature as it stands, though, has a pronounced tilt towards positive news, and you never know just how large the invisible halo of dark results might be. And the better and more impressive the journal, the higher that bias probably is.
Low statistical power is easy enough to understand – well, you’d think. But the literature is also full of underpowered experiments that not only can’t really support their own conclusions but probably can’t support any conclusions at all. As Bishop notes, “researchers have often treated statisticians who point this out as killjoys“. Valuable negative results are one thing, but studies that are too small from the start don’t even rise to that level. And it’s important to realize that “too small” is a sliding scale. A study done with six mice might have been far more acceptable if it were done with twenty or thirty, and at the other end, a genome-wide association study across one hundred thousand people could well be too tiny to trust. It’s all about effect size; it always has been. If your sample size is not appropriate for that effect size, you are wasting your time and everyone else’s.
P-hacking is another scourge. And sadly, it’s my impression that while some people realize that they’re doing it, others just think that they’re, y’know, doing science and that’s how it’s done. They think that they’re looking for the valuable part of their results, when they’re actually trying to turn an honest negative result into a deceptively positive one (at best) or just kicking through the trash looking for something shiny (at worst). I wasn’t aware of the example that Bishop cites of a paper that helped blow the whistle on this in psychology. Its authors showed that what were considered perfectly ordinary approaches to one’s data could be used to show that (among other things) listening to Beatles songs made the study participants younger. And I mean “statistically significantly younger”. As they drily termed it, “undisclosed flexibility” in handing the data lets you prove pretty much anything you want.
And the last category, hypothesizing ex post facto, is a related error that not everyone realizes can be error. To me, that’s a dancing-on-the-edge thing: your results may in fact be telling you something that you didn’t realize and suggesting your next experiment. But if you think that, then you had better set up that next experiment before you think about publishing. Deciding that your first round of results are worth broadcasting, now that you look at them from this nicer angle over here, is almost certainly a mistake. At best, you’re going to be putting true-but-suboptimal stuff into the literature, results that could have been presented in a much more convincing and useful form. And at worst, you’re just trying to turn garbage into gold again. Or not even gold.
The good news, as mentioned, is that people are actually more alert to these problems. The well-publicized efforts to reproduce key studies are a very visible sign, and the ability of sites like PubPeer and other social media outlets to critique papers more quickly and openly is another. The incentives for shoddy work are still there, though, in many cases. The key for longer-term change will be to find ways to reduce those – without that, it’s going to be uphill all the way.