As you’ll have heard, the Reproducibility Initiative has come out with data on their attempts to replicate a variety of studies (98 of them from three different journals) in experimental psychology. The numbers are not good:
The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.
Right there you can see why the reporting on these results have been all over the map, because that paragraph alone assumes more statistical fluency than you could extract from an auditorium full of headline writers. So you can pick your headline – everything from “total crisis in all of science” to “not much of a big deal at all” is available if you dig around some.
These results have actually been publicly available for some months now – here’s a post on this blog from back in May. But their official publication has been the occasion for plenty of publicity, good and bad, accurate and confused (not to mention confusing). So as a chemist, as a practitioner of what’s supposed to be a reasonably reproducible field of study, what’s my own take?
First off, it’s true that the new replications do not prima facie invalidate the original studies. But they sure don’t shore them up the way one would have liked. As that section above shows, the strength of the original results was diluted by pretty much every measure. But if you look at this study by study, this was not an across-the-board effect. As Science 2.0 points out, some of what we’re seeing here is an effect of relying on P values. The papers whose results were just barely significant were much more likely to not reproduce well, and that, folks, is exactly what you’d expect.
There’s a tendency – perhaps especially in experimental psychology, but probably more widespread – to regard something that drags itself over the 0.05 P value line as (ta-daa!) “significant”, while anything that doesn’t is consigned to the outer darkness. But a study’s results can end up on the sunny side of that street by chance – that’s the whole thing that a statistical treatment is trying to tell you, and P values are probably not the best way to do it. Many people look at a P value at 0.05 or better and say “95% chance that these results are valid”, but that’s just wrong. In reality, there’s about a one in three chance that those results are not valid. And that’s if everything is sufficiently powered (enough cases, data points, and replicates) – skimp on that and you’re heading right into the swamp. And that’s if everything is conducted perfectly – throw in some human bias and some “P-hacking” and you’ve got, well, what we have.
So some of what we’re seeing comes under the heading of “These results were never as good as people thought they were, and the statistics have been trying to tell us that all along”. There’s still room for a good dose of “Some of these results have been hyped up”, too, of course. But that doesn’t have to mean fakery. Richard Feynman was one hundred per cent right when he said that you are the easiest person to fool when it comes to your own work and your own ideas. One of the things that no one disputes about human psychology is that there are constant unconscious biases towards confirming what we think we know and/or what we would like to be true. File it under “human nature”. That’s why we do double-blind trials in clinical research, because otherwise the results would be hopelessly tilted by human expectations.
At the very least, I think it’s safe to say that a lot of the experimental psychology literature is insufficiently reliable. The situation is probably even worse than this study indicates, too, because (1) these were papers from the upper rank of journals, (2) the replication work was done with unusual attention to detail, and (3) it was done, whenever possible, with the cooperation of the original authors. So these figures are probably a boundary; it’s just going to get worse from there.
How to fix this? Journals should insist on more statistical rigor, for one thing – better powered studies would help quite a bit. And both readers and authors should realize the limits of the emphasis on “statistical significance”, but I’m not holding my breath on that one. That point has been hammered on over and over, vividly and publicly, and things have gone on pretty much as usual. Too many people need and want to publish papers, too many people are used to the statistical regimes we already have. You’ll notice that these two suggestions (and pretty much every other suggestion that would actually improve the literature) have a common feature. Back during the educational reform debates in 1960s England, Kingsley Amis famously said “More will mean worse“. The logical contrapositive holds true here: Better Will Mean Less.
If you want better, more reproducible papers, you’re going to have fewer of them. Shorter publication lists, fewer journals, and especially fewer lower-tier journals. The number of papers that are generated now cannot be maintained under more reproducible conditions, and that’s not just true for the experimental psychologists. I think that this would be a good thing, for several reasons, but it’s important to realize that this is what we’re asking for. Committees that judge hiring candidates based on how many papers they have, journals whose interest is a steady flow of submissions – whether you realize it or not, or whether you care or not, your priorities are not the same as those trying to improve the scientific literature.