Skip to main content

The Scientific Literature

Thoughts on Reproducibility

As you’ll have heard, the Reproducibility Initiative has come out with data on their attempts to replicate a variety of studies (98 of them from three different journals) in experimental psychology. The numbers are not good:

The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.

Right there you can see why the reporting on these results have been all over the map, because that paragraph alone assumes more statistical fluency than you could extract from an auditorium full of headline writers. So you can pick your headline – everything from “total crisis in all of science” to “not much of a big deal at all” is available if you dig around some.

These results have actually been publicly available for some months now – here’s a post on this blog from back in May. But their official publication has been the occasion for plenty of publicity, good and bad, accurate and confused (not to mention confusing). So as a chemist, as a practitioner of what’s supposed to be a reasonably reproducible field of study, what’s my own take?

First off, it’s true that the new replications do not prima facie invalidate the original studies. But they sure don’t shore them up the way one would have liked. As that section above shows, the strength of the original results was diluted by pretty much every measure. But if you look at this study by study, this was not an across-the-board effect. As Science 2.0 points out, some of what we’re seeing here is an effect of relying on P values. The papers whose results were just barely significant were much more likely to not reproduce well, and that, folks, is exactly what you’d expect.

There’s a tendency – perhaps especially in experimental psychology, but probably more widespread – to regard something that drags itself over the 0.05 P value line as (ta-daa!) “significant”, while anything that doesn’t is consigned to the outer darkness. But a study’s results can end up on the sunny side of that street by chance – that’s the whole thing that a statistical treatment is trying to tell you, and P values are probably not the best way to do it. Many people look at a P value at 0.05 or better and say “95% chance that these results are valid”, but that’s just wrong. In reality, there’s about a one in three chance that those results are not valid. And that’s if everything is sufficiently powered (enough cases, data points, and replicates) – skimp on that and you’re heading right into the swamp. And that’s if everything is conducted perfectly – throw in some human bias and some “P-hacking” and you’ve got, well, what we have.

So some of what we’re seeing comes under the heading of “These results were never as good as people thought they were, and the statistics have been trying to tell us that all along”. There’s still room for a good dose of “Some of these results have been hyped up”, too, of course. But that doesn’t have to mean fakery. Richard Feynman was one hundred per cent right when he said that you are the easiest person to fool when it comes to your own work and your own ideas. One of the things that no one disputes about human psychology is that there are constant unconscious biases towards confirming what we think we know and/or what we would like to be true. File it under “human nature”. That’s why we do double-blind trials in clinical research, because otherwise the results would be hopelessly tilted by human expectations.

At the very least, I think it’s safe to say that a lot of the experimental psychology literature is insufficiently reliable. The situation is probably even worse than this study indicates, too, because (1) these were papers from the upper rank of journals, (2) the replication work was done with unusual attention to detail, and (3) it was done, whenever possible, with the cooperation of the original authors. So these figures are probably a boundary; it’s just going to get worse from there.

How to fix this? Journals should insist on more statistical rigor, for one thing – better powered studies would help quite a bit. And both readers and authors should realize the limits of the emphasis on “statistical significance”, but I’m not holding my breath on that one. That point has been hammered on over and over, vividly and publicly, and things have gone on pretty much as usual. Too many people need and want to publish papers, too many people are used to the statistical regimes we already have. You’ll notice that these two suggestions (and pretty much every other suggestion that would actually improve the literature) have a common feature. Back during the educational reform debates in 1960s England, Kingsley Amis famously said “More will mean worse“. The logical contrapositive holds true here: Better Will Mean Less.

If you want better, more reproducible papers, you’re going to have fewer of them. Shorter publication lists, fewer journals, and especially fewer lower-tier journals. The number of papers that are generated now cannot be maintained under more reproducible conditions, and that’s not just true for the experimental psychologists. I think that this would be a good thing, for several reasons, but it’s important to realize that this is what we’re asking for. Committees that judge hiring candidates based on how many papers they have, journals whose interest is a steady flow of submissions – whether you realize it or not, or whether you care or not, your priorities are not the same as those trying to improve the scientific literature.

22 comments on “Thoughts on Reproducibility”

  1. Ash (Wavefunction) says:

    This is one of the reasons such meta analyses are key: it’s important to ask what the P value for getting a certain set of P values is for instance.

  2. HT says:

    It would be worthwhile to look at the Bayesian interpretation of results from the Reproducibility Initiative (link in my handle). The big picture isn’t significantly different, but it provides more details than just results reproduced or not for a particular study. It makes so much sense that I feel that all similar attempts in future should adopt the Bayesian analysis.

  3. David Stone says:

    There’s also the issue that not all p-values are created equal, especially for small numbers of study participants. And then there’s the effect size, which is arguably more important for psychology and social science research. Not sure if it was originally posted here or elsewhere, but I came across the “Dance of the p Values” a while back, which covers the issue pretty well:

  4. anon says:

    Psychology is not science.

  5. Anon says:

    The human species has always had a problem with putting too much emphasis on quantity vs quality, simply because quantity is more easily measured. The same applies to extending life expectancy rather than quality of life in healthcare. We prefer to fill the world with low-quality noise just so we can have more, more, more!

  6. anon1 says:

    Psychology deals with the ultimate complexity. Human mind. Taking the glass half full view, since 36% are reproducible in psychology, perhaps we will get more reproducible results in Cancer reproducibility project. Most of the cancer papers deal with cells and mice; presumably less complex systems than human brain.

  7. Eric says:

    Is this really very surprising? Investigators and journals like to publish significant findings, so negative studies are systemically underpublished. Perhaps this shouldn’t be the case but to be honest I’d never take the time to read through a journal filled with negative studies. The papers that do get published are inherently biased toward significance and there will inevitably be some studies that arrived at significance through statistical chance (even in well-powered studies). When efforts are made to replicate these studies, many of these false positives will fail to repeat. Furthermore, statistical chance will rear its head again because the repeat studies are also subject to the vagaries of fate.

    We shouldn’t expect 100% repeatability. 40% seems a little low, but I don’t know what the ‘optimal’ number should be. Personally I’d rather see the same study repeated by 2 different labs than have one lab spend an inordinate amount of time and resources repeating a study in triplicate to ensure that they got the ‘right’ answer. That, of course, means publishing data that sometimes won’t be reproduced by others.

  8. Mivil1 says:

    There are issues here that go beyond P-values and effect sizes. In clinical studies in which the end-points are subjective (psychiatry, psychology, pain, etc), reproducibility is notoriously difficult. Placebo effects can be enormous (30-40% responses). In cancer, antibiotics, many cardiovascular indications, there are objective responses, i.e., the tumor shrunk, the pathogen was cleared, blood pressure was lowered. They are not patient reported outcomes that are found in subjective indications, i.e., how is you pain on a 1-10 rating scale, how is your depression, etc. These results should not be so surprising. In fact, as i under stand it many approved anti-depressants have more failed than successful trials and many approved pain drug fail to beat placebo in multiple studies. It’s not just the stats!

  9. luysii says:

    Amusing that the percentage (36) of the studies that were reproducible is very close to the percentage of the placebo effect.

    Later today, I hope to have a post up concerning my inability to note the lack of reproducibility of an important study (tissue plasminogen activator for stroke) into the literature — this was in the bad old pre-blogosphere days when all discussion of results was tightly limited by the published journals.

  10. Scifanz says:

    “If you want better, more reproducible papers, you’re going to have fewer of them.” Not necessarily, What needs to change is the incentive system for scientists. Divorce completely publication from peer-review by adopting pre-prints system across the board. More reaerch published, as pre-prints, but only few accepted for publication in journals, the ones which survive the scrutiny in the pre-print server. If no one reads/comments on the pre-print, no one cares, the pre-print should not become a “paper”.
    Hiring committees and grant-awarding agencies should also favor quality over quantity, while this quality should be assessed not only but the “rank” of the journal but by the support the paper receives in the community (positive comments and citations).

  11. David says:

    Not sure how this hasn’t been posted yet

  12. tcmJOE says:

    I think at this point pretty much every graduate student should have to read the book “Statistics Done Wrong” early in their career.

  13. Luysii says:

    This is the post concerning my inability to note the lack of reproducibility of an important study (tissue plasminogen activator for stroke) into the literature — this was in the bad old pre-blogosphere days when all discussion of results was tightly controlled by the published journals.

  14. luysii says:

    Just click on Luysii above and you’ll get to the post

  15. Gunter Hartel says:

    I think that part of the problem is that P-values are often misunderstood. It is natural to expect that 95% of studies that are significant at alpha=5% should be ‘truly’ significant. But that is not what the p-value means. The proportion of truly significant results among a set of results with p-values < 0.05 depends on the "population" rate of true results. It's like a screening test with 95% specificity. The proportion of those who test positive that are truly positives depends on the prevalence of the condition in the population. When a condition is rare a large proportion of the test positives are actually false positives.
    There is a very nice exposition of the problem on
    Effect sizes also can be exaggerated especially in under-powered studies, because only the unusually large observed effects make it to significance.

    All of this is true even when everything is done 100% correctly and all the assumptions are met. When we also consider the effects of publication bias, unadjusted multiple testing, post-hoc hypothesis tweaking, stepwise model selection, numerous sub-group analyses, multiple endpoints, etc it is maybe actually surprising how many of the results could be replicated. I assume that these were randomised experimental interventional studies. If we consider the many observational, correlational studies we might find even less reproducibility.

  16. tangent says:

    Gunter Hartel said “I assume that these were randomised experimental interventional studies.” Not all!

    The article says “100 experimental and correlational studies”, and by my quick read, they never do break out their results by the class of the original study. It’s not in their list of “indicators”.

    The study names are at their OSF page (, so somebody could read and classify them. But I’m surprised the authors didn’t. Yes, I have what I must admit to be a strong preconceived notion that the correlational studies will fare much more poorly.

  17. DM says:

    I think the issue here is one of science literacy. Not just from the perspective of the general public but also for us scientists!

    I am a behavioural scientist and I think that whilst it is nice to get a result that is “statistically” significant, the larger question to be asked is if the result is “actually” significant! It would seem to me that the media tends to report a one-liner sound bite where often the entire discussion section is needed to appreciate what an often short results section means!

  18. Anon says:

    For sure this is not acceptable. We need something better than P values, or at least a better understanding of what they mean, as well as hidden underlying assumptions, such as potential systemic errors that are not represented in the original data.

  19. Anon says:

    PS. Perhaps the reproducibility rate itself would be a better metric than P values, since that is ultimately what we care about?

  20. Excellent blog Eric. I’m not surprised. Sadly this is also tarnishing science and the scientific process. Several mentors in graduate school stressed that good science is hypothesis driven and that the best scientists perform experiments to disprove their hypothesis before getting overly excited. Unfortunately, as you point out, the publish or perish mentality is driving a lot of this.

  21. Poinsy says:

    Yet another pertinent XKCD comic (title “Trouble for Science”):

  22. stonebits says:

    I came across this post

    It makes the point that the “same study” conducted in different locations just isn’t the same study. I thought the example given: studying housing selection in a commuter school vs a non commuter school was especially salient

Comments are closed.