There’s clearly something wrong with the way that statistics get handled and interpreted in scientific studies, and there have been many warnings. But change in this area is a hard thing to bring about. Biocentury has a good interview with someone who can tell you about that, John Ioannidis, of scientific reproducibility fame. He’s recently set off a lot of comment with a proposal to lower the threshold for “significance” to p < 0.005, and as you’d imagine, there are some strong opinions on that. His estimate is that this would shift about a third of the published biomedical results to “suggestive” rather than significant, and his take on this is, basically “good riddance”. According to Ioannidis, a lot of those are false positives anyway, and a lot of the ones that are real are not really useful. Here’s the root of the problem:
BC: What is the correct way to think about a p-value?
JI: A p-value is the probability or the chance that you’ll see a result that is so extreme as the one that you see if the null hypothesis is true and if there is no bias. Many people say a p-value is the chance of the null hypothesis being wrong, of some effect being there. This completely ignores the fact that you have these two ‘ifs’ that are required to interpret the p-value. So it is the chance of seeing such an extreme result, or something even more extreme, if the null hypothesis is true and if we have no bias.
And he’s upfront about the fact that his suggested new threshold is not the final answer, but just an interim step towards larger reforms. He wants to see better experimental and trial designs (Bayesian and traditional), larger sample sizes, and more attention paid to potential sources of bias and to effect size. In fact, if people just go wild for the new, stricter p-values, it’ll be a loss:
BC: Don’t you have a concern that by moving to a smaller p-value you are reinforcing the fetish about p-values and driving people to create enormous data sets? With data sets that are big enough, you can get very small p-values, but that doesn’t mean the results mean anything.
JI: Absolutely. And this is why 80% of the literature should not be using p-values, maybe 90% of the literature. With huge amounts of data, with big data, using a p-value means nothing, because everything can be statistically significant at 0.05 or even at 0.005.
But the question is, how do you change the mentality of people who are kind of automatically hooked to using a single magic number? It’s not that the alternative approaches that are better, like using effect size and confidence intervals or using Bayesian statistics, are recent. They have been out there for a long time, but their adoption, even though it is accelerating, has not been what we want.
As stated above, he thinks that p-value is the right tool for the job about 20% of the time, whereas Bayesian measures could be useful half the time, maybe even more. One instant objection to the idea of lowering p-value cutoffs is that it will make everything much more expensive and time-consuming (clinical trials, first and foremost), but his contention is that it doesn’t have to be that way. Better designs and more appropriate measures could cancel a lot of the brute-force bigger-sample stuff – and you get the impression that he’s deliberately pushing people towards those by making it too expensive to do it the old way. “We have some very strong tools that we don’t apply”, says Ioannidis, and he’s going to try to make people apply them, one way or another. . .