Time for another “watch those statistics” post. I did one about this time last year, and I could do one every couple of months, to be honest. Here’s a good open-access paper from the Royal Society on the problem of p-values, and why there are so many lousy studies out there in the literature. The point is summed up here:
If you use p=0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time. If, as is often the case, experiments are underpowered, you will be wrong most of the time.
True, true, and true. If you want to keep the false discovery rate down to below 5%, the paper says, you should be going for p<0.001. And just how many studies, of all kinds, across all fields, hit that standard? Not too damn many, which means that the level of false discovery out there is way north of 5%.
(This paper) deals only with the very simplest ideal case. We ask how to interpret a single p-value, the outcome of a test of significance. All of the assumptions of the test are true. The distributions of errors are precisely Gaussian and randomization of treatment allocations was done perfectly. The experiment has a single pre-defined outcome. The fact that, even in this ideal case, the false discovery rate can be alarmingly high means that there is a real problem for experimenters. Any real experiment can only be less perfect than the simulations discussed here, and the possibility of making a fool of yourself by claiming falsely to have made a discovery can only be even greater than we find in this paper.
The author of this piece is David Coqulhoun, a fact that some people will have guessed already, because he’s been beating on this topic for many years now. (I’ve linked to some of his prickly opinion pieces before). He’s not saying something that a lot of people want to hear, but I think it’s something that more people should realize. A 95% chance of being right, across the board, would be a high standard to aim for, possibly too high for research to continue at a useful pace. But current standards are almost certainly too low, and we especially need to look out for this problem in studies of large medical significance.
Update: what this post needed was this graphic from XKCD!