Skip to main content
Menu

Clinical Trials

Underpowered And Overinterpreted

Time for another “watch those statistics” post. I did one about this time last year, and I could do one every couple of months, to be honest. Here’s a good open-access paper from the Royal Society on the problem of p-values, and why there are so many lousy studies out there in the literature. The point is summed up here:

If you use p=0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time. If, as is often the case, experiments are underpowered, you will be wrong most of the time.


True, true, and true. If you want to keep the false discovery rate down to below 5%, the paper says, you should be going for p<0.001. And just how many studies, of all kinds, across all fields, hit that standard? Not too damn many, which means that the level of false discovery out there is way north of 5%.

(This paper) deals only with the very simplest ideal case. We ask how to interpret a single p-value, the outcome of a test of significance. All of the assumptions of the test are true. The distributions of errors are precisely Gaussian and randomization of treatment allocations was done perfectly. The experiment has a single pre-defined outcome. The fact that, even in this ideal case, the false discovery rate can be alarmingly high means that there is a real problem for experimenters. Any real experiment can only be less perfect than the simulations discussed here, and the possibility of making a fool of yourself by claiming falsely to have made a discovery can only be even greater than we find in this paper.

The author of this piece is David Coqulhoun, a fact that some people will have guessed already, because he’s been beating on this topic for many years now. (I’ve linked to some of his prickly opinion pieces before). He’s not saying something that a lot of people want to hear, but I think it’s something that more people should realize. A 95% chance of being right, across the board, would be a high standard to aim for, possibly too high for research to continue at a useful pace. But current standards are almost certainly too low, and we especially need to look out for this problem in studies of large medical significance.
Update: what this post needed was this graphic from XKCD!

50 comments on “Underpowered And Overinterpreted”

  1. M Bower says:

    Link above is to an old favorite on this topic. Irreproducible results plague us, it seems.

  2. M Bower says:

    This link is even better. Dance of the P values.

  3. Morten G says:

    Peter Kenny also had a nice paper on how you can’t bin values and then calculate a p-value. Or it was a correlation coefficient. Whichever. Basically what they do in every single study that uses BMI (intuitively it makes better sense when you realise that the statistical methods implicitly assume that the data in each bin is normally distributed and it rarely is).
    Anyway, isn’t this the kind thing that software like IBM’s Watson might be able to tidy up? Find studies with questionable statistical approaches and flag them?

  4. Carl 'SAI' Mitchell says:

    “And just how many studies, of all kinds, across all fields, hit that standard?”
    The particle physics standard for a discovery is p

  5. Wavefunction says:

    A great book about the depredations that the p-value visits on us is “The Cult of Statistical Significance” by McCloskey and Ziliak.

  6. Carl 'SAI' Mitchell says:

    Apparently this has HTML formatting, didn’t realize that.
    “And just how many studies, of all kinds, across all fields, hit that standard?”
    The particle physics standard for a discovery is p &lt 0.0000003 (5 std deviations). The standard model is incredibly accurate. Sadly, it’s so complex to calculate that it can’t be used directly for chemistry.

  7. Zippy says:

    His statistics primer Lectures on Biostatistics is a classic. Been out of print for decades, but now available on line for free at his website.

  8. Neo says:

    How many of those praising this article have recommended rejection of a paper using p=0.05 as threshold for significance?
    More importantly, how many of you have opposed to publication of a paper that test the hypothesis on the same data set that was used to generate it?
    This is the real problem. I think it is much more interesting to live with uncertainty than to live with answers that are probably wrong.

  9. m says:

    Another giddy commentary on the scientific method based on a simplistic delineation of empiricism and plausibility.
    It must frustrate the statistics evangelists that science is progressing rapidly despite these highly compromised standards.
    No data or conclusion is an island, explanations and their relationship to others matters. Tired of these sanctimonious straw men.

  10. Semichemist says:

    Anyone care to dumb this down for someone who is only vaguely familiar with statistics? If you use a test to determine that there is a 5% chance your hypothesis is wrong, how does that translate to being wrong 30+% of the time?

  11. to semichemist says:

    there is a 5% chance your hypothesis is wrong… if the effect is real. But that the effect is real itself is not 100% and is often much less hence the false discovery rate is much higher than 5%
    also read the paper.

  12. a. nonymaus says:

    In chemistry, can we claim statistical validity by virtue of the large number of molecules that we are using in the reaction? If I do a reaction at the micromole scale, I’m testing those reaction conditions on ca. 10^18 molecules. I doubt there will ever be a clinical trial with that number of subjects. Admittedly, there is residual uncertainty in the other variables such as temperature (let alone the per-molecule definition of temperature) and reaction time.

  13. johnnyboy says:

    @10: To oversimplify it a bit: it’s a matter of doing multiple tests over the long run. Your p

  14. Pete says:

    ‘The Cult of Statistical Significance’ by McCloskey and Ziliak (recommened by Ash) is a good read and Steve Ziliak kicked off the 2013 Computer Aided Drug Design GRC with a most excellent harangue. As we pointed out in our ‘Inflation of correlation in pursuit of drug-likeness’ (thanks for flagging this up, Morten), the most anemic of trends can aquire eye-watering statistical significance when powered by enough data. A number of highly cited studies present statistical significance as evidence for the strength of a trend (e.g by showing mean values with standard error bars) and, given some of what gets published in journals that should know better, one could be forgiven for wondering why compound attrition is not an even bigger problem than it is.
    I’ve linked my BrazMedChem 2014 presentation (‘Accident and misadventure in property-based molecular design’) which covers addresses both correlation inflation and the deficiencies of ligand efficiency metrics as the URL for this comment and it provides links to both the correlation inflation study and ‘Ligand efficiency metrics considered harmful’.

  15. jonnyboy says:

    @10: to continue: Your p

  16. Pmd says:

    All very true. But people have made these points before and keep making them…and no one does anything about it. How do we ensure this translates into actual change in the way results are evaluated and reported?

  17. Pete says:

    @16 Pmd, I have wondered this myself. In the drug-likeness and compound quality field, there appears to be a ‘magic circle’ in which the ‘experts’ cite each others’ articles and there seems to be an unwritten rule that one does make reference to odor even when it is overpowering. Something that might make a difference would be if the editors of journals that have published flawed studies either persuaded the authors to publish errata or, if the authors proved intransigent, published health warnings for the studies in question.

  18. Sam Adams the Dog says:

    Not addressed by the article is the consequence of “making a fool out of yourself”, aka “being wrong”.
    If the net effect of false positives is to cause many healthy individuals to take a drug, whether that is a good or bad thing fro the point of view of health is dependent upon the side effects of the drug.
    Suppose, for example, that many individuals not susceptible to stroke are told to take an aspirin a day, as are the majority of individuals who are susceptible. Suppose, too, that the treatment really is effective for the susceptible.
    There is little lost by being wrong here — given that individuals who experience bad things when they take aspirin likely know that ahead of time and will avoid the treatment.

  19. zmil says:

    @to semichemist
    “there is a 5% chance your hypothesis is wrong… if the effect is real. But that the effect is real itself is not 100% and is often much less hence the false discovery rate is much higher than 5%”
    That’s backwards. P-value assumes the null hypothesis is true, it says nothing directly about your hypothesis, or the reality of the effect. P-value=.05 means there’s a 5% chance of getting the data you saw, assuming the null hypothesis is correct.

  20. just a tip says:

    @15 jonnyboy,try it again using “p less than” rather than using the less than sign.

  21. MarkySparky says:

    If you only consider the statistics, it makes sense. But if you consider the cost/benefit ratio of arbitrary statistical thresholds, then I submit you are missing the larger point of research.
    There is little in this world so tiresome as a statistical pedant, IMHO.

  22. Robb says:

    I disagree the standard of p=0.05 is too low… for what it’s intended.
    Journals grew out of letters exchanged between scientists. “Hey Kelvin, I tried this and got this interesting result. Can you give it a go and see what you find? There’s a good chap.” The p=0.05 level is a good triage threshold in that spirit: an interesting result worthy of more study, preferably by some independent investigators. That’s also what most journal articles describe, and SHOULD describe. Journals are meant to disseminate interesting results in a timely manner, not hard and fast truth.
    Should we use 0.05 as the standard for phase III trials? Of course not. The FDA already recognizes this in a clumsy way, at least in my field, by generally requiring at least two “positive” trial results.

  23. Robb says:

    I disagree the standard of p=0.05 is too low… for what it’s intended.
    Journals grew out of letters exchanged between scientists. “Hey Kelvin, I tried this and got this interesting result. Can you give it a go and see what you find? There’s a good chap.” The p=0.05 level is a good triage threshold in that spirit: an interesting result worthy of more study, preferably by some independent investigators. That’s also what most journal articles describe, and SHOULD describe. Journals are meant to disseminate interesting results in a timely manner, not hard and fast truth.
    Should we use 0.05 as the standard for phase III trials? Of course not. The FDA already recognizes this in a clumsy way, at least in my field, by generally requiring at least two “positive” trial results.

  24. Hap says:

    Except it isn’t just statistics – the large numbers of studies that can’t be replicated (that “march of science” 9 was talking about) indicate that something is broken in the way people understand their results. The result ends up being that schools (and ultimately, us) pay lots of money to employ people who don’t get real results or can’t tell the difference (and teach others how to do the same) and other people spend lots of money on snipe hunts. This makes drugs more expensive and taxes higher for no gain to the people intended. If you don’t use statistics, how does one intend to validate that results are real and not the product of lots of effort and random chance?
    It also seems disingenuous to claim significance of research based on statistics but complain about statistical pedants when you don’t know exactly what they really mean.

  25. Anonymous says:

    I’m happy to accept that p values are crap. But what metric should be used instead? What are the best value and formula to measure statistical reliability of the hypothesis?

  26. johnnyboy says:

    @20 yes thank you, another fun Corante quirk.
    @10 so here we go again: the argument is not over a single p “less than” 0.05 test, which actually is the probability of the difference observed being due to chance rather than a real effect (and therefore not exactly the same as how you understand it). His point is that the 0.05 value is used over and over, both in a single article for all the different tests done, and by extension for the entire literature. And if you do 1000 statistical tests with that p level, combined with a statistical power of 0.8 (which depends on sample size and degree of treatment effect), which is a fair estimate of power used in real life, you will get to a probability of making the wrong conclusion around 30% of the time. You’ll have to read his article to understand the calculations, but it’s pretty simple to follow. And that is when all the assumptions on which adequate statistical testing depends are followed (random sampling, normal distributions, etc…), which is often not the case. Leading him to conclude that easily more than 30% of the results based on p less than 0.05 in the whole literature could be errors.

  27. johnnyboy says:

    @25: the point is not that p values are crap. The point is that a p value of 0.05 is not stringent enough and leads to too many findings being interpreted as real when they are only due to chance. To avoid this the significant p level should be much lower than 0.05. I’m no statistician but I’ve done my share of research and data interpretation, and from a purely gut-level feeling, I would tend to agree that 0.05 is not enough and lets in too much potential garbage. I don’t know that I would insist on 0.001 as the threshold, especially in life sciences where variability can be great and multiple confounding factors will mess with your data, but in general I like to see something below 0.01 before I start believing in it.

  28. db says:

    @18:
    I disagree that there is little to be lost. The consequences are measured not only in terms of side effects of the drug, but in other ways as well. Consider the economic inefficiency of people spending money on drugs they don’t need that provide no effects, good or ill. Consider the opportunity cost of alternative activities they could afford. The opportunity cost of time spent acquiring the drugs and taking them may not be insignificant.
    Furthermore, consider the opportunity cost of tying up manufacturing lines to produce the excess drugs consumed without effect. Consider the incorrect market signals that would indicate drug companies should put more or less effort into such drugs due to skewed sales figures.

  29. johnnyboy says:

    And since Derek likes xkcd’s cartoons, here’s one that sums up the issue nicely:
    http://imgs.xkcd.com/comics/significant.png

  30. Sam Adams the Dog says:

    @20 It all comes down to cases. I believe (nearly) everyone would agree that for the instance I have quoted, the inclusion of non-susceptible individuals in the group taking the drug would be cost-effective as well as health-neutral.
    Obviously, that won’t always be the case, and I would not propose ignoring the possibility of long-term effects for any newly released drug, especially one to be taken over a long term.

  31. Sam Adams the Dog says:

    Sorry — I meant @28, not @20.

  32. Anonymous says:

    It’s always easy for the chemists to harp on P values when they don’t really have a handle on the sheer quantity of time or resources it would take to get p

  33. Pete says:

    It’s worth thinking about how the different ways in which P values and confidence intervals get used in medicinal chemistry. We might analyze a large data set and find that a correlation coefficient for X with Y of 0.2 with a 99.999% confidence interval of 0.1 to 0.3. We would conclude that there is a statistically significant correlation but in reality this correlation is still very weak even though it is statistically significant. Put another way X only explains a small fraction of the variance in Y which means that most of the variance in Y is explained by other factors. If somebody proposes that you use guidelines then you should always ask about the strength of the trend(s) upon which the guidelines are based because this tells you how rigidly you should adhere to the guidelines. The significance of a trend is determined by both the strength of the trend and by size of the sample and so you can always make the trend more significant by drawing larger samples from the population. If somebody tries to impress you with sample size, you might consider asking why they needed such a large sample to achieve statistical significance. Two relatively high profile articles which focused more on trend significance than on trend strength are JMC 51:817–834 (2008) and JMC 52:6752–6756 (2009) and I have linked ‘Inflation of correlation in the pursuit of drug-likeness’ and the criticism there of the first article is relevant to this discussion.

  34. A Nony Mouse says:

    #33: “If somebody tries to impress you with sample size, you might consider asking why they needed such a large sample to achieve statistical significance.”
    I thought the whole point of larger sample sizes was to find statistically significant relationships since it’s hard to ever find these in small samples even if they exist.

  35. Hap says:

    Science isn’t “some $#%^ my Ph.D. says”; you’re actually supposed to back up what you say with evidence that it means something. If doing so is too expensive to do, then perhaps you should be making statements you can back up, or finding something else to do.

  36. Pete says:

    A Nony Mouse (#34), the primary reason that we do data analysis is to inform decision making. The weaker the trend, the less useful it is as an aid to decision making. If the trend is strong, you can achieve statistical significance with a smaller sample and it’s a good idea to think in terms of experimental design (bomb on a grid) when sampling. If you need draw a large sample to achieve statistical significance then the trend is by definition weak and therefore of limited value for prediction by itself. A number of uncorrelated weak trends can be used together together and this is the basis of multivariate modelling.

  37. Anonymous says:

    Still no suggestions for an alternative to p values?
    Seems that people here aren’t interested in solving problems, but just like moaning about them.

  38. Frosty says:

    Now, for the equivalent of farting in church, apply the conclusions of this paper to the field of climatology.

  39. Max says:

    Anyone that has been in research for 40+ years or as in my case been trained by people that have been, will know that p-values are a recent requirement for publication in decent journals. In honesty, who cares if something is

  40. Dr. Guntram Shatterhand says:

    Slightly off topic, but can anyone suggest a biomedical science publishing-oriented discussion forum that isn’t entirely focused on retractions or misconduct?
    Specifically I am looking for a place to discuss generalities related to publishing from an industry perspective, preferably a forum that allows the use of pseudonymous posts (like this one) and discussions that don’t get into project specifics or other trade secrets (I want to publish, not get fired).
    The question I’m immediately interested in discussing is the tension we face in early mid-stage drug development where we’re pretty much limited to publishing on lab methods, descriptions of failed drugs, or out of date science that has no IP value.
    As a consequence, much of what we publish never gets read, cited, or otherwise credited by our peers (especially at annual review time). Another consequence is that much of the data we generate is considered insufficiently deep to warrant academic interest, even if it contains plenty of useful nuggets of information.
    Does anyone here have suggestions for determining the least publishable unit for pre-clinical Pharma data and picking a target journal without doing salami science or publishing dross?
    Thanks.

  41. mike says:

    I know I will be bashed for saying that, but in our company a change in culture towards statistical rigour happened thanks to Six Sigma training for a dozen scientists, and presence of a few belts in the management. I agree that perhaps strict six sigma approach is not applicable to early r&d and discovery in general, but the toolbox is just a set of standard stats plus process analysis and optimisation, which is a great way of teaching evidence based decision making.

  42. johnnyboy says:

    @mike: interesting, actually. How does the increased statistical rigor manifest, concretely ?

  43. Vicki says:

    Sam Adams the Dog:
    One problem here is that the usual way someone finds out they are susceptible to bad side effects is by experiencing them. The person who already knows they are allergic to aspirin or penicillin will avoid them, but if more people take a drug, there will be more allergic reactions, ranging from minor rashes to anaphylactic shock or kidney failure.
    I can decide to take the risk of serious side effects, but I want to have as much good information as possible: how likely is this drug to be helpful, in what ways, and how likely is it to be harmful, in what ways.

  44. Pete says:

    Responding to the challenge (Seems that people here aren’t interested in solving problems, but just like moaning about them) from Anonymous 37, I would like to point out that one solution to the problem is to quantify the size (e.g. using Cohen’s d) of an effect in addition to its significance. When used appropriately, there is nothing inherently wrong with a P value but at the same time it is an error to interpret it as a measure of effect size. We made a number of recommendations in the conclusion’Inflation of correlation in the pursuit of drug-likeness’ (linked as URL for this comment) which I’ll list here:
    1. Data sets should be made available as supplemental material.
    2. A purely graphical representation of data is inadmissible as evidence that a relationship between one pair of variables is stronger than that between another pair of variables.
    3. Provided that all data are in-range, a correlation coefficient for X and Y or coefficient of determination for the fit of Y to X should always be presented to support any assertion of a strong relationship between X and Y.
    4. For data sets partitioned into bins, observation of strong relationship between average values of Y and/or X is inadmissible as evidence of a strong relationship between X and Y.
    5. For data sets partitioned into bins, each average value should be accompanied by a measure (e.g. standard deviation; inter-quartile range) of the spread of the distribution that is independent of sample size.
    6. For data sets partitioned into bins, it should be demonstrated that inferences drawn from analysis are independent of the binning scheme.

  45. M Bower says:

    @37 – the alternative to p value is to use confidence intervals to give a feeling for the size of the effect, if present.

  46. MIMD says:

    What does this paper imply about the use of “Big Data” in medicine for, say, comparative effectiveness research, esp. when the data used is nearly entirely uncontrolled (e.g., from myriad electronic medical record systems)?

  47. NJBiologist says:

    @44 Pete–In principle, I like your suggestions. In practice, I’ve found that using something like Cohen’s D leads to recommendations for sample sizes that are well below my comfort level. For example, I spent a fairly miserable year trying to replicate a coworker’s results determined using n=5 rats; based on Cohen’s D, this was a thoroughly reasonable sample size. However, the effects only seemed to be reproducible with larger n–say, 10 rats. I suspect this could be due to distribution issues: I haven’t yet found an effect size calculation that accounts for, say, kurtosis.
    So I guess my suggestion is to see if your work is reproducible, and if not, start making changes–including increasing group size.

  48. Nick K says:

    A very minor point: DC’s name is spelled “Colquhoun”, and pronounced “Cohoon”.

  49. J. says:

    I think this is just another instance of the base rate fallacy, because in very few cases there is actually anything interesting to be found. The p=0.001 suggestion for 5% false discovery would indicate that there are about 20 in 1000 cases where the null hypothesis is false. Thankfully, Wikipedia has an example with just the right numbers for p=0.05:
    https://en.wikipedia.org/wiki/Base_rate_fallacy#Example_2

Comments are closed.