I’d like to recommend this article from *Nature* (which looks to be open access). It details the problems with using *p-*values for statistics, and it’s simultaneously interesting and frustrating to read. The frustrating part is that the points it makes have been made many times before, but to little or no effect. *P*-values don’t mean what a lot of people think that they mean, and what meaning that have can be obscured by circumstances. There really should be better ways for scientists to communicate the statistical strength of their results:

One result is an abundance of confusion about what the P value means. Consider Motyl’s study about political extremists. Most scientists would look at his original P value of 0.01 and say that there was just a 1% chance of his result being a false alarm. But they would be wrong. The P value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place. To ignore this would be like waking up with a headache and concluding that you have a rare brain tumour — possible, but so unlikely that it requires a lot more evidence to supersede an everyday explanation such as an allergic reaction. The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding is a false alarm, no matter what the P value is.

Critics also bemoan the way that P values can encourage muddled thinking. A prime example is their tendency to deflect attention from the actual size of an effect. Last year, for example, a study of more than 19,000 people showed that those who meet their spouses online are less likely to divorce (p < 0.002) and more likely to have high marital satisfaction (p < 0.001) than those who meet offline (see Nature http://doi.org/rcg; 2013). That might have sounded impressive, but the effects were actually tiny: meeting online nudged the divorce rate from 7.67% down to 5.96%, and barely budged happiness from 5.48 to 5.64 on a 7-point scale. To pounce on tiny P values and ignore the larger question is to fall prey to the “seductive certainty of significance”, says Geoff Cumming, an emeritus psychologist at La Trobe University in Melbourne, Australia. But significance is no indicator of practical relevance, he says: “We should be asking, 'How much of an effect is there?', not 'Is there an effect?'”

The article has some suggestions about what to do, but seems guardedly pessimistic about the likelihood of change. The closer you look at it, though, the more our current system looks like an artifact that was never meant to be used in the way we’re using it.

It’s worth taking a look at this statistician’s response to that article. Short version — there aren’t great, easy, alternatives to p values. Understanding the alternatives can be hard, and misusing them — just like p values — easy. Their favored partial solution is for researchers to understand more statistics, which (writing as someone who almost has his biostatistics degree) sounds good to me, except that it would also be nice if people ate more vegetables and drove more carefully, and how’s that coming along?

There have been many, many, many articles similar to this over the years. And I agree, what does statistical significance even mean? It doesn’t mean that your results are repeatable and it doesn’t mean that your results are important. You can make almost anything statistically significant if you just keep testing and make your sample sizes larger. Unfortunately, try making the argument that p values and statistical significance shouldn’t mean all that much given an observed large effect when trying to publish a journal article. No one will publish it if you don’t have *, **, or *** on your bar graphs.

Also how many ANOVAs are wrong out there that were used to show statistical significance on a set of data? The major assumption that people make is that data is normally distributed with an ANOVA test. Let’s be honest, how many people check to see if their data really is normally distributed before picking a test to show p

According to Wikipedia, “the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true”.

Is that wrong? If not, how does it differ from how people tend to mis-interpret it?

@anonymous,

The mistake commonly made is thinking that the p-value is also a measure of the likelihood the underlying hypothesis is true (or false). It’s not.

I guess that depends whether we are talking about objective or subjective probabilities, but philosophically, what’s the difference since all we know is what we know?

I’ve seen a lot of this argument over the years and I am absolutely open to change. However, all I read is statisticians telling me how wrong p-values are or arguing amongst themselves. Where is the practical solution that replaces what I do already?

How many statistical tests does the garden variety pharmacologist really need anyway? Is this thing really bigger than that thing. How about that thing? Maybe I need to compare some count data every once in a while.

Clearly explain a better way to do this that still allows data collection on real world time scales, and I have zero problem in doing it.

^ “like!”

Maybe use the same metric but just call it a “bs-Value”?

One important things to note is that statistically significant differences may not always translate to real world significance. For instance, two cancer drugs one of which gives a p value of 0.005 in a clinical trial and the other gives 0.002 may still tip the balance in favor of the second drug, depending on the severity of the disease and the current options for therapy.

A great book – although a little too polemical – about the misuse of p values and others measures of statistical significance is “The Cult of Statistical Significance”.

I hope no one is just looking at the P values. If the first drug extends life in the treatment group by 2 years (p

Let’s try that again. (First 2 paragraphs)

I hope no one is just looking just at the P values. If the first drug extends life in the treatment group by 2 years (p

Let’s try that yet again. The less than symbol seems to muck things up.

I hope no one is just looking just at the P values. If the first drug extends life in the treatment group by 2 years (p less than 0.005) and the other by 1 month (p less than 0.002), then you’d better be taking the first one.

At the last Gordon CADD meeting there was a lot of discussion on this issue. I’m always flabbergasted at the extent of misunderstanding regarding p values. And I was blown away by the degree (historically) that p less than 0.05 has become some kind of magic threshold.

A few points you learn in an intro stats course

1. Large data sets are always considered to have a ‘normal’ distribution

2. Effect size should always be taken into account

http://en.wikipedia.org/wiki/Cohen%27s_d#Cohen.27s_d

For a t-test you use cohen’s d. Its well known that large samples will ‘look’ significant, so you always need to run another calculation for effect size. Why someone would publish a paper without doing this is a bit crazy.

3. The non parametric version of the t-test is called Wilcox for paired data or Man-Whitney for independent data. These tests use rank orders and therefore avoid issues of distribution or unequal variance between groups (Leven’s test)

It seems well known how to handle effect size and lack of a normal distribution. All the calculations can be done in R in seconds.

intro stats courses using R are offered free online. No excuse to publish something and not at least try to do it correctly.

#8: As far as ‘BS-value’. If you know what your doing, how could anyone ‘BS’ you?

When we do urinalysis in our preclinical studies, the results from those tests are also p values.

Anonymous @3:

Let’s take as a simple example of a test conducted by flipping a coin 10 times. If we take as a null hypothesis that the coin is unbiased, then if I get 10 “heads” in flipping, the p-value of the experiment is p

Many researchers in the world are doing what is analogous to flipping a coin in many different ways. So, someone may end up getting 10 consecutive “heads” even if the coin is unbiased. And the same researcher may stop flipping the coin and report the result instead of keep flipping the coin 100 more times.

One difficulty is that we use the word â€˜significantâ€™ in more than one way. So really the answer to the question, “Is this result significant?” has two parts. One is the magnitude of the effect compared to the control. Is a 1.2-fold stimulation significant? That depends on the experimental system. The second part is the statistical significance. If p is less than 0.2 then the result is not significant in the second sense. The phrase â€œthe result is statistically significantâ€ does not mean that the result itself is significant.

^ Right, it’s a function of the difference between population means, as well as lack of overlap between population tails.

“an artifact that was never meant to be used in the way we’re using it” describes much of civilization.

Science News ran an article (Odds Are, It’s Wrong) about this problem in the March 27, 2010 issue.

#17: “If p is less than 0.2 then the result is not significant in the second sense.” It depends on how much less. p = .000000001 is less than .2 but is highly significant. The correct statement is that if p is greater than .05 the result is not significant.

don’t usually comment here but thought i might direct your readers to a classic on the same subject – Cohen’s The Earth is Round (p . might be a little dated, and aimed towards social scientists, but the analysis is excellent.

ultimately the solution to this issue is to abjure the use of frequentist statistics, when possible, in favor of bayesian hypothesis testing.

The simple solution would be to say:

“Down with p less than 0.05!”

If p less than 0.005 is used instead, many problems would disappear, and p less than 0.001 would be even better.

Even if you would take at face value the common assumption that p = 0.05 means 95% probability of truth, the results are completely unacceptable: Nature has about 15 papers per week, 750 per year. p = 0.05 would mean that 38 of them are wrong. Even just using this assumption, Nature should have a p limit of 0.001.

The simple solution would be to say:

“Down with p less than 0.05!”

If p less than 0.005 is used instead, many problems would disappear, and p less than 0.001 would be even better.

Even if you would take at face value the common assumption that p = 0.05 means 95% probability of truth, the results are completely unacceptable: Nature has about 15 papers per week, 750 per year. p = 0.05 would mean that 38 of them are wrong. Even just using this assumption, Nature should have a p limit of 0.001.

Alternatives? One needs to be very lazy for not finding alternatives. One example:

http://www.indiana.edu/~kruschke/BEST/

@23 thanks! Very interesting.

Although I think it is a bit of a stretch to expect someone who just needs to compare two means to 1) read the Journal of Experimental Psychology 2) be able to asses the relative merits of this arcane websites method against others and 3) install an R module to carry out the actual analysis.

Not a biologist, so perhaps I’m not entitled to an opinion on this.

But when I look at epidemiological studies, I ignore all those that fail to give a reasonable P value *and* report a relative risk less than 3.

Because systematics are at least as important as random errors.

Besides, that second criterion means I can skip reading about 99% of all epidemiological studies, and go do something more interesting, like watch paint dry.

One of the biggest reasons to beware of p-values is when looking at large clinical outcome trials. The large N in those studies acts a “black hole” so that in the limit, everything is significant. I have a nice example of this I show students – two popuations one with a mean=49, the other with a mean of 50. Given a large enough sample size, you can make the p-value for the comparison of the two populations sink into the basement (and I am talking about do-able samples 5-10K– about the size of a typical cardiovascular outcome trial). A bayesian analysis will not show this – a larger sample size will simply make the two distributions look more and more the same. he bottom line is that, no matter what, we must use our brains to interpret data; there is no “automated” or single “theory of everything” calculation that you can do to declare something significant or not.

Here’s a basic stat lesson to better understand what p is and how it can be misleading

H0 (hypothesis) – statement about an attribute/model equation that provides the prediction. Generally what you want to disprove.

Ex: the coin is fair (50/50), A = 0, A > B

H1 (rejection) – a direct opposite of H0. A low p shows this to be likely (assuming the experiment is done correctly).

Ex: The coin is not 50/50, A != 0, A greater than B

Bad ex: the coin always flips heads, A = 2, A = B

So for @15, 10 heads will produce a low p which indicates that it was unlikely that H0 was true (the coin was fair), and likely that H1 is true (the coin was not fair).

However, here are some possible issues even if p is low

Bad H1: low p can only show exact opposite of H0 + hard to prove you captured all other possibilities (Not B does not imply A, might be C. not B or C does not imply A, might be D, etc.)

True but amount is not relevant: typical coins are not actually fair- the head is slightly heavier and thus slightly more likely (50.5/49.5 or something)

Multiple (and hidden) trials: 95% means if you do 20 different sets, it is likely 1 will still have low p

Extraneous vars: A + B + C explain X. But so do just A + B

Too many vars: A + B + … Z can “explain” complete randomness.

And for statistics in general:

Bad method for experiment/collection of data: coin is fair, but flipping is not- you can predict the flip by how much it spins and how high you drop it (ex: buttered toast “always” falls face down)

Bad sampling: randomly choosing through phone numbers- but population that has phones (and will answer them for a survey), is biased.

Confounding/hidden variables: A > B because A -> C and C > B

Note: most of these are for ALL statistical methods. Garbage in, garbage out. Or rather- biased assumptions often prove themselves. There are other tests to reduce certain faults, but none to eliminate them and none so simple as p value.

Here’s a look at these issues from a regulatory perspective:

http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003658.pdf

Whether you use a parametric or non-parametric statistical test; whether you set your p threshold at 0.05, 0.07, or 0.001, the question remains for any “statistically significant” effect: Is the effect also of *practical* significance.

If a weight-loss pill causes a statistically significant loss of 0.5 +/- 0.2 kg (95% C.I.), would you buy it?

If activity X decreases the life expectancy of a healthy person by 4 +/- 2 weeks, should you make a point of avoiding X?

If a new logP model increases prediction accuracy by 0.05 +/- 0.03 log unit over the model you currently use, should you replace your current model?

One of the themes of the book *The Cult of Statistical Significance* (mentioned by another commenter) is that “size matters”. That is, the magnitude of an effect is at least as consequential as its statistical significance, and by focusing exclusively on the latter, we can fool ourselves into thinking that unimportant effects are important or vice versa.

@24, Rasmus has a very nice and convenient webapp linked from Kruschke’s website if you are not familiar with R for this type of BEST data analysis. BTW, Kruschke’s book “doing bayesian data analysis” provides more fundamental details. It is fun to read even for a scientist or engineer.