There’s clearly something wrong with the way that statistics get handled and interpreted in scientific studies, and there have been many warnings. But change in this area is a hard thing to bring about. *Biocentury* has a good interview with someone who can tell you about that, John Ioannidis, of scientific reproducibility fame. He’s recently set off a lot of comment with a proposal to lower the threshold for “significance” to p < 0.005, and as you’d imagine, there are some strong opinions on that. His estimate is that this would shift about a third of the published biomedical results to “suggestive” rather than significant, and his take on this is, basically “good riddance”. According to Ioannidis, a lot of those are false positives anyway, and a lot of the ones that are real are not really useful. Here’s the root of the problem:

BC: What is the correct way to think about a p-value?

JI: A p-value is the probability or the chance that you’ll see a result that is so extreme as the one that you see if the null hypothesis is true and if there is no bias. Many people say a p-value is the chance of the null hypothesis being wrong, of some effect being there. This completely ignores the fact that you have these two ‘ifs’ that are required to interpret the p-value. So it is the chance of seeing such an extreme result, or something even more extreme, if the null hypothesis is true and if we have no bias.

And he’s upfront about the fact that his suggested new threshold is not the final answer, but just an interim step towards larger reforms. He wants to see better experimental and trial designs (Bayesian and traditional), larger sample sizes, and more attention paid to potential sources of bias and to effect size. In fact, if people just go wild for the new, stricter p-values, it’ll be a loss:

BC: Don’t you have a concern that by moving to a smaller p-value you are reinforcing the fetish about p-values and driving people to create enormous data sets? With data sets that are big enough, you can get very small p-values, but that doesn’t mean the results mean anything.

JI: Absolutely. And this is why 80% of the literature should not be using p-values, maybe 90% of the literature. With huge amounts of data, with big data, using a p-value means nothing, because everything can be statistically significant at 0.05 or even at 0.005.

But the question is, how do you change the mentality of people who are kind of automatically hooked to using a single magic number? It’s not that the alternative approaches that are better, like using effect size and confidence intervals or using Bayesian statistics, are recent. They have been out there for a long time, but their adoption, even though it is accelerating, has not been what we want.

As stated above, he thinks that p-value is the right tool for the job about 20% of the time, whereas Bayesian measures could be useful half the time, maybe even more. One instant objection to the idea of lowering p-value cutoffs is that it will make everything much more expensive and time-consuming (clinical trials, first and foremost), but his contention is that it doesn’t have to be that way. Better designs and more appropriate measures could cancel a lot of the brute-force bigger-sample stuff – and you get the impression that he’s deliberately pushing people towards those by making it too expensive to do it the old way. “We have some very strong tools that we don’t apply”, says Ioannidis, and he’s going to try to make people apply them, one way or another. . .

Easy for him to say, since he does not have to look at, for example, human populations to find trends, and only get grants if you find these trends (at p<.05) exist.

You gotta pump those numbers up, those are rookie numbers.

You cannot fix the reproducibility problem without fundamentally changing the incentive structure of modern science.

“You cannot fix the reproducibility problem without fundamentally changing the incentive structure of modern science.”

This. If the ability to pay my mortgage depends on getting p0.05 meaning I’m a terrible scientist and a failure, then the outcome will always be as it is.

My reply got totally messed up somehow, here’s what I had written:

“You cannot fix the reproducibility problem without fundamentally changing the incentive structure of modern science.”

This. If the ability to pay my mortgage depends on getting p of less than 0.05, and p greater than 0.05 meaning I’m a terrible scientist and a failure, then the outcome will always be as it is.

Yeah. Right. So let’s do that. Let’s “fundamentally change the incentive structure of modern science”. You start. I’ll just be in the corner, finishing up this cold fusion stuff.

I saw a presentation a while back where the presenter (a statistician) strongly suggested that the “first pass” at a statistical analysis should be a manual one. Pile up your data, put it in a big flat file, and LOOK at it. No stats packages for the initial stage of the process. See if anything pops out at you. If it doesn’t, you might not have anything useful.

Makes you wonder if the big stats packages need an “I’m not really a statistician” lockout to keep regular scientists from using all of the obscure filters and other junk in order to find a “significant” result.

Or in the words of Ernest Rutherford: “If your experiment needs statistics, you ought to have done a better experiment”

As my now-deceased PhD mentor would say, “No one needed statistics to know that penicillin worked on the pneumococcus.”

I think this is a very standard approach. Look at box plots, graphs of different varieties, and see if the data tell you anything. At least that’s what I do. Only after something interesting *appears* to be there, do you then attempt to quantify it with a p-value. But it needs to meet an eyeball and common sense test first.

This is a great point. We were looking at mosquito feeding rates, about as noisy a data set to be found this side of SETI. We consulted with a mathematician that offered similar advice: Separate your “results” into bins, and then plot it to see if you have something that looks like a normal distribution. It takes 5 minutes from scratch with Excel. Obviously (?) you do more formal tests for distribution later, but you should be able to pass a simple gut check for distribution before you start torturing your data in GraphPad’s Iron Maiden…

In other words, if you’re going to take a journey of a thousand steps, for god’s sake, make sure the first one is going in the right direction…

-t

So, if a p-value is the right tool 20% of the time and Bayesian measures could be useful half the time, then what is the correct tool for the other 30% of the time?

You are ignoring the possibility of overlap where either p-value or Bayesian measures can be useful. In the worst case, other tools may be needed half the time.

The other 30-50% of experiments fall into three bins:

1) More elaborate or unusual testing is a valid and justifiable approach to meet the needs of a particular, well-thought-out experiment. This bin is probably smaller than most people think.

2) The nature of the experiment is such that the result is qualitative and pretty unambiguous, and trying to hang a statistical score on it is wasting time. (Is the band present or absent in a Western blot? Does the IHC show a major change in localization? Etc.) These data often support or are supported by more quantitative results.

3) The experiment was poorly designed, and the question of appropriate statistical tests is moot. Someone set things up to get to p <0.05 on a readout as quickly and cheaply as possible; or someone built a study without thinking in advance about the analysis at all.

A professor I took a math course from a few years ago was in a panic because Harvard Medical School was going to require that all applicants had taken a statistics course (meaning that all medical schools would probably follow). People able to teach it were thin on the ground and he asked me if I knew any. The one I knew worked at a local hospital and demurred.

I think this is a good idea particularly for the medical literature.

It would be very useful study to take a list of approved drugs generally accepted as useful and see what their distribution of p value for efficacy was in, say, Phase-III clinical trials.

People probably need to be looking more at size of effect because even the most anemic of trends acquires impressive levels of significance when powered by enough data. In the area of drug-likeness & compound quality, p-values are sometimes used because continuous data has been transformed to categorical data prior to ‘analysis’ (which may consist of ‘expert’ comment on pictures). This practise (also known as binning) is specifically discouraged by current J Med Chem. I have linked a practical guide for making correlations look stronger than they are as the URL for this comment.

So if using lower p-values really does result in higher quality clinical data (and presumably fewer late-stage failures when more patients are involved and more robust data are generated), then there will be no need to “sell” the idea of tightening up p-criteria. Clinical organizations will naturally flock to the idea as a way of lowering pan-portfolio development costs, and driving better fast-to-fail decision-making, right? Anyone? Bueller?

I remember attending the NHLBI Cardiovascular Regeneration meeting a few years ago. This is when sticking bone marrow cells into the hearts of cardiac patients was all the rage (based on a single Nature paper using a mouse model that was never reproduced, but that’s another story). Speaker after speaker stood up and gave their clinical trial results showing statistically significant increases in the left ventricular ejection fraction (LVEF). Finally, an old German doctor stood up and in heavily accented English said: “Statistically significant, clinically insignificant”. The silence that followed was deafening because he hit the nail on the head – everyone was looking to publish and none of these doctors considered the fact that the small difference had no real-world benefit to their patients. I like Anders quote from Rutherford – your experiments (and clinical trials) should only look at large effects and not need statistics to demonstrate a difference.

The old German doctor was talking about size of effect…

He was saying that you can have a treatment that reaches statistical significance but is absolutely meaningless in terms of patient treatment. In contrast, this type of result needs no statistics whatsoever. That’s the difference.

https://www.sciencedaily.com/releases/2018/04/180416101413.htm

You don’t need to be an old German dude to know this. It’s taught to everyone who takes a few data science courses.

People just want to publish. They know the data is meaningless, they just don’t care or conveniently ignore reality.

If clinical significance were a criteria, I assume many cancer drugs would go out of the window.

I think the reliance on p-values stems in part from the fact that the process is rather straightforward to perform and interpret. “I have this sort of data, I therefore run it through this defined statistical test, and I get out a p-value where *X* means *this* and *Y* means *that*.” The only uncertainty is finding the statistical test which matches your type of data.

Compare that with the alternatives, which are often just a whole bunch of nebulousness. “How do you interpret this data?” “Well …” say the statisticians, hemming and hawing, “it really depends.” And, true, it may depend, but there’s often very little guidance on how to parse out the depending. It comes off as seeming like statistics is not a logical process with defined rules, but rather an ad hoc, bespoke, artisan endeavor, where statistical interpretations need to be lovingly hand crafted by skilled experts steeped in occult knowledge. (I exaggerate for effect.)

A good example of this is Bayesian approaches, where you’re immediately confronted by the choice of prior. “How do you choose the prior?” “Well …”, say the statisticians, looking uncomfortable, “that’s a good question.” Followed by a bunch of hand waving, hedging and nebulousness. — (I’ve finally come to understand that, if your results are decent at all, it shouldn’t really matter which (reasonable) prior you use. But whenever Bayesian methods are discussed much ado is made about priors, so good luck finding anyone who actually comes out and says that if the choice of priors actually matters, you’re doing something wrong.)

When confronted with the nebulousness and uncertainty of the alternatives, is it any wonder that people reach for the comfort of the straightforwardness of p-values?

Is it nebulous because there is no unambiguous answer (to determine the significance of a result), or because methods to find an unambiguous answer have not been made clear (or as clear as possible). The point of science to prevent you from fooling yourself, so fooling yourself into feeling like you know something when you don’t is a failure.

Of course, if fooling yourself with statistics is necessary for funding and hiring decisions (or at least that it can’t or won’t be distinguished from actual results by the relevant authorities), well, then science isn’t really relevant anymore. If being hired and fired and doing actual science are (at best) on skew planes, then there’s a problem that changing statistics won’t help.

This will not be news to many readers of this blog, but it’s worth remembering that the standard at FDA (at least in CDER, the small-molecule arm) has long been even more conservative than the p < 0.005 advocated by Ioannidis. The regulations call for "adequate and well-controlled trials [note the plural]," and getting p < 0.05 twice, with both results in the same tail of the distribution, means p < 0.00125.

As the disease-splitters continue to triumph over the disease-lumpers, this will have to change. Suppose that p < 0.00125 for a clinically-trivial result in a 600-patient trial. Suppose further that what really happened is that the treatment was terrific in 1% of the patients, but useless in the rest, and you can figure out what was special about the 1%. You could then redo the trial, screening another 600 patients and enrolling 6. Now try to get statistically significant results out of that.

While I agree with a lower p threshold (to remove lots of chaff), sadly, you can’t solve p-hacking by changing p. In an era of copious data, you can always find more variables to test until one hits. (And expecting non-statistical scientists to become proficient in statistics is a pipe-dream.)

A better, incremental proposal would be to encourage more validation procedures such as training/test sets, cross-validation, and randomization tests – procedures well within the capability of most scientists, are more interpretable, and gently encourage collecting “enough” samples to leave a lot out for a test set.

Well that’s obviously why it’s best not to report any error bars and don’t bother to verify your new method by an established traditional method. Cuz who cares, you’re getting a paper brah!!

“But the question is, how do you change the mentality of people who are kind of automatically hooked to using a single magic number? It’s not that the alternative approaches that are better, like using effect size and confidence intervals or using Bayesian statistics, are recent. They have been out there for a long time, but their adoption, even though it is accelerating, has not been what we want.”

Well, who’s to blame for that? Who’s been teaching frequentist statistics for the last 50 years, and stating emphatically that a number designed to be used informally (along with the means, as well as mechanistic data) has suddenly become the most important outcome of a study. Who’s been incredibly resistant to Bayesian methods, only loosening up a bit over the last 10 years? Look in the mirror, statisticians, and you will see the problem.

The statisticians I’ve worked with have spent years trying to push people away from using p-values, towards reporting confidence intervals and effect sizes. One used to have a favourite point that your null hypothesis was almost certainly falsifiable at

some level, the question is whether it’s a difference that matters or not (clinical versus statistical significance).Dropping p thresholds for significance isn’t really the answer. Ioannidis’s big reveal was that if you assume a certain proportion of failed studies then given a particular significance criteria you will have a proportion of failed studies passing as significant, this came as a shock to nobody who actually understood statistics. But simply reducing the threshold is basically Bonferroni correction, it will also kill the power of any study that is trying to falsify a false null hypothesis (from a non-statisician’s point of view: a study that should succeed). Bayes isn’t necessarily the answer either, it often requires you to make decisions to get the answer, meaning your result packages up more assumptions.

If you start doing power calculations based on p=0.005 many small studies will be killed off. You may think this is a good thing, but not everything is a clinical trial of a medicinal product. Some things are pilots, some things are MSc psychology projects, one size does not fit all. Reporting confidence intervals is a very good way to start summarising the uncertainty of estimates even if it’s not perfect. Particularly you still need to consider p if you have to get into multiple comparisons correction…or start to try to explain covariances. (I’ve seen papers that only reported p-values for model parameters, not even the effect size estimates, what use is this to anybody?)That went very italic! It was meant to stop after “at some level”. Let this be a warning to everyone to remember to close your tags…

Effect size is a great tool when you have a real breakthrough new drug in a disease area where the standard of care is poor. For example in the field of hepatitis C, the regimens used before 2010 gave you about 50% cure rate, and now with the new drugs they are >95%. You pretty much don’t need stats when the regimen works on nearly every patient.

Tell it to the animal welfare and funding agencies. In my experience, any proposal with alpha <0.01 is viewed as extravagant.

What, no one has invoked Mark Twain yet? “There are three kinds of lies. Lies,

damnedlies, and statistics.” (Note that when Twain wrote that line, damned lies were fighting words, bordering on fight-to-the-death words.)Statistics is also right up there with radiological control math. In the US Navy, there are 4 different career fields that work on the reactor.

– RCDiv is Reactor Controls Division, you ask them what 2+2 equals and they’ll tell you “4.0000”.

– MDiv is the Mechanical Division, you ask them what 2+2 equals and they’ll tell you “4”. MDiv is in charge of the pipes and turbines.

– EDiv is the Electrical Division, you ask them what 2+2 equals and they’ll tell you “about 4”. EDiv is in charge of the generators attached to the turbines.

– RLDiv is the Reactor Laboratories Division, you ask them what 2+2 equals and they’ll ask you, “Well, what do you

wantit to equal?”. RLDiv is in charge of tracking how much radiation everyone has received.And who is paying for all the animals? And who is changing the minds of regulatory organisations when you suddenly ask for five or ten times the number of animals?

Exactly. On top of that, the ability to perform the experiment even if those things could be overcome would be next to impossible. The solution to my mind is to view all papers as suggestive. If repeated by other groups, then we can think about the finding to be “true”. That should be where the replication comes in (I’d argue that’s a much better test of reproducibility anyway). It means a shift in our thinking, moving away from assigning credit before we know something is true. No NYT reports until we have a greater consensus and less hero-worship of people for flashy findings.

I think we’ve lost sight a little bit of what a research paper should be. In my opinion, each individual paper doesn’t need to be an unimpeachable entity. Instead, the conclusions of the field as a whole over time is the thing we should be focused on, that’s what needs to be accurate. By actually being interested and promoting the self-correcting nature of science (as opposed to now, where it’s suppressed in various forms) we could get back on track. This means giving credit in a different, but I think in a more fair, fashion. If you’ve performed your experiments as reported, and you’re not excluding any data that doesn’t fit your narrative, you’ve run controls etc., then it shouldn’t be a problem to be wrong. Over time the truth will come out. It’s only starting to be real when someone else also finds the same thing.

I fully agree! There has been way too much press about the reproducibility crisis. I don’t know why we would expect every paper to be 100% reproducible. When multiple labs demonstrate the same general principle then we can interpret it as a ‘true’ finding. Changing statistical reporting norms wouldn’t affect this.

I recall a lab that had very consistent and intriguing cell based data for years that other groups struggled to reproduce. It turns out they always got their serum from the same source (actually one specific horse!). A better description of the methods would have clarified this early on, better statistics wouldn’t have made a difference. Either way, the lack of replication should have been a clear indication that something was amiss.

Exactly, there are so many variables that can modulate an outcome, that there can be very legitimate reasons for discrepancies that do not involve cheating. Sometimes people are sloppy, sure, but I don’t believe that’s such a huge issue as long as they do like you say and report what they did accurately.

Serum is notorious for the kind of thing. But these things can be interesting biologically too, even if it appears as just a technical issue. There was an instance a few years back where two groups found differences in the amount of a particular T cell population (Th17 cells) in an in vitro differentiation assay, even though they followed the “exact same protocol”. Turns out that one group used RPMI and the other used IMDM as their culture media, which they hadn’t really considered to be all that important (this wasn’t them being dim, I’d say at the time most immunologists weren’t that focused on the media composition). IMDM actually has lots more tryptophan than RPMI, which can be converted to an agonist of the aryl hydrocarbon receptor (AhR) by light, like when it sits in your cold room. Guess what, it turns out that AhR plays a big role in the biology of Th17 cells (It’s Veldhoen M, Journal of Experimental Medicine, 2009) and so the breakdown of tryptophan in their IMDM was the variable that explained the difference in results. What appeared to be a technical anomaly, actually turned into something really interesting biologically when followed up (unfortunately they didn’t tell the story exactly as above in the paper; I’ve patched together from how it was presented at conferences).

That’s a very interesting observation – thanks for putting it in here!

This surprises me a lot. I do a lot of neuroscience primary culture, and the base medium is one of the most debated things.

Bla-I don’t want to slander the entire field of immunology, perhaps there were groups who were keyed in on this! But I think typically there were 2-3 standard media that didn’t seem to modulate T cell differentiation assays very much and were widely used. People typically were focussed more on the conditioning factors they were adding (cytokines etc.). Since that paper, and concurrent with the explosion of work on the effects of metabolism on the quality of immune responses, other factors like the level of glucose have also shown to be important. It’s funny because as a group immunologists tend to be obsessive about serum source/batches, so it’s certainly odd to be so lax re media composition.

Derek-No problem. I’ve learned so much from this blog that I’m happy to be able to have some knowledge transfer in the opposite direction, however small it is!

Metabolic requirement in immune system is largely context dependent. Cysteine is another one that is critical for purified T cells but not PBMC (due to presence of monocytes etc.). Now, my solution to this complex situation is to add MEM to all medium I use for stimulation.

This is interesting. We are currently growing the same T-cell clones in labs in different countries, and see marked differences in cell growth. Can you elaborate a bit more on this area?

The great thing about p-values is that you can teach them to a fifth grader. The bad thing about p-values is that they don’t teach anything better to PhDs.

As others have mentioned, with a big enough data dredge you can get extremely high probabilities of extremely small effect sizes, for example the California Coffee causes Cancer kerfuffle of a couple of weeks ago.