Skip to main content

Clinical Trials

Too Big to Be True

I recently wrote a column for Chemistry World on the concept of effect size – the readership there is from all sorts of chemistry, so it’s perhaps not as familiar a concept, and I thought it worth highlighting. (Briefly, effect size is the difference between the means of your treatment group and control group, divided by the standard deviation – it’s a “corrected” difference between the two. A small clinical trial is likely to only reach statistical significance for things that have a rather large effect size, while a large trial, on the other hand, can at times still reach significance for things that are small enough to make no real-world difference).

Here’s an excellent blog post on the idea, illustrated by an example that you may have heard about. There was a study a few years ago that seemed to show that judges handed down stiffer sentences right before lunch. The authors ascribed this to hunger, irritability, a desire to wrap things up, etc. But as that post shows the effect size that the paper found is impossibly huge, for a psychological effect:

If hunger had an effect on our mental resources of this magnitude, our society would fall into minor chaos every day at 11:45. Or at the very least, our society would have organized itself around this incredibly strong effect of mental depletion. Just like manufacturers take size differences between men and women into account when producing items such as golf clubs or watches, we would stop teaching in the time before lunch, doctors would not schedule surgery, and driving before lunch would be illegal. If a psychological effect is this big, we don’t need to discover it and publish it in a scientific journal – you would already know it exists.

There are other good examples given, along with links to papers that have tried to refute the “hungry judges” story in general. To not enough avail – it still gets trotted out as an example of the interesting and surprising findings that social science and psychology can provide. The point here, though, is that we should be wary of things that look too interesting and surprising, and also look for other causes when we find them. (In this case, one possibility is courtroom scheduling, where complicated cases are scheduled early, while plea bargains, mandatory sentences, and other more open-and-shut items get fitted in before lunch as time allows).

The more startling the result (positive or negative), the more it needs to be interrogated. We have better chances, in biomedical research trials, of producing profound effects, but we still need to be open to all sorts of possible explanations for them. . .


23 comments on “Too Big to Be True”

  1. Pennpenn says:

    Nevertheless, you’re to hope your judge isn’t cranky or hungry when he’s handing down a judgement on you…

  2. Anonymous says:

    The Implicit-Association Test has been taking a lot of heat lately: The IAT is alleged to be a quick, easy way to measure how implicitly biased individual people are
    (racial bias, gender bias, etc.). Despite the large numbers of data points, critics have pointed out numerous problems with the studies and the statistical analyses and failure to provide access to almost 20 years of raw data. I do not see “effect size” on the wikipedia entry, but the differences the test is supposed to measure are miniscule and not reproducible.

    BUT, it’s psychology research and careers and reputations are at stake so I suppose we should just let it all slide … NOT.

  3. tangent says:

    Bless you for writing about effect size! Next can you do a column on Bayesian belief theory?

    You draw a line from “large effect size” to “less plausible a priori“, which certainly has be true as the effect size goes to infinitely, but it’s not really a direct line. In some studies you really do find multiple sigmas of separation. In others, d = 0.2 would be shocking. Identifying “implausibly large” takes a sense of the context of related research, and there’s no shortcut.

    Well, it’s a fair point that some effects would have been familiar amd
    well known to common sense. But that’s something of a — what’s the analogy, not a strawman — it’s a valid example but it’s atypical, so discussion that’s based on it has limited applicability.

    1. NJBiologist says:

      Indeed–if Irving Langmuir was still around, he’d protest that it’s the very small effect sizes that suggest the presence of pathological science. (Although there’s probably a safe middle ground between the infinitesimal and the infinite….)

  4. Synthoj says:

    There was a Stipendiary Magistrate called Lincoon Hallinan in Cardiff, Wales who had a reputation for enjoying a liquid lunch and his judgements after lunch were so bizarre and harsh that he was lampooned in the satirical magazine Private Eye as Lunchtime O’Hallinan. The Court Clerk had to remind him 25 years was well outside the guidelines for burglary, that sort of thing. So an outlier on the hungry judge effect.

  5. Jacob says:

    Moderately related is the Winners Curse ( Even if an effect is real, the first people to discover it will almost certainly overestimate the effect size.

  6. Isidore says:

    I appeared in court only once, to request dismissal of a speeding ticket, some 20 years ago. That was in North Carolina where it was possible to plead guilty but have a case dismissed by pointing to extenuating circumstances or to one’s previous unblemished record (it was called a “prayer for judgment” if memory serves), in my case to no traffic tickets ever until that point. It was right after lunch break, and the judge joked that with a full stomach he felt magnanimous.

  7. Kyle MacDonald says:

    Even more than the Bayesian logical objections, this business with effect size is to me the most damning criticism of NHST. To take a line from Andrew Gelman, pretty much no effect is truly zero, because (stealing from Tobler now) everything is related to everything else. Get a large enough sample and you can pretty much always reject what Gelman calls the “straw-man null” of zero effect, unless you’re studying something like ESP. A small trial studying a small effect, on the other hand, is required to overestimate that effect. Neither small nor large trials are going to help you if “nonzero effect, p<0.05" is your standard of success.

  8. You wrote that a small study is not likely to reach statistical significance unless there is a large effect size. On a related point, large effect sizes in small studies are likely the result of noise, not a high effect size. See

    1. Kyle MacDonald says:

      Pleasant and unsurprising to see fellow Gelman readers in the comments on this post.

    2. dstar says:

      Isn’t that another way of saying that small studies are worthless?

      Because from what I understand from my late wife’s statistics classes, they are.

  9. Emjeff says:

    The other issue here is the “black hole” effect – very small differences in means can be made to be statistically significant by large sample sizes. This is the “dirty little secret” of p-values, and companies understand this very well.

    1. Roger Moore says:

      This is why it’s important to distinguish between statistical and practical significance. With a large enough sample size, it’s possible to find effects that are statistically significant but practically meaningless.

  10. tlp says:

    That’s a problem with sensational information. Once it’s out, it’s much harder to make everyone to forget. Do I need to remind about vaccination/autism link?

  11. milkshaken says:

    I went through a college where grades were decided mostly by oral exams where you had to prepare and give a brief presentation on 1 or 2 randomly chosen subjects from a test question list. The time slots just before lunch were the last to fill, since everyone knew the professors were grouchier before the lunch whereas after lunch they become more relaxed and tolerant.

    1. DH says:

      Did they “know” this based on empirical data, or did they just assume it? You could come up with a plausible counter-argument that just before lunch there would be fewer challenging questions because the professors were eager to get out of there and go eat.

      1. milkshaken says:

        nah, all questions were on the list and you picked up their numbers out of a hat. Grouchiness was anecdotal but people notice being cut off after 5 minutes with a C grade or asked to come the second time (second oral exam failure = you failed the class)

  12. Bagger Vance says:

    If a psychological effect is this big, we don’t need to discover it and publish it in a scientific journal – you would already know it exists.

    LOL, come on, everyone knows that noticing things is “racist”, “xenophobic”, “sexist” or any number of other terms designed to convince us all that “five” is the correct response to “2 + 2”.

    Welcome to [Current Year]!

    1. tangent says:

      Nah, determinedly not noticing things is how people mostly do racism/etc. nowadays.

  13. 👌 says:

    If I had a nickel for every time I’ve heard that!

  14. eyesoars says:

    Maybe the study effect is impossibly huge; it’s not clear what ‘65% favorable’ means. If, perhaps, that means 65% of sentences were more lenient than criteria x, y, and z for each decision, it seems possible. If it means deciding between the state and the accused, perhaps less likely.

    However, I can, and perhaps should say, that from experience as a performer, there are few crowds worse to perform for than a hungry crowd that knows it will be fed when the performance ends.

  15. Wavefunction says:

    When your sample size is small, you have a higher likelihood of seeing extreme effects on either side.

  16. Sean fearsalach says:

    Not exactly a new discovery

    ” The hungry judges soon the sentence sign
    And wretches hang that jurymen may dine”
    Pope, Rape of the Lock, 1712

Comments are closed.