Skip to Content

Gotta Be a Conclusion In Here Somewhere

A couple of years ago, I wrote about how far too much of human nutrition research was unfit to draw conclusions from. This new story does nothing to make a person more confident in the field: it’s a detailed look at the lab of Brian Wansink at Cornell, where he hold an endowed chair. He’s the former head of the Center for Nutrition Policy and Promotion at the USDA, author of both a long list of scientific publications and popular books, and his work is widely quoted when the topic of human behavior around food comes around. And it appears more and more like most (all?) of that work is in trouble.

This has been building for a few months. During 2017, Wansink had several papers retracted, and this appears to be one of the things that started it all off. This is the sort of abstract that will ruin a person’s whole day:

We present the initial results of a reanalysis of four articles from the Cornell Food and Brand Lab based on data collected from diners at an Italian restaurant buffet. On a first glance at these articles, we immediately noticed a number of apparent inconsistencies in the summary statistics. A thorough reading of the articles and careful reanalysis of the results revealed additional problems. The sample sizes for the number of diners in each condition are incongruous both within and between the four articles. In some cases, the degrees of freedom of between-participant test statistics are larger than the sample size, which is impossible. Many of the computed F and t statistics are inconsistent with the reported means and standard deviations. In some cases, the number of possible inconsistencies for a single statistic was such that we were unable to determine which of the components of that statistic were incorrect. . .The attached Appendix reports approximately 150 inconsistencies in these four articles, which we were able to identify from the reported statistics alone. . . 

But actually, the trouble began with a post on Wansink’s own blog. He described the  pizza work as initially appearing to be a “failed study” with “null results”, but went on to describe how a grad student in his group (at his urging) kept going back over the data until she began finding “solutions that held up”. That raises the eyebrow, Spock-style, because you’re supposed to design a study to answer some specific question. Rooting around in the data post hoc to see what turns up, although tempting, is a dangerous way to work. That’s because if you keep rearranging, testing, breaking down and putting together over and over, you can generally find something that comes out looking as if it were significant. But that doesn’t mean it is. (Update: as Alex Tabarrok points out, there is a very germane XKCD for this!)

If you’re going to try to “torture the data until they confess”, as the saying goes, then what you really have to do is take that interesting trend you seem to have spotted and design another study specifically to test for it. If you’re on to something, you’ll get a stronger signal in the numbers – but most of the time, unfortunately you’re not on to something. You can chase this sort of stuff for a long time watching it evaporate in front of you, and the larger the original data set, the greater the chance of this happening. It’s especially dangerous with notoriously fuzzy readouts like field studies of human behavior – this stuff is very much hard enough with cells growing in containers or mice in uniform cages, so imagine what it’s like to work with data you collected down at the pizza buffet. The reproducibility crisis in social science is driven, in large part, by the fact that humans are horrendously hard to work with as objects of study.

As you can see, though, turning around and designed more specifically controlled studies is not what Wansink did. Instead, the grad student’s work turned directly into the four papers mentioned in that abstract – from what I can see, what should have been preliminary conclusions to be tested again turned into the conclusions, the whole points, of four new papers. Which is one way to bulk up the publication list. That list has now been the subject of a lot of scrutiny, and this new article is not going to damp any of that down, either:

Now, interviews with a former lab member and a trove of previously undisclosed emails show that, year after year, Wansink and his collaborators at the Cornell Food and Brand Lab have turned shoddy data into headline-friendly eating lessons that they could feed to the masses.

In correspondence between 2008 and 2016, the renowned Cornell scientist and his team discussed and even joked about exhaustively mining datasets for impressive-looking results. They strategized how to publish subpar studies, sometimes targeting journals with low standards. And they often framed their findings in the hopes of stirring up media coverage to, as Wansink once put it, “go virally big time.”

Oh boy. The article goes on to detail just those things, and it’s grim reading. Grim for more than one reason, though – as the piece describes Wansink and his co-authors looking for topics that would bring in attention and funding, worrying about numbers that didn’t quite reach publishable significance thresholds and wondering if there could be some way to push them across, and submitting papers, after rejection, to progressively less demanding journals just to get them published. . .well, a lot of readers may find themselves squirming in their chairs a bit.

The “p-hacking” and data-grinding that went on in Wansink’s lab really appear to be beyond what responsible researchers should engage in. These things are the sins here, because there are a lot of conclusions out there in the literature, thanks to these papers, that are just wrong (or at best, not proven right, although claimed to be). But once past the outright misconduct, some of the other activity described is all too familiar, and to see them mixed together in a “Can you believe this stuff?” article makes for uncomfortable reading. It’s worth thinking about what a lot of other labs’ internal emails might look like if published at Buzzfeed. But at least their results stand up. They’d better.

45 comments on “Gotta Be a Conclusion In Here Somewhere”

  1. Cameron Beaudreault says:

    Speaking of p-hacking, there’s a 2016 paper from Nature that found evidence that astrocyte scars are necessary for both limiting the extent of CNS trauma, and for permitting CNS axon regrowth. I’ve long suspected the paper to be guilty of p-hacking, because they used ANOVA with the Newman-Keuls method to test for significance. Internet descriptions of Newman-Keuls say that it is a non-conservative test, more likely to make Type I errors than Tukey’s range test. I’m not a statistician, so I cannot speak on this with authority, but is it appropriate to use non-conservative tests with biological data, as found in this king of paper? doi:10.1038/nature17623

    1. NJBiologist says:

      There is a staggeringly wide range of answers to the question “what’s the appropriate test statistic for this set of data”–often honestly arrived at, with some level of thought, and possibly with an origin lost to memory. (Think grad school.) I don’t usually use NK or SNK, so I can’t really speak for that one, but it’s common to see different stats used.

      If you’re worried about the “torture the data until they confess” thing, the best warning sign is probably consistency. Does the lab use a different post hoc test for every study? Do they usually use one–say, Tukey, or Dunnett’s–but this data set got NK? Both of those would be worrisome.

      1. kriggy says:

        Shouldnt you get same results no matter which method do you use ? I dont have much idea how statistics works but I would suppose if your results are realy significant, it will show no matter which method is used.

        1. NJBiologist says:

          Short answer: yes for huge differences, no for small differences.

          If you’ve got a big effect in a well-powered study, different test statistics should all tell you your data are showing statistical significance.

          If your data clock in at p ~ 0.06, there are very good odds you can find a statistic that will put your data at p ~ 0.04.

        2. amature says:

          If its going to make any real difference in how the results are interpreted, you should run more than 1 test and/or model. These numbers can be generated by a click of the button. Then you let the readers decide if its ‘real’.

    2. arcya says:

      Honestly nearly every paper has AT LEAST one incorrectly used statistical test – partly because choosing the correct test for the circumstances is quite difficult (and not everyone agrees!) and partly because most of those papers are written by grad students or multiple authors. We rarely see the statistical errors that go the other way (incorrect test leads to erroneously “not significant” conclusions) because no one really publishes their null data.

      This is why the hallmark of a good paper – and there are a lot of bad ones – is that the conclusion is proven via multiple experiments and methods, such that if one conclusion was wrong, the overall thesis is likely to be right.

  2. Hap says:

    My advisor used to say the if something wasn’t published, it didn’t exist. If it’s not published, no one else knows what you did so that other people can try to reproduce it or show that it can’t be reproduced (or needs to be better defined). I don’t think trying to publish what you did is a problem. What is a problem is when you try to make something publishable that isn’t – that you don’t care whether you’ve actually found something but only that you’ve published something. Publishing is a way of communicating and preserving knowledge. If you don’t actually have any knowledge (other than “I really want to be published” or “I need some more grant money and want agencies to give it to me” or “I want tenure and don’t care how many grad students I have to stand on to do it”) then you shouldn’t be publishing as if you do.

    1. Ian Malone says:

      At the core of that is the insistence on only accepting positive (non-null) results. If a study is unpublishable because it didn’t work out then nobody knows about it. It’s also important studies should be sufficiently powered to detect the effect they’re looking for, but the funding models don’t always encourage that either. A well-powered null result should be persuasive and publishable. (Also, few results can ever truly be null, just have vanishingly small effect sizes, so estimates of effect size and confidence are important and more useful than just p values. At least some journals have been requiring that for a number of years now.)

  3. Wavefunction says:

    “In some cases, the number of possible inconsistencies for a single statistic was such that we were unable to determine which of the components of that statistic were incorrect”

    Ouch. In other words, there were so many things wrong that we couldn’t say what was wrongest.

  4. DCRogers says:

    Sadly, the worst p-hacking can occur even when all the statistics in the paper are perfect.

    I have reviewed papers where the data is a table of descriptors, then used to build a model, and the model shows great statistics. So far so good, right?

    But looking closely, it’s obvious that these descriptors were cherry-picked from a larger, and un-commented-on, set. Was it just a few removed? Hundreds? Thousands?

    It’s an ugly place to be, as a reviewer, because you’re tempted to downgrade/reject work based upon something that happened outside of your direct knowledge.

    Anyhow, for cases where I wanted to reject, there were usually enough other problems in the paper to hang my rejection on those (though I would gently raise my data table fears in my review, not as a accusation, but as “friendly advice”).

    1. Imaging guy says:

      You are absolutely right. When you are building regression models (aka multivariable analysis/regression analysis/machine learning/deep learning), you use hundreds of descriptors/variables/parameters (thousands in the case of machine learning) and the software will remove many of them iteratively till you are left with variables that produce statistically significant result. This is a form of p hacking. The model obtained with leftover variables must be tested on independent datasets. As John von Neumann said, “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk”.

      1. Nanosomething says:

        See “Drawing an elephant with four complex parameters”, American Journal of Physics 78, 648 (2010)

        They also show a fifth parameter can be used to control the shape of the trunk, allowing wiggling.

        1. truthortruth says:

          Love it.

      2. amature says:

        For each additional variable/combination you hypothesize about, and test, you should divide lower your p-value. I think its p/n were p is 5% and n is the number of hypothesis you have. So you need write out all the variable combos you think are important BEFORE looking at the results.

        Of course, machine learning and deep learning and ridge regression are different. There you start with as many variables you like, and you get rid of those useless ones. But the purpose of machine learning isn’t to get a p value. Instead you fit a ‘training’ set, and then see how well it works on a ‘test’ set. There is no p value.

  5. Eugene Fisher says:

    I’m sure the email discussions were taken “out of context” just as the CRU emails were.

  6. anon says:

    The strangest thing about the Wansink case is that he didn’t seem to understand that what he was doing wasn’t ok. He didn’t really try to hide his p-hacking and when immediately after his infamous blog post people complained about it, he seemed completely clueless. Even after being pointed to that XKCD, he still didn’t seem to get it.

    1. hmm says:

      Yep, from that blog post (and the 1st addendum to it) it seems like he was convinced that NOT doing his way was wrong. The sad asymmetry of the situation is that post-doc who said ‘no’ will never be rewarded. It’s a lose-lose situation.

      1. Harrison says:

        There have been several high profile cases where grad students and post-docs did not feel comfortable with what was being asked of them. When they went through the appropriate chain-of-command, nothing would happen, or worse they would suffer negative consequences. This is particularly unfortunate, as a grad student has likely taken a statistics class (or ethics class) more recently than the PI. Their objection should be a clear warning, and not ignored with “but that’s how we’ve always done things.”

  7. Emjeff says:

    I think some blame has to be placed at the feet of the reviewers of these manuscripts. Surely, a reviewer should have noticed that the degrees of freedom > N, right? Or, perhaps no one in the nutrition world knows any statistics at all…

    1. tt says:

      True…but with journal shopping, they likely would just keep submitting until they found one with bad reviewers. There’s enough forums for publishing crap that the check of peer review is meaningless.

  8. MoMo says:

    You are picking on a guy who makes a living observing America’s eating behavior and whose claim to scientific fame are soup bowls that constantly refill themselves. We should all be in the lab making new generation SSRi’s or water soluble propolfol to treat his PTSD and his lack of concern over statistics.

    Must be a slow day in the lab.

    This is Psychology- statistics don’t matter.

    1. CheMystery says:

      I think the point is more about scientific rigor in the era of Fake News. It does not reflect on those of us are working “really important” questions when a member of our tribe is sulling the name of Science. If we scientists are the last bastion of truth, we cannot tolerate such bad actors.

  9. In Vivo Veritas says:

    It’s brutal. Cornell faculty, students and post-docs should be outside his lab with pitchforks & torches. He’s undercutting the legitimacy of his own field and his own university.

  10. Synthon says:

    My brother was a professor of medical statistics and a statistical reviewer for the BMJ. He was often asked by medics to look at a paper they had written proving something or other and “just check the stats”. When he reported the work proved nothing as the trial had been ill conceived he was usually ignored and the work was published unchanged in a lesser journal that did not use a statistical referee.
    The field of psychology and nutrition needs much more rigour.

  11. luysii says:

    Back in 2015 the dietary guidelines shifted yet again. Cholesterol is no longer bad.

    Shades of Woody Allen and “Sleeper”. It’s life imitating art.

    Sleeper is one of the great Woody Allen movies from the 70s. Woody plays Miles Monroe, the owner of (what else?) a health food store who through some medical mishap is frozen in nitrogen and is awakened 200 years later. He finds that scientific research has shown that cigarettes and fats are good for you. A McDonald’s restaurant is shown with a sign “Over 795 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 Served”

    Seriously then, should you believe any dietary guidelines? In my opinion you shouldn’t. In particular I’d forget the guidelines for salt intake (unless you actually have high blood pressure in which case you should definitely limit your salt). People have been fighting over salt guidelines for decades, studies have been done and the results have been claimed to support both sides.

    So what’s a body to do? Well here are 4 things which are pretty solid (which few docs would disagree with, myself included)

    l. Don’t smoke
    2. Don’t drink too much (over 2 drinks a day), or too little (no drinks). Study after study has shown that mortality is lowest with 1 – 2 drinks/day
    3. Don’t get fat — by this I mean fat (Body Mass Index over 30) not overweight (Body Mass Index over 25). The mortality curve for BMI in this range is pretty flat. So eat whatever you want, it’s the quantities you must control.
    4. Get some exercise — walking a few miles a week is incredibly much better than no exercise at all — it’s probably half as good as intense workouts — compared to doing nothing.

    Not very sexy, but you’re very unlikely to find anyone telling you the opposite 50years from now.

    1. Jim Hartley says:

      For a stimulating look at why people die, this video of Richard Peto from the NIH archives is hard to beat.

      1. steve says:

        I believe that Richard Peto is actually at Oxford

    2. Pennpenn says:

      I’m always hugely dubious about the whole “drinking alcohol in moderation is good for you!” set of claims, feels like people trying to get a correlation to excuse drinking alcohol because it’s socially acceptable but utterly stupid on face value. Oh yay, drink something that impairs your mental faculties, what a brilliant idea!

      Then again, I currently feel like I’d rather cut out my tongue than drink that bilge-fluid so maybe I’m biased.

  12. dearieme says:

    Well here are 4 things which are pretty solid (which few docs would disagree with, myself included)

    l. Don’t smoke AGREED.
    2. Don’t drink too much (over 2 drinks a day), or too little (no drinks). Study after study has shown that mortality is lowest with 1 – 2 drinks/day CORRELATION BUT IS IT CAUSE?
    3. Don’t get fat — by this I mean fat (Body Mass Index over 30) not overweight (Body Mass Index over 25). The mortality curve for BMI in this range is pretty flat. THE OVERWEIGHT OUTLIVE THE “NORMAL”. EVEN CLASS I OBESE PATIENTS OUTLIVE THE “NORMAL” i.e 30 – 35 outlive 18.5 – 25. CLASS II OUTLIVE THE UNDERWEIGHT. ALSO “CORRELATION BUT IS IT CAUSE?” Source:

    1. luysii says:

      Dearimee — you are quite correct. The quote was lifted from something written in 2015. You are also right about better data — here’s a link — to a post about BMI and mortality —

      The relevant work is Flegal’s — here’s a partial summary from the post

      A great paper 5 years ago by Katherine Flegal analyzed nearly 3 million people with 270, 000 deaths reported in a variety of studies —

      The problem is that the lowest mortality didn’t occur in those with normal weight (BMI < 25) but was lowest in the overweight group — not by much (6%), and second lowest in the mildly obese (BMI 30 – 35), over 35 it was 20% higher.

      Naturally this did not sit well people who'd staked their research careers on telling people to lose weight. There is a truly hilarious article describing a meeting at Harvard discussing the paper. Here's a link It's worth reading in its entirety, particularly for a graph it contains.

      Your point about CORRELATION BUT IS IT CAUSE is particularly interesting, because EVERY statement made by docs is based on this sort of reasoning (smoking/increased mortality, alcohol consumption and mortality, body weight and mortality etc. etc.) as I don't think we're anywhere close to going from correlation to a correct understanding of the mechanisms of causation. We simply don't understand cellular biology, organismal biology, the effects of environment well enough to be sure.

      1. Chris Phoenix says:

        BMI is probably a bad measure anyway, because it’s based on the square of height, and body mass should probably be proportional to the cube of height.

  13. CatCube says:

    You can’t discuss post-hoc rooting around in data sets looking for correlations without a link to Tyler Vigen’s Spurious Correlations website, so I’ll post the link here:

    My favorite is the US per-capita consumption of mozzarella is correlated to the number of civil engineering doctorates awarded with an r=0.959

    1. Me says:

      Ben Goldacre published a chapter on post-hoc analysis in one of his books – I think it was Bad Science. Someone ‘proved’ that a med ( I think in a cancer trial) had a statistically significant benefit in patients born on a Thursday or some such. Fun with numbers!

    2. “My favorite is the US per-capita consumption of mozzarella is correlated to the number of civil engineering doctorates awarded with an r=0.959”

      Based on doing my graduate studies at a college with a strong engineering school (Caltech), and observing the habits of the denizens at first hand, I question whether this is actually spurious. ;)V

  14. K says:


    We don’t get enough of it

    1. oldnuke says:


      We don’t get enough of it”

      Ask anyone on death row.


  15. Kevin H says:

    My interpretation of the XKCD wasn’t that it was about p-hacking and HARKing, so much as it was about the limitations of significance testing if you just do enough experiments, with perhaps an implicit nod to the file-drawer effect (only the p<0.05 result gets published–or gets media attention).

    I guess I always assumed that the cartoon scientists in Randall Munroe's world would be basically honest and (however grudgingly) hardworking. They were complaining about the loss of their valuable Minecraft time because they had to do twenty new experiments – one for each M&M colour – instead of just slapping together a dubious post-hoc subgroup analysis of their original data set.

    (And the mouseover for the comic suggests that they went back and tried to replicate their result….)

    Still, I suppose it is ambiguous, and there's something to be learned (or at least, something that should be learned) either way.

    1. Chris Phoenix says:

      It could also be a reference to studies showing that prayer improves health outcomes. Quite literally, they study 20-ish outcomes and report the one that improves.

  16. Emjeff says:

    When we design clinical studies in industry, we also decide how we are going to analyze the data – in fact, the design and the analysis plan go hand-in-hand. The statistical analysis plan is written along with the protocol, and is so detailed that the number and kinds of tables and figures are pre-specified.If you think about it, you can’t really design a study without knowing how you are planning to analyze the data. Obviously, this guy collected data without the foggiest idea of how he was going to analyze the data.

  17. Li says:

    Science mag just bemoaned the fact that the authors submitting papers on A.I. do not have the information for the work to be replicated by anyone else (and so can’t provide it to them). WHAT?!?! And they’ve been wringing their hands for some time over the many authors who don’t deposit the data they agreed to. Apparently, the communities won’t support this kind of honesty and transparency. Not that it isn’t a good thing when some charlatan like this guy is caught, but I reject the idea that his peers aren’t part of the problem. It’s like blaming Harvey Weinstein (and his Board of Directors) for the systematic sexual exploitation his company and MANY of his employees engaged in. #NotOneGuy.

  18. Anonymous says:

    Another major Harvard psychology program has been the target of criticism. The “implicit association test” (IAT) from the Banaji labs is based on minuscule differences in subject response times, is not very reproducible (same subjects generate different results upon retesting), and has a slew of other controversial problems. Although the researchers said they were going to release their raw data for others to analyze statistically, I don’t know if they have actually done so.

    (Not reproducible? How many of you had HR require you to take the Myers-Briggs Personality Test? Same deal: different results on different days. But HR needs to justify their existence somehow. M-B is more psychology tripe but the Myers-Briggs INDUSTRY [it’s like a franchise thing] is probably more profitable than R&D anyway.

  19. David Edwards says:

    From Derek’s coverage of this issue above:

    Rooting around in the data post hoc to see what turns up, although tempting, is a dangerous way to work. That’s because if you keep rearranging, testing, breaking down and putting together over and over, you can generally find something that comes out looking as if it were significant. But that doesn’t mean it is.

    Which is as succinct an encapsulation of the perils of the whole “data mining” business as one could wish for. Not that it will stop corporations and intelligence agencies from continuing to data mine, the former in the hope of making fat wads of cash for the C-suites, the latter in the hope of finding an easy target to pursue. Regardless, of course, of whether the resulting cubes even point to something resembling a genuine correlation, let alone a cause-and-effect linkage. But it’s a bit of a shock to see data mining this careless turning up in what is supposed to be scientific research. This is the sort of crap I associate with marketing types pretending to be computer scientists, in the hope that they’ll be first on the promotion ladder when the C-suites start collecting extra yachts, the sort of activity that gives proper software developers a bad name.

    Trouble is, as Derek has ruefully commented previously in other posts, this sort of approach is becoming fashionable among get-rich-quick merchants, who think they can do med-chem better than trained chemists. I might not be a trained med-chem specialist, but even I know better than to ignore their advice, just because someone with a polished PowerPoint presentation and a smooth line of patter says he can do better using his (untested) super-duper AI system.

  20. oliver says:

    Two thoughts:
    Psychology is not science!!!
    Wansik delenda est!

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.