A couple of years ago, I wrote about how far too much of human nutrition research was unfit to draw conclusions from. This new story does nothing to make a person more confident in the field: it’s a detailed look at the lab of Brian Wansink at Cornell, where he hold an endowed chair. He’s the former head of the Center for Nutrition Policy and Promotion at the USDA, author of both a long list of scientific publications and popular books, and his work is widely quoted when the topic of human behavior around food comes around. And it appears more and more like most (all?) of that work is in trouble.
This has been building for a few months. During 2017, Wansink had several papers retracted, and this appears to be one of the things that started it all off. This is the sort of abstract that will ruin a person’s whole day:
We present the initial results of a reanalysis of four articles from the Cornell Food and Brand Lab based on data collected from diners at an Italian restaurant buffet. On a ﬁrst glance at these articles, we immediately noticed a number of apparent inconsistencies in the summary statistics. A thorough reading of the articles and careful reanalysis of the results revealed additional problems. The sample sizes for the number of diners in each condition are incongruous both within and between the four articles. In some cases, the degrees of freedom of between-participant test statistics are larger than the sample size, which is impossible. Many of the computed F and t statistics are inconsistent with the reported means and standard deviations. In some cases, the number of possible inconsistencies for a single statistic was such that we were unable to determine which of the components of that statistic were incorrect. . .The attached Appendix reports approximately 150 inconsistencies in these four articles, which we were able to identify from the reported statistics alone. . .
But actually, the trouble began with a post on Wansink’s own blog. He described the pizza work as initially appearing to be a “failed study” with “null results”, but went on to describe how a grad student in his group (at his urging) kept going back over the data until she began finding “solutions that held up”. That raises the eyebrow, Spock-style, because you’re supposed to design a study to answer some specific question. Rooting around in the data post hoc to see what turns up, although tempting, is a dangerous way to work. That’s because if you keep rearranging, testing, breaking down and putting together over and over, you can generally find something that comes out looking as if it were significant. But that doesn’t mean it is. (Update: as Alex Tabarrok points out, there is a very germane XKCD for this!)
If you’re going to try to “torture the data until they confess”, as the saying goes, then what you really have to do is take that interesting trend you seem to have spotted and design another study specifically to test for it. If you’re on to something, you’ll get a stronger signal in the numbers – but most of the time, unfortunately you’re not on to something. You can chase this sort of stuff for a long time watching it evaporate in front of you, and the larger the original data set, the greater the chance of this happening. It’s especially dangerous with notoriously fuzzy readouts like field studies of human behavior – this stuff is very much hard enough with cells growing in containers or mice in uniform cages, so imagine what it’s like to work with data you collected down at the pizza buffet. The reproducibility crisis in social science is driven, in large part, by the fact that humans are horrendously hard to work with as objects of study.
As you can see, though, turning around and designed more specifically controlled studies is not what Wansink did. Instead, the grad student’s work turned directly into the four papers mentioned in that abstract – from what I can see, what should have been preliminary conclusions to be tested again turned into the conclusions, the whole points, of four new papers. Which is one way to bulk up the publication list. That list has now been the subject of a lot of scrutiny, and this new article is not going to damp any of that down, either:
Now, interviews with a former lab member and a trove of previously undisclosed emails show that, year after year, Wansink and his collaborators at the Cornell Food and Brand Lab have turned shoddy data into headline-friendly eating lessons that they could feed to the masses.
In correspondence between 2008 and 2016, the renowned Cornell scientist and his team discussed and even joked about exhaustively mining datasets for impressive-looking results. They strategized how to publish subpar studies, sometimes targeting journals with low standards. And they often framed their findings in the hopes of stirring up media coverage to, as Wansink once put it, “go virally big time.”
Oh boy. The article goes on to detail just those things, and it’s grim reading. Grim for more than one reason, though – as the piece describes Wansink and his co-authors looking for topics that would bring in attention and funding, worrying about numbers that didn’t quite reach publishable significance thresholds and wondering if there could be some way to push them across, and submitting papers, after rejection, to progressively less demanding journals just to get them published. . .well, a lot of readers may find themselves squirming in their chairs a bit.
The “p-hacking” and data-grinding that went on in Wansink’s lab really appear to be beyond what responsible researchers should engage in. These things are the sins here, because there are a lot of conclusions out there in the literature, thanks to these papers, that are just wrong (or at best, not proven right, although claimed to be). But once past the outright misconduct, some of the other activity described is all too familiar, and to see them mixed together in a “Can you believe this stuff?” article makes for uncomfortable reading. It’s worth thinking about what a lot of other labs’ internal emails might look like if published at Buzzfeed. But at least their results stand up. They’d better.