Skip to main content

Clinical Trials

Studies Show? Not So Fast.

Yesterday’s post on yet another possible Alzheimer’s blood test illustrates, yet again, that understanding statistics is not a strength of most headline writers (or most headline readers). I’m no statistician myself, but I have a healthy mistrust of numbers, since I deal with the little rotters all day long in one form or another. Working in science will do that to you: every result, ideally, is greeted with the hearty welcoming phrase of “Hmm. I wonder if that’s real?”
A constant source for the medical headline folks is the constant flow of observational studies. Eating broccoli is associated with this. Chocolate is associated with that. Standing on your head is associated with something else. When you see these sorts of stories in the news, you can bet, quite safely, that you’re not looking at the result of a controlled trial – one cohort eating broccoli while hanging upside down from their ankles, another group eating it while being whipped around on a carousel, while the control group gets broccoli-shaped rice puffs or eats the real stuff while being duct-taped to the wall. No, it’s hard to get funding for that sort of thing, and it’s not so easy to round up subjects who will stay the course, either. Those news stories are generated from people who’ve combed through large piles of data, from other studies, looking for correlations.
And those correlations are, as far as anyone can tell, usually spurious. Have a look at the 2011 paper by Young and Karr to that effect (here’s a PDF). If you go back and look at the instances where observational effects in nutritional studies have been tested by randomized, controlled trials, the track record is not good. In fact, it’s so horrendous that the authors state baldly that “There is now enough evidence to say what many have long thought: that any claim coming from an observational study is most likely to be wrong.”
They draw the analogy between scientific publications and manufacturing lines, in terms of quality control. If you just inspect the final product rolling off the line for defects, you’re doing it the expensive way. You’re far better off breaking the whole flow into processes and considering each of those in turn, isolating problems early and fixing them, so you don’t make so many defective products in the first place. In the same way, Young and Karr have this to say about the observational study papers:

Consider the production of an observational study: Workers – that is, researchers – do data collection, data cleaning, statistical analysis, interpretation, writing a report/paper. It is a craft with essentially no managerial control at each step of the process. In contrast, management dictates control at multiple steps in the manufacture of computer chips, to name only one process control example. But journal editors and referees inspect only the final product of the observational study production process and they release a lot of bad product. The consumer is left to sort it all out. No amount of educating the consumer will fix the process. No amount of teaching – or of blaming – the worker will materially change the group behaviour.

They propose a process control for any proposed observational study that looks like this:
Step 0: Data are made publicly available. Anyone can go in and check it if they like.
Step 1: The people doing the data collection should be totally separate from the ones doing the analysis.
Step 2: All the data should be split, right at the start, into a modeling group and a group used for testing the hypothesis that the modeling suggests.
Step 3: A plan is drawn up for the statistical treatment of the data, but using only the modeling data set, and without the response that’s being predicted.
Step 4: This plan is written down, agreed on, and not modified as the data start to come in. That way lies madness.
Step 5: The analysis is done according to the protocol, and a paper is written up if there’s one to be written. Note that we still haven’t seen the other data set.
Step 6: The journal reviews the paper as is, based on the modeling data set, and they agree to do this without knowing what will happen when the second data set get looked at.
Step 7: The second data set gets analyzed according to the same protocol, and the results of this are attached to the paper in its published form.
Now that’s a hard-core way of doing it, to be sure, but wouldn’t we all be better off if something like this were the norm? How many people would have the nerve, do you think, to put their hypothesis up on the chopping block in public like this? But shouldn’t we all?

21 comments on “Studies Show? Not So Fast.”

  1. luysii says:

    There’s nothing wrong with observational studies (that’s how thalidomide teratogenicity was found), but they should ALWAYS result in a controlled trial before acting on them. Here’s particularly horrible example of relying on observational studies. For a few more examples more please see –
    [ Science vol. 297 pp. 325 – 326 ’02 ] During the planning study for the Women’s Health Initiative, some argued that it was UNETHICAL to deny some women the benefit of hormones and give them a placebo. The basis for this was that 3 different meta-analyses concluded that estrogen replacement therapy decreases the risk of coronary heart disease by 35 – 50% — these were all meta-analyses of observational studies and not prospective and randomized.
    The reason the HERS study was funded was that Wyeth couldn’t get the FDA to approve hormone replacement therapy as a treatment to prevent cardiovascular disease. So Wyeth funded HERS to prove their case.
    More work from the Women’s Health Initiative trial in 16,608 women showed increased risk of stroke, dementia, global cognitive decline. In addition there was no benefit against mild cognitive impairment. This was published in the 28 May ’03 JAMA. The present work extends the initial early findings to followup to 5.6 years. The rate of stroke was 31% higher. 80% were ischemic. The increased risk was seen in all categories of baseline stroke risk. 40/61 women diagnosed with dementia were in the hormone (prempro) group. This is in a subgroup of 4,532/16,608 women in the study. The references are all JAMA vol. 289 pp. 2663 – 2672, 2651 – 2662, 2673 – 2684, 2717 – 2719 ’03.
    So what was the problem. Why were the results so different from what was expected? The women taking hormones in the 50 observational studies were (1) thinner (2) better educated (3) concerned enough about their health and vigor to take hormones — it is well known that compliers with mediation — even placebo medication — have a better outcome than noncompliers — believe it or not (4) smoked less.

  2. Carmen says:

    I’m a big fan of HealthNewsReview, a watchdog site for health stories. Their reviewers are a mix of MDs and PhDs, many with journalism chops too. They have a list of criteria for evaluating story quality that includes “Does the story seem to grasp the quality of the evidence?” (Click the link in my name for the full list).
    Of course, none of this helps the folks who have to generate 22 articles a day and use a handy stack of press releases as a crutch.

  3. Lisa Balbes says:

    One of my favorite books ever – “How to Lie with Statistics”. The examples are dated now, but it should be required reading for anyone who wants to be an informed citizen.

  4. Ryan Powers says:

    Whenever I see these sorts of stories on news sites, the xkcd Jellybean comic comes to mind:

  5. Helical Investor says:

    The plan you put out seems sensible, but in many instances the data available is limited, so binning into separate sets may not be practical. That is changing though, especially as electronic medical records gain traction. Data in one silo can and should be blindly compared (after the initial analysis) to data in other.
    Fully expect that ‘outcomes’ data on a wide array of different therapeutics and their use will be compared to with populations in different insurance programs.

  6. This is often done in machine learning, see for example the website kaggle, or for a more biomedical example, the DREAM challenges.

  7. David Stone says:

    I see someone’s already posted the Jelly Bean XKCD comic. These are also worth taking a look at:
    Spurious correlations
    The origin of cell phones
    Anyone covering health news for a media outlet should be required to read and understand those first!

  8. Anonymous says:

    Remember that study that claimed that people have psychic powers? And how the results weren’t replicable?
    That was an interesting one. A good test of how to handle this kind of thing.
    The thing is though, it’s only because the results were so unusual (they could imply the need to expand our theories of funamdental physics, and reevaluate the way we think about biology and human behavior) that the thing was put up to so much scrutiny. I know the whole “extraordinary claims require extraordinary evidence” thing, but one has to wonder what percentage of all papers’ conclusions are just as wrong, but haven’t ever had to stand up to any scrutiny merely because the results seem reasonable.

  9. bluefoot says:

    A couple of years ago at a conference, one of presenters was talking about how great their model was. I asked what the differences were between the training data set and their test set…..and they had used the same data. The scientist wasn’t exactly junior, but didn’t see anything wrong with doing it the way they did. Sometimes I despair.

  10. lynn says:

    @David Stone – thanks for mentioning Spurious Correlations site. Of course I’ve always been skeptical of correlations, but it’s often hard to impart that skepticism to people who fall for them. But now I can go to Spurious Correlations and pick out a few graphs to show the gullible.

  11. Esteban says:

    The sad reality is that those receiving funding for observational studies are desperate to publish a positive result at the end, so cannot afford to lay all of their cards on the table upfront. Instead, they churn the data looking at various endpoints/associations, find at least one with a p-value below .05 (unadjusted for the multiple looks at the data of course), concoct a story as to why the effect is scientifically plausible, then write it up for publication. If it was a very expensive study, they will dig up multiple such stories and publish multiple times.

  12. Jack Scannell says:

    It is not clear to me that the sensible concerns expressed here, or in the Young and Karr paper, necessarily reflect an observational vs. experimental distinction.
    A high ratio of published false positives to true positives is a consequence of factors such as relatively lax thresholds for rejecting null hypotheses (e.g., p is less than 0.05 rather than p is less than 0.00001), uncorrected multiple comparisons, low experimental power (i.e., a low true positive detection rate), shifting the analytical goalposts once the data are in, the desire to publish stuff that looks interesting, and a genuine rarity of true positives, among other things.
    Most of these factors apply to experimental studies too, as Derek’s blog has pointed out over the years (e.g., Begley & Ellis on cancer work in 2012, Prinz et al. 2011, Perrin on mouse models in 2014).
    The beauty of experiments over observations, of course, is that you can believe that your false results prove causation.

  13. matt says:

    The Young and Karr paper has it right, IMO.
    Interesting to go back to the comments on the AD blood test, and pull out Feynman’s Cargo Cult lecture: this follows precisely, I think, his “magic sauce” separating the science-make-believers from the actual science-makers. Feynman’s advice was to the students to have integrity and set these controls to avoid fooling themselves; Young and Karr, I believe rightly, suggest the process be changed so that everything is in the open.
    In other words, rather than the bank (and customers) strongly urging employees to be honest and not steal, strong controls are put in place and everything is done in the open.
    I’d guess the National Academies, NIH, NSF, etc major funding agencies would have to drive this for the work done for them, and that would generate enough attention to swing many of the publications that care about their reputation into some form of lip-service for this model.
    But there are major professions (psychology being one, nutrition being another–probably many of the same culprits Feynman mentioned) which I think would be profoundly upset at having their routine disturbed (collect data, fish for surprising conclusion, publish, $$$!).

  14. Esteban says:

    @12,13: I agree that there is no reason to think this is only a problem with observational studies.

  15. Darren says:

    Another xkcd comic springs to mind,

  16. Oblarg says:

    That nutritional “science” generates almost nothing but false results is already known to any scientifically literate person. It is the same issue that we have with “big data” (though strangely that one seems to get more of a pass from people who really ought to know better – maybe it sounds more impressive?): it is statistically impossible to generate true results by looking at massive data sets confounded by large numbers of unknown variables without solid motivation for the effects being investigated.
    As good as the proposed measures would be for cleaning up the literature (though, sadly, I think they’re likely unfeasible as they’d spell the demise of essentially the entire field and a lot of careers – people’s livelihood, unfortunately, depends on peddling this crap), they’d have no real end-effect on public gullibility. No one who isn’t a scientist can be bothered to look at the data were it public, nor would they have the tools to understand what conclusions to draw from it even if they did. Hell, if we can’t expect people who spend time studying “science” at actual universities to understand why you can’t double-dip on your data sets, what hope is there for the general public?
    The root cause of all of this is statistical illiteracy on the part of almost everyone. The only way to fix this is significant investment in practical math education for all students. It is a travesty that anyone graduates high school without a working knowledge of statistics, when it is probably the single most important tool needed to be an informed human being. Until we address this, no amount of reformation in our publication methods will do a thing.

  17. db says:

    It is similar to the idea that open source software inherently is more secure because it’s code can be inspected by anyone. Of course it can be, but so few people have both the capability and time to actually do so, that the ideal is far from the reality.

  18. db says:

    Autocorrect again shows its shortcomings. It’s really shameful.

  19. There also the Science News Cycle according to PhD comics.

  20. Vader says:

    As a mere interested layman (on biological sciences; I’m a qualified scholar regarding world-shattering giant lasers) I simply apply a number of filters to these studies.
    1. Is the relative risk less than 3? Ignore for at least the first three or four papers.
    2. Is the study retrospective? Ignore.
    3. Is the sample size less than three digits? Ignore.
    I end up not feeling obligated to read very many observational papers this way.

Comments are closed.