Skip to Content

Cancer

Looking Back at Sutent’s Development: Problems or Not?

Here’s a report on the preclinical development of Sutent (sunitinib). The authors did a retrospective analysis of all the published studies, and found a number of problems. It should be noted that these are definitely not limited to just this drug – in fact, the authors are explicitly extending their conclusion to a lot of preclinical oncology research:

This systematic review and meta-analysis revealed that many common practices, like randomization, were rarely implemented. None of the published studies used ‘blinding’, whereby information about which animals are receiving the drug and which animals are receiving the control is kept from the experimenter, until after the test; this technique can help prevent any expectations or personal preferences from biasing the results. Furthermore, most tumors were tested in only one model system, namely, mice that had been injected with specific human cancer cells. This makes it difficult to rule out that any anti-cancer activity was in fact unique to that single model.

. . .evidence (suggests) that the anti-cancer effects of sunitinib might have been overestimated by as much as 45% because those studies that found no or little anti-cancer effect were simply not published. Though it is known that the anti-cancer activity of the drug increases with the dose given in both human beings and animals, an evaluation of the effects of all the published studies combined did not detect such a dose-dependent response.

The poor design and reporting issues identified provide further grounds for concern about the value of many preclinical experiments in cancer. . .

To be sure, animal models in oncology are already known to be of limited use. You’d think that being able to shrink tumors in a mouse model would be pretty predictive of the ability to do the same in a human, especially if these are human-derived tumor lines, but that isn’t quite the case. To use the framework laid out in this paper, that’s a problem with construct validity, a general disconnect between the clinic and the model. (More on this below). The two other main problems are internal validity (bias or improperly accounted-for variation in the experiments themselves), and external validity (when other factors might be driving the readout in one particular model).

The authors find problems in all three areas. For external validity, going into different species and different sorts of tumors would help, and there was a general trend towards small effect sizes as the models became more diverse. And with internal validity, there were too many studies that didn’t address dose-response, and (as mentioned in that quote above) randomization and blinding were a lot less common than they should have been. It should be noted, though, that when these practices did show up, they didn’t seem to have all that much influence on effect sizes, so they may (at least here) be less important.

Publication bias varied quite a bit. For renal cell carcinoma, things actually looked pretty good, with thorough coverage relative to the effect size. (And that, clinically, is one of the drug’s major indications). Looking at the published studies, though, you’d think that it was going to be of much more use for (say) small-cell lung cancer, where clinically it turned out to have little impact. Overall, as the authors say, “That all malignancy types tested showed statistically significant anti-cancer activity strains credulity“. And while that’s true, I wouldn’t rule out the general crappiness of the prelinical models pitching in on this as well: construct validity is a much larger problem in this field than outside observers tend to believe it is.

The authors aren’t having this, though:

One explanation for our findings is that human xenograft models, which dominated our meta-analytic sample, have little predictive value, at least in the context of receptor tyrosine kinase inhibitors. This is a possibility that contradicts other reports (Kerbel, 2003; Voskoglou-Nomikos et al., 2003). We disfavor this explanation in light of the suggestion of publication bias; also, xenografts should show a dose–response regardless of whether they are useful clinical models.

Should they? You’d hope so, but why does that have to be the case? And when you look at those two references they cite, you find this in the Kerbel paper:

It is not uncommon for new anti-cancer drugs or therapies to show highly effective, and sometimes even spectacular anti-cancer treatment results using transplantable tumors in mice. . .Unfortunately, such preclinical results are often followed by failure of the drug/therapy in clinical trials, or, if the drug is successful, it usually has only modest efficacy results, by comparison.

The Voskoglou-Nomikos one found that xenograft models were basically worthless for breast and colon cancer, but were predictive in non-small-cell lung and ovarian cancer as long as sufficiently broad panels of different cell lines were used. If you picked fewer cell lines and jumped the wrong way (and there’s no good way to know a priori which way that is), you were out of luck there, too. So I don’t see how these two reports so blatantly contradict the idea that xenografts – particularly xenografts during the period of sunitinib’s development – have (had) little predictive value. You could just as easily cite those two against this new paper’s claim here. And although they don’t cite it in this context, this paper’s own reference 28, (this paper, from this same era) has this to say about xenografts:

For 39 agents with both xenograft data and Phase II clinical trials results available, in vivo activity in a particular histology in a tumour model did not closely correlate with activity in the same human cancer histology, casting doubt on the correspondence of the pre-clinical models to clinical results.

So actually, I think the weight of the evidence is that the preclinical data for sunitinib look bad because the state of preclinical research in cancer, especially back at the time, was bad. Construct validity was the single biggest problem that this meta-analysis seems to have identified, and I really think you can chalk that up to “damn xenografts, again”. Internal validity should have been better, but (as noted above) didn’t seem to influence the results all that much, and external validity is wrapped up in the xenograft problem as well. Finally, publication bias certainly seems to have been present, but (in the indication where sunitinib was eventually approved, renal cell carcinoma), things looked far more representative, as compared to indications that were abandoned.

If this paper wants to come to the conclusion that preclinical oncology needs improvement, I absolutely agree. But I can’t endorse (or not yet) the idea that sunitinib’s particular case was hurt by anything more than those same limits.

Update: more on this from Retraction Watch.

12 comments on “Looking Back at Sutent’s Development: Problems or Not?”

  1. petros says:

    From simply reading some of the comments made by the authors of this paper, they come across as naive. Their comments on xenograft models lack of predictivity is hardly surprising but what are the alternatives? Would it be ethical, let alone cost-effective, to try and develop colonies of animals in which tumours develop?

  2. WD says:

    In other news…
    WSJ reports “FDA puts Zagen Trial on Hold After Patient Death”
    http://www.wsj.com/articles/zafgen-says-fda-puts-trial-on-partial-hold-after-patient-death-1444997283

  3. bhip says:

    How often are cancer cells which are immune to the drug effect in 96 well plates transplanted into mice? They aren’t.
    Xenograph models are primarily PK/PD exercises for “in vitro active” drugs (again…in certain cell lines) that can give you data on protein plasma shifts & potentially active metabolites.

  4. Z-squared says:

    @petros:

    If a model as no predictive value than what is the point of using it? I have struggled with this question many times – our tendency as researchers is to go ahead and use a model we know has little value because it is better to do SOMETHING than NOTHING, but in the end I think we just end up hurting ourselves.

  5. Tim R says:

    @Z-squared (“If a model as no predictive value than what is the point of using it?”)

    The reason that these models are used is that not everyone thinks they have zero predictive value. Drugs have been successfully discovered and developed using them.

  6. John Hood says:

    It is well known that fast growing xenografts (colon and NSCLC in particular) are particularly sensitive to anti-angiogenic compounds. If you have ever performed the study you would recognize the reason is simple geometry and physiology. there is a very limited vascular network for the tumor to recruit subQ on a mouse flank and anything that disrupts that is very powerful. many colon xenografts spontaneously necrose and involute because of that. In contrast, the human liver (where most CRC metastases land) is awash in nutrient rich blood and angiogenesis is hardly needed.

    We need to dig deeper than superficial generalizations like “xenografts aren’t predictive” and look at the details why lest we throw out the baby with the bathwater. In this case, you can accurately say xenografts are over-predictive in certain tumor types with certain therapeutic modalities and given the prevalence of those tumors and therapies that can skew the trend for all tests.

  7. johnnyboy says:

    An animal model will always be bad if you don’t understand its limitations and how to use it. Thinking that “a drug that shrinks a colon carcinoma cell-culture derived xenograft will be effective in a patient with colon carcinoma” is not understanding the limitation of the model. I thought that this kind of thinking had died 20 years ago, but apparently some researchers are still outraged that culture-derived xenografts are not 100% predictive.

  8. Z-squared says:

    @ Tim R

    Your point that drugs have been discovered using these models is a good one, but it does not necessarily mean that the model really had “predictive value”. The problem is that sometimes we just get lucky, and that leads us to think that the model has much more predictive value than it actually does…and for some reason there is something about human nature that makes it just plain HARD to let go of that initial result even after being confronted with additional data that should make us substantially de-value the model.

    I agree with John Hood’s comments that understanding why / when the models are predictive for some subtypes but not others is an extremely valuable undertaking and should allow researchers to weight the data accordingly.

    I should also mention that this problem of latching onto models that may not be as predictive as we think is something that extends far beyond xenografts and drug discovery. I’ve worked in a few different areas of chemistry and there just seems to be a basic tendency to over-value the positive results (simple confirmation bias, I suppose) from our model studies simply because we do not have a better alternative, even if deep down we probably know better.

  9. milkshake says:

    I am on the Sutent patent although I did not make the parent compound, because I joined late in the project. What I remember though is that there were worrying signs of organ tox in animals (adrenal necrosis) which raised question of the narrow safety margin. Also, SUGEN company did one expensive blunder previously with SU6668, which worked in animals but failed in clinic due to very high plasma protein binding (which pushed the concentration of free drug below the therapeutic level). I was told that with that compound, the plasma protein binding was measured for bovine, dog serum etc. – but not for the human plasma.

    Sometimes biotech companies bringing their first drug into clinic may lack sufficient clinical development experience. I believe the very first drug we had, SU5416, would not be brought all the way to late clinic – only to fail for lack of efficacy due to low systemic exposure due to terrible insolubility. Big pharma would have recognized the problem and did the prodruging effort far earlier than we did (by the time we got finally stable IV infusion prodrug, the 5416 project got canned)

  10. ScreeningAgnostic says:

    I agree this paper is either very naive or disingenuous.
    First, in their assumption that xenograft models are the sole decision-making test for drug discovery and approval, and that we in the business actually take the results as gospel.
    Second, that all these studies actually had anything to do with clinical approval and development of Sutent. A large proportion of the studies have been published by academic labs and post-2010, therefore have no relevance to drug approval.
    Third, that xenograft studies are intended to be unbiased screening assays for efficacy, rather than model systems for studying drug effects. Of course there’s a publication bias – we don’t just pick models at random and test them in animals with no expectation of efficacy – that would be frankly unethical.
    p.s. XenografT with a T (pet peeve).

  11. johnnyboy says:

    It’s essentially the same problem with animal tox studies. People who are a bit naive assume that presence of an animal toxicity will mean the same human toxicity, and the same goes for lack of toxicity. Anyone who is actually involved in tox knows that the real purpose of animal tox is not to accurately predict human tox, but essentially to determine whether the drug is safe enough to go into phase 1 or not.

  12. HT says:

    Before we continue the discussion, can we agree that the authors (Henderson et al.) “disfavor” the idea that xenograft models are non-predictive? The authors “favor” publication bias and experimental design as the reasons for poor correlation between preclinical results and clinical observations. In fact, the Nature commentary on the article also focused on the design and reporting, without mentioning the models.
    It is one thing to have flawed models, and another to have flawed experiments based on perfectly good models. So for those who are trying to defend the models, perhaps you could redirect the arguments to the proper source?

Comments are closed.