Skip to main content

Biological News

A First Look at Reproducibility in Cancer Biology

I’ve been waiting for these results: the Reproducibility Project (Cancer Biology) has been going back over several prominent papers in the field, seeing how well their results hold up. This follows a similar effort in experimental psychology, where the results were mixed. The hopes here were to reproduce 50 high-profile studies, but as this Nature News article details, those ambitions had to be scaled back. They’re doing 29 papers, and the first seven now have results (five of them are publishing in eLife today).

Those results are. . .mixed. Again. Two of the five seemed to reproduce fairly well, although they have their own problems: a paper from Science Translational Medicine on cimetidine as an antitumor agent worked, but the statistics were not as good, and one from Cell on BET bromodomain inhibition and cMyc, while it also worked, had substantial differences in the respective control groups. Another two yielded results that are basically uninterpretable compared to the original work – they might have worked, they might not, but something appears to have gone off along the way. One of these is a PNAS paper on SIRP-alpha as a cancer target, where the replication study didn’t find the reported level of tumor inhibition, but instead non-statistically significant growth with a few spontaneous remissions that confounded the statistics. The other is a Nature paper on a gene (PREX2) whose mutations seem to accelerate tumor growth, but in the reproduced study everything grew much too fast for differences to show up. Finally, though, one of the five (a Science paper on a peptide, iRGD, that aid in doxorubucin penetration of tumors), just seems to have not reproduced at all.

That last paper is from the lab of Erkki Ruoslahti at Sanford Burnham, and as that Nature News article reports, he’s nonplussed, to say the least. Ruoslahti says that his work has actually been reproduced in at least ten other labs around the world, and if he’s right about that, then that’s food for thought, too. I can find a number of papers that have used the iRGD peptide to enhance tumor drug delivery, so Ruoslahti may well have a point. In which case, quis custodiet ipsos custodes: why didn’t it reproduce for the reproducers?

When you get right down to it, though, none of these five papers did what someone outside the sciences might have hoped for – that is, reproduced pretty much as written. Working scientists (and especially working biologists) know that that’s a high expectation, which is why the first two papers are still in the “substantially reproduced” category. There are a lot of variables in this sort of work, and not all of them are even known to the authors themselves, by any means. The Reproducibility Project tells the labs redoing the papers that they have to follow things exactly as they were originally reported, but anyone who’s tried following complex papers of this kind will have had to mess around with the conditions along the way, not that that always works, either.

So what this is telling us so far is that (1) getting tumor biology papers to repeat exactly is very difficult, and (2) the original papers themselves are probably not providing as many experimental details as they could. Neither of these conclusions will be controversial for anyone in the field. What we don’t know yet is the fundamental nonreproducibility rate. There are papers that can’t be replicated because the cells were treated a little differently, or the buffer conditions had to be changed, or the antibodies used for detection were wonky, and so on and very much so on. Those will come right if you’re just willing to mess with them long enough. Then there are papers that can’t be reproduced because their fundamental results are invalid. No amount of tinkering will fix them. But as it’s currently being run, the Reproducibility Project is not going to tell us how many of those there are.

It’s already telling us that the literature in the field is probably inadequate in many respects and that there are a lot more factors at work than you might have guessed from reading the original papers. It’s good to have proof of that, but honestly, we knew it already.


36 comments on “A First Look at Reproducibility in Cancer Biology”

  1. johnnyboy says:

    Blimey. Might as well pack it in and go fishing.

  2. dearieme says:

    “the original papers themselves are probably not providing as many experimental details as they could”: in an age of cheap electronic storage there really should be an answer to that rather important complaint.

    1. Bagnar says:

      Storage might be cheap, time is still the same.
      How long could it be to describe properly any experiment including all parameters ?

      In a chemical experiment I run on a daily basis, I expected the reader, the reviewer or anyone reading my lab notebook to be competent in organic synthesis. Some details might be then, deliberately forgotten.

      If I had to describe precisely any colour change, any temperature switch etc. my notebook would be bigger than Game of Throne’s book.

      So, where is the limit between enough details given and too much time spend on them.

      1. anon says:

        You must be joking. Your notebook and procedure is not for a degree or a reward. It is for the people who come after you so that they can follow your work, check it and build upon it. I often see in literature “reagents were combined in ….”. How did you combine them? Did you dissolve one and add the other as solid? What is the order of addition? How long did it take? Did you just dump the whole thing into the flask? You mention the color change for example. Maybe your reaction worked because of that color change. Maybe it was a metal salt/impurity that made it work. If you don’t report these details, no one can repeat your work. And then we get those people with “special hands”. Nobody is special.

        1. Bagnar says:

          I was a bit ironic but it didn’t transpose very well in my comment.

          Brief example.
          I have a reaction running at -78°C.
          I have a two-necked flask, with a argon line and a septum on one neck and a thermometer on the second.
          My reaction is maintened at the same temperature during an addition of a random Grignard on my ketone.
          Honestly, do you report on your notebook that you used this specific glassware ? Do you mention that the temperature inside rose from -78 to -72°C at one point ? Do you mention that the reaction lenght two hours and thirty two minutes or only 2,5 hours ? And so on.

          Indeed, compounds’ order of addition, colours change and many more are mentionned. But, I haven’t see any chemists put in the notebook that the solution is light pink after two minutes, pink after 3, light orange after 5 etc.

          Key elements are only important, especially those parameters that really influct on the desired experiment.
          Following my previous example, who cares the name of your neighboor at the bench this day … But it may influct on your experiment, if this precise neighboor washed your glassware and some impurities remained …

          So, yes, it is easy to record tons of details but it is time consuming and it may dilute key points. The barrier may be really thin between enough and not enough supporting informations.

          I deeply apologize if I appear pretentious in the first place.

          1. Paul says:

            Perhaps it is because I am a process chemist rather than a medicinal chemist but yes, I do record the specific glassware (a 100-mL 4-necked round-bottomed flask equipped with a cooling bath, mag stirrer, football mag stir bar, gas inlet adapter, thermocouple, and a septum), and temperature changes (the temperature rose from -78 °C at 1339 to -72 °C at 1341). With respect to time recoding, I usually just put down the time and calculate elapsed time later. I even record color changes every minute or two if need be. I’ve always taken the approach that my lab book contains ALL the detail. I can always assess what detail is important and which is trivial later. Many times, things I thought were trivial turned out later to be the vital clue. One former colleague of mine used to create a left-side column where he wrote down the time that everything occurred. Was this overkill? Possibly but he never had to guess how long anything took or when it occurred. Our notebooks were not appreciably longer than others. I’ve always figured that the company was more than happy to get us more notebooks. I never found this level of detail to be time consuming and I never found any key points being diluted by the trivial.

          2. loupgarous says:

            I have to agree with Paul on this. Lab books are where you capture everything. Writing the experiment up in a paper is where you decide (after considerable thought given to your reasons for doing so). Color changes – and I shouldn’t have to explain this to chemists – indicate changes in constituents. Temperature changes can show a number of possible things you might want to report in a paper, once you’ve proven what they are.

            A lab book ought to capture what happens in an experiment.

  3. Anon2 says:

    So here’s the dream. You stand up at the bench and load a vial of your new drug target – you’ve discovered that inhibition of Whateverase II or a ligand for the Type IV Whazzat receptor would be a good candidate for modifying some disease. You load up a few reagents, and your speedy, capable physics goes to work fitting useful conformations of all the molecules in your company’s collection into the active site of the protein. When it’s finished with that – it doesn’t take that long, you know – it will go on to the current commercially available set of small molecules and do the same for them. If you want more, your brain has a function to enumerate new structures that it has reason to believe would be potent hits. Come back in a little while and the whole list will be rank-ordered for you.

    I guess I should stipulate that you’re also young, extremely well-paid, and ferociously good-looking, and that Stripebutt, your rainbow-colored pet unicorn, is looking over your shoulder and whinnying appreciatively while you get all this done. Because sometimes it looks like Stripey’s going to make an appearance before that process ever does, pesky unicorn droppings and all – we’ve been trying to realize something like this for decades, and anyone who tells you that we’re there is trying to sell you something.


    But we’re still listening for those unicorn hoofbeats in the distance. . .

    1. Anon2 says:

      In reference to yesterday’s post, if you missed it. Yesterday, we conclusively proved that computational chemistry doesn’t work. Today, we proved that medicinal chemistry doesn’t either!

  4. PS says:

    I can find a number of papers that have used the iRGD peptide to enhance tumor drug delivery, so Ruoslahti may well have a point. In which case, quis custodiet ipsos custodes: why didn’t it reproduce for the reproducers?

    Yeah, because self perpetuating bovine excrement has never been observed before in the cancer literature. The story of Runx3 as tumor suppressor GI cancer is a nice example. Yoram Groner’s group had to make three mouse strains and use five different antibodies to prove that Runx3 is not expressed in the GI track, hence it cannot act as tumor supressor there. By the time Groner published his data the original Cell paper was cited over 200 times. Now the number of citations has risen to over 1000 according to google scholar and I assure you that most of them don’t say “Oh, what a crock that Runx3 as tumor supressor paper was”.

  5. Andy Extance says:

    What if you have an approach that works 50% of the time, for example? Wouldn’t replication get pretty complex then? Take, for example, the work of Richard Lenski: He has 12 strains of E Coli that are continuously evolving. But one of them is showing uniquely interesting behaviour. What would it take to replicate that experiment?

  6. Mike B. says:

    I don’t think non-biologists understand just *how complex* biology really is, because they’ve never done simple things like culture a flask of cells for starters. I mean yes, they understand that biology is hard and that it is complicated, but do they actually understand on an experimental level what we’re dealing with? A lot of irreproducibility in biological/biomedical related science comes from the complex nature of biology itself, not necessarily ‘bad science’ or people doing dishonest things. For example, cells have signaling networks that can sense things like how crowded they are. These networks upregulate and downregulate hundreds, if not thousands, of different genes. Additionally, some of the transcription factors involved in those types of mechanical stress sensing pathways have secondary functions such as regulating the biogenesis of micro RNAs. Even if you happened to follow another lab’s protocol to the T and used the same exact stock of cells, can you ever be 100% absolute certain that you both cultured your cells to the same densities as each other every single time over the course of 1 year’s worth of experiments? No? Opps, looks like your gene and micro RNA profiles are going to shift simply because controlling for things like number of culture passages and densities is next to impossible.

    And that opens up a whole can of other worms, how can you be certain your cells are epigenetically identical to the a lab’s whose work you’re trying to replicate? Even if you obtained a cell stock from that lab, your epigenetic profile could shift differently because of other unknown reasons. And that’s the tip of the iceberg; we now know of phenomena like the epitranscriptome, where methylation of RNAs plays a role in gene expression regulation, yet we still don’t know in great detail what regulates RNA methylation. How do you know your epitranscriptomes aren’t shifting differently either?

    1. Dr CNS says:

      @Mike B.

      I suggest expanding your comment to biologists and non-biologists alike. It has nothing to do with your field of training, but rather with how you perform as a scientist.

      In the recent past we have created a narrative of science that is driven by simplicity – and science is anything but simple.

    2. Eric says:

      Agreed. If you go one step further and move into in vivo models the complexity increases again.

    3. David Mellor says:

      Your example of cell density is an excellent example of a potential hidden moderator that could affect the outcome. If the original author knew that cell density was potentially important, then it should be included in the paper. If the original author was “lucky” and the experiment happened to be run at the optimal cell density and the replication did not control that, then it will take a long time to figure out that cell density is an important factor. The fact is that that discovery process cannot happen without exact replications, whose methods and analyses are specified in advance, and whose outcomes are published regardless of outcome. This is the Registered Reports model and it is critical to pursue as the hidden moderators become more problematic (because all of the easy to detect effects have already been detected)

  7. Kendall Square Postdoc says:

    In the supplement of the iRGD Science paper, they say that the synthetic peptides were “prepared as described” and referenced a paper that does not provide any synthetic/preparation details for the iRGD peptide (the citation is the wrong number, but I’m guessing they meant the Cancer Cell iRGD paper from 2009, and not an explanation of what the single letter amino acid codes stand for (reference 3 in the Science paper)).

    I know the Ruoshlati lab isn’t chemists, but that Science paper was a very surprising result, so it would have been nice to provide more details than that. It’s not a complicated peptide, but details would have been useful to people trying to replicate the results (e.g. how the was cyclization done? is the c-terminus a carboxyl? purified in 0.1% TFA?)

  8. HTSguy says:

    In that same issue of eLife, there is a discussion of what it means to be reproducible. Using their conceptual framework, as a drug-discovery scientist, I’m much more interested in whether the phenomena I’m targeting is “conceptually replicable” than if it’s “directly replicable”, because the former indicates a more general phenomenon that is not strictly dependent upon a particular set of experimental conditions (i.e. I think it has a better chance of being observed in a diseased human).

    1. Eric says:

      I agree. I’m actually far more interested in having a similar, but not identical, study confirm the phenomena in a different lab. It tells me that it’s a phenomena that might be useful for drug development. If it only works in one cell line with one set of reagents under very specific conditions then I’d hazard a guess it’s probably not broadly applicable.

  9. Eric says:

    All of the recent press about the reproducibility crisis has made me wonder – what fraction of the studies should we expect to replicate? It’s certainly not 100%. With a p-value of 0.05 a type 1 error would be expected about 1 out of 20 times and those studies would not be anticipated to replicate.

    Furthermore, when the replicating lab performs the follow-up study they also run the risk of a type 1 error – but more likely they could get a type 2 error and claim the study wasn’t reproducible.

    This doesn’t even address the issue of minor experimental variations between labs that increases the variability. There must be some statisticians reading the blog that can answer this question. If we ignore outright fraud and assume that published work is valid, what is a realistic expectation for the rate of successful replication?

    1. johnnyboy says:

      More than 2 out of 7, perhaps ?

  10. nitrosonium says:

    i agree. if it takes all day, one full notebook and your hand nearly falls off from writing,, include every single possible detail (color, temperature, time, volumes, order of addition, CAS numbers, reagent lot numbers, etc) for all reactions. even for do-over reactions. label and cross label unambiguously between notebooks and spectra. assume nothing.

    as i synthesis guy i cringe every time i read : “…was worked up in the usual fashion”.

  11. Anon says:

    “The hopes here were to reproduce 50 high-profile studies, but as this Nature News article details, those ambitions had to be scaled back. They’re doing 29 papers…”

    So technically, that’s already 21 papers that *couldn’t* be reproduced, before they even report on the others.

  12. Anon says:

    By definition, no experiment (neither conditions nor results) can be reproduced *exactly*, since they can only be repeated separately, at different times. So the whole question/premise is wrong.

    The question we *should* be asking is: Do we reach the same fundamental conclusions with similar statistical significance by repeating experiments as best we can?

    Thus, we should only focus on drawing the same conclusions, not on getting the same data/results.

  13. Cancer Immunologist says:

    For the CD47 Willingham paper, the “reproducibility project” only looked at a single figure in the paper, did not bother to attempt to reproduce the rest of it. They admit technical problems in their own findings (spontaneous rejections of their tumors in the control treated mice) that greatly complicate their findings. So their inability to reproduce the experiment could likely be due to their own ineptitude, not mistakes from the Weissman group.

    Weissman’s work has been reproduced by several other labs now, including in syngeneic models (see work from Dick, Ploegh/Garcia, Van der Berg and others). Soon enough we’ll have data from the phase Ib combination phase of the anti-CD47 trial, the only “reproducibility” that matters.

  14. bacillus says:

    Regardless of which results are “correct” if the models are so fickle as to be essentially impossible to fully replicate, then under what conditions should the NME be administered in the clinic to an outbred individual with various co-morbidities, taking a dozen or so other medications not to mention gender, age, etc differences. It seems to me, that you’d be just as likely to be successful in the clinic by foregoing the animal testing in this particular branch of biomedicine?

  15. Mike says:

    I’ve heard quite a few people credit Einstein with the definition of insanity as ‘doing the same thing over and over and expecting a different result’ (I’m unsure of the veracity of the quote). Any experimental biologist can tell you that true insanity is believing that you can control all the relevant variables in order to do anything more than once. As other commenters noted, the results that are worth following into discovery / development are likely those that are robust enough to reproduce the same general conclusions despite being replicated under somewhat different conditions.

  16. steve says:

    I remember attending a Gordon conference a long time ago when a group from Jackson labs couldn’t reproduce results from another lab (I think it was at Penn) on something to do with oocytes. Each tried multiple times and it kept coming up negative in one lab and positive in the other. Being adults they decided that each would reproduce the experiment in the other’s lab. Same results. They ended up identifying the water as the variable that made it work in one lab but not the other. I’m sure there are many, many variants of this happening all over science.

    One big problem is that most academic labs do not have appropriate (or any) QA/QC, proper protocols, etc. that you would find in a company lab. Plus most experiments are done by graduate students or post-docs who aren’t rigorously trained. Why would anyone be surprised then at the low level of reproducibility of academic labs?

  17. Bryan says:

    Important note from the eLife editorial accompanying the papers (link in handle): “The experiments in the Reproducibility Project are typically powered to have an 80% probability of reproducing something that is true: this means that if we attempt to repeat three experiments from a paper, there is only a ~50% chance that all three experiments will yield significant p values, even if the original study was reproducible. Therefore, we cannot place the bar so high that the replications need to hit a significant p value in every experiment.”

  18. DanielT says:

    I think we are kind of missing the point in focusing on reproducibility and not on robustness. What we want are not results that can be reproduced if you hold every experimental detail constant, but results that work when things are highly variable. A result that only works in one mouse strain, with one cancer line, under one treatment protocol is effectively useless in the real world.

    What we want are results that are reproduced when almost nothing is held constant because those are the results that will translate to practical treatments. Let’s focus on robustness.

  19. TX raven says:

    so, if a branch of experimental biology cannot be reproduced for legitimate reasons, and in spite of our greatest efforts, should it be considered a scientific discipline?

  20. Nile says:

    With reference to the Lenski work, the *specific way* that one of his lines is doing something interesting isn’t reproducible.

    Lenski’s observations, however, are impeccable. And his *reported finding* isn’t a validation of a detailed prediction about that specific line. He set out to demonstrate that mutation and evolution can be observed, investigated, and validated right down to the molecular level, and he did.

    I am certain that he would have had the integrity to report a negative finding, although I am less sure that he would have got it published.

    And so we turn to the question under investigation in those cancer trials:

    “Is this compound effective against tumours or not, and can we demonstrate a ‘yes’ to high degree of statistical confidence?”

    That’s a yes or a no, and we count uncertain results as a ‘No’.

    It is alarming that these results aren’t holding up. My gut feeling is that the majority of these failures are down to a lack of time and resources, and maybe a need for the ‘laying on of hands’ by the original researcher: they are not intrinsically irreproducible, just difficult, exactly as Dr. Lowe says.

    I doubt that many will have the glaring errors of methodology and calculation that emerged from reproducibility studies in psychology: but some will.

    Yes, some of them will, and some of them will have no glaring error but they will indeed be intrinsically irreproducible because the results are not real. The researchers simply did not see what they thought they saw. Among other issues, one in twenty of all the conclusions ‘supported’ by a crude 95% confidence test are intrinsically irreproducible.

    Meanwhile, there’s a case to be made for keeping score of reproducibility ‘fails’ that only turn out right after extensive and expensive effort: individuals or groups that demonstrate a pattern need some oversight on their record-keeping and their publication practices.

    …And it’s not that they’re sloppy, it’s just that extra procedures are needed for some compounds and some processes. And we need to start collating records and getting systematic about that.

    I can draw the obvious analogy with ‘dangerous’ chemistry – yes, here on the home of Things I Won’t Work With! – in that those exciting chemicals are not unworkable: they need extra work, and careful attention to special procedures in order to work on them safely. And at the end of that, there are compounds that some people shouldn’t work on, and most would prefer not to, and a clear understanding of what you need to do if you have to.

    … And this, in turn and returning from our analogy, applies to reproducibility.

    I’ll start with naming Curcumin the Chlorine Trifluoride of reproducibility, and raise you ‘any study of Alzheimers’ as the ‘Radiochemistry’ laboratory or reproducibility where we’ll learn, with hindsight, that we should’ve applied a lot of extra controls and monitoring protocols when we did it all for the first time.

    And yeah, some of this will involve filling up enough notebooks to act as an effective radiation shield in for-real ‘hot lab’.

  21. simpl says:

    Regarding research that changed my life, we were asked at junior high school in an English lesson to explain “Like a candle-flame where salt is sprinkled” from Browning’s poem on the Pied Piper of Hamelin.
    The literati quoted some encyclopedia, it flickers orange: and the scientifically minded reported that the flame goes out.
    That day, the scientific half of the class was miffed, as the teacher sided with the literature. Afterwards, I was more aware that getting the details right is important.

  22. simpl says:

    Regarding research that changed my life; at junior high school ,we were given an assignment for an English lesson to explain “Like a candle-flame where salt is sprinkled” from Browning’s poem on the Pied Piper of Hamelin.
    The literati quoted some encyclopedia, it flickers orange: and the scientifically minded reported that the flame goes out.
    That day, the scientific half of the class was miffed, as the teacher sided with the literature. Afterwards, I was more aware that getting the details right is important.

    1. tangent says:

      Yes, when I’ve sprinkled salt on a flame it colored yellow-orange. You sprinkled salt on your flame and it went out?

      Just how many g / sec and g / m^2 / sec were you sprinkling here? Experimental details do matter!

  23. dvizard says:

    > There are papers that can’t be replicated because the cells were treated a little differently, or the buffer conditions had to be changed, or the antibodies used for detection were wonky, and so on and very much so on.

    But isn’t that the fundamental problem? I mean, assuming you are doing science to gain some insights about nature, or to find a potential pathway for a disease, rather than to get a paper through the peer review process successfully: then what good is a study if it isn’t robust? Doesn’t it basically mean that whatever result was published, if it is dependent on such narrow conditions, is basically meaningless for anything besides publication?

Comments are closed.