Skip to Content

The Dark Side

How Many Doctored Papers Are Out There?

Just how much crap is out there in the scientific literature? “Quite a bit” comes the answer from anyone with real experience of it, but that’s not too quantitative. Here, though is an analysis of one (perfectly respectable) journal of its own output, and the results are. . .well, they range from “pretty bad” to “honestly, I expected even worse”, depending on your level of cynicism.

Molecular and Cellular Biology and its parent organization (the American Society for Microbiology) went back over the published papers in the journal from 2009-2016 (960 total, 120 random papers per year), looking for doctored/duplicated images (which is still one of the easiest ways to spot sloppiness and fraud). The procedure they used seems effective, but does not scale very well: basically, the first step was to have Elisabeth Bik look at each paper (here’s an interview with her and the other co-authors). She seems to have a very good eye for image problems, and as an amateur astronomer, I can tell you that she would have made a very effective comet or supernova hunter for for the exact same reasons. “Cuts and beautifications” were not scored as problematic – there needed to be outright duplications and/or serious alterations.

What they found was 59 papers with clear duplications, and in each case the authors were contacted:

The 59 instances of inappropriate image duplications led to 42 corrections, 5 retractions and 12 instances in which no action was taken (Table 1). The reasons for not taking action included origin from laboratories that had closed (2 papers), resolution of the issue in correspondence (4 papers), and occurrence of the event more than six years earlier (6 papers), consistent with ASM policy and Federal regulations established in 42 CFR § 93.105 for pursuing allegations of research misconduct. Of the retracted papers, one contained multiple image issues such that a correction was not an appropriate remedy, and for another retracted paper, the original and underlying data was not available, but the study was sufficiently sound to allow resubmission of a new paper for consideration, which was subsequently published.

Interestingly, this paper also records the amount of time all this took, and it’s substantial – at least 6 hours of staff time per paper, involving hundreds of emails overall and a lot of back-and-forthing. As usual, cleaning something up takes a lot more time than the act of making it messy in the first place. To that point, the journal introduced pre-publication screening of images in 2013, and the incidence of trouble did indeed decline notably starting in that year. (They didn’t tell Elisabeth Bik when the policy was introduced, so as not to bias her).

As those figures show, the good news is that many of the duplicated images appear to be sheer carelessness, and could be fixed. But at least 10% of the paper flagged had to be pulled completely. Extrapolating from this experience (and that of two other journals previously studied) leads to a rough estimate that the 2009-2016 Pubmed literature database (nearly 9 million items) should have about 35,000 papers removed from it completely (and, of course, that means that a lot more papers in it still need to be fixed up). Overall, the number of junk papers can be described as “small but still significant”, and there’s no reason to have them cluttering up the literature.

Increased screening in the editorial phase seems to be worth the effort – it adds time, but not nearly as much as the time it takes to go back and fix things later. (And that fits with another time-honored piece of advice, that if you don’t want shit to land on you, then do not allow it to rise in the first place). This is, as the authors note, more of a recent problem due to the proliferation of the digital tools needed to mess around in this way – and, to be fair, these tools also allow for faster, easier honest mistakes to be made as well. And it admits of modern solutions, too – software to catch image duplications has been (and is being) worked on by several groups, and should obviate the need to clone Elisabeth Biks.

That takes us back around to the question in the first paragraph, about how much crap is out there. Papers with clearly fraudulent images in them are obviously in that category, but there are many other less obvious ways that papers can be fraudulent. So I would call that 35,000 estimate a likely undercount, even given that there are many papers in PubMed that don’t have images of this sort in them.

But beyond fraud, there is the larger universe of papers that are basically honest but are simply no good – statistically underpowered studies, nonreproducible procedures, inadequate descriptions, conclusions that don’t necessarily follow from the data presented. The literature has always had these things in it. Poor-quality work has not been waiting on image-editing programs to make it possible; we all come with the necessary software pre-installed between our ears. Clearing out the frauds is an obvious first step, but it’s also (unfortunately) the easiest. The other stuff is, as it’s always been, on the readers to look out for.

Which brings up one last point, which has been made here and in other places before. In these days of modern times, as the Firesign Theatre used to put it, some of those customers of the scientific literature are not human beings. Machine-learning software has great promise for analyzing the huge pile of knowledge that we’ve generated, but such algorithms are easily poisoned by the garbage in/garbage out problem. Data curation is and always will be a crucial step in getting any machine learning effort to yield useful conclusions, and studies like this one just remind us that curating the biomedical literature is no simple thing.

37 comments on “How Many Doctored Papers Are Out There?”

  1. Chad Irby says:

    The biggest understated problem is probably plain old statistical abuse.

    Starting the the 1980s, when microcomputers became powerful enough to do actual statistical work, programs like SPSS started showing up in people’s offices instead of being run on minicomputers and mainframes.

    Nowadays, it’s incredibly easy to fiddle with variations on stats to find just the right curve-fitting (whether that approach is actually statistically valid or not). Instead of paying someone to do the real statistical analysis, too many researchers just sort of hand-wave the math until it looks right.

    It also lets you find correlations in large datasets that nobody would have noticed a few decades back. Bad correlations, not surprisingly good ones.

    1. Wintermute says:

      Chard Irby: I’m continually reminded of the hilarious site “Spurious Correlations” which lets you look for significant correlations and determine R values in a HUGE dataset ranging from “number of murders by steam and heated gasses in the united states, by year” to “number of Nicholas Cage movies released by year”

      Some of my favorites from the front page are a 99.79% three-way correlation (R = .9979) between US government spending on science, space and technology, the number of barred attorneys in the state of Georgia, and US suicides by strangulation or hanging; 99.26% correlation between per capita cheese consumption in the US and the divorce rate in the state of Maine; 95.86% correlation between the US per capita consumption of Mozzarella cheese and the number of civil engineering PhD’s awarded in the US; and a -93.7% inverse correlation between the number of works of visual art copyrighted in the US and the number of deaths among females in New York State killed in an accident involving tripping or slipping.

      Just goes to show if you collect data and dig hard enough you WILL Find something publishable. Even if that something is that the most important way to prevent drowning accidents would be to get Nicholas Cage to stop acting

  2. Emjeff says:

    I agree that the stats abuse is a huge problem, and comes in large part from the availability of statistical software. The other more subtle, but more serious problem is that researchers (academics in particular) often do not design the analysis at the same time they design the study. Before you do any experiment, you should have a clear idea about how you are going to analyze the data, as the design and the analysis should go hand-in-hand. Instead, what happens far too often is that the study is designed and executed, with only a vague idea of how the data are to be analyzed. This can result in a lot of nonsense being generated.

    There is also a serious problem with data-dredging in the biomedical sciences, and no one is more guilty of this than epidemiology. Today’s epi practitioner plows through large datasets, finding correlations galore, and the words “Type 1 error” are never uttered. This is why so much junk is published yearly about what food/drink/exercise is good/bad/fatal for humans every year. No one listens anymore, and it gives Science a terrible reputation. Looking through databases for correlations is not and will never be “hypothesis-generating” – one must do actual experiments and come of with potential mechanisms for effects first.

    1. AQR says:

      Isn’t “looking through databases for correlations” the goal of Big Data?

    2. zero says:

      > This is why so much junk is published yearly about what food/drink/exercise is good/bad/fatal for humans every year. No one listens anymore, and it gives Science a terrible reputation.

      This is important. Food and dieting advice is probably the most visible and relatable interface between ‘science’ and average people, and it’s a crapshow. Breathless articles certainly don’t help; journalistic mistreatment of the real results (when there even are real results) erodes trust in the news media as well as in science.

  3. spaget says:

    Journals could also use a common law enforcement technique; once you identify a bad actor (accidental or otherwise) go back and check their other submissions (and pass along the name to other publishers so they can check that author in their publications).

    1. cynic says:

      There’s no incentives for publishers to do so. Retractions require a lot of work, unpleasant communications, and do not necessarily increase their impact factor. And, you know, you can always blame peer-reviewers for not spotting the errors. So why bother?

      1. T says:

        Editor here. I can’t speak for other but in my department we always pursue allegations of misconduct and as you suggest we also check the author’s previous work within the journal and our sister journals. Passing information to other publishers, however, is much more tricky. Since there is no body to report/pass these cases to that has the authority to run a proper investigation and then officially “convict/acquit”, accusing these people to outside parties is essentially libel. In my experience, the people who really hold these things up is the research institutions, who have a clear motivation to bury potentially embarrassing incidents. Since there is no “research police force” and editors clearly have neither the qualifications nor mandate to run official investigations of these cases ourselves, our only option in tricky cases is to contact the author’s institution and request an investigation, and such assistance is often far from forthcoming/satisfactory.

  4. Dominic Ryan says:

    Poor analysis without fraudulent intent has a range of motivations as well. I suspect a fair chunk of it is just plain sloppiness resulting from too little training of the right kind in their background (especially graduate work).

    But, I’m sure some is in a genuine belief that they are sitting on a golden nugget, the starting point of a revolution, the Nobel just down the road. I wonder, just how often does the scraping the bottom of the data barrel actually produce a breakthrough result? Is anyone aware of any? My bet is that it is exceedingly rare.

    What do you think the balance is between sloppy and wishful?

    One might hope that in science wishful dredging at the cost of violating statistics might be much less frequent than plain sloppiness but human emotions don’t operate based on such logic, otherwise the lotteries would go out of business.

    One might have hoped that the lotto mentality might

  5. BernYeePixelClods says:

    Its time to hold these reptiles accountable and no one escapes the Bik Inquisition! Match that AI! How much US grant money was spent on these data? I want my tax dollars back-Donald T, you seeing this?

    Ow, my eyes!

  6. Uncle Al says:

    If potential loss is less than potential gain, it is not a punishment – it is a business plan. Diversity is intolerant of ethics.

  7. Bob Seevers says:

    The thing that always griped me was not so much bogus papers as ones that presented the data in a way that made things look better than they actually were. I’m thinking of papers on novel synthetic techniques that report GC yields rather than what I can actually end up with in a flask. It’s lovely that your super new bond forming reaction went 99% to completion, but how about telling me the truth that 10 -15% of that will be lost on work-up?

    1. John Wayne says:

      Don’t get me started about the shift from reporting enantiomeric excess (e.e.) to enantiomeric ratio (e.r.)

      1. asymcat says:

        it’s easier to calculate kcal/mol from er than from ee
        does it matter if a reaction gives you 90% or 80% ee
        both mixtures wouldn’t be all that pure anyway

      2. AVS-600 says:

        As someone who used to do enantioselective chemistry (and reported e.e. in papers), enantiomeric excess is an annoying artifact because of the way polarimetry measurements worked. Now that analytical chiral HPLC is de rigeur, e.r. is just as easy to determine from the raw data as e.e. The main reason people don’t report e.r. is because they want to avoid being perceived as juicing their numbers. But in terms of the “intuitiveness” of the values, no one reports diastereomeric excess or other forms of isomeric excess, and there’s a reason for that.

        1. bruce says:

          My vote is for er. Whenever I see an ee the first thing I do is work out the er in my head because it’s a more concrete thing to picture. “40% ee” doesn’t mean anything to me; a 70:30 ratio means something.

          1. Derek Freyberg says:

            But forget the 70:30 ratio, what really matters is how much of the desired isomer did you get (oh, assuming you can separate the two)? If the answer is 50%, you haven’t got an enantioselective reaction.

  8. Anonymous says:

    Prior replies have already hit on some of my favorite points. As an undergrad, I worked in a lab that shared a full time PhD statistician that kept us from making gross errors of analysis. I was measuring binding curves and getting very good correlation coefficients with a few points falling off the curve. We had a graphing calculator and one of the senior staff members showed me how to improve the fit: he threw in an exponential, then some higher order terms, and maybe even a sine or cosine until all the points fell on the wobbly line. He was making the point that the simple physical / mathematical model was reasonable and my data was pretty good and that throwing in more parameters and terms made no sense in reality and was not justified.

    In addition to SPSS, R is available to anyone who wants to play the role of statistician and make pretty pictures even when they have no or incorrect knowledge of the data and physical phenomena they are attempting to model and illustrate.

    Regarding fraud vs poor analysis (and estimating “intent”): when someone is informed that their poor analysis is incorrect and they publish it anyway, they have crossed over the line to fraud.

    1. tangent says:

      “he threw in an exponential, then some higher order terms, and maybe even a sine or cosine until all the points fell on the wobbly line”

      Oh man, I haven’t thought of this Apple 2 game in a lot of years:
      http://www.mobygames.com/game/apple2/algebra-arcade

  9. RBW says:

    I don’t remember if the book ‘How not to be wrong’ by mathematician Jordan Ellenberg has been mentioned here. A great read, and there’s a whole chapter on how statistical analysis is abused in biology.

  10. Big Richard says:

    I like how there was no action taken in six cases due to “occurrence of the event more than six years earlier” Maybe this is because people only keep lab books five years but it is garbage. Nice to know there won’t even be a correction or a note on these papers which will persist. But hey, who cites papers from more than 6 years ago? That never happens…

  11. Nick Danger, 3rd eye says:

    After all… how can you be In two places at once
    when you’re not anywhere at all ?
    Why, you probably still believe that pigs live in trees
    and that faithful Rover is nothing more
    than a pet lying by the doggy door….

  12. Crocodile Chuck says:

    All Hail Marx & Lennon!

  13. Scott says:

    The computer science and social science papers are filled with statistics so wrong that even this Human Resources major can call them out.

    First Rule of Statistics: Compare Like with Like, only one point of difference is allowed. Any paper that fails this test is junk.

    1. Chad Irby says:

      My rule for comparisons:
      “If it takes three or more qualifiers to tell me how great or awful something is, then it’s not that great or awful.”

  14. David Edwards says:

    Another issue is the need to have one or more publications to show, to boost one’s chances of securing the next research grant. The dark side of reward by results.

  15. Chris Phoenix says:

    Not quite on-topic for the post, but related to a lot of the comments:

    Let’s take a group of 7 particular people who have died, and do a statistical study of all causes of death for that group. If we find one cause of death that is statistically significant vs. a particular group of 24 other people, let’s publish it! In Nature!

    Of course Nature will publish it – why wouldn’t they? It’s a very interesting topic – reasons not to fly into deep space! Yes, the 7 people are the lunar astronauts who had died, and the 24 people are the LEO astronauts who had died, and 3/7 of the lunar astronauts died of cardiovascular causes (for a particular, somewhat broad definition). With N that low, and four causes of death, it would be hard _not_ to find a statistically significant increase on one of the causes.

    Also, the paper had some stuff about biochemistry, and a rodent study, to distract from the statistical flim-flammery.

    https://www.nature.com/articles/srep29901
    https://xkcd.com/882/

    1. Anonymous says:

      I don’t really understand this example. What would you do, send another n = 100 people to the Moon? Or keep sending them and not worrying about space radiation because, hey, those data are from n = 7 so let’s keep observing for the next 50 years until we are sure it’s a real problem. And Scientific reports is not Nature. Same publisher but big difference.

      1. zero says:

        We would conclude that there is insufficient evidence to make a conclusion. There is evidence to suggest that exposure to GCR *might* increase the risks of cardiovascular and ocular issues later in life. There is also evidence to suggest that astronauts live long, healthy lives even if they have gone into deep space and landed on another body. (One would expect that anyways from a cohort in peak physical and mental condition for the majority of their lives.)

        Astronauts are scientists, and they have the power to refuse a mission if they believe the benefits do not justify the risks. A paper like the one cited is of no benefit and may cause undue harm to public opinion of manned space exploration. We should not make policy decisions on the basis of this paper. More information is needed to justify a halt or change in procedures; until that is forthcoming we should proceed with our existing ample supplies of caution (and courage).

    2. Anonymous says:

      The following comment was previously posted to Pipeline, Pyridine Doesn’t Do What You Think It Does, 12-January-2017. Years ago (80s? 90s?), there “was a multi-year study of factory workers exposed to benzene (in measurable, work-place monitored non-zero concentrations) and a matched cohort that was not exposed to benzene in their normal lives. The factory workers lived longer and with fewer health problems than the cohort yet the study concluded that benzene is highly toxic and will kill you.

      This was in the days of usenet newgroups and sci.chem. (Anybody else remember sci.chem? We should hold a reunion!) One of the regular participants was known for his knowledge, sarcasm, subtlety and bluntness, vitriol and humor. I believe it was he who posted about the benzene exposure study:

      “The study proves that benzene will kill you. It kills you by making you live longer.””

      (No, it was not posted by Archimedes Plutonium.)

    3. loupgarous says:

      Low n in that population means low statistical power. P-values would be unholy high.

  16. Tourettes of Chemistry says:

    CDC catalyzed numerous reports in 2016 on suicides – a cohort sorting error has now led to a full retraction – one cannot script this stuff:

    https://www.cdc.gov/mmwr/volumes/67/wr/mm6725a7.htm

    And the current budget request adds some perspective:

    https://www.cdc.gov/budget/documents/fy2019/cdc-overview-factsheet.pdf

    Who is reviewing?

  17. eugene says:

    Now can Elisabeth Bik please go over all those JACS catalysis papers, and find the ones where they report great selectivity and yields, but the postdoc actually just put a dirty reaction on the HPLC in order to get 10mg. for a nice NMR? Oh, there is no way to determine that unless a few people in industry (since one is not enough) think it’s useful, try to repeat it, magically get in touch with each other, and are mad enough to care about raising a stink? Well, that should be good news for a few high output groups, and the postdocs’ careers.

    A tip to the aspiring academic in a high-flying methods group though: try to stay away from simple stuff like palladium acetate and some salts and easily purchasable pyridine ligands. Maybe try a bespoke catalyst that takes two steps to make. That increases the activation barrier for all that bad stuff happening by about 10kcal per annoyed industry person.

    1. If you got separation problems, I feel bad for you son. I got 99 problems but a column ain't one. says:

      I’m sorry you feel that way, but I can understand your worries. I think this might be destroying our field. Our non-high flying method development group reports what we get out of the actual Schlenk tube (which leads to unimpressed reviewers). We also make special pre-catalysts, but our aim is never to achieve a high entry barrier to reproduction. In fact, I think reporting negative scope evaluation results in the SI should become a standard.

      1. eugene says:

        I do buy specialty catalysts from time to time for my reactions. Usually the ones for sale are from people confident in their chemistry. I think that to start solving the problem, reviewers need to put less emphasis on yields (and maybe even selectivity) and more on the novelty and usefulness of a reaction. After all, yields can be optimized for a specific substrate often much better than the conditions that the methods group used who published the article.

        In my last paper I reported quite a large, I feel, negative scope in the SI as well. I think this is useful and is not done enough.

        My comment is mostly based on a very real incident with a top ten methods development group, whose name everyone would know if I mentioned it. It really was a case of just making up yields and selectivities and presenting what was probably an HPLC purified compound. When caught, the group just did a huge correction, but the paper which was earlier accepted in Jackass on the strength of said selectivities and yields was not retracted. And the postdoc got rewarded with his own academic position. He might have gotten the position before the correction, but in any case, he is still a professor. Maybe his papers are more trustworthy now that he’s “made it”, but I can’t really trust his work and that of his previous advisor and I feel like they are hogging the grants and an academic position of someone who is more honest and careful. So this behavior works and it does pay. I would be surprised if it was not much more widespread.

  18. Greg says:

    For real, Statistical analysis as provided by a colleague from another department.
    Correlation coefficient of mechanically maesured value as compared to a Certified lab result.

    “R value = 0.28. Let’s do away with the Lab testing, we have a good straight line that we can use that to determine the actual value”

    Yep the “scatter” chart looked like the freckles on my face. Somewhat random.

    It’s amazing what can be achieved with Stats and a little knowledge.

    Have a good weekend 🙂

  19. retracted says:

    But does retraction actually have that big of an impact? I know of several papers that were retracted for “spectral errors” or other small edits.

    You could also be like the crazy-pants UT professor that decided to rescind someones doctorate over a spectral misassignment…..

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.