Skip to main content


Machine Learning Deserves Better Than This

This is an excellent overview at Stat on the current problems with machine learning in healthcare. It’s a very hot topic indeed, and has been for some time. There has especially been a flood of manuscripts during the pandemic, applying ML/AI techniques to all sorts of coronavirus-related issues. Some of these have been pretty far-fetched, but others are working in areas that everyone agrees that machine learning can be truly useful, such as image analysis.

How about coronavirus pathology as revealed in lung X-ray data? This new paper (open access) reviewed hundreds of such reports and focused in on 62 papers and preprints on this exact topic. On closer inspection, none of these is of any clinical use at all. Every single one of the studies falls into clear methodological errors that invalidate their conclusions. These range from  failures to reveal key details about the training and experimental data sets, to not performing robustness or sensitivity analyses of their models, not performing any external validation work, not showing any confidence intervals around the final results (or not revealing the statistical methods used to compute any such), and many more.

A very common problem was the (unacknowledged) risk of bias right up front. Many of these papers relied on public  collections of radiological data, but these have not been checked to see if the scans marked as COVID-19 positive patients really were (or if the ones marked negative were as well). It also needs to be noted that many of these collections are very light on actual COVID scans compared to the whole database, which is not a good foundation to work from, either, even if everything actually is labeled correctly by some miracle. Some papers used the entire dataset in such cases, while others excluded images using criteria that were not revealed, which is naturally a further source of unexamined bias.

In all AI/ML approaches, data quality is absolutely critical. “Garbage in, garbage out” is turbocharged to an amazing degree under these conditions, and you have to be really, really sure about what you’re shoveling into the hopper. “We took all the images from this public database that anyone can contribute to and took everyone’s word for it” is, sadly, insufficient. For example, one commonly used pneumonia dataset turns out to be a pediatric collection of patients between one and five, so comparing that to adults with coronavirus infections is problematic, to say the least. You’re far more likely to train the model to recognize children versus adults.

That point is addressed in this recent preprint, which shows how such radiology analysis systems are vulnerable to this kind of short-cutting. That’s a problem for machine learning in general, of course: if your data include some actually-useless-but-highly-correlated factor for the system to build a model around, it will do so cheerfully. Why wouldn’t it? Our own brains pull stunts like that if we don’t keep a close eye on them. That paper shows that ML methods too often pick up on markings around the edges of the actual CT and X-ray images if the control set came from one source or type of machine and the disease set came from another, just to pick one example.

To return to the original Nature paper, remember, all this trouble is after the authors had eliminated (literally) hundreds of other reports on the topic, for insufficient documentation. They couldn’t even get far enough to see if something had gone wrong, or how, because these other papers did not provide details of how the imaging data were pre-processed, how the training of the model was accomplished, how the model was validated, or how the final “best” model was selected at all. These fall into Pauli’s category of “not even false”. A machine learning paper that does not go into such details is, for all real-world purposes, useless. Unless you count “putting a publication on the CV” as a real-world purpose, and I suppose it is.

But if we want to use these systems for some slightly more exalted purposes, we have to engage in a lot more tire-kicking than most current papers do. I have a not-very-controversial prediction: in coming years, virtually all of the work that’s being published now on such systems is going to be deliberately ignored and forgotten about, because it’s of such low quality. Hundreds, thousands of papers are going to be shoved over into the digital scrap heap, where they most certainly belong, because they never should have been published in the state that they’re in. Who exactly does all this activity benefit, other than the CV-padders and the scientific publishers?

42 comments on “Machine Learning Deserves Better Than This”

  1. Derek Jones says:

    You have completely misunderstood the purpose of machine learning in academia (based on my experience in software engineering). Machine learning provides a means for people who don’t know anything about a subject to publish papers in the field. All that is needed is some data (it does not have to be much, see second link below), some button pressing, the ability to convincingly sprout techno-babble, and getting lucky with reviewers.

    Use of tiny datasets:

    1. dearieme says:

      “Machine learning provides a means for people who don’t know anything about a subject to publish papers in the field”

      So it’s just like getting a PhD in Economics.

      1. Steve says:

        Well said.

  2. John M says:

    I work on improving this situation within the cancer center where I am employed. I regularly use the results of this PubMed query to illustrate the point of this post and the STAT article.

    ((“machine learning”[Title/Abstract] OR “artificial intelligence”[Title/Abstract] OR AI[Title/Abstract] OR NLP[Title/Abstract])) AND (cancer[Title/Abstract] OR oncology[Title/Abstract])

    Today it returns 8661 results, including 1336 published in 2021! But there are incredibly few ML algorithms in daily clinical use in cancer, for all the reasons cited.

  3. Philip says:

    Derek, there does need to be a standard for all image processing AI studies.

    DICOM info is just too tempting for the AI to ignore. Think about it for COVID-19, the AI would get age and sex and that is a good start that I would not want my image processing AI to take advantage of. To prevent this all of the images must be converted to the same format (PNG would be my choice, but any lossless 16 bit per channel format would work). Make sure there is no meta data from the DICOM file makes it to the PNG file.

    For a first pass, I would only use data from the same model of X-ray, CT or MRI devices. That way the AI could not just look at say resolution and determine that the device used in the COVID-19 ward is where COVID-19 patients’ images come from.

    An example of how things could go wrong. The clinic where I work has two OCT devices (Zeiss Cirrus 5000 and 6000) that produce images that look identical to me, but the output is slightly different. One is close to the retina clinic and one is not. If we feed images from both into an AI training session for finding early signs of AMD, it would find use the subtle differences between the devices to help it determine who was destined to be diagnosed with AMD.

    I think AI for many diagnostic devices is going to be important in the future, but we are going to have to be very careful about how they are trained.

    Normally I am out of my league on this blog. Not so much this time. I have been working with computer image processing for decades going back to programming Kontron systems back in the mid 80’s. I have only played with AI for a couple of projects, so I am not an AI image processing expert.

    1. a s says:

      It’s very difficult to hide information from ML models, but one thing you can do to avoid bias is to provide all those categories during training and then train it to ignore them – eg instead of producing the most accurate classification for the training set, you have a goal that the classification must be the same for all values of the age/sex fields.

      That way, even if it finds a way to detect them in the image, it won’t use it for anything.

    2. Todd Knarr says:

      One thing any ML system needs to have done is training on known-result data sets. You feed it a set of data from across all of your sources where you know whether each data point is positive or negative, and the training criteria is that the model’s classification must match the known result. Alternate that with other training sets. Force the model to constantly recalibrate it’s parameters against known results.

      In software we call this “testing”. We feed in data where the correct results are known, and if the output differs from the expected correct result we know the code has a bug in it that needs fixed. ML models are software, they should be subject to the same testing regime.

  4. Neo says:

    There is shoddy work in every research area. There is a lot of that too in big pharma:

    It is easy to hide when you are a pharmaceutical industry researcher. If only journals did their job and enforce reproducibility, we would not get so much snobbism coming from that direction, I am sure.

    1. Patrick says:

      Yes, it’s very easy to hide as a corporate researcher – especially from all those downstream users of your work.

      Fake the assay results? No problem – no one else ever uses what the chemists make. The biologists won’t notice the compound doesn’t bind, and certainly the rats won’t.

      A harsh truth is that it is dramatically easier to hide ongoing fraud or incompetence in academia, because a great deal of work is funded without a real world downstream user. It just goes in to the void, and continues to be funded anyway. (There are many reasons for this, a lot of them not even bad – it’s impossible to tell for sure in advance what will be useful.)

      But in industry? It’s a hell of a lot harder to persistently do work no one else uses. And if you lie or just do shit work, you’ll usually get found out.

      Get off your high horse and admit the real problems in academic research, rather than insulting industrial researchers. It doesn’t invalidate the enterprise to recognize it has flaws.

    2. CMCguy says:

      I am not sure the Forbes article points to “shoddy work” or any “big secret” in the pharmaceutical industry where common knowledge for those in the field indicates there are many drugs that have been developed that are found to work effectively but not necessary by the targeted or proposed MOA. It just reinforces how far we still are from understanding highly complex biological systems and how to impact those system with administration of man-made agents. While seemingly sound science will be the starting point in discovery programs there nearly always appears to be a serendipity element that comes in to play to go from idea to actual beneficial treatment. Part of that process is determining that “”Yes it Works!” excitement coupled with a frustratingly “Not sure how and not apparently by our proposed pathway” that requires hard and quality work to execute and then will/can not be kept hidden.

    3. medchemgeek says:

      You didn’t REALLY think you were going to convince anyone of anything by linking to an article by that font of all human knowledge, Alex Z, did you?

  5. Derek Jones says:

    Philip, I’m sure that at some point in the future AI will be capable of answering all kinds of questions. However, I think this is many years in the future.

    The current AI successes have been the result of massive computational power being available to anyone with a credit card, with a suitable spending limit (previously the upfront cost of building the necessary system deterred most), and the availability of lots of domain specific data (e.g., cat pictures).

    Piekniewski’s blog provides a refreshing dose of reality in the AI field:

  6. Sergey Page says:

    You use Google search every day, yet Google also “[fails] to reveal key details about the training and experimental data sets, to not performing robustness or sensitivity analyses of their models, not performing any external validation work, not showing any confidence intervals around the final results (or not revealing the statistical methods used to compute any such), and many more.”. It’s still a useful tool, even for doctors. What am I missing? Why the double standard?

    1. metacelsus says:

      Google isn’t making life-or-death decisions about patient healthcare.

      (At least, not yet.)

    2. Judgy McJudgeface says:

      Google is a search tool.
      Of course its results are (or should be) judged to a higher standard than results published in a peer reviewed journal.
      If they aren’t then what’s the point of a peer reviewed journal?

      1. Judgy McJudgeface says:

        I got that phrasing the wrong way round. Results in a peer reviewed journal should obviously be judged to a higher standard than Google.

        1. stewart says:

          It’s all right; I interpreted the original as sarcasm.

    3. sort_of_knowledgeable says:

      The million of users that use the google search engine every day are the external validation method. If the search doesn’t turn up something they expect they stop using it.

      Also the problems are usually easier. Search for “how to fix door bell” and articles are found that work.

      Search for something harder like “cure Alzheimer’s disease” and the results may start becoming less reliable.

  7. Thoryke says:

    It has always depressed me that I saw more rigorous work done to ensure clear coding criteria and achieving good inter-rater reliability in the Psych and English departments at Carnegie Mellon than I have seen in many a published science paper. When you try and run a meta-analysis and realize that out of 84 studies of a phenomenon, only 4 have actually covered all the common bases necessary for a rational assessment, ….aaaaagh.

  8. Markus Sitzmann says:

    This really starts sounding like a verbatim copy of the outcome of the previous ML/AI hype about 25 years ago in the late 90s. At like any hype in data science it depends on the quality of data which quite often is a real problem. And yes, “putting a publication on the CV” as a real-world purpose :-).

  9. John Wayne says:

    A friend of mine recently attended a ML/AI conference online. I asked her what she thought. After a pause she said, “It all seemed quite hopeless.”

    Still makes me laugh.

    Don’t get me wrong, as scientists we should definitely be doing this sort of work. It doesn’t seem to be be worth much yet, but that will change.

    1. Jerry says:

      Given all the real world applications expert systems are already being used for (job candidate evaluation and parole determination serve as existing frightening examples) the real tragedy is how much damage is done before people get a true handle on these things (assuming we ever get a handle on it at all….)

  10. tally ho says:

    Derek – thanks for pivoting from Covid-19 topics to “things that smell” 🙂

  11. En Passant says:

    For example, one commonly used pneumonia dataset turns out to be a pediatric collection of patients between one and five, so comparing that to adults with coronavirus infections is problematic, to say the least. You’re far more likely to train the model to recognize children versus adults.

    FWIW, even generally AI is out of my league, but one thing seems intuitively obvious: without fiducial markers on images, using AI to recognize anything is not going to work well. Facial recognition AI works in part because there are naturally occurring fiducial markers in facial images. For example, the distance between the centers of eye pupils.

    I don’t have the first clue what naturally occurring fiducial markers might be present in, say, chest x-ray images. But I would expect there would be some that physiologists could define.

    At least the presence of natural markers would allow scaling, so that the actual size of the subject could be eliminated and the AI could train on image characteristics relevant to whatever it is looking for.

    But, as I said, AI diagnosis of clinical images is light years out of my league.

    1. Philip says:

      AI for medical image diagnosis is here:

      More will come with time. I suspect AI diagnosis will come from the device manufacturers. As I stated in an earlier post, limiting training to one device makes things easier.

    2. Patrick says:

      I think there are plenty of parameters you could extract from a chest x-ray to guess age, gender etc as long as the imaging setup is standardised – something like the distance between the humeral heads would be a good analogue to using the pupils in facial recognition.

      The main problem seems to be that the authors of these papers are too lazy to do even the most basic sanity checks on their data. As Derek says, getting some data (any data, from god knows where) and running it through an algorithm is easy. The hard graft is generating a trustworthy dataset in the first place…

  12. Daniel Barkalow says:

    Have they tried looking for COVID-related pneumonia in post-mortem Atlantic salmon?

    1. eub says:

      Ha, I was wondering when that would come in!

  13. sgcox says:

    I will believe in AI power when one of them analyse others and come to the conclusion that they all are hokum. And then tweet it.

    1. John Wayne says:


    2. Ken says:

      Predicted (of course) by xkcd, at

      1. Some idiot says:

        The mouse-over on that one is (as usual) the icing on the cake… The guy is a global treasure…!

  14. LeeH says:

    There’s a lot wrong with how machine learning models are described in the literature, and what claims are made. My biggest pet peeve is a disregard for experimental error, such as model metrics reported to 4 significant figures using data with one or two sig figs (like IC50s). Machine learning people just do not generally have backgrounds where they have generated data themselves, and therefore do not have an appreciation for the limitations and pitfalls of that data.

    That being said, academic machine learning papers are usually not about the production of industrially useful models. They’re about providing evidence about the potential of the application of a method. If an imaging company wants to make use of that technology, it’s their responsibility to do all of the due diligence in order to embed that method into their product.

    I’m not excusing lazy or incomplete paper writing and refereeing. And the journals are chock full of trivial papers. But it’s too easy to ding papers that don’t have particular analyses, or don’t have all of the quality controls that you think are necessary. For instance, confidence limits are not really pertinent to image classification, where you’re just trying to assign a probability that an image belongs in the positive or negative bucket. Sensible cross-validation, for instance, is more important.

    All models are wrong. Some are useful. Doesn’t Google correctly identify most pictures of you as you? Do you think they vetted Fluffy’s cat photos?

  15. ye74992 says:

    We have been here before…
    How not to develop a quantitative structure-activity or structure-property relationship (QSAR/QSPR)
    J C Dearden, M T D Cronin, K L E Kaiser
    DOI: 10.1080/10629360902949567
    ..and the same points apply if you are doing some fancy schmancy ML or just linear regression.
    LeeH’s point about data is well made. It seems so many skipped the lessons on precision vs accuracy, and the software vendors are among them.

  16. BayesianFrequentist says:

    This sounds like a terrible thing to say, but I think part of the problem is that the community has developed so many powerful open-source software packages for ML (keras, sklearn, pytorch, tensorflow, caret), that it has become too easy for people with no idea what they are doing to “train” models.

    I have seen some really terrible ML papers published in biology journals — papers that used small training sets with P >>>> N and did not take even elementary precautions to ensure their model would generalize — and it’s definitely a little depressing. Maybe that is just a symptom of the broader state of scientific publishing though, I don’t know.

    ML is definitely a powerful tool for bioinformatics, DeepMind’s success with AlphaFold demonstrates that. But you have to spend the time and effort to properly evaluate a model, and as you say think carefully about what you are using to train it. Apparently an alarming number of people either don’t realize that is important, or don’t care because they are much too focused on getting a quick paper to pad their resume (not sure which is worse).

  17. ML guy says:

    There is a huge economic incentive to publish. Having a couple of ML papers puts a fresh PhD in the front of the queue for jobs with ~200K or more in salary. Especially if the papers makes it to one of the prestigious ML conferences .It has been studied and concluded that due to the volume of submissions and lack of reviewers the acceptance at these conferences is random ( It is purely a numbers game.

  18. Your article is amazing
    Everything well explained
    I am also a blogger
    I know it’s hard job but you did it very well
    Keep it up

  19. Dan Elton says:

    It’s really quite shameful the crap that gets published in the ML field, even in top outlets like NeurIPS and ICML. Journals have lowered their standards to basically zero to capitalize on the hype and increase their impact factor. It’s a big frenzy of crap papers and lot of citations all around which journals love because it boosts their impact factors.

    One of the problems is there’s so many new entrants to the field that many of the reviewers are newbies (undergrads or PhD students) with very low standards for scientific rigor. Many of them are CS majors who don’t think very rigorously (no offense!) or know how to or think of doing statistical significance tests when making comparisons.

    A good summary of the issues can be found in this paper from a few years ago:

    I have also blogged about the deluge of crappy “AI for COVID” papers (focus on Radiology)

    It’s really terrible how much time and taxpayer money is being wasted on this crap. A side point is that a lot of US taxpayer money spent on producing these papers is in effect being used to subsidize the training of Chinese PhD students and postdocs who more and more are leaving the country after graduation due to enticements by the CCP (for instance faculty positions, etc). In such scenarios taxpayers get no value and the CCP gets a gift for their dreams of a totalitarian police state and global dominance.

  20. Dr Scott says:

    Being an emergency physician and pandemic playbook planner, I consider myself more of an Ai translator trying to sit at the intersection of hype vs reality in health care and Ai; I am way more encouraged than discouraged these days on the role of Ai in optimizing health outcomes in the very near future. However, way way way too many “Image recognition of Sars-Covid 2 pneumonia on CT” papers the past 18 months and nobody ever talked to front line physicians to see if that was really a problem we needed solved.
    I try to temper my enthusiasm with pointed articles like this one:

    and this one:

    “Ai is hard…and that’s okay”

  21. Marawan Ahmed says:

    I would also say that everybody should be suspicious when the models’ accuracies exceed the expected success rate of a human expert by a significant margin. In the end, if the model learn what the physician learns, then they should get the same performance, a little faster and more democratized, that is all. If the margin is too high, then this is a hallmark that the models learn something else, i.e. spurious training. We should remember that the labels given to the model to train with are actually assigned by these experts.

  22. Anonymous says:

    For anyone who wants to learn more about the pitfalls of machine learning in radiology applications, I highly recommend the blog of Luke Oakden-Rayner, a radiologist and machine learning researcher. He was sounding the alarm on ML (and CT) covid diagnosis back in March of last year:

  23. TallDave says:

    gigo still gigo

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.