Skip to main content

In Silico

How To Deal With Machine Learning Papers

Here’s a very useful article in JAMA on how to read an article that uses machine learning to propose a diagnostic model. It’s especially good for that topic, but it’s also worth going over for the rest of us who may not be diagnosing patients but who would like to evaluate new papers that claim an interesting machine-learning result. I would definitely recommend reading it, and also this one on appropriate controls in the field. The latter is a bit more technical, but it has some valuable suggestions to people running such models, and you can check to see if those are implemented yet. Edit: I should definitely mention Pat Walters’ perspective on this, too!

The new article has a pretty clear basic introduction to the ML field, and frankly, if you take it on board you’ll already be able to at least sound more knowledgeable than the majority of your colleagues. That’s the not-so-hidden secret of the whole ML field as applied to biomedical and chemical knowledge: there are some people who understand it pretty well, a few people who understand it a bit, and a great big massive crowd of people who don’t understand it at all. So here’s your chance to move into the “understand it a bit” classification, which for now, and probably for some time to come, will still be a relatively elite category (!)

As you’d imagine, most of the diagnostic applications for ML involve image processing. That’s widely recognized as an area where these techniques have made significant progress, and that’s for several good reasons. Conceptually, we already had the example of the visual cortex to lead the way as an example of multilayered neural-net processing. A key advantage is the relative data-richness of images themselves, and it’s especially useful that they come packaged in standardized digital formats: grids of pixels, each of which are already assigned with numerical values in some standard color space. There has also been a massive amount of time and money spent on developing the image-recognition field, not least for defense and security applications, which has had a big influence over the years.

All that work has also exposed some of the pitfalls of image recognition – see this recent article for a quick overview. Every deep-learning algorithm has vulnerabilities, just as our own visual processing system does (thus optical illusions). And you have to be alert to the ways in which your snazzy new software might be seeing the equivalent of lines that wiggle when they’re actually stationary, or is missing the equivalent of that person in the gorilla suit weaving in between the basketball players. One characteristic of neural-network models can be brittleness: they work pretty well until they abruptly don’t, and although you would really like to know when that happens, the model may be constitutively unable to tell you that.

Consider what is probably the absolute worst-case “adversarial image attack” for a given system – one where someone knows the ins and outs of just how it was developed and trained, and (more specifically) knows the various weights assigned to parameters during that training and optimization. With such data in hand, you can produce bespoke images that specifically addresses vulnerabilities in the algorithm, and such images are simply not detectable as altered by the human eye. The example shown (from this page at Towards Data Science) is a “projected gradient descent” attack against the ResNet50 model – the perturbations in the middle panel were specifically aimed at its workings (and have been magnified by a factor of 100 just so you can see what they’re like). As you will note, the resulting image is indistinguishable by inspection from the starting one, but the program is now more convinced that the bird is a waffle iron than it was convinced that it was even a bird to start with. The potential problems of such adversarial attacks on medical imaging are already a subject of discussion, but it will be appreciated that errors do not need to rise to the deliberate bird-vs-waffle-iron level to be troublesome.

The JAMA paper recommends that you ask yourself (or perhaps a new paper’s authors!) several questions when you see an interesting new ML diagnostic method. For one thing, how good is the reference set? “Good” can and should be measured in several ways – size, real-world fidelity, coverage of the expected usage space, inclusion of deliberately difficult or potentially misleading images, etc. Note that when IBM’s attempt at using its Watson software for cancer diagnosis failed, one reason advanced for that wipeout was that the cases it trained up on were synthetic ones produced for its benefit (although to be sure, there were probably many other reasons besides).

Another question to ask is two-pronged: do the results make sense, or are they perhaps a bit too counterintuitive? Counterintuitive and right is a wonderful combination, but counterintuitive and wrong is a lot easier to achieve. On the other side, are the results just too darn perfect? That’s a warning sign, too, perhaps of an overfitted model that has learned to deal perfectly with the peculiarities of its favorite data set, but will not do so well when presented with others. And then there’s the ever-present questions of repeatability and reproducibility: if you feed the same data into the system, do you get the same answer every time? And can other people get it to work as well?

The “adversarial controls” paper linked to in the first paragraph (Chuang and Keiser) also recommends a similar reality-check approach, and also recommends seeing if other models converge on the same answers. If not, that’s a sign that one or more of them (all?) are reacting to extraneous patterns that have nothing to do with the issue at hand. They also strongly suggest that people generating such models deliberately try to break them: take out some part (or parts, one at a time) that you would think are crucial and check to make sure that they really are. That’s what led to this situation that I blogged about a year ago, when the substitution of random data for an ML model’s parameters did not seem to degrade its “performance”. If the authors of an ML system you’re interested in haven’t done things like this, then you should try them yourself, by all means.

So we’re all going to have to sharpen up our game, because this topic is definitely not going away. I know that there’s a blizzard of hype out there right now, but don’t use that as an excuse to dismiss the field or ignore it for now. The whole machine learning/deep learning field is moving along very briskly and producing real results, and there is absolutely no reason to think that this won’t continue. Underestimating it is just as big a mistake as overestimating it: avoid both.

14 comments on “How To Deal With Machine Learning Papers”

  1. John Wayne says:

    A fellow chemist made an interesting point the other day – we have been trying to use computers to help us design drugs since for decades. The average medicinal chemist has a great perspective on when, how and how useful these tools are for getting things done. My take is that the tools are slowly getting better, while the marketing is experiencing a bull market.

  2. MattF says:

    Another questionable validation strategy is to look at a dozen different tests and derive a single score from them. Questions to ask: do these tests measure different things, and what if the goal is not one-dimensional.

  3. Wavefunction says:

    It’s a really helpful guide. I would also strongly recommend Gary Marcus’s outstanding recent critique on deep learning, “Rebooting AI”. Marcus takes the entire field to task and gives tons of examples that demonstrate how deep learning systems are very brittle and not generalizable, failing at even simple tasks at which human intelligence excels. Marcus exhorts researchers to do two things; firstly, to step away from deep learning as the only viable strategy, and secondly to actually learn from neuroscientists how the brain works and try to incorporate these insights instead of just trying to get high fits with the data. Unfortunately deep learning seems to be going the same way that string theory did, with its most vocal proponents proclaiming it to be the best – and only – game in town.

    1. Ryan Huyck says:

      It makes sense as to why deep learning has become such a big deal. Deep learning can do one very particularly unique action that the other AI solutions cannot really do: find discriminating features in the data on its own. For other AI algorithms, the data scientist must supply the features that solution needs to use to make decisions; and that is both a very time consuming and difficult task. For instance, what features of the face should you define for an algorithm to have reliable facial recognition? With deep learning, you simply give it a big input data set of faces, telling it what the solution should be for each one, and then it figures out the rest. This has resulted in deep learning systems that have discovered novel features in data missed by humans — as well as provided a method to find multidimensional features well beyond what a human brain can wrap around.

      Any superior solution to deep learning will probably have to aim for that similar ease and power of self feature selection or finding.

      You are very right that AI solutions do not accurately model how the brain works. Deep learning and other neural networks boil down to matrix and vector multiplication with non-linear transformation at each step — fundamentally a different thing from the pattern recognition state machine that is the organic brain where even a single neuron can learn and dynamically adapt to give specific output patterns for many different inputs.

      1. Wavefunction says:

        Indeed. One of the things Marcus says in his book is that it’s worth supplementing deep learning systems with pre-programmed data about relationships between entities, basic laws of physics etc. to make it much smarter. He’s not saying it’s useless, but he’s saying it’s overhyped and its capabilities are massively overrated, and successes in narrowly defined domains are often massive generalized (mostly by the media, although researchers also often contribute).

  4. Jacob says:

    Massive overhype of new technology seems to be the norm in medicine, and ML is another instance of this. One hopes that adversarial examples like this won’t be common in actual practice. Sadly, proprietary diagnostics are pretty common so I wouldn’t expect AI systems to break that trend. Which theoretically should be okay as long as performance is validated and there are QC checks in place.

    It should also be noted when reviewing performance statistics that one should compare against a human reference (or other appropriate standard of care). I’ve seen studies showing radiologists have 80-90% agreement, so an AI with 95% sensitivity is pretty damn good. It can be challenging to train an AI to do better than the quality of the labelling of the data it’s trained on, but it’s possible. If nothing else, an AI that’s as good as the best humans might be a lot better than the average human.

    I’m also reminded of the Anil Potti affair: That didn’t involve deep learning (though it could plausibly be called ML) and was absolutely insane that such bad science made it to a clinical trial.

    1. Hap says:

      If they get used in anything important, algorithms will be a target for other companies (in business) or for nations. Image recognition is likely going to be a part of nations attempting to monitor citizens and other people and so images that break the algorithms will likely be a part of countersurveillance.

  5. pi boson says:

    Someone not skilled in the art should not try to read a ML/DL paper as they do not have the appropriate or necessary skills. I would encourage them to reach out for insight, but this too may prove challenging.

    Ask your questions, but don’t expect an answer that completely answers your question. Be diligent, be persistent, but don’t just resist the technology as the scientists and engineers working on this technology are doing their diligence, but also have corporate interests that make be more compelling.

  6. enl says:

    (I am not in the ML field recently, but follow moderately closely and have been in CS, at one time AI, since the 1970’s) It was telling to me about a year or two ago when I went to a Shannon Lecture by a leader in the field who was pretty much unable to explain adversarial nets or even explain convolutional nets. I don’t mean unable to explain the techniques, but unable to explain WHY they work and why they fail.

    It is not a failing of the speaker, but a failure in understanding across the field. It is developing, but there is a long, long way to go before life-or-death results will be trustworthy, in particular when there might be an intelligent adversary in the mix.

    I recently had my first experience with ML employment screening with one of my employers (considering a well known commercial service). I found it to be about as valid as the obsession with blood type that some Japanese employers seem to have. We ran several current employees through as if prospective hires. A long standing successful and productive employee was rated do-not-hire. One that is soon to be let go with cause was rated highly. Most of the ratings were pretty much on where performance would indicate, but it rejected a strong employee, and, worse, highly rated a major liability (I can not give details, but a MAJOR liability on an axis the provider nominally assesses. Came to light not long after the test run).

    There is a long way to go here.

  7. CADD says:

    For our field in particular you need to pay very close attention to proper separation of training and test sets and having a test set that’s fit for what you want to predict. Simply removing a few compounds at random at a time from the training set is not good enough to say if the AI is able to make predictions beyond what it knows, or is just able to regurgitate the information it was given. Recognising that a picture is a picture of another cat is all very well, but we want to predict which of a million forms of cat yet to evolve would be a better mouse-catcher, and that is something totally different.


    I’m not in the medical field but have experience in CS and trading algorithms. A common problem with trading algorithms is over optimization. Looks great statistically in testing but is brittle in production. Out of sample data is needed to give greater confidence to the algo’s results. Good ML development follows the same process and uses of out of sample data.

    1. Icefox says:

      The ML term for what you observe is “overfitting”, and is very easy to do. It’s also not hard to detect… IF one tries.

  9. loupgarous says:

    Until the way in which ML models work is thoroughly documented so that we can be confident that a ML program “sees” is what is actually there, the field is interesting, but not ready to replace physicians.

    It might be useful in the way spell-checking is, as a tool to backstop physicians – radiologists and other physicians who must constantly interpret images, and are specifically trained to do so.

    There, the user should already know what the image connotes, so when the ML tool tells the (say) radiologist that (say) there’s disease progression since the patient’s last scan that she missed, the radiologist has to re-examine her work. At worst, a little of the radiologist’s time is wasted, but potentially an error in interpretation might be drawn to the radiologist’s attention if the ML’s read on an image differs from hers.

    Yes, there’s a need for that. My oncologist has caught radiologists in errors of interpretation (or at least, unintentional errors in wording) in reports made on my nuclear medicine scans on at least two separate occasions. In one case metastatic spread of disease wasn’t reported, and on a more recent case disease progression in existing tumors was described that didn’t really exist.

    So there’s a job ML can do, warts and all. Just as long as radiologists are strictly trained on what “pseudovalidation by computer” means. There have been fatal mistakes when computer displays on radiotherapy machines were either misinterpreted, used incorrectly, or simply reporting a narrower and less damaging radiation beam than actually was delivered to the patient.

Comments are closed.