Skip to main content

Chemical News

Machine Learning: Be Careful What You Ask For

Let the machine learning wars commence! That’s my impression on reading over the situation I’m detailing today, at any rate. This one starts with this paper in Science, a joint effort by the Doyle group at Princeton and Merck, which used ML techniques to try to predict the success of Buchwald-Hartwig coupling reactions. The idea was to look at the robustness of the reaction in the presence of isoxazoles, which can poison the catalytic cycle. A priori, this is the sort of place you’d want to have some help, because that coupling is notorious for, and I blogged about this work at the time. I found the results interesting and worth following up on, although not (of course) the “For God’s sake, just tell me what conditions to use” software that everyone optimizing these reactions dreams of.

But there’s been a response to the paper, from Kangway Chuang and Michael Keiser of UCSF, and they’re opening the hood and kicking the tires:

We applied the classical method of multiple hypotheses to investigate alternative explanations for the observed machine learning model performance. The experiments in this study explore the effect of four reaction parameters— aryl halide, catalyst, base, and additive—with all combinations exhaustively generated through 4608 different reactions. This complete combinatorial layout provides an underlying structure to the data irrespective of any chemical knowledge. Correspondingly, we posited the alternative hypothesis that the machine learning algorithms exploit patterns within the underlying experimental design, instead of learning solely from meaningful chemical features.

That’s a very good point. The ML techniques will exploit whatever data you give them (and whatever arrangements might be inside those numbers), and you have to be sure that you realize what you’re shoveling into the hopper. For example, the UCSF group stripped out the chemical features from each molecule in the paper’s data set and replaced them with random strings of digits. We’re talking dipole moments, NMR shifts, calculated electrostatics, and all the other parameters that went into the original model: all replaced by random numbers across a Gaussian distribution.

What happens when you turn the machine learning algorithms on that pile of noise? Well. . .the UCSF folks say that you get a model whose predictions are nearly as good as the original. See the plots at right: oh dear. Keiser et al. have been advocating what they call “adversarial controls” for machine learning (see this recent paper in ACS Chem. Bio.), and you can see their point. There are further problems. The original paper had done the analysis while holding out one of the three 1536-well plates and then running the derived ML algorithm across that one as a cross-check, getting a root-mean-square error of of 11.3%, versus 7.8, which at the time I called “not quite as good, but on the right track”. But it turns out that if you run things the two other ways, training on two and leaving out one, you get even higher RMSE values (17.3 and 22%), which is not encouraging, to put it mildly.

As mentioned in that earlier blog post, the original paper had found that descriptors of the isoxazole additives seemed to be the most important features of the ML algorithm. But the UCSF response found that this is probably an artifact as well. They tried shuffling all the data (yields versus chemical features), and unfortunately found that the model still tells you that the isoxazole additives are the most important thing:

In 100 trials of this randomized-data test, additive features were nonetheless consistently identified as most important, and consistently occupied 9 of the top 10 by rank (Fig. 2, C and D). These results indicate that apparently high additive feature importances cannot be distinguished from hidden structure within the dataset itself.

It’s important to note that this response paper does not say that the isoxazole features aren’t important. They may well be. It’s just that the experimental design of the earlier paper isn’t enough to establish that point. Likewise, it’s not saying that you can’t use machine learning and such experimental parameters to work out reaction conditions: you probably can. But unless you carefully curate your input data and run these kinds of invalidating-hypothesis challenges on your model, you can’t be sure that you’ve done that or not.

Now, this response paper has been published with a further response from the Doyle group. In it, they take the point that their out-of-sample evaluation wasn’t sufficient to validate their ML model. But it also seems to be talking past the UCSF response a bit:

That a ML algorithm can be built with random barcodes or reagent labels does not mean that these or the chemical descriptors are meaningless, nor does it invalidate any structure-activity relationship present in a chemical descriptor model. Performing a Y-randomization test—a well-accepted control for interrogating the null hypothesis that there is no structure-activity relationship in the data—on the chemical descriptor model results in an average crossvalidated R2 value of –0.01, demonstrating that the model encodes meaningful information.

As mentioned, the UCSF paper is not saying that such descriptors are meaningless at all – just that the published ML model doesn’t seem to be getting anything meaningful out of them in this case. And I think that they might respond that the Y-randomization test is picking up just what they did with their shuffled-data experiments: information that was imbedded in the structure of the data set, but which doesn’t necessarily have anything to do with what the experimenters intended. The rest of the Princeton group’s response, to my eyes, is a reworking of the original data to show what might have been, running the model on a number of variations of the test sets and using alternatives to the random-forest model. But that would seem to make the point that ML models are in fact rather sensitive to such factors (especially when given the test of running them on random or scrambled data as an alternative), and that this was largely missed in the first paper.

I’m not going to stand on the sidelines shouting for a fight, though. This whole episode has (I hope) been instructive, because machine learning is not going away. Nor should it. But since we’re going to use it, we all have to make sure that we’re not kidding ourselves when we do so. The larger our data sets, the better our models – but the larger our data sets, the greater the danger that we don’t understand irrelevant patterns in those numbers that we didn’t intend to be there, patterns which the ML algorithms will seize on in their relentless way and incorporate into their models. I think that the adversarial tests proposed by the UCSF group make a lot of sense, and that machine-learning results that can’t get past them need to be put back in the oven at the very least. Our biggest challenge, given the current state of the ML field, is to avoid covering the landscape with stuff that’s plausible-sounding but quite possibly irrelevant.

35 comments on “Machine Learning: Be Careful What You Ask For”

  1. JB says:

    “information that was imbedded in the structure of the data set,”
    Are we talking about things like positional effects in the plates? Any high-throughput screener would immediately look for that but chemists might not think about it.
    On the larger topic there’s a great twitter feed, which I can’t find at the moment, of people giving examples of ML algorithms giving unexpected results because the rules weren’t set stringently enough so the ML exploited whatever loopholes it could to meet the goal criteria. One that recall is an AI designing physical features of hypothetical “animals” to see if it could make the fastest theoretical body shape. It ended up making really tall skinny animals and in the test they would win because they’d fall over and have the largest linear velocity as they were doing so.

    1. Hap says:

      Maybe the choices of compounds? What functionality people include in the experiments and what they don’t might suggest what they think is likely to be important – it’s hard to forget all the stuff you’ve spent your life learning and structuring your experiences around.

      The disturbing one was lifeforms that didn’t use energy to birth but needed energy to live so spent their lives sitting around, reproducing, and eating the spawn for food. I’m pretty sure that wasn’t what they meant, unless they were trying to come up with horror movie scripts or a novelization of “A Modest Proposal”.

    2. 42 says:

      Here’s the master list of AI gaming the system.
      one of my favorites is a ML being designed to distinguish between poisonous and edible mushrooms; simply realized that the data was being fed to it in alternating order.

      1. JAB says:

        As a long time mushroom picker, I have to laugh, except for the danger to which it might expose folks.

      2. metaphysician says:

        Okay, that master list is hilarious. The best thing about it is how most of the “cheating” are actually perfectly valid answers. . . just not the ones we asked for.

        ( Remember, when programming machine learning, always include a test that discourages human extinction. 😉 )

        1. sgcox says:

          No, the best entry is down the list: “In an artificial life simulation where survival required energy but giving birth had no energy cost, one species evolved a sedentary lifestyle that consisted mostly of mating in order to produce new children which could be eaten (or used as mates to produce more edible children).”
          Sounds eerie familiar…

  2. Curious Wavefunction says:

    In some sense this debate goes to the heart of what a model is and reminds me of some of the arguments in computational chemistry where someone builds a ten parameter model with topological indices giving a good correlation with biological activity and then someone else points out that a two parameter model with just logP and molecular weight would do as well.

    The fact is that almost no models are unique, and most ML models should not be considered unique or minimalist either. As long as they are intuitive and help investigators vary chemically interpretable parameters for getting better results, I don’t see why they should be discarded. This exchange also reminds me of that hilarious list of AI/ML models that learnt something completely absurd which was posted recently: for instance, when asked to maximize its velocity, a self-driving started spinning around at one place furiously.

    On the other hand, one of the advantages of feature detection is precisely is that it can separate out random or unexpected effects from ones which we think are operating, so in one sense this paper and its critique shows exactly how the system should be working. I am glad these guys published the comparison.

    1. anon1 says:

      Just a couple of points. ML models do not exist to provide intuitive insight, they are purely empirical. Unlike a multiple regression, where you want the fewest variables to improve your p values, there is no ‘significance’ with an ML model. Instead, with an ML model, you want to avoid overfitting, and to do this, you need a training and test set. But there is no measure of significance. Arguing that the smiles strings aren’t needed, is in some ways, contrary to what ML is all about…ML will decided what it cares about. Of course, as a chemist, you always want an intuitive model, and a model where the smiles strings are of no importance is…not inspiring.

      1. Wavefunction says:

        There are presumably intuitive correlations hidden in data that ML models could ferret out (not because the ML model knows chemistry but simply because there are too many parameters for humans to tease out relevant ones). The problem is the old correlation-causation problem, so while intuitive correlations might exist, it’s still important not to call them causal.

      2. Ian Malone says:

        One active area of research (in imaging rather than chemistry) is algorithms that produce some form of interpretation in their output, e.g. providing saliency maps that show the areas the algorithm used to make its decisions.

        Sadly this (let’s hope it formats right):
        But it also seems to be talking past the UCSF response a bit:

        That a ML algorithm can be built with random barcodes or reagent labels does not mean that these or the chemical descriptors are meaningless, nor does it invalidate any structure-activity relationship present in a chemical descriptor model.

        Is not too unusual, it seems not uncommon in ML to retreat from a grand claim into “well we found a little signal”, while quietly ignoring most of the signal you originally reported was spurious.

  3. John D. says:

    Machine learning, genetic algorithms, and other black box techniques by their very nature are notorious for using any and all “information that was imbedded in the structure of the data set, but which doesn’t necessarily have anything to do with what the experimenters intended.”

    I just read a few days ago a very similar lesson learned when a team from Duke University tried to “evolve a radio”:

    “However, the team noticed some interesting emergent behavior. The algorithm tended to reward amplification behavior from the circuit, leading to many configurations that oscillated poorly, but amplified ambient noise. In the end, the algorithm developed circuit configurations that acted as a radio, picking up and amplifying signals from the surrounding environment, rather than oscillating on their own. The evolutionary algorithm took advantage of the interaction between not only the circuit elements, but effects such as the parasitic capacitance introduced by the switching matrix and appeared to use the PCB circuit traces as an antenna.”

    Translated from electronicese, the algorithm, directed to generate 25Khz “somehow”, found nearby ambient sources that qualified and promptly amplified that, using ANY parts around it including its own materials. The circuit almost certainly wouldn’t work if it was put in any other location, but nobody realized you had to add that condition to the algorithm.

    The original paper is at:

  4. Sans sheriff says:

    I think it’s remarkable how the computers learned to randomly screen without any knowledge of chemistry, much like many of my grad school lab mates!!

    1. fajensen says:

      These things don’t learn :).

      Most ML algorithms “simply” map a huge pile of input parameters onto straight lines drawn inside of a “hyperspace”, which is basically a coordinate systems with many more dimensions than there are categories of input data. The clever bit about the training process is figuring out the number of dimensions and the line-fitting. Then the trained algorithm performs a weighted linear regression on all of the linear fits to map from hyperspace and into the, usually fewer, predicted output values. Sometimes people will use parabolic curves in hyperspace also.

      ML is fun and somewhat amazing but these things are really just statistics with a lot of the horrible maths and model building automated and hidden away.

      Dumb as a doorknob they are, and one would be leery of using them for important things, like self-driving cars, because when one does not understand the underlying model, one will not know what the limitations / assumptions are and where the model will break down until it suddenly does.

      The very first practical instance of machine learning was probably grandad’s Kalman Filter!

      This instance of ML is so old and so well understood that we trust it to fly our planes, run power plants, and even guide our cruise missiles to Moscow.

      Here is a neat description of it:

  5. CheMystery says:

    In any machine learning paper I come across these days, I do a control-F for “y-rand” and if I don’t see it, then I do not believe. Y-randomization is essential. This paper – from 2003 – outlines the issue of proper model validation quite well ( The problem is all these neophytes jumping onto the AI-ML gravy train, without knowing well established pitfalls and limitations. Oh, the importance of knowing the literature….

    1. anon1 says:

      surely they randomized the samples before splitting them up into the train and test set? I don’t think the order you put the data into the system matters, at least not with a random forrest (maybe with a neural network it would matter more). But the dividing of the train and test sets is very important.

      1. DH says:

        That’s not what CheMystery means by y-randomization (also known as y-scrambling), which is scrambling the response column while leaving the other columns (the x columns) the same. This breaks the connection between y and the x’s. You then build a model from the scrambled data and if the model looks “good” (high r-squared, low RMSE), then that’s a red flag.

  6. anon1 says:

    Not that anyone cares what I think, but I just read the ‘response’ from UCSF. The princeton paper is crap. There’s no arguing about that or any point in trying. First, they have very few samples for ML, and strangely, they held one plate out as the test set, and yet somehow never did the same with the other two?? It takes MINUTES of work, to do this. And they never held out the chemical structures to make sure that it made any difference (10 minutes of work, maybe)…which is pretty amazing oversight for a science paper. Not good…I don’t think the right people reviewed this paper.

    I’m thinking this is a great way to publish fun papers. Wait till organic chemists publish crap like this, and with a few hours of work, prove them wrong…and get a science paper out of it.

    1. Ned Naylor says:

      The real giveaway that this was bogus was when mcmillan didn’t immediately steal it and claim it a natural collabortion

    2. sprinkles says:

      FYI — the first author is an organic chemist by training.

      It turns out it’s possible to gain proficiency in both fields. 😉

      1. anon1 says:

        I def agree. But in this case the PI has no expertise (I assume), and so they aren’t able to properly check the students work.

  7. tlp says:

    C’mon, nobody mentions that but original authors deserve some credit for uploading the exact code to github; so that response paper was even possible.

    1. peter says:

      I would suggest that this should be the minimum expectation: how else can the results be replicated at all?

      1. tlp says:

        And yet it’s not such a common practice. Sometimes because the algorithm or database is proprietary, sometimes because everything is done semi-manually, sometimes because ‘it’s just common sense’.

  8. useless molecule says:

    The worst part is that all that shit is published in Science.

  9. AlloG says:

    Yo Isoxazoles! Don’t need no AI, E or other constanants – Dey work in oxacillin and sulfamethoxazole so I added one to a cyclopentanoperhydrophenanthrene ring once- the hair on my legs Is now 6 inches long- Take that ML!

  10. another guy says:

    It reminds me of an article I read in the 90’s about a neural network system trained to detect the presence of enemy tanks from surveillance photographs. It used a training set based on photographs of various enemy tanks hiding in bushes, driving through mud, from different angles, etc. They also used a control set of photographs which as you would expect were shot in similar surroundings but without the tanks. The neural network had a very high positive predictive value for detecting tanks as long as the photographs were selected from the same batch that included the training samples. Otherwise the system failed miserably and reported no tanks when it was obvious a tank was squarely in the middle of the photograph. It turned out the neural network was associating a higher brightness value for photographs that included tanks and it was solely discriminating on that alone. Simply showing the system a white card would be enough for it to claim a tank was present. I don’t have the reference for this article but if I come across it I will post it.

    1. anon1 says:

      No pictures from the training set should be included in the test set…

      But further, its common practice to take pictures included in the training set, and then randomly distort them buy adding noise, rotating them, changing the brightness etc. In this way, a training set starting with 10’000 pictures can be easily exploded out to 10-100x more…Its also common to change all the pixels to 1 or 0, and/or shades of grey, and all sorts of other crazy stuff.

  11. Anonymous says:

    Paywall on the Science paper. Grrrrrrrrrrrrr.

    Derek wrote: “… replaced by random numbers across a Gaussian distribution. What happens when you turn the machine learning algorithms on that pile of noise? Well. . .the UCSF folks say that you get a model whose predictions are nearly as good as the original…”

    That reminds me something Fritz Menger published many years ago.
    “Origin of high predictive capabilities in transition-state modeling.”
    FM Menger, MJ Sherrod, JACS, 1990, 112(22), 8071-8075.
    Abstract: “Rates of 15 acid-catalyzed lactonizations, calculated by transition-state modeling, have been reported to correlate well with experimental rates (r= 0.95). It is now found that the ability of transition-state modeling to predict rates depends less on the accurate portrayal of a transition state than on how closely the associated parameters coincide with a family of parameter sets that happens to give a good correlation. Success of transition-state modeling relies on unrealistic force constants (eg, stretching parameters for partial bonds at 50% of ground-state equivalents) arbitrarily and fortuitously assigned to the transition structures. When these force constants are replaced by ab initio derived parameters, the correlation degenerates into scatter. Rate correlations near unity are achievable with “nonsense” force fields created by our FUDGIT software. Caution is advised in deducing from transition-state modeling any notions …”

    I liked the name of their program “FUDGIT”. FUDGIT = Fudge It = twisting things to make them fit (e.g., adjusting data to get the desired correlation).

    I don’t remember how that controversy played out back then. Anybody?
    Comparison to the current case at hand?

    1. Wavefunction says:

      Fred Menger. He was on my PhD committee. Fred was famous for taking down theoretical results by constructing simple experimental systems. In this particular case he was taking on Houk’s claims of using molecular mechanics to model TSs. As I understand the criticism was spot on. Later when Houk visited our department for a seminar and Fred was in the audience, he started by showing a picture of two mud wrestlers and wryly remarked that that was the way it looked when he first visited. All forgiven, forgotten and laughed at by that point.

      1. Anonymous says:

        I don’t know Fred, but I often heard those who know him refer to him as Fritz (his nickname?) in regular conversations and group meetings. But I was always a fan of Menger. He did experiment VS theory as well as theory VS experiment. He was one of those who tried to challenge Breslow’s “negative rate constants” (experimental results) which exposed how difficult it can be to challenge publications. He and Haim had to fight to get their analyses published. (Menger had some other zingers in his career.)

        The “mud wrestlers” resolving their differences, “all forgiven” can work out OK in the academic realm of chemical reactions, but when AI or ML is misapplied to therapeutics or diagnostics it can lead to tragic results.

        Come to think of it, when basic research funding, even on chemical reactions, is diverted by faulty AI / ML guided decision making, it can lead to delays in obtaining improved med chem, Dx and Tx outcomes.

        It can also have affects on careers. Menger and Haim were both tenured and secure in their positions and encountered considerable trouble trying to rebut the Breslow papers in print. Junior faculty or non-faculty might have been labeled troublemakers and seen their careers or funding come to a halt. If an industrial chemist questions a Management Directive to obey the AI / ML program, he can be marginalized or terminated.

        (Another example: The first disclosures of self-replicating molecules turned out to be unjustified, i.e., incorrect, but I don’t recall any successful publications challenging those claims. I did hear numerous challenges at seminars. Ultimately, acknowledgement of the errors and that the first systems were NOT self-replicating was buried in a single sentence in the back of a review article (Acta Chem Scand, I think). No retractions or other published corrections. OTOH, the students challenging him at seminars were never heard from again.)

  12. Imaging guy says:

    Machine learning is basically multivariable regression analysis, supervised learning, modelling, “system biology” or whatever you call it (1). It involves collecting many independent variables/parameters and building a model/algorithm to predict whatever dependent variables you want to predict. These independent variables may be metabolites, plasma proteins, mRNAs (“gene expression signatures”), single nucleotide polymorphisms/SNPs (“polygenic risk scores”/PRS), pixel/voxel intensities from image datasets (e.g. fMRI, histological images), molecular descriptors, answers from psychological questionnaire or business and economic variables. Dependent variables may be the risk of developing diseases, the prognosis of patients given certain drugs, the areas of brain involved in carrying out certain tasks or the chance of success of companies and economies (“Quant guys” from hedge funds). It is not true that multiple regression relies on p value. The need for training and testing datasets in multiple regression was already stated 30 or 40 years ago. The problem with is all these ML/multiple regression methods is overfitting (i.e. inverse problem). Billions of dollars have been spent on multiple regression/modeling in life science, health care, quantitative structure activity relationship, psychology, sociology, econometrics, finance, marketing and other areas and very little of anything useful ever comes out. Machine learning is just the flavor of the month. This is what Enrico Fermi told to Freeman Dyson, “How many arbitrary parameters did you use for your calculations? I thought for a moment about our cut-off procedures and said, “Four.” He said, “I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” (2)
    1) Big Data and Machine Learning in Health Care (
    2) “A meeting with Enrico Fermi”

      1. Wavefunction says:

        When I asked Dyson about that story, he told me that was when he roughly decided his skillsets were much more broad and he probably didn’t have it in him to be a physicist like Fermi. We are all fortunate he made that decision.

        1. loupgarous says:

          Dyson certainly contributed more to Los Alamos, the JASON Group and the nation as a mathematician than he might have as a physicist. He collaborated well with physicists, anyway.

          Most readers of this blog know about Project Orion, which Dyson and Theodore Taylor conceived. However, the JASON report “Tactical Nuclear Weapons in Southeast Asia” he co-authored with Steven Weinberg and two others prevented nuclear weapons from being used as a shortcut out of nasty political messes. Dyson’s analysis of the limited returns from use of tactical nuclear detonations in Southeast Asian terrain was central to that study.

  13. Earl Boebert says:

    Well, as we used to say about AI/ML: “If it works, we call it pattern recognition.”

    As many examples cited here show, the problem is determining whether the pattern the robot has learned to recognize is the pattern you’re interested in or something it thinks looks better.

    As an aside, my old outfit did a very successful tank recognizer. The trick is to look for corners. It was hard-wired, not “taught.”

Comments are closed.