Let the machine learning wars commence! That’s my impression on reading over the situation I’m detailing today, at any rate. This one starts with this paper in Science, a joint effort by the Doyle group at Princeton and Merck, which used ML techniques to try to predict the success of Buchwald-Hartwig coupling reactions. The idea was to look at the robustness of the reaction in the presence of isoxazoles, which can poison the catalytic cycle. A priori, this is the sort of place you’d want to have some help, because that coupling is notorious for, and I blogged about this work at the time. I found the results interesting and worth following up on, although not (of course) the “For God’s sake, just tell me what conditions to use” software that everyone optimizing these reactions dreams of.
But there’s been a response to the paper, from Kangway Chuang and Michael Keiser of UCSF, and they’re opening the hood and kicking the tires:
We applied the classical method of multiple hypotheses to investigate alternative explanations for the observed machine learning model performance. The experiments in this study explore the effect of four reaction parameters— aryl halide, catalyst, base, and additive—with all combinations exhaustively generated through 4608 different reactions. This complete combinatorial layout provides an underlying structure to the data irrespective of any chemical knowledge. Correspondingly, we posited the alternative hypothesis that the machine learning algorithms exploit patterns within the underlying experimental design, instead of learning solely from meaningful chemical features.
That’s a very good point. The ML techniques will exploit whatever data you give them (and whatever arrangements might be inside those numbers), and you have to be sure that you realize what you’re shoveling into the hopper. For example, the UCSF group stripped out the chemical features from each molecule in the paper’s data set and replaced them with random strings of digits. We’re talking dipole moments, NMR shifts, calculated electrostatics, and all the other parameters that went into the original model: all replaced by random numbers across a Gaussian distribution.
What happens when you turn the machine learning algorithms on that pile of noise? Well. . .the UCSF folks say that you get a model whose predictions are nearly as good as the original. See the plots at right: oh dear. Keiser et al. have been advocating what they call “adversarial controls” for machine learning (see this recent paper in ACS Chem. Bio.), and you can see their point. There are further problems. The original paper had done the analysis while holding out one of the three 1536-well plates and then running the derived ML algorithm across that one as a cross-check, getting a root-mean-square error of of 11.3%, versus 7.8, which at the time I called “not quite as good, but on the right track”. But it turns out that if you run things the two other ways, training on two and leaving out one, you get even higher RMSE values (17.3 and 22%), which is not encouraging, to put it mildly.
As mentioned in that earlier blog post, the original paper had found that descriptors of the isoxazole additives seemed to be the most important features of the ML algorithm. But the UCSF response found that this is probably an artifact as well. They tried shuffling all the data (yields versus chemical features), and unfortunately found that the model still tells you that the isoxazole additives are the most important thing:
In 100 trials of this randomized-data test, additive features were nonetheless consistently identified as most important, and consistently occupied 9 of the top 10 by rank (Fig. 2, C and D). These results indicate that apparently high additive feature importances cannot be distinguished from hidden structure within the dataset itself.
It’s important to note that this response paper does not say that the isoxazole features aren’t important. They may well be. It’s just that the experimental design of the earlier paper isn’t enough to establish that point. Likewise, it’s not saying that you can’t use machine learning and such experimental parameters to work out reaction conditions: you probably can. But unless you carefully curate your input data and run these kinds of invalidating-hypothesis challenges on your model, you can’t be sure that you’ve done that or not.
Now, this response paper has been published with a further response from the Doyle group. In it, they take the point that their out-of-sample evaluation wasn’t sufficient to validate their ML model. But it also seems to be talking past the UCSF response a bit:
That a ML algorithm can be built with random barcodes or reagent labels does not mean that these or the chemical descriptors are meaningless, nor does it invalidate any structure-activity relationship present in a chemical descriptor model. Performing a Y-randomization test—a well-accepted control for interrogating the null hypothesis that there is no structure-activity relationship in the data—on the chemical descriptor model results in an average crossvalidated R2 value of –0.01, demonstrating that the model encodes meaningful information.
As mentioned, the UCSF paper is not saying that such descriptors are meaningless at all – just that the published ML model doesn’t seem to be getting anything meaningful out of them in this case. And I think that they might respond that the Y-randomization test is picking up just what they did with their shuffled-data experiments: information that was imbedded in the structure of the data set, but which doesn’t necessarily have anything to do with what the experimenters intended. The rest of the Princeton group’s response, to my eyes, is a reworking of the original data to show what might have been, running the model on a number of variations of the test sets and using alternatives to the random-forest model. But that would seem to make the point that ML models are in fact rather sensitive to such factors (especially when given the test of running them on random or scrambled data as an alternative), and that this was largely missed in the first paper.
I’m not going to stand on the sidelines shouting for a fight, though. This whole episode has (I hope) been instructive, because machine learning is not going away. Nor should it. But since we’re going to use it, we all have to make sure that we’re not kidding ourselves when we do so. The larger our data sets, the better our models – but the larger our data sets, the greater the danger that we don’t understand irrelevant patterns in those numbers that we didn’t intend to be there, patterns which the ML algorithms will seize on in their relentless way and incorporate into their models. I think that the adversarial tests proposed by the UCSF group make a lot of sense, and that machine-learning results that can’t get past them need to be put back in the oven at the very least. Our biggest challenge, given the current state of the ML field, is to avoid covering the landscape with stuff that’s plausible-sounding but quite possibly irrelevant.