Skip to main content

Chemical News

Calculate Your Way Out of Bad Yields

I wrote a little while back about a brute-force approach to finding metal-catalyzed coupling conditions. These reactions have a lot of variables in them and can be notoriously finicky about what combination of these will actually give decent amounts of product. At the same time, it appears that almost any given metal-catalyzed coupling reaction is capable of being optimized, if you care enough. So this is a good field for both miniaturized reaction searching (as in the link above) and for machine learning (as in this new paper).

It’s going after what’s often an even more challenging case, the C-N Buchwald-Hartwig coupling. That one gets used a lot in medicinal chemistry (we like nitrogen atoms), but it’s also a well-known beast when it comes to variation in yield. You can honestly never be quite sure that it’s going to work well when you set it up on a new system for the first time, because there are a lot of conditions (solvent, catalyst, additives) to pick from, and often very little guidance about which region of reaction space is going to be most fruitful. This work, a Princeton/Merck collaboration, is an attempt to calculate a way out of the woods.

It’s important to note that the two approaches mentioned in the first paragraph are not an either/or case. Far from it. If you’re going to do machine learning, you’re going to need a lot of reliable data (positive and negative) to feed into the model, and how better to generate that than in an automated, high-throughput setup? That’s especially true as you start adding in more reaction variables: every one of those can increase the complexity of the machine learning model in an exponential fashion, and stuff gets out of control quickly under those conditions. So the question is, can ML provide something useful here, and can it outperform the simpler-to-implement regression models?

You’d definitely want robotic help up front: the data set was the yields from 4,608 coupling reactions  (15 aryl or heteroaryl halides, 4 ligands, 3 bases, and 23 isoxazole additives). About 30% of the reactions gave no product at all – valuable fodder for the model – and the rest were spread in a wide range from “pretty darn good” to “truly crappy”. Just what you need. At this point, you need to figure out just what you’re going to tell the system about these reactions (a nontrivial decision). Too few (or wrongly chosen) parameters, and you’ll never get a useful model. Too many, and you risk an overfitted model that looks good from a distance but is both hard to implement and still not robust. In this case, the Spartan program is used to provide atomic, molecular, and vibrational descriptors of the substrate(s), catalysts, bases, and additives. So what goes in are electrostatics at various atoms, vibrational modes (frequencies and intensities), dipole moments, electronegativity, surface areas, and so on.

At this point, the team took 70% of the data as a training set to see if they could predict the outcomes in the other 30%, and ran the data through a whole set of possibilities: linear and polynomial regression for starters, then k-nearest neighbors, Bayes-generalized linear models, layered neural networks, random-forest techniques, etc. Only when they got to that last two did they start to see real predictive value, and random-forest seems to have been the clear winner. Its predictions, even using only 5% of the data, were better than linear-regression models tuned up on the whole 70%. But even so, there were limits. “Activity cliffs” are always a problem in these reactions, and those are hard to pick up (even with 4,608 reactions to train on!) The wider the range of chemical/reaction space you’re trying to cover with your model, the better the chance you’re going to miss these things (and the better the chance you’re going to end up with an overfitted model).

One way out of that is to restrict your question to a narrower region of potential reactions (and thus get more fine detail into the model). But that, naturally, makes it less useful: after all, the dream is a Buchwald-Hartwig Box, where you walk up, draw the structures of your two reactants (no matter what they might be), and it pauses for a moment and spits out a set of good reaction conditions. That’s. . .still a little ways off. But I do have to say that this current effort beats both human intuition (insofar as anyone has any about these reactions) and other attempts to calculate likely starting points. So it is a real improvement, and the finding that random-forest techniques outperformed everything else is worth building on, too.

The group did try putting in some additives that were not in the training set at all, to see how the model would handle them. And the answer is “not bad”: the root-mean-square error in predicted yields for the original test set versus reference set was about 8%, and the RMSE for the new additive set was about 11%. So it was not quite as good, but on the right track. (By comparison, the single-layer neural network model had about a 10% error on the original set, while the other methods came in at 15 to 16%).

The results suggested a look at the final random-forest procedure to see if there was anything that could be learned about the mechanism (a tall order, from what I know about the field). Some of the most important descriptors were things like electrostatic charge on the isoxazole additive’s atoms, its calculated LUMO, and its carbon NMR shifts. That suggests that the additive’s electrophilic behavior was important, but those parameters taken by themselves weren’t enough to produce any kind of useful model on their own. Still, a check of isoxazoles from both ends of the scale suggested that the more electrophilic ones were capable of undergoing a side reaction with the Pd catalyst that reduced yields, which may well be what the model was picking up on.

Overall, I’d say that this paper does show (as have some others) that machine-learning techniques are going to help us out eventually in predicting reaction conditions, even if that day has not quite arrived. When you get down to it, the parameters going into this model are not particularly complicated, so it may well be a good sign that it’s worked as well as it has. You’d have to think that larger data sets and new inputs can only going to make these things perform better (after plenty of human effort and care, of course), and that’s leaving aside the general possibility of improved algorithms. I think we’ll get there, but we’re not there now.

25 comments on “Calculate Your Way Out of Bad Yields”

  1. QuantumChemist says:

    I have not read the paper yet, but if they have only done calculations on the reactants, then there is certainly room for improvement. The are a number of research groups working on more or less automating the search for transition states and also potentially mapping all thermodynamically and kinetically accessible products and complexes. I would bet that feeding a bunch of reasonably accurate barrier heights into the machine learning models would make a big difference. Of course it would also increase computational cost, but core-hours and teraflops are only going to get cheaper.

    1. ClassicalChemist says:

      Isn’t that a bit of a catch-22?

      If you are able to calculate accurate barrier heights then you wouldn’t need to do further work to predict rates.

      If your are not able to calculate accurate barrier heights then you do need help predicting rates but you only have bad barrier data to give to the computer.

    2. Design Monkey says:

      A little problem with Quantum Chemist proposal is that, that currently those transition state energy calculations mostly tend to be in category “not even wrong”. Completely bonkers, just produced with high science approach and teraflops.

      1. QuantumChemist says:

        There are a couple reasons why someone might get awful barrier heights:

        1.Got the wrong TS geometry, i.e. not the barrier you were looking for

        2. The chosen electronic structure method is not adequate for the system or the basis set is too small: I consider this mostly a matter of getting people away from obsolete methods and bad habits (like blindly using B3LYP with a small basis set), and also getting people to use software beside Gau$$ian. In many areas it lacks state of the art methods, the density fitting feature is a bad joke, no DF-MP2, no DF-CCSD, no F12 methods, no localized orbital/linear scaling methods, piss poor multithreading for CCSD, no SAPT, lack of recent DFT functionals, and so on. Using the right tool for the job is important. Barrier heights are not as much of a problem with state of the art tools, as they once were. Still not perfect, but the last 5 years have been a steady flow of improvements. Transition metals are still a pain in the arse, but we are getting there.

        3. Solvent effects: I am going to concede this, current solvent models do suck, and the calculation of accurate solvation energies is one of the big unsolved problems.

  2. At least the first author wasn’t bumped to second author this time by another group muscling in on his project, like in the first nickel-photoredox Science paper

    1. anon says:

      It’s your fault if you let the other person muscle you around.

  3. John Wayne says:

    This is a nice bit of work, but I have to admit I’m surprised that it made it into Science.

    1. real says:

      I didn’t want to comment on my surprise, as its a bit cheesy to knock other papers for getting into the top journals; but really, Science? And is this the ultimate of synthetic chemistry today, generating a random forest ML classification that an undergrad CS major could easily do?

      1. tlp says:

        In a way, ML hinted a mechanistic hypothesis for why yield drops for certain additives i.e. new(ish) type of oxidative addition of Pd(0) to N-O bond. One should have quite some experience in organic chemistry to recognize such hint. So one could argue that it’s a first example of joint human-ML insight into mechanism of a reaction.

        1. tlp says:

          OK, after more careful reading I agree, it is surprising that the paper appeared in Science (by the way SI is as interesting to read as the paper itself).
          The random forest regression authors have built uses only one (ONE) aniline as a substrate, p-toluidine. So the whole paper is pretty much about optimizing Buchwald-Hartwig coupling with p-toluidine, with no comment on whether the results will hold if you take, say, m-toluidine or p-methoxyaniline. Given the variability of yields even in their not-so-diverse library of substrates, the general applicability of the found regression is dubious.
          Probably reviewers were ML/AI people, not organic chemists.
          Anyway I have to say kudos to authors for sharing the code!

          1. Me says:

            So all the arguments about over-optimising the chemistry are also applicable to the computational method then? I was wondering that based on Derek’s report but don’t have access to the journal.
            Sometimes being right is annoying.

      2. Anonymous says:

        What I really want to see (either in something like this, or the retrosynthesis things Derek has posted) is somebody “closing the loop” on the feedback. As the network is learning, it drives one of these automated chemistry setups and incorporates that data, hunting around for optimizations.

        1. anon says:

          A lot of smart people are actively working on it. It’ll be here soon.

        2. Vectrex says:

          Those kinds of systems are called “active learning” systems, and they’re quite a hot topic in research right now – with how data-hungry these models are, it would be really helpful if they could actively identify what the most effective kinds of additional data would be. I haven’t seen anything particularly compelling on that front yet, and certainly not for chemistry experiments in particular, but it’s definitely a thing that people recognize would be useful but haven’t yet figured out how to do effectively.

    2. anon says:

      Machine learning ? Check.
      AI ? Check.
      Superstars ? Check.

      1. tlp says:

        who are superstars here?

        1. Mister B. says:

          Zinedine Zidane ! (It refers to a song you can find on youtube, called “Zinedine Zidane Superstar” ) 😉

      2. anon says:

        Since when did doyle become a superstar? Is there any series of work that defines doyle’s career so far?

  4. real says:

    Its always random forest.

  5. AQR says:

    The article is behind a paywall so I haven’t read it, but according to Derek’s summary, they were able to use ML to predict the yields of the 30% of the reaction set that wasn’t used as the training set. That’s interesting, but I would think that the real test of an exercise like this would be to use these data to identify conditions for a reaction in the training set that would provide a higher yield than any of the examples on which the machine was trained.

  6. Anon says:

    What is the erratum about? I am lost.

  7. Text Selector says:

    Why can’t I select text on any of the Pipeline blog pages. I’ve tried with two different browsers. Is this deliberate? I can’t even select text when writing in this reply box.

    Note: my reason for want to select text from a post is to look up words I don’t know. A reason for selecting text when writing is to check spelling.

    1. tlp says:

      Selection is actually working but you can’t see a selection box 🙂 looks like a funny bug

      1. anon says:

        Not surprised since Science Mag doesn’t even have an SSL (Firefox). They should pay a 10 year old to get their license. Embarrassing.

  8. Me gold says:

    Hey, why do anything other than a brute force ( churn an burn ) approach when you gotta couple of lab rats workin for u that dont have much ahead of em except a few years o’ lab work and an early grave. Small price to pay $$$$$$, for me life! Me a mentor?, lol, that is 50 years out of date.

Comments are closed.