Skip to main content
Menu

Chemical News

The Machines Rise a Bit More

Here’s a new paper in Nature on computer-generated synthesis of natural products. More formally, you’d call it retrosynthesis, since the thought process in organic chemistry tends to work backwards when you have a particular target that you’re trying to make: “OK, this part could could be made from something like this. . .and that, you could make by condensing two pieces sort of like these. . .”

You work back to more accessible starting materials, based on the transformations that you know about or can picture being feasible. For more simple molecules, it’s the kind of thing you ask sophomore students to do on one of the higher-point-value questions at the end of the test. But for larger and more complex ones, it can be a great deal of work. The “decision tree” about what pathways to use to build up a tricky structure can be huge, and the relative advantages of each are not always obvious. Some of the things that we chemists do value, though, are brevity (fewer steps are almost always better), high yields in each step (because even 90% yield per step will whittle your material away surprisingly quickly), use of readily available/inexpensive reagents and materials (especially important to industrial chemists, obviously), reproducibility (no one goes in and tries to reproduce a 35-step total synthesis for the heck of it, but if you ran step 26 fourteen times and it only gave you a decent conversion once, that’s bad form), and what we all call “elegance”.

That last one is hard to define, but one aspect of it is the “I didn’t see that coming” factor, where parts of a complex molecule are assembled more quickly and surely than you would have pictured, through some nonobvious path. Compression is another aspect, and that can mean something as simple as doing more than one step in the same flask without having to do the whole work-up-the-reaction-isolate-and-purify-the-product thing every single time. Past that, it’s getting the most out of every chemical step, something like “This hydroxy group is what makes the nucleophile come in from this direction in this step, and in the next one it’s also going to be what sets off the rearrangement that fixes two more chiral centers at the same time” Getting things like that to work ain’t easy. Actually just seeing them in the first place isn’t, either. If you know organic chemistry well enough, the reaction to a great synthesis really does have an aesthetic component (as overplayed as that part is in the writing of some practitioners over the years). It’s like watching a talented writer land the last lines of a poem, managing to finish its point while also suggesting new meanings that apply to the earlier stanzas, and simultaneously making a subtle but deliberate unexpected reference to some other work of art that illuminates the whole poem from a different direction still.

So from all that, you can tell that coming up with such synthetic proposals is (for many organic chemists) something that they see as a unique and central part of their discipline. And that’s why attempts to automate it grate on some people’s nerves. Imagine how Petrarch might have reacted to a brochure for a Sonnet-o-Matic. If your horse is high enough, you can regard such software as an offense against Nature and against your honor, and even if you’re a bit closer to the ground it might occur to you that part of your recognized job is now under some form of assault.

I’ve written about such programs a few times over the years. There are several commercial software packages out there already, and a number of competing approaches. I think it’s fair to say that none of them have taken over the world, but it’s also fair to say that they’re being taken seriously. There are particular advantages to a computational approach to retrosynthesis that are harder to realize with one’s brain: avoiding a thicket of process patents, for example, or not even considering using reagents and starting materials that are not on some particular list. And that’s not even mentioned the difficulties of keeping up with the literature itself – as I’m fond of saying, a retrosynthesis program can learn new chemistry every evening, while most of us can’t keep up that pace.

This latest work is from the people who came up with the Chematica program, and it has some interesting insights into what happened when the authors tried to push the software into more challenging natural product chemical space. There were, they report, many instances where the program knew all the individual steps that could go into such a synthetic route, but still failed to find one. They had to make a number of modifications to make it work more strategically – for example, being willing to admit a step that made things temporarily more complex for a bigger synthetic payoff a step or two later, or looking for opportunities to accomplish more than one chemical step at a time. Not all of these are at that strategic level, I should add – one extension was directly having the software recognize about a hundred useful and well-precedented functional-group interconversions and sequences that had shown up in human-driven total synthesis over the years.

The analogies to chess playing come to mind whether you want them to or not: you’re getting the software to handle the idea of sacrificing a piece to gain better position or better prospects in the end game, or loading it with particular lines of play that have proven useful and forcing it to take those into account. And these analogies work so well because organic synthesis is itself a game, played on a very large board with very complex rules, and with the added complexity of new pieces and moves being discovered from time to time. That’s why we like it so much.

In the end, the authors assembled a set of natural product syntheses from the literature, all of which we can presume to have been worked out by various sorts of humans. And they mixed these with a set (done on broadly similar sorts of molecules) generated by the souped-up version of Chematica. They sent these around to a number of experienced chemists and did a sort of Turing test, asking people if they could tell which routes were from the humans and which were from the machines. You can try the same experiment – start from the beginning of the Supplementary Information file and make your calls. The answer key comes after the syntheses are laid out.

What I can tell you is that no, it appears that the experts couldn’t really tell the difference. And that says something about Chematica, but I fear that it also says something about organic synthesis. None of these syntheses, the known human ones nor the machine-generated ones, are going to trigger a major aesthetic experience for anyone. The natural product structures are fair ones, but they’re generally not complicated enough for something really elegant or surprising to occur. That makes their synthesis, even when performed by humans, a bit more of a mechanical exercise than it would have been at one time. We know a lot more chemistry than we did in R. B. Woodward’s day, and what he often had to invent, we now use as a matter of course. Whole classes of ring systems and functional group combinations have been worked on to the point that we have pretty reasonable ideas of how you might produce them. And while those aren’t always going to work in practice, enough of them will (and there are enough alternatives for the steps that don’t) that the resulting synthesis falls into the “Yeah, sure, why not?” category, rather than “Whoa, look at that”.

No software is yet producing “Whoa, look at that” syntheses. But let’s be honest: most humans aren’t, either. The upper reaches of organic synthesis can still produce such things – and the upper stratum of organic chemists can still produce new and starting routes even to less complex molecules. But seeing machine-generated synthesis coming along in its present form just serves to point out that it’s not so much that the machines are encroaching onto human territory, so much as pointing out that some of the human work has gradually become more mechanical.

22 comments on “The Machines Rise a Bit More”

  1. Luysii says:

    Retrosynthetic analysis and Moliere

    Chapter 30 of Clayden, Greeves et. al. concerns retrosynthetic analysis, but what in the world does this have to do with Moliere? Well, he wrote a play called Le Bourgeois Gentilhomme back in 1670 and played the central character, Monsieur Jourdain, himself in its first performance (before king Louis XIV). Jean Baptiste Lully, one of the best composers of the time (Bach hadn’t been born yet) wrote the score for it and also played a role. M. Jourdain was a wealthy bourgeois gentilhomme who wanted to act like those thought better (e.g. the nobility) at the time. So he hired various teachers to teach him fencing, dancing and philosophy. The assembled notables watching the play thought it was a riot (did not the French invent the term, nouveau riche). He was taught the difference between poetry and prose, and was astounded to find that he’d been speaking prose all his life.

    So it is with retrosynthetic analysis and yours truly. Back in ’60 – ’62 we studied the great syntheses that had been done to learn from the masters (notably Woodward). Watching him correctly place 5 asymmetric centers in a 6 membered ring of reserpine was truly inspiring. Even though Corey had just joined the department, the terms retrosynthetic analysis and synthon were nowhere to be found. The term is almost a tautology, no-one would think of synthesizing something by making an even more complicated molecule and then breaking it down to the target. So synthetic chemists have been speaking retrosynthetic analysis from day 1 without knowing it.

    1. Oudeis says:

      Lully! Been a while since I heard that name. I’m sure you’re familiar with the story of his death, but for the broader audience…

      Lully didn’t conduct with a little baton. He used a big staff. One day, carried away by the music he was conducting, he smashed his own foot with the ridiculous thing. Really smashed it–gangrene set in. The doctors told him he needed it amputated, but no! If it were amputated, he could not dance!

      Neither America nor the twenty-first century has a monopoly on egotistical idiots who refuse sound medical advice. And isn’t that about the Frenchest way to die you ever heard?

      1. Luysii says:

        Wow. Never heard of it. Thanks. Musicians are an unusual lot.

      2. Dams says:

        First of.. Lully was not French but Italian..
        Second of.. what is a “French way to die” is that not a bit prejudice not to say racis…??

        1. Sulphonamide says:

          Depends on whether it is considered racist to look on (grudgingly) admiringly at the perceived foibles of a nation? To imply that a nation places art and beauty over (one’s own) life itself is (potentially) rather a compliment…and (hopefully) can still be taken thus.

          Loved the comments on the poet, a breath of fresh Covid-free air to receive our daily dose of such analogies (and the tale about Lully).

        2. Oudeis says:

          Saying Lully was not French is a bit like saying Alexander Hamilton wasn’t an American.

          As for prejudice and racism, I’ll refer you to Sulphonamide’s comment.

  2. Interesting; I wonder if this software takes into account enzyme-aided reactions; Much biosynthesis of interesting natural products involves lots of unique enzymes.

  3. Jonathan says:

    Great post, Derek. This is an interesting development in automated retrosynthesis.

    It’s also the opposite of machine learning, right? The authors described an extensive process of expert-coded transformations, considering everything from functional group incompatibilities to literature precedent for diastereoselectivity.

    I wonder if there will be a point of diminishing returns for adding new synthetic rules. It doesn’t seem to have slowed them down yet though.

    1. John Hasler says:

      Sounds more like an expert system than machine learning. Surely the marketing literature calls it AI, though.

    2. Just another chemist says:

      It is supervised learning (https://en.wikipedia.org/wiki/Supervised_learning) which is a subset of machine learning. The rules are analogous to the training data in a simple ML model

  4. Per-Ola Norrby says:

    I’ve seen software produce “Whoa, look at that”. When the central benzene is formed by a Diels-Alder-fragmentation instead of serial EAS, it can have that effect. It’s a good example of a sequence that can be hard to see if you didn’t train specifically on it, but it’s in the literature, so an unbiased machine finds it.

  5. Chris Phoenix says:

    Typo, I think: “new and starting routes” -> startling

  6. James Reader says:

    It is incredible how low Nature has gone publishing such a thinly veiled piece of advertisement for a commercial software. Scientifically, the work is not reproducible, as the description of the software is deliberately obscure and it is completely unclear how many structures were tried and how selectively the results were reported. Big shame.

  7. Matt says:

    My biggest fear with increasingly automated software like this is that it will prevent humans from learning and practising the simple things which will in turn prevent them from gaining the hidden insights that are needed to later make unexpected discoveries.

    1. John Wayne says:

      Agreed

    2. Joe Q. says:

      Yes, and this is true across pretty much every field where “operator skill” is required.

  8. Uncle Al says:

    Managerial chemistry makes useful molecules. After a few turns of that crank, we are CO2 reducers, cellulose pyrolyzers, and hydrogen liberators in a brave new world. LCAO is a heuristic, DFT is the real thing. In which domain do you dream?

    Making stuff is good! That said, architecture is about building tents. Frank Gehry is a warning.

  9. Kaleberg says:

    This kind of software reminds me of the early assaults on symbolic integration. Integration was long taught as a combination of algebraic methods, parametric, by parts, by series, by the book. There was a whole section of the CRC handbook full of components and techniques one could apply. Early software just copied human techniques, but over several decades more advanced approaches were discovered that worked in more general cases. Retrosynthesis is massively more complex, and seems to be entering the early stages of automation. Who knows how much more sophisticated the software will be in 50 years?

    Macsyma did a pretty good job with integration by 1970. In 2020, it’s successors do a much better job, more quickly and more robustly. Calculus students still learn how to integrate by hand, though working mathematicians and engineers are more than glad to turn the job over to Mathematica or the like. For those at the cutting edge, trying to determine the metrics of some numerical space, Mathematica is sort of a pocket calculator able to grind out algebraic solutions to critical subproblems but incapable of dealing with the subtleties of the axiom of choice for example.

    I’m really glad to see this work, and I’m really glad that the synthesis software is no longer producing high chemical comedy, or, at least, producing comedy more as the exception than the rule. My only real concern is that so little of this work is being done in the public domain as opposed to by private parties. If Macsyma had been private, profit oriented and closed source, it would have strangled mathematical software in its crib, and we would all have paid for it.

    1. Derek Lowe says:

      Thanks for that comment! I had wondered on and off how such programs handled these tasks and the interface between symbolic mathematics and computation in general (exact solutions, as opposed to grinding out finer and finer approximations with numerical methods). You prompted me to go learn a bit about the Risch algorithm (which I have placed in my mental picture of things as a sort of super-Laplace Swiss army knife), holonomic functions (man, Taylor series just keep on coming through), and so on.

      1. Unintelligible says:

        Nice to hear such kind words about good ol’ Macsyma, and Derek’s appreciation for the classic symbolic math toolbox. But nowadays there’s also some novel machine-learning approaches being tried: see for example “Deep Learning for Symbolic Mathematics” at https://arxiv.org/pdf/1912.01412.pdf
        These programs seem a bit magical: they treat integration as a language translation task, succeeding with unreasonable effectiveness.

        1. Derek Lowe says:

          That’s an unnerving preprint indeed!

  10. Anil Lele says:

    Scientists make digital breakthrough in chemistry that could revolutionize the drug industry-
    https://www.cnbc.com/2020/10/24/how-a-digital-breakthrough-could-revolutionize-drug-industry.html?__source=sharebar|linkedin&par=sharebar

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.