While working on my talk on robotics and artificial intelligence, I was sent a link to this paper (PDF) which I thought was worth a look. It’s from a team at the University of Münster, and what they’re trying to do is look for patterns in the entire corpus of synthetic reactions. They’ve used data from Reaxys to create a database of over eight million binary chemical reactions, representing 14.4 million different species, and they’re trying to use this to predict new reactions.
This high-throughput reaction prediction (HTRP) idea has been tried several times before, and the paper provides a useful review of these. Broadly, these attempts have used either some sort of rule-based expert system framework, attempted to work out a logic or grammar of chemical reactivity to extrapolate with, or used outright machine learning techniques. (These don’t exclude each other, and there have been approaches that mix these together). Success has been mixed – well, very mixed, to be honest. I think it’s fair to say that the times it’s seemed to work, it’s been on areas of limited applicability, and when these ideas have been applied in a more general fashion, they haven’t necessarily worked very well.
This new paper, though, does seem to go further, but it also shows the limits of the current state of the art. As a number of other theoretical approaches to organic chemistry have done, they’re shifting the world of organic chemistry over into a graph-theory problem. From this perspective, discovering new reactions becomes a search for new nodes and edges in the graph. Studies of other sorts of networks gives some techniques for this, but I’m going to long-jump over the math involved (the paper has more than you want). But the underlying reasoning is easy to understand – let’s say you have two molecules, A and B, and both of them are both known to react with a partner C to give some new product in each case. The program will note all these similarities, and searches for cases where compound A reacts with yet another molecule type D. Since A and B have been classed as having a similar reactivity pattern (they both reacted usefully with C), the program hypothesizes that B will do something with D as well, even though no examples like this may be in the database (there are similarity cutoffs in the computational steps to keep too many things from coming out the other end of the process). The system can also predict what it thinks the new product of this reaction would be, based on the A-C reaction products.
They tested this approach by using everything up to 2013 to predict reactions, and the set of reactions published since then to check their results. Looking at 180,000 randomly selected reactions, the predictions were correct about 67% of the time. For the most part, the failed predictions are still reasonable – the examples given include structures where there are two sites of reactivity, and the predicted product was one of the two steps, while the actual reaction goes all the way to both.As you might expect, the accuracy improves linearly with the size of the graph/data set. Known expert systems approaches, when applied to the same data set, give lower accuracy.
Of course, predicting some of these reactions is not much of a feat. To isolate the more interesting predictions, they removed all the ones that could have been predicted with a rule-based method from the pre-2013 literature, and looked at what was left over. Looking at 13,000 randomly selected reactions from that set, the accuracy was about 35%. You can look at that figure two ways – organic chemists looking to find new reactions may see this as “two-thirds wrong”, and therefore more likely than not to waste their time if they were to use it as a guide for discovery. Others might be more impressed, because even that 35% figure is one that (by definition) can’t be reached by existing rules-based methods.
I can think of several ways to use these predictions, though. One (as above) would be to look through them for what could be useful transformations that haven’t been described, with an eye to bringing them into practice. You could also keep in mind this line from the paper: “Our model does fail to predict reactions of a molecule in positions where it has not been activated before, or where a mechanistic consideration is necessary to predict the outcome”, and use this as a guide in the other direction, to see what sorts of reactions would break into completely new space were they to be discovered. I wonder if a comparison of the starting graph and the one with the predicted nodes and edges could be compared to see if there are areas of organic synthesis that are unusually rich in predictable (but unknown) transformations, or unusually poor in them. Depending on your point of view, either area might have something to recommend it.