The phrase “automatic chemical design” will generally get my attention, especially when it’s applied to drug-like molecules. And that’s one the the key parts of this paper, from researchers at Harvard, Toronto, and Cambridge. From what I can see, they’re trying to come up with a new technique for generating potential new chemical structures – for example, to do virtual screening on. Most of the paper discusses their methods for encoding (and decoding) numerical representations of chemical structures in a way that lets new ones be generated quickly. This process is also, in theory, taking some molecular properties into account:
To enable molecular design, the chemical structures encoded in the continuous representation of the autoencoder need to be correlated to the target properties that need to be optimized. Therefore, based on the autoencoder results, we train a third model to predict molecular properties based on the latent representation of a molecule. To propose promising new candidate molecules, latent vectors of encoded molecules are moved in the direction most likely to improve the desired attribute and these new candidate vectors are decoded.
This is not a crazy thing to do – in fact, many people have described ways to do it, and this paper is, in fact, presenting itself as an improvement on those. At the same time, I’m not sure if coming up with possible new structures is a rate-limited step for anyone, although I’d gladly be corrected on that one. What this work reminds me of a bit are the efforts by the Reymond group to determine all the possible molecular arrangement below a certain number of heavy atoms (such as the GDB-13 set). This paper is not a from-the-ground-up effort like that work, but is rather an attempt to say “Given this particular molecule (or this set of molecules), how can we use these structures as seeds in order to computationally explore chemical space?”
Update: several readers have pointed out that I’m missing a key point of the paper – the handling of what are essentially discrete variables of chemical structure as continuous ones, allowing, computationlly, the ability to slide along various axes towards desired properties. So I wanted to mention that here, to make sure it doesn’t get lost.
There’s some unfortunate coverage of this paper at Technology Review, summed up by the headline (“Software Dreams Up New Molecules in Quest For Wonder Drugs”). The article suffers from its handling of the point I just raised, since it claims that “Pharmaceutical research tends to rely on software that exhaustively crawls through giant pools of candidate molecules using rules written by chemists, and simulations that try to identify or predict useful structures”, which isn’t quite the case. Equally unfortunate is what happens when you start looking at the output of this process, since it becomes clear that none of the authors are from the departments of chemistry at any of the mentioned institutions. Figure 4 in the paper shows about 65 compound variations starting from aspirin, but by my own count, about 14 of them are either not at all drug-like (acid chlorides, anhydrides, cyclopentadienes, aziridines) or chemically implausible (a fluorocyclobutadiene, a diaminocycloheptatriene). Running structures such as these through any virtual screening effort is a waste of time, and what worries me is that aspirin is about as innocuous a starting point as you could imagine, and this method still produced about 20% craziness.
Looking through some of the other outputs in the appendix of the paper does not inspire more confidence. To begin with, there are a lot of three- and four-membered rings in there, many of them very unlikely indeed, and that reminds me of the graph-theory work mentioned above. The GDB data sets, it’s often forgotten, had to first be purged of well over 99% of their generated chemical frameworks because they were weird concatenations of small rings, and I think that the current program is exhibiting tendencies toward small-ring-forming of its own. To be sure, it also seems to like cycloheptatriene and cyclooctatetraenes a lot more than more people do, so the ring-generating problems may be deeper.
Other problems are immediately apparent if these are supposed to be (even vaguely) drug-like molecules. As with the aspirin-derived structures, there are a lot of reactive and/or unstable molecules in the outputs. The program seems to have no problem with enamines, hemiaminals, enol ethers, and several other labile groups, but there are even bigger problems. I have reproduced at right some (but by no means all) of the problematic structures that appear. It is not going too far, I think, to characterize software that proposes such compounds as defective. No organic chemist could have looked at these without raising the alarm – this stuff is not, by many standards, publishable at all. When the authors do show this work to someone in the field, it will not go well. In fact, this blog post is an example of just such an encounter, and no, it isn’t going well. We’re getting into deepfryer cow cow territory here.
As mentioned, I don’t find the idea behind this paper to be intrinsically wrong or anything of the sort. But what the authors are trying to do is harder than it looks. The molecules that are generated by this method have too many examples like the ones shown to be taken seriously, and on the other end of the spectrum, there are also too many that might be described as “Have you tried adding an isopentyl ether to it?”. No one’s in great need of that set, either. I do still encourage the authors to proceed with this work, if that’s what they’re doing, but I also strongly urge them to consult some actual chemists along the way. And don’t talk to anyone at Technology Review for a while, either.