Skip to main content

Chemical News

Generating Crazy Structures

I feel like a dose of good ol’ organic chemistry this morning, and a (virtual) meeting I attended yesterday gave me a paper to talk about that delivers some. I was speaking with a local group of modelers and computational chemists (BAGIM), and MIT’s Connor Coley was there presenting some of his group’s work.

I had missed this preprint from him and Wenhao Gao, which I’d like to highlight today. It’s on the broad subject of using computational methods to come up with plausible small molecules against a given biological target, which is one of those things that everyone would like to do and no one can quite be sure that they’re able to manage yet. There are plenty of ways to approach the problem, and these range from “indistinguishable from trivial” all the way to “indistinguishable from magic”.

As you’d guess, the magic-powers end of the scale tends to be both computationally intensive and wildly variable in its results, and this is the place to say what I often say about such things. That is, I’m a short-term pessimist and a long-term optimist. I see no reasons why we shouldn’t be able – eventually – to compute our way to fresh drug structures. But at the same time, I think that we have to get a lot better at it than we are now. This is a really hard problem, as has been abundantly demonstrated for several decades now, and it’s hard for a lot of hard reasons. Let’s list a few!

First, the compounds themselves can range over a potential chemical space that is far too large for the human mind to comprehend, and far too large to calculate through in any direct fashion. I mean that in the most literal sense – our solar system has not been around long enough to furnish us the the time to evaluate all those possible structures. And that’s just the structures as received. In real life, they can often adopt all sorts of three-dimensional conformations, and their interactions with protein targets can make them choose some that you wouldn’t have expected. That’s because those interactions are many and various themselves, including but not limited to hydrogen bonds, interactions between pi-electron clouds, van der Waals interactions, halogen bonds, and many more. Our abilities to model these computationally is. . .evolving, let’s say. And the proteins that such compounds bind to have a range of possible pockets and surfaces that are just as wide-ranging as the small-molecule structures are, and that includes their conformational flexibility as well. Add in the effects of key water molecules around these sites, which can be contacted or displaced to completely change the thermodynamics of ligand binding, and you have plenty to deal with.

Speaking of those thermodynamics, surprisingly small energy differences will determine if a given molecule binds to its target or not – differences that are very likely within the error bars of many of your calculations. Those calculations, in turn, will tell you little or nothing about the on-rates and off-rates of that binding process, and that can be yet another huge factor in whether or not you have an interesting compound or not. Dealing with such molecular dynamics is another kettle of hostile, uncooperative fish and can call on computing resources that will strain the largest systems you can possibly access without needing to make contact with intelligent space aliens.

But here’s where I flip around and say again that none of these look like insoluble problems. There seem to be no mathematical or chemical reasons why we can’t get a better handle on them, given sufficient time, money, effort, and ingenuity. It breaks no laws of physics to be able to calculate the binding of a compound to a protein target well enough to determine its affinity as accurately as an assay in a few microliters of solvent can, and to do that over and over with a varied list of compound structures isn’t impossible, either. Just really hard.

One hard limit you do bang into though, as mentioned, is the sheer size of chemical space. We need algorithmic help in navigating that, and that’s the subject of the preprint linked above. It’s talking about “generative” methods to navigate through such space, an idea I last spoke about here. The concept is pretty easy to describe but pretty tricky to implement: you allow the software to build out molecules from some starting point, accepting and rejecting particular lines of inquiry so as to focus the computational resources on the richer seams of structural space. That is, you’re generating new structures in a deliberate fashion instead of just (impossibly) running through every variation possible.

These models need some breakfast in order to get out of bed, computationally speaking. One way is to train them on a set of molecules in order to generate some hypotheses about the chemical spaces in which they reside. If you don’t have any of those or don’t want to introduce possible biases that way, you can try letting some software loose to try to score newly generated molecules against a “black-box” function, but you are then running a larger risk of going on a computational snipe hunt. All sorts of approaches to both of these methods have been proposed.

But what sorts of molecules then get generated? In 2016, I took a look at a compound-generating effort that led to some pretty funky structures, and what I take away from the current preprint is that the funky beat continues. The new paper evaluates three generative algorithms (SMILES LSTM, SMILES GA, and Graph GA) with fourteen multi-property objective functions (which is how you try to convert a chemical structure into a numerical fitness score). These are the techniques used in the “Guacamol” paper on de novo compound design, proposed there as a benchmark for such efforts. Gao and Coley show, for example, that if you use the SMILES GA algorithm with the MPO that’s derived from the structure of the known EGFR drug osimertinib, the structure shown below is a top-scoring suggestion.

That’s clearly not a very useful suggestion. Admittedly, it’s the worst example in the paper, but that computational combination did not generate an actual make-able compound in any of its top 100 molecules. Looking through the other examples, all sorts of odd P, N, O and S functionality appear, ranging from inadvisable down to nutball – endoperoxides, hypochlorites, aromatic phosphorus rings, weirdo phosphines, four-membered rings with three nitrogen atoms in them, and so on.

So this preprint is attempting to impose some order on the situation with various suggestions for synthesizability calculations, with particular attention to MIT’s previously reported ASKCOS system in a paper I blogged about here. There are other ways of doing this, which are compared in the paper – for example, you could send the structures to a retrosynthesis program and see what it makes of them, but that’s a time and computing-cycle intensive way to go. Anyway, the take-home, at least for me, is that sometimes you get a pretty reasonable list of compounds and sometimes you don’t. The systems that train up on a set of known compounds tend to give you more realistic lists, whereas the “goal-oriented” ones that use a black-box function and bootstrap their way along tend to give you the wilder ones.

When you get one of those crazy lists, applying the synthesizability screen throws out so many compounds that the ones that are left may not even be anything that the algorithms found very interesting in the first place. On the other hand, sometimes you get lists right off the bat whose synthesizability scores are actually better than the one you get (68%) by running the ChEMBL database past the ASKCOS software – and remember, ChEMBL is made up exclusively of compounds that are actually known to have been made and tested. You could also apply some synthesizability rules up front, so weirdo molecules aren’t even generated in the first place, but how heavy a hand to use there is another question.

So we don’t have a perfect answer yet, naturally. But this preprint is a good start on the problem, and shows how large that problem is under some conditions. And it’s funny to see this play out, because long-time drug discoverers will have deja vu feelings about having had just these sorts of interactions with their modeling colleagues over the years. “The model says that you should do X”. “Well, we can’t do X – what else does the model say?” “Nothing that looks anything as good as if you could do X” And so on. Good to know that this well-established iteration has now been automated!

41 comments on “Generating Crazy Structures”

  1. Ezra Abrams says:

    great post, thanks

    “halogen bonds”

    learn something new every day

  2. Rue baby says:

    You can make that compound really easily using mellatio-redoxx

    1. Magrinho says:

      Mellatio-redox?! Whoa, let’s keep it clean here!

  3. Barry says:

    I’ve been peripherally involved in selection of “diverse small-molecule” compounds to purchase, as well as to synthesize. Any algorithm that’s looking for diversity is going to violate a med. chemist’s sensibilities. I had to manually discard things like organomercurials that scored well for “diversity”

    1. Some idiot says:


      Well, I am sure that it would have scored _quite_ well on diversity…! I guess the problem was the diversity/insanity tradeoff…


  4. myma says:

    People have been working on this for a long time, at least 20-25+ years in my own personal experience. Actually it is bit a like a whack-a-mole. It usually pops up when a new physicist devises some brand new scoring algorithm (faster!), and try to explore chemical space (bigger!), and then are surprised when the structures look ugly as hell and unsynthesizeable to chemists. So then they put in rules and subroutines to make them not-quite-so-ugly, and reiterate. Then some years later they get bored or run out of VC money or something, and the next round of fresh physicists pop up with Their Own faster bigger algorithm, and hey wow what ugly molecules!

  5. confused says:

    Can you even do three bonds to iodine?

      1. confused says:

        Huh. Wow. Thanks!

        Hmm well I guess if chlorine can do three or five bonds in ClF3 or ClF5 it’s not really that surprising… Just looks really weird in an organic structure.

    1. anon says:

      iodine tri-, penta- and heptafluorides exist: You can even use IF5 as a solvent (now, doesn’t that sound like fun!)

      1. confused says:

        Heptafluoride… Now that’s kind of amazing. Thanks.

    2. AVS-600 says:

      There are reagents that are commonly used that involve three+ bonds to iodine, but, you have to have fairly electronegative substituents for those to be stable. Assuming you’d be able to do that with an I-H bond is… optimistic.

    3. David Edwards says:

      Iodine atoms will allow one, three, five or seven bonds in the right circumstances.

      An example of a compound with five bonds to an iodine atom is Dess-Martin Periodinane:

      This reagent, though somewhat expensive, is routinely used in certain oxidation reactions.

  6. Tim says:

    “four-membered rings with three nitrogen atoms in them” Please forgive my ignorance, but are there such things, and if so, have they already been covered in Things I Won’t Work With?

    1. Derek Lowe says:

      None that I know about! I would assume that they are deeply unhappy compounds even should they be shown to exist.

      1. Aleksei Besogonov says:

        Hey, azidoazide azide exists (well, usually briefly)!

        1. Marko says:

          That’s a five-member with four Ns , but I take your point. And , according to youtube , you can make it in your garage and live to tell the tale.

    2. David E. Young, MD says:

      You probably know this, and I am certainly not the most qualified person here to explain this, but some of the compounds with a lot of nitrogen atoms close together……. explode. Think TNT (tri-nitrotoluene). There are couple of new anti-cancer targeted drug in pill form that have a large number of nitrogen atoms in them (they might not be explosive, but then again…) and I ask the drug rep… “what happens if I toss a bottle of these pills in a furnace?” and they give me a funny look. I am not certain that it is a funny question, but of course the pharma rep would have no idea if the compound is explosive.

      1. Barry says:

        While explosive drugs are uncommon (nitroglycerin/glyceryl trinitrate (GTN) excepted), once a drug candidate has passed from Research to Development, someone is responsible for assessing each synthetic intermediate for such hazard. Sandoz had an aryl diazonium salt intermediate explode in the 70s, launching the (1.5″ thick) steel lid of the (1,000 liter) reactor through the roof, across the highway, into a parking lot on the other side. Since then they have a lab which subjects all intermediates to heat, and strikes them with hammers to preclude such surprises.

        1. theasdgamer says:

          Your comment reminded me of a bugs bunny cartoon where he was testing bombs to discover duds.

  7. Some idiot says:

    “Dealing with such molecular dynamics is another kettle of hostile, uncooperative fish”

    Classic! One of the reasons I love reading this blog…!


  8. M says:

    As a computational chemist for many years at a big pharma company, I never had the type of back and forth mentioned here. I always presented my ideas to my medicinal chemist colleagues as hypotheses to test, both with what the model predicted were the right choices, but also what the model predicted *shouldn’t* work. I never presented ideas that were synthetically infeasible either. Just because I was a “modeler” didn’t mean I had to forget any organic (or other) chemistry I had learned. The result was that I had a really good relationship with all of my medicinal chemistry colleagues.

    1. myma says:

      Yes, it does work out sometimes! The first difference is that you call yourself out as a chemist first by training and computational modeller second. The second difference is that it sounds like you work with your models and your med chem customers to learn, understand, and refine. It isn’t a black box, it is about quality of the data in, quality out, and honest interpretation.
      I may sound harsh to AI in my comment above, but there is a shed load of hype in AI for drug discovery. Last week, I came across a start-up with bucu VC funding that claimed its models are “98.3% predictive”. This sort of extreme language does a disservice to those computational chemists who actually add real value to projects.

  9. LeeH says:

    A common issue with academic exercises is that they push the “de Novo” aspect. So the question they try to answer is “what happens if we ignore what’s already been done, and just apply our method”. In this case, the simple application of a few cheminformatics filtering rules would remove the vast majority of structures suggested by the generative method (as in this case, Num_Boron = 0, or Num_Boron <=1 if you were in a Millennium mood). I think Connor would have been better served had he shown an example structure that was esoteric, but not outside the bounds of normal industry parameters. To his credit, he wasn't holding back in trying to illustrate the difficulty in tuning the generative method to give something diverse, but not too diverse.

    1. Anonymous says:

      If that means no borons, not the best example:


      1. LeeH says:

        Hence the <= 1 and the reference to Millennium.

  10. Marko says:

    “…. you allow the software to build out molecules from some starting point, accepting and rejecting particular lines of inquiry so as to focus the computational resources on the richer seams of structural space. That is, you’re generating new structures in a deliberate fashion instead of just (impossibly) running through every variation possible.”

    To what extent does the further development of quantum computing allow us to explore the last option , and how far are we away from that technology. I have no clue , but I’m hoping someone here might.

    In other words , is the work described above simply laying the groundwork for inevitable computational developments to come ?

    1. MPK says:

      It does not (yet?). Advantage of quantum vs classical computing (“quantum supremacy”) applied to comp chem hasn’t been shown yet. So far, very simple Hartree-Fock systems can be done and there are multiple attempts at approximations to coupled clusters or large CI. But still very far from anything practical.

      1. Marko says:

        OK. So , a ways off yet. Thanks.

  11. A.I. says:

    Less boron, more cow bell.

  12. That BH2.N.IH(NH)(SH) cluster made me wince out loud

  13. sPh says:

    Isn’t the point of these efforts to propose molecules that knowledgeable people would not, because their previous experience leads them to set aside certain categories or paths? As I have aged through my chosen technology path I have become aware that (1) I can often solve problems faster and more accurately than my younger employees whose brains are clearly now faster than mine, because my greater depth of knowledge and understanding based on years of practice lets me get to the root of the problem more surely than the fast-brains (2) when real creativity is needed, or there is no obvious path, I have to step aside and let the people with less experience try because my depth of knowledge will constrain my sight and keep me on the paths I already know. Is it possible that the phrases ‘that molecule clearly won’t work’ and ‘even if that did something it could never be manufactured’ may be constraining?

    note: I am not a physicist!

  14. Insilicoconsulting says:

    1. Its important to define what’s maeant by chemical space. It can be defined in terms of thousands of dimensions and diff types of dimensions from simple measurable properties to fields.

    2. Fitness functions will need to incorporate synthetic feasibility or have two loss functions one for SAR and one for feasibility (and mybe one for DMPK)

    To reach long term goals organic/medchem chemists need to work very closely with ML/compchem ones.

  15. Kevin says:

    I confess to being just a tiny bit disappointed to see a post titled “Generating Crazy Structures” that didn’t lead to a new article from Klapötke’s Things I Won’t Work With insanity factory.

    Mention of a four-membered three-nitrogen ring is a bit of a consolation prize, I guess.

  16. Marek Vokac says:

    Well, this is not exactly a new subject either. I just have to include an excerpt from the ever-inspiring Ignition by John D. Clark, page 171, as follows:

    The Air Force has always had more money than sales resistance,
    and they bought a one-year program (probably for something in the
    order of a hundred or a hundred and fifty thousand dollars) and in
    June of 1961 Hawkins and Summers punched the “start” button and
    the machine started to shuffle IBM cards. And to print out structures
    that looked like road maps of a disaster area, since if the compounds
    depicted could even have been synthesized, they would have, infallibly,
    detonated instantly and violently. The machine’s prize contribution to
    the cause of science was the structure, H—C= C—N N-——H to
    O O
    F F
    which it confidently attributed a specific impulse of 363.7 seconds,
    precisely to the tenth of a second, yet. The Air Force, appalled, cut
    the program off after a year, belatedly realizing that they could have
    got the same structure from any experienced propellant man (me,
    for instance) during half an hour’s conversation, and at a total cost
    of five dollars or so. (For drinks. I would have been afraid even to
    draw the structure without at least five Martinis under my belt.)
    Like Derek says above… we should in principle be able to figure this out, but the parameter and solution spaces are astronomical in size and my guess is that classical computation methods won’t hack it.

  17. luysii says:

    ” . . . the proteins that such compounds bind to have a range of possible pockets and surfaces that are just as wide-ranging as the small-molecule structures are, and that includes their conformational flexibility as well.”

    You couldn’t ask for a better example than Cell vol. 182 pp. 1574 – 1588 ’20 — It has the cryoEM structures of LSD and NBOMe (a designer hallucinogen) bound to the very well known target the serotonin 2A receptor (5HT2A). The target and its binding site has been known for years, and yet the two compounds bind in completely different orientations. Even worse NBOMe produces a new pocket in 5HT2A to which part of it binds. So you can’t even use the presumably ‘known’ binding site for computation. Yet another reason drug development is very very hard. For details please see —

  18. exGlaxoid says:

    Why not just have the computer look at fragments of things that exist in reality and then put them together into larger pieces, as those compounds might have a chance of being made. DeNovo molecules may sound great, but are akin to monkeys on typewriters. By analogy, I’d rather use a program that randomly mixes real words and then looks for good grammer to come up with new phrases that might mean something that just random textual noise.

    The first part is to remove those fragments that are not practical like mercury compounds, explosives, radioactives, etc, as well as pentavalent carbon, then come up with simple rules for stable compounds, and no transcyclohexenes, etc. Given chemical space, that would still leave way more compounds that we can ever make. But maybe not as many as have been theoretically patented, based on some patents that say a”ny molecule with a carbon and 4 substituents which can range from H to any compbination of up to 1000 elements,”

  19. Thomas Lumley says:

    Reminds me of this passage from ‘Ignition’:
    ‘The Air Force has always had more money than sales resistance, and they bought a one-year program (probably for something in the order of a hundred or a hundred and fifty thousand dollars) and in June of 1961 Hawkins and Summers punched the “start” button and the machine started to shuffle IBM cards. And to print out structures that looked like road maps of a disaster area, since if the compounds depicted could even have been synthesized, they would have, infallibly, detonated instantly and violently.’

  20. Walther White says:

    Nice to get an update on what’s happening in this area. I remember working with “growing” ligands using genetic algorithms against a protein pocket–computationally cheap, but the results were invariably “brick” compounds or chemically impossible chimeras. One alternative I used was Markush enumeration of variable cores, followed by conformational searches (you don’t want to bother scoring strained molecules) to generate leads for docking or ROCS. Generated some pretty good leads this way.

  21. A Nonny Mouse says:

    Once did some work for one of these “De Novo” companies when they were down to their last £100K having blown £23 million.

    The structures were already pre-screened to remove the absolute dross, but what was left wasn’t much better (lots of exo-methylenes, I recall). Managed to make about a dozen, but none of them showed up any activity for what they were designed for. They tried to continue selling their software, but that was eventually abandoned as well.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.