Skip to Content

David Weininger and Chemical Names

David Weininger passed on last week, and you probably have to be into chemoinformatics for that name to immediately register. He came up with the SMILES notation for chemical structures, though, so that should make his contributions clear. Here’s an excellent appreciation by Anthony (“Ant”) Nicholls that will really give you a sense of the guy; I very much recommend it.

One thing that may not be appreciated is just how much of a big deal SMILES really is. The reason is that chemical structures, although excellent and meaningful representations (which is why we chemists get so ticked off when art directors mangle them) are actually pretty unwieldy for computers to handle. Until the 1950s or so, that probably wasn’t much of a problem, but it became clear that there needed to be ways to turn structures into some sorts of numerical or alphabetical forms so that they could be dealt with by software.

Natural language, as usual, wasn’t much help. For the non-chemists in the crowd, you’ve surely seen chemical names written out in English or whatever your native language might be, all the way from “methane”on up. Those names are systematic, and can be converted back and forth from structures, but they can be pretty unwieldy – for example, here’s the systematic name for testosterone: (8R,9S,10R,13S,14S,17S)- 17-hydroxy-10,13-dimethyl- 1,2,6,7,8,9,11,12,14,15,16,17- dodecahydrocyclopenta[a]phenanthren-3-one. Not very enjoyable, and it gets a lot worse from there. IUPAC nomenclature like that tends to assume that polycyclic systems are aromatic by default, and then reduces the bonds as needed, thus that “dodecahydro” part. It’s also hard to compare bits of a structure with that naming system, since the root of it the name (and the numbering scheme that goes along with it) can totally flip over with even fairly minor changes in structure.

smiles-example

An early way to do this was Wiswesser line notation, which I noted to my amazement was first worked out in 1949. I remember seeing it once in a while in the 1980s, in grad school, but I never really learned it. One problem with it that became apparent as the years went on was that it was difficult to write software that could read it easily. So in the 1980s, Weininger worked up the SMILES format, which is much more friendly in that regard (although not as compact). For an explanation of how it works, I can’t do better than the Wikipedia graphic at right, which uses the antibiotic ciprofloxacin. What you can see is that the molecule’s rings are broken apart, and it’s named as branches off of a resulting chain. There are ways to note what sort of bond connects the atoms, the stereochemistry (if applicable), positive and negative charges, even isotopes.

And as you can imagine, there are a lot of different SMILES strings that are equally possible for any complex molecule. Any given software package should be able to come up with the same representation every time you give it the same molecule, but it’s not necessarily the one that another SMILES generator might return. (When you convert it back to a chemical structure, though, you should get the same thing from either). These text strings can be handled for all sorts of purposes, as you’d imagine, and modern chemoinformatics wouldn’t be possible without something like this.

There are other “structure to text string” systems out there, naturally. One of the most common, besides SMILES (which is basically everywhere there are chemicals and computers) is InChI, the International Chemical Identifier. One big difference between the two is that InChI aims to give a unique representation for every molecule – there’s only one way to do it for each molecule you put in. That’s also the case for Chemical Abstracts Numbers, of course, but CAS numbers, though short, are arbitrary and tell you absolutely nothing about the structure or anything else. InChI is trying to be both unique and comprehensive at the same time, which is a tall order. Attempts have been made to bring InChI and SMILES together as well, at least to make them more freely interconvertible.

With practice, you can sort of eyeball out structures from either SMILES or InChI, or at least differences between two structures. But they aren’t meant for us – they’re meant for machines. And given how we rely on those machines, I’m glad that they work for them.

22 comments on “David Weininger and Chemical Names”

  1. Anon says:

    Put simply, SMILES is able to compress 3D information into 1D information (a linear string of text), very efficiently and without losing anything.

    1. Wolf-D. Ihlenfeldt says:

      Hardly without losing anything. There is no way to store 2D or 3D information in SMILES. You can only get the original connectivity back, but no 3D coordinates, or 2D layout. SLN (Sybyl line notation) is actually a much cleaner and extensible design, but alas did never get the market share of SMILES.

      1. Anon says:

        Isn’t the connectivity enough to tell you all about the 3D structure? Maybe not exact stereochemistry, but all the bond lengths and angles are highly predictable.

        1. Wolf-D. Ihlenfeldt says:

          No. Stereochemistry is actually the part which works out pretty well (SMILES does even support square planar, pentahedral and octahedral stereochemistry), but for most applications where you bother to store a 3D structure you want a specific rotamer/conformation, and that cannot be reconstructed.

          1. Anon says:

            Well the rotamer conformations are also easy to figure out by energy calculations – unless they are dynamic, in which case they are artificially constrained (given as fixed when they are not) in any case.

  2. Falanx says:

    Unfortunately, the CAS system *does* have several examples where multiple different numbers exist for the exact same compound, Usually in inorganic chem.

  3. Wavefunction says:

    There’s a great description of both Weininger and Anthony in Ed Regis’s book “The Info Mesa” about the 1990s Santa Fe startup scenario.

  4. Wolf-D. Ihlenfeldt says:

    “Any given software package should be able to come up with the same representation every time you give it the same molecule”

    This is generally not the case. Rather, the most straightforward way to generate SMILES is to traverse the atoms in the order they are numbered or stored internally, and using this approach the same molecule yields different SMILES depending on how it was input, even within the same software package. Extra canonicalization is usually disabled by default since it eats performance, and canonic SMILES are exchangeable only within a single implementation anyway.

    Weininger already proposed himself a method to generate a unique SMILES string independent of the atom and bond numbering/sequence. Unfortunately (and probably intentionally – Daylight was once enormously profitable with their database software and it was not in their interest that other software could reproduce their database keys) the original Daylight SMILES canonicalization algorithm was only published in a very cursory and incomplete fashion, and thus every software package had to invent its own solution to arrive at a canonic SMILES string, making them completely non-interoperable.

    Also, almost every cheminformatician has some choice words about the definitely weird definition of aromaticity implied in SMILES, which is generally incompatible with more mainstream definitions, and requires tedious special-purpose re-implementation of aromaticity detection and resolution code in every SMILES package.

  5. LeeH says:

    Dave long ago gave me my first introduction to SMILES, a quick tutorial, while he rolled a cigarette using tobacco from his 70’s-style leather stash bag on his belt. I remember it vividly.

    On a more technical note, SMILES is great way to store structures compactly, but it does not handle enhanced stereochemistry at all, making it inappropriate for warehouse applications.

  6. AndyM says:

    David Weininger will be missed! I’ll never forget the first time I met him. It was an ACS meeting in Chicago, circa 2002. As ACS meetings usually go there wasn’t much exciting, except for encountering a very cool dude that looked like Moses (that would be David), fresh from an audience with Pharaoh (i.e. “let my people go”), and witnessing a pack of rabid academicians getting schooled on electrostatics by an “outsider” (that would be Anthony). The introduction to David came by way of Andrew Grant, who was in deep discussion with David on a cheminformatics topic. My all too brief conversation with David was exhilarating – no further need for caffeination for the remainder of the afternoon ACS sessions. David covered topics of chemical diversity, “druggable” chemical diversity, astrophysics, aeronautics and how to calculate the minimal fuel needed to successfully transit the Pacific Ocean (from Hawaii to California), flying a vintage Italian aircraft. I departed with an infusion of chutzpah, genius, and an invitation to Star Parties at his backyard observatory in Santa Fe. Not too many scientists/innovators get to the “mountain top” – David was one of them. The cheminformatics community has benefited greatly from his vision and creativity and it will continue to do so for many years to come. Thanks David – CAVU! Ceiling And Visibility Unlimited….

  7. Metacelsus says:

    How would SMILES handle the stereochemistry in something like BINOL or a helicene?

    1. Wolf-D. Ihlenfeldt says:

      It cannot – at least not in any of the Daylight definition incarnations.

  8. Peter S. Shenkin says:

    My favorite quote from the Daylight theory manual: ‘The “aromaticity” designation as used here is not intended to imply anything about the reactivity, magnetic resonance spectra, heat of formation, or odor of substances.’ It’s clear that what’s foremost is the ability to canonicalize a structure even if the user should enter a specific kekule structure.

    There are indeed a few places where one might take issue with the choices made, but the operative word is “few”. More problematically, for these “few” cases, different packages disagree on whether a given ring is aromatic. For example, there is at least one widely used package (not Daylight) in which the central four-membered ring in biphenylene is considered aromatic.

  9. Argon says:

    Just a basic question… Which notations are the most ‘friendly’ for algorithmic, similarity searches?

    1. Wolf-D. Ihlenfeldt says:

      Does not matter. Similarity searches are generally not performed on notation strings (though there are creative though theoretically rather unfounded approaches to directly use SMILES notation word fragments for that – but that is definitely not mainstream). The normal method is to compute some bit vector from a decoded structure, and then use a bitvector comparison algorithm. What representation the structure was decoded from is then irrelevant.

  10. Peter Kenny says:

    We shouldn’t forget SMARTS for specifying substructure when remembering Dave. The power of SMARTS notation is that it allows you to impose a view of chemistry (e.g. atom types; tautomeric preferences) on databases of chemical structures in a transparent manner. In my view, SMARTS notation is even more powerful than SMILES notation. I have linked a blog post on SMARTS as the URL for this comment.

  11. Pick says:

    At least its not one of guliuo superti furga’s ideas. The guy is dumb eurotrash.

    1. drsnowboard says:

      I think you’ve strayed into the wrong comments section? Insults is next door…

  12. Project Osprey says:

    On the subject of SMILES and databases. Here’s a chemical structure search engine which runs off the SMILES strings in Wikipedia. Sort of a poor man’s SciFinder but to my eyes much more useful than similar open access databases (ChemSpider or PubChem) in so much as it tells you what the various compounds are and is much faster.

    http://www.cheminfo.org/wikipedia/

  13. Anon says:

    Personally, I’d like to commend our dear author for keeping the ‘o’ in chemoinformatics! Too long have I laboured under the lesser ‘cheminformatics’ banner.

    And my sincere thanks to the late David Weininger, for without SMILES, so much of chemoinformatics and drug discovery would have been harder. And probably duller.

  14. Julia Yerkov Kline says:

    I am not a chemist but have met Dave while refurbishing molecule sculpture with Steve Kline sculptor from New Orleans, placed in front of the Daylight co. in Santa Fe. Dave Weininger and Steve Kline have known each other from time Dave and Dawn lived in New Orleans. We stayed at Dave’s house at the top of the hill, had fun time visiting The Black Hole in Los Alamos surplus yard; we spent late nights outdoors, long talks under the open sky, time I will never forget….I will miss Dave Weininger always.

  15. JohnH says:

    I worked with Dave in the mid 90s and we covered a lot of ground on “difficult” chemical structure representation. His approach was that if I could make a valid information argument about an approach then it would be put into the software otherwise not. The discussions normally took place in the late evening after a joyful meal and walking around under the stars in Santa Fe until we reached resolution. He was a truly inspirational person who enriched the lives of those who knew him.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.