Skip to main content

In Silico

Odd Structures, Subjected to Powerful Computations

Here’s a paper in Nature Chemistry on computational simulation of GPCR activation, using the beta-2 receptor as a model. I’m writing as someone who’s worked on GPCRs, who is interested in such mechanisms, but who is not a computational chemist. And as such, I have some real reservations about the paper. Are my misgivings well-founded or not?
What this team at Stanford has done is a massive amount of molecular dynamics work, attempting to capture details of the the conformational changes that have to be taking place during receptor activation. They present plots illustrating the movements of key residues versus time for agonists, antagonists, and inverse agonists, at a very fine level of detail. I am not competent to judge the fitness of their MD software, their use of Markov state models, the methods by which they reduce 3,000 of those MSMs down to a ten-state model of the receptor dynamics, and so on. I’d be glad to hear from people who are – for now, I’ll assume that all this has been done to a high standard.
But the paper does get into some areas that I feel able to question. Fellow chem-blogger Wavefunction’s Twitter account first alerted me to the feature that I find most disturbing. Take a look at these structures. They’re supposed to be from the GPCR Ligand and Decoy Database, published here, which this paper used as a source of more agonists, antagonists, and comparison compounds. But when I search the beta-2 files from that server (both agonists and antagonists, ligands and decoys), I can’t find any compounds with the left-hand sides of compounds 2 or 3. They’re not in the ZINC database, either, as far as I can tell. (A minor point is that the current paper refers to them as catecholamines, but that’s not right, either: a catechol is a dihydroxybenzene, and there’s an extra methylene in these. Such structures can be found in beta-adrenergic ligands, but they’re not catecholamines).
Then there’s the more serious matter of that hemiaminal. That’s not the usual pharmacophore for an adrenergic receptor, which has amine and OH on adjacent carbons, and it’s not even really a stable group under most conditions. That was the first thing that struck me when I saw the structures – how are these things GPCR ligands, when I wouldn’t even be sure that they’re stable in buffer?
Exacycle plot
So I don’t know where these compounds come from, or if some mistake has been made along the way, but that’s how it looks to me after a bit of digging around. I took a look through the literature for structures like these, and I did find some in an old Theravance patent, WO03042164 (see compounds 16 through 26). Looking over the patent procedures, though, I think that those structures are a mistake. The claimed chemical matter (and the synthetic procedures used in the rest of the patent) are all directed to making traditional beta-hydroxy-amino structures. The patent has a run of these hemiaminal things, supposedly made by the same coupling procedures that make the real beta-receptor ligands in the rest of the patent, and that’s not going to work. My guess at the moment is that some beta-receptor ligand database has been corrupted by inclusion of these structures, which may well have propagated from this application or others in the series. At any rate, I don’t see, at the moment, how these things are beta-receptor agonists, and I would have to say that running molecular dynamics simulations on them is not the best use of computing cycles.
And that brings up one last problem I have with this paper, which may be a minor one (or maybe not). The title is “Cloud-based simulations on Google Exacycle reveal ligand modulation of GPCR activation pathways”. I can’t help but thinking of something Bill James (the baseball statistics guy) wrote back in the 1980s. He was doing one of his exercises to try to predict performance, and said that he’d done a “computer projection”. But then he backed up, and said that he figured that this phrase was probably going to disappear from use in the coming years, because the only way to do these things was with a computer, and you didn’t say that you’d done a “pencil projection” or something. The tool would become so common that it would disappear into the background. To a large extent, that’s just what’s happened – but it was a sufficiently novel thought back then that I found it striking.
So I have to wonder, perhaps unfairly, if the “Google Exacycle” part is there to bring in some more attention. It’s true that the cloud-computing aspect of this work did allow the authors to do a lot more than the usual MD simulation, and it may well be a difference of kind rather than just a difference of degree (again, I’d be glad to hear from computational folks on this point). But it can’t help sounding cutting-edge, can it?
Update: Both Stuart Cantrill of Nature Chemistry and lead author Vijay Pande have shown up in the comments section, and I appreciate both of them coming by. These issues are being looked at – more later as details become clear.
Second Update: Prof. Pande says in the comments that this is indeed a drawing error. The correct structures are more reasonable beta-receptor ligands, and those are the ones that were docked. The paper is being corrected.

45 comments on “Odd Structures, Subjected to Powerful Computations”

  1. anonao says:

    3 authors are from Google, so that could explain the fact that the google cloud is in the title.
    It’s also an advert for the google cloud to try to attract researchers to use their infrastructure (competition versus Amazon cloud, mainly, and Microsoft Azure)

  2. Wavefunction says:

    As someone who considers himself a structure-aware computational chemist I am always disappointed when such mistakes crop up since they then breed suspicion and, more importantly, detract from the real value of a paper. I have a lot of respect for the authors of that paper and I actually think that MSMs constitute one of the most significant advances in theoretical protein folding over the last twenty years, potentially allowing us to capture kinetic rates, transition barriers and rare events. This makes such chemical errors even more unfortunate since, instead of appreciating the methodology used in the paper, they only make chemists cringe and focus on one part of the study.
    The lesson here is for computational chemists to have chemical structures vetted by organic chemists or those with a background in organic chemistry. Even if your paper is mainly about methodology don’t let it unnecessarily get derailed by elementary errors in structural drawings. Otherwise your good work runs the risk of being ignored, all because of mistakes that any competent organic chemist would have been able to notice within ten seconds of eyeballing.
    As far as the Google Exacycle part is concerned, sure it was included partly to bring in more attention, but it’s also quite true that it demonstrates the kind of computing power needed to conduct such large-scale, fine-grained studies of proteins as complex as GPCRs. These guys were studying tens of thousands of trajectories with average lengths of a few ns each, and that’s exactly the kind of problem which would benefit from a massively parallel computing platform like the exacycle.

  3. anon the II says:

    Your analysis sounds like an old episode of Columbo. You know the authors are responsible for killing science, but you’re gonna go into minute detail and give them every opportunity to work their way out of this mess. But just as the show is about to end, they bring out the aminals and everybody knows that they did it and they’re led away in handcuffs as the credits roll.

  4. Anon says:

    I think the simplest explanation here is that there may be a drawing error.
    That would be consistent with their catecholamine nomenclature.

  5. David Borhani says:

    Pretty funny, but actually pretty sad. The journal is Nature Chemistry, after all. Reviewers and editors asleep at the wheel?
    And the corresponding author is in Stanford’s Department of Chemistry.
    Looking beyond what one hopes are just graphical and typographical errors (did they really dock molecules 2, 3, and 4?), saligenins are perfectly respectable partial β-agonists (think albuterol/salbutamol and salmeterol). Oh, and by the way folks, 1, not 4, is the ethanolamine.
    And how about that structure 8? Don’t let it near trace acid, it might not hang around long!
    @Derek: my comment yesterday on the previous blog entry is stuck in review. Could you please release it? Thanks!

  6. Chris Ing says:

    This study consumed 450 million core hours of simulation. At about 4 cents an hour per core (, this would’ve cost $18 million to run on Google’s Compute Engine. Anyone who works on HPC knows that you don’t do serious computation on cloud computing platforms. It’s more of an advertisement for Google, in general.

  7. Derek Lowe says:

    #2 Wavefunction –
    That’s the thing – this paper should have had just that sort of look taken at it during the review process, but something appears to have broken down.

  8. anon the II says:

    @ Wavefunction
    The fact that you have to construct the term “structure-aware computational chemist” is kinda scary. My sense is that there shouldn’t be any other kind. Unfortunately, there are.

  9. 3Anon says:

    I am curious about your rights to the data. Google is notorious for getting to peak at your data or trends within it.
    Gmail is free because they are able to track your conversation and watch your key words (which they will use for advertising against you).
    Google search tracks your searching habits and is able to do so even if you clear your cookies/ISOs.
    Google’s model is to provide a service for free or at a reduced cost and use the user’s input to generate their financial returns (mostly through advertising right now). My bet is that the Exacycle terms of service are not going to give you 100% ownership of the data and will still give Google an extra set of eyes.
    That said, I warn others about adopting such terms as Google is entering the Biotech/Pharma world through Calico and its various venture backed programs. Other that outright knowing the molecule of interest, if a member of my team logs in and keeps checking a specific domain, Google parties will have a good idea of what I am thinking of targeting. They will be able to know what models I am looking at, how I analyze the data, etc.

  10. Wavefunction says:

    #8: Most computational chemists I have encountered generally come from one of two backgrounds – organic/medicinal or physical chemistry. A few others come from biochemistry or computer science. The ones who come from either p-chem or CS have to be most careful at interpreting the results in terms of known physical-organic principles. Some time back I wrote a post on becoming a modeler in a drug discovery setting which captures some of these points (linked to in my handle).

  11. Pete says:

    Maybe they were worried about catechols hitting PAINS filters and the results being declared invalid by the self-appointed arbiters of molecular good taste? anon the II (#8), I hope that anyone calling him/herself a chemist would be structurally aware.

  12. pc says:

    off topic but this shall be of interests to folks here –
    Potti scandal at Duke

  13. OldLabRat says:

    I hope those are drawing errors. Otherwise, how would one model a racemic or diastereomeric mixture in 3D? I haven’t had a chance to read the paper, but presumably the authors modeled all stereoisomers. So why not draw the stereoisomer that was modeled for the plot results? As far as I can tell, there’s no way to draw any conclusions about the papers validity.
    Definitely a failure of the journal editors to find qualified reviewers.

  14. DCRogers says:

    Almost certainly drawing errors; one hopes they were only misdrawn for the paper, not for the modeling.
    @12: my biggest headache reviewing statistical chemistry papers was confirming that the author did not perform variable selection (with knowledge of the dependent variable) prior to validation. Data tables with independent columns named like: DESC_1, DESC_2, DESC_7, DESC_12… always made me question what bathtub the missing descriptors were quietly drowned in so that Q^2 could stay afloat.

  15. Stu says:

    We’re looking into the issues raised by Derek’s post – stay tuned.

  16. Just maybe says:

    My bet is on drawing errors. Maybe they are phenethylamines rather than benzylamines.

  17. VijayPande says:

    Thanks for identifying these issues. We’re looking into what’s the issue here and our suspicion is that there’s an issue with the drawing of the example structures. We’ll post more after we’ve gotten to look at the points you raised in detail.
    However, despite the unfortunate example structures shown in a main figure of the paper, this issue does not detract from our main point in this section. The main aim is to show that these new **protein** structures (along the different pathways for activation identified from the MD/MSMs, where all of the big computation was used) can be used –– in the context of virtual screening –– to utilize unique binding site configurations that can select unique ligand types, which may not have been discovered by conventional virtual screens.

  18. Biff says:

    Another beef I have with the paper: how does a “simulation” actually “reveal ligand modulation of GPCR activation pathways,” anyway? It’s a simulation, dammit!
    Wouldn’t the proper title be, “Simulated ligand modulation of GPCR activation pathways?” I guess that would make it sound less worthy of being in a Nature publication.
    Further, (1) making a simulation “cloud-based” and (2) performing it specifically on the Google Exacycle platform contribute NOTHING to the scientific merit of the study. The fact that both are featured so prominently in the title suggests to me that our friends at Nature Publications may be blurring the line between a scientific article and so-called “content marketing.”

  19. Moody Blue says:

    @4 Anon & @16 Just may be,
    I agree you both will win your bets! The structures should be phenethylamines instead of benzylamines. The -CH2OH group on the aromatic ring should simply be -OH group making it a catechol. It’s matter of misplaced methylene group! In other words, just move the methylene from -CH2OH and insert between nitrogen and the hemiaminal carbon!! That would be derivatives of epinephrine, a well known beta-AR agonist!

  20. schinderhannes says:

    This is quite obviously a very embarrassing drawing error!
    The key question is when it happened, though!
    Was it early on? Were these structures fed into the cloud for computing? (I bet the cloud can do almost everything but doesn´t hydrolyze hemiaminals (or condense them t imines), but rather accepts the happily as valid structures).
    This would make the results worthless.
    Or was it in the drawing for the manuscript?
    Embarrassing for authors and reviewers, but no further harm…..

  21. canman says:

    Multiple people have screwed up. Publishing rubbish structures like these should be inexcusable. This is chemistry, apparently done by people with `chemist` in some part of their title. Unfortunately it does look like it’s bordering on advertising. Perhaps we can all help out here to lower the prestige of these high impact journals by NOT CITING PAPERS LIKE THESE.

  22. Anonymous says:

    Having used the dataset of Gatica et al. we found out that for the receptor that we were benchmarking there were a lot of structures that were “junk”. Probably because because the original structures taken from the Glida were never validated..
    A more reliable way is to take Ligands from chembl and look up the papers.

  23. anon the II says:

    A more reliable way is for people who know structural chemistry to do structural chemistry analysis and those who don’t know an aminal from an animal to go work in zoos.
    No wait, that’s probably not a good idea either.

  24. anonao says:

    to #22. And did you publish the fact that the data was wrong somewhere? Would be good for other people who use the data from Gatica et al. or from Glida not to make the same mistakes (may not be accepted in journal, but sites like slideshare or figshare would do).

  25. Anonymous says:

    oh, man, does it take hours to find it out?!

  26. BariTony says:

    I had a similar situation occur with my thesis work. Shortly before writing my dissertation on apolipoproteins, I noticed during the analysis of one of my last MD simulations that there had obviously been a major mistake. I was terrified to tell my thesis advisor because I knew the only way to correct the problem was to re-run a month-long MD simulation while I was in the final months of grad school. I did, and my advisor thanked me – said that I could have easily hidden the mistake and graduated without anyone knowing!
    The upshot is, while it may be embarrassing, it’s not necessarily catastrophic. If the authors followed best practices (ie, did a good job with version control while keeping every script they used to run the simulations and analyze the results), it should be straightforward to make the corrections to the structures and re-run the simulations.
    Dr Pande,as to your assessment that these errors shouldn’t affect the results, that may definitely be the case. That’s been my experience with some simulations in the past as well. However, that doesn’t mean that you shouldn’t go back and re-run the simulations if there’s been a problem and report on the results. If not as a sanity check, then simply as a matter of following good computational practices.

  27. anonao says:

    #25 I suppose you are replying to me. Apparently sharing is not really something people are doing. Instead of presenting that there are errors you think it’s better to keep it for yourself and then complain if someone makes a mistake, or just let other people waste some of their time doing the curation. #openscience is not ready yet
    But may be that was some industry people do, spent their time on something and then keep it secret so other people will have to waste their time doing it again.

  28. RM says:

    Biff@18 – Similar beefs could be made with the titles of *most* published studies:
    “How does this study reveal anything about the native? It’s a fluorescently tagged protein, dammit!”
    “How does this study reveal anything about how the protein works in cells? It’s in vitro, dammit!”
    “How does this study reveal anything about how this system works in humans? It’s in a rodent, dammit!”
    “How does this study reveal anything about how the drug works in real patients? It’s in a carefully selected cohort, dammit!”
    Maybe you have broader issues about paper titles, but I just want to point out that it’s really not something that’s simulation/computation specific, unless you have a personal axe to grind.
    You get persnickety about those qualifications, and you quickly wedge yourself into a corner where no experiment can be said to reveal anything about anything, and paper titles all end up like “A particular florescently tagged variant of a particular mouse XYZ4 splice variant (previously proposed to play role in stem cell transcription), when expressed from yeast cells, co-ellutes from a particular binding column with a region of ABC6 (expressed in bacteria) which in another paper was proposed to be the regulatory domain, at least in our hands and under the particular conditions outlined in the methods section of this paper.” As opposed to something more pithy like “Stem-cell regulatory transcription factor XYZ4 interacts with the regulatory domain of the ABC6 proto-oncogene ”

  29. Anonymous says:

    No, #25 was somebody else.
    I (#22) didn’t publish these results.
    Nevertheless, all I was trying to say is that you should always check the source of your data being it from Glida, Gatica etc. E.g. if you look at Glida Desipramine is listed as a beta-2 agonist, but if you search for the assay it says;
    Not Active (inhibition

  30. VijayPande says:

    Here’s a followup from my post yesterday. We have verified that this is indeed a drawing issue. The three ligands Dr. Lowe brought up from top to bottom are CID 44216210, CID 44213610, CID 44209282. In the pictures, there is just a missing carbon between the nitrogen and carbon with an alcohol group (the hemiaminal that is unstable) and the -NH group should be a postively charged amino group at pH=7. We will work with Nature Chemistry on a correction and are grateful to Dr. Lowe for pointing out this problem.
    I would like to stress that this was a drawing issue solely: we double checked yesterday that we did dock the correct structures. And it’s also worth pointing out that the MSM simulations did not include these molecules (they came from docking after we predicted these structures), i.e. even if this wasn’t a drawing issue, these odd structures were not involved in the “powerful [MSM] calculations” (suggesting a correction to this blog post’s title may also be in order).
    Also, to address the question raised in the comments regarding why the cloud aspect is significant: running molecular dynamics simulations on the timescales we did (hundreds of microseconds) in the cloud is itself a major algorithmic achievement. Most people would have thought this to be impossible (due to hardware scaling limitations) and could only be done on specialized hardware (such as Anton). Doing this in the cloud is also not just an algorithmic challenge –– this novel methodology also opens the doors to others’ ability to run similar calculations, since cloud resources are so ubiquitous, readily accessible even just for short periods of time, and relatively inexpensive (compared to specialized hardware).

  31. anonymous says:

    @30, Prof. Pande:
    CID 44216210 appears to specify stereochemistry at only one of the two chiral centers*. Which diasteriomer was docked (both)?
    CID 44213610 appears to not specify any stereochemistry at the one chiral center*. Which enantiomer was docked (both)?
    CID 44209282 appears to not specify any stereochemistry at the one chiral center*. Which enantiomer was docked (both)?
    CID 44216210 and CID 44209282 are presumably partial agonists; for CID 44213610, it is unclear from the published data whether it is a partial or full agonist. How are partial vs. full agonism reflected in the results of the paper?
    * Perhaps a topic for another post, Derek?

  32. Biff says:

    RM@28: Sounds like I hit a nerve.
    Maybe if we had a little more rigor in describing things, we’d get more reproducible data, as opposed to the all too common case of “X upregulates Y” getting published by one group and “X downregulates Y” by another.
    Maybe we’re so used to using shorthand language in our narrow fields that it becomes too cumbersome to get input from people in complementary fields who might catch our obvious mistakes when our work tiptoes into their subject matter.
    Maybe we prefer devoting a third or more of our title text to marketing our funding source than to accurately summarizing our conclusions. Quite to the contrary of your histrionic examples, my proposed title was actually **both shorter and more accurate** than the original.
    Maybe we’re just comfortable reducing rigor among friends. Maybe Potti and Nevins (see Derek’s next post) started, innocently enough, that way.

  33. Henry Rzepa says:

    Surely a text book case for adoption of FAIR data. If you do not know what that is, check out

  34. Henry (#33) alerted me to this and I’m pleased to say I spotted what was wrong very quickly.
    This cries out for machine refereeing of data-rich papers as is mandatory in much crystallography. If the compounds had been submitted as machine-readable connections tables then a quick lookup of many databases would have flagged an error. Yet chemistry as a disciple continues to publish as paper rather than semantically.
    More generally I suspect that masses of compchem suffers from incorrect input. Comp chemists almost never publish the numeric results of their calculations (log files, etc. ) although this is trivial. Until data is properly publish this will continue

  35. anon the II says:

    I think most of the computational type people reading this blog are missing the point. That includes 33 and 34. Derek didn’t spot the error “based on his considerable experience of these kinds of molecules”. He spotted the error because he has just a little experience with synthetic organic chemistry. That’s all he needed. To a normal, run of the mill synthetic chemist, those aminals look like blood sprayed all over the bathroom, like a bomb going off on the street, like a banshee screaming during a moment of silence. The fact that this was published with a boatload of authors and a few reviewers and sat there in the literature for over a year before he wrote about it says something about the state of computational chemistry and who’s doing it. It’s just been all downhill since Clark Still went away. I’m guessing it’s just too much to ask for someone to know a little organic chemistry and be comfortable with a Unix prompt.

  36. Henry Rzepa says:

    Re #35.
    Hemiaminols (for that is the key to the issues surrounding these systems) require someone to know more than merely a little organic chemistry.
    I have just done a search of 750,000 or so known crystal structures for this motif. It might come as a surprise that there are 33 good quality crystal examples of this sub-structure known, ie R-NH-CH(R’)-OH.
    Knowing that eg this motif might be too unstable or otherwise inappropriate in the context described in the original article takes a little more skill than merely being a synthetic organic chemist.
    Equally important is that the supporting information that was provided with this article could have included at least some examples of the molecules, but non 2D or 3D coordinates were supplied.

  37. Derek Lowe says:

    #36, Henry R. –
    I’ll bet that all, or nearly all, of those 33 are cyclic structures (probably with the nitrogen in the ring), or have the nitrogen acylated (or both). Those sorts of things I’ve seen, but I had never, ever, seen a structure like the ones drawn in this paper advanced as an actual stable compound.
    So one might need to know a bit more than a little organic chemistry, but it’s not exotic knowledge, either. Any decent synthetic organic chemist should be able to look at those structures and go “Hmm? Didn’t think those were stable.” They might well do the same at the cyclic or acylated forms, only to check and find out that yes, those can hold together more. But not these. The more experience you have looking at available molecules, the stronger your feelings will be about this, so experience does play a part. But sophomore organic chemistry teaches people that these sorts of structures are unstable intermediates.

  38. cdsouthan says:

    Following on from #31 picking out the CIDs interesting to walk these out to the source and same connectivity (isomer) structures for the CIDs that the journal (presumably with author check?) usefuly indexed
    The primary depositor for these three was the GLIDA database but unfortunaly you cannot track-back to their extraction source reference for provenance.
    CID 44216210 is “flat” but one of resolved forms
    CID 68976760, maps to Pfizers WO2004100950. The other two “flats” are singletons (i.e. no one has submitted the resolved forms to PubChem) but as Dereck estalished via a different route (SciFinder?) these map back to WO9631466 or the same family (via SureChem and IBM)

  39. maybe it should have spent more than a week in review says:

    no, really, maybe it should have spent more than a week in review. maybe somebody should have actually reviewed it instead of just rubber stamping this advertisement? Nature Chemistry should issue a retraction if they want to preserve their good name. Haha.

  40. Matthew K says:

    @28 (RM) – yes indeed, and this set of sceptical filters should pop up like image tagging whenever a competent bioscientist looks at a study in any model system, hell, in any system.
    My proudest “dad moments” are when my 8 and 12 year old daughters watch Mythbusters and keep up a steady stream of alternative explanations, failures of control conditions, etc. I’ve never taught them this but they must have acquired it by osmosis.
    I’d go so far as to say that the automatic uncertainty tagging of any “fact” is the essential mindset of empirical science – a probabilistic framework of explanations, which responds to perturbation and disruption by shifting weights and equilibria, instead of catastrophic failure. And the job is to tweak and nudge at this framework by any means available.

  41. Henry Rzepa says:

    #37 Derek,
    If the query R-NH-CH(R’)-OH is constrained to R=sp3 only, the hits reduce to 2. If the N-C bond is also specified as acyclic, one hit is obtained (doi: ). If only the N-C bond is specified as acyclic, but R=sp2, then 5 hits are obtained. So one molecule out of about 750,000 is indeed not a high frequency (and one might try to find out what it is about that single hit that does stabilize it, or whether indeed it is a valid analysis!).
    One might also ask whether these sorts of reality checks should be a ROUTINE part of any refereeing process involving the review of molecular structures. I fancy this sort of check is likely to be very very rare amongst referees. And it only takes about 2 minutes (if you have access to the database, which not all do have).

  42. sgcox says:

    Henry, that paper is about peptides, there is no such structure in peptides ! I am 100% sure there was an error in description and one carbonyl in the peptide bond was mistakenly assigned as single bond.

  43. Henry Rzepa says:

    The correction supplied by the authors is In the pictures, there is just a missing carbon between the nitrogen and carbon with an alcohol group (the hemiaminal that is unstable)
    So its clearly not a carbonyl group that was intended, but a missing atom (one which in fact appears elsewhere). My gripe was with the supporting information that could have provided a checksum for the structures, but did not.

  44. sgcox says:

    No, I refer to the paper you mentioned in post 41 as an example of the improbable structure exisiting in the crystallography database, 1 in 750,000. It is an error. The should be simple peptide bond there instead.

Comments are closed.