Skip to main content

Drug Assays

How Do Chemist (Think That They) Judge Compounds?

There’s an interesting paper out in PLoS One, called “Inside the Mind of a Medicinal Chemist“. Now, that’s not necessarily a place that everyone wants to go – mine is not exactly a tourist trap, I can tell you – but the authors are a group from Novartis, so they knew what they were getting into. The questions they were trying to answer on this spelunking expedition were:

1) How and to what extent do chemists simplify the problem of identifying promising chemical fragments to move forward in the discovery process? 2) Do different chemists use the same criteria for such decisions? 3) Can chemists accurately report the criteria they use for such decisions?

They took 19 lucky chemists from the Novartis labs and asked them to go through 8 batches of 500 fragments each and select the desirable compounds. For those of you outside the field, that is, unfortunately, a realistic test. We often have to work through lists of this type, for several reasons: “We have X dollars to spend on the screening collection – which compounds should we buy?” “Which of these compounds we already own should still be in the collection, and which should we get rid of?” “Here’s the list of screening hits for Enzyme Y: which of these look like useful starting points?” I found myself just yesterday going through about 350 compounds for just this sort of purpose.
They also asked the chemists which of a set of factors they used to make their decisions. These included polarity, size, lipophilicity, rings versus chains, charge, particular functional groups, and so on. Interestingly, once the 19 chemists had made their choices (and reported the criteria they used in doing so), the authors went through the selections using two computational classification algorithms, semi-naïve Bayesian (SNB) and Random Forest (RF). This showed that most of the chemists actually used only one or two categories as important filters, a result that ties in with studies in other fields on how experts in a given subject make decisions. Reducing the complexity of a multifactorial problem is a key step for the human brain to deal with it; how well this reduction is done (trading accuracy for speed) is what can distinguish an expert from someone who’s never faced a particular problem before.
But the chemists in this sample didn’t all zoom in on the same factors. One chemist showed a strong preference away from the compounds with a higher polar surface area, for example, while another seemed to make size the most important descriptor. The ones using functional groups to pick compounds also showed some individual preferences – one chemist, for example, seemed to downgrade heteroaromatic compounds, unless they also had a carboxylic acid, in which case they moved back up the list. Overall, the most common one-factor preference was ring topology, followed by functional groups and hydrogen bond donors/acceptors.
Comparing structural preferences across the chemists revealed many differences of opinion as well. One of them seemed to like fused six-membered aromatic rings (that would not have been me, had I been in the data set!), while others marked those down. Some tricyclic structures were strongly favored by one chemist, and strongly disfavored by another, which makes me wonder if the authors were tempted to get the two of them together and let them fight it out.
How about the number of compounds passed? Here’s the breakdown:

One simple metric of agreement is the fraction of compounds selected by each chemist per batch. The fraction of compounds deemed suitable to carry forward varied widely between chemists, ranging from 7% to 97% (average = 45%), though each chemist was relatively consistent from batch to batch. . .This variance between chemists was not related to their ideal library size (Fig. S7A) nor linearly related to the number of targets a chemist had previously worked on (R2 = 0.05, Fig. S7B). The fraction passed could, however, be explained by each chemist’s reported selection strategy (Fig. S7C). Chemists who reported selecting only the “best” fragments passed a lower fraction of compounds (0.13±0.07) than chemists that reported excluding only the “worst” fragments (0.61±0.34); those who reported intermediate strategies passed an intermediate fraction of compounds (0.39±0.25).

Then comes a key question: how similar were the chemists’ picks to each other, or to their own previous selections? A well-known paper from a few years ago suggested that the same chemists, looking at the same list after the passage of time (and more lists!) would pick rather different sets of compounds. Update: see the comments for some interesting inside information on this work.)Here, the authors sprinkled in a couple of hundred compounds that were present in more than one list to test this out. And I’d say that the earlier results were replicated fairly well. Comparing chemists’ picks to themselves, the average similarity was only 0.52, which the authors describe, perhaps charitably, as “moderately internally consistent”.
But that’s a unanimous chorus compared to the consensus between chemists. These had similarities ranging from 0.05 (!) to 0.52, with an average of 0.28. Overall, only 8% of the compounds had the same judgement passed on them by at least 75% of the chemists. And the great majority of those agreements were on bad compounds, as opposed to good ones: only 1% of the compounds were deemed good by at least 75% of the group!
There’s one other interesting result to consider: recall that the chemists were asked to state what factors they used in making their decisions. How did those compare to what they actually seemed to find important? (An economist would call this a case of stated preference versus revealed preference). The authors call this an assessment of the chemists’ self-awareness, which in my experience, is often a swampy area indeed. And that’s what it turned out to be here as well: “. . .every single chemist reported properties that were never identified as important by our SNG or RF classifiers. . .chemist 3 reported that several properties were important, for failed to report that size played any role during selections. Our SNG and RF classifiers both revealed that size, an especially straightforward parameter to assess, was the most important .
So, what to make of all this? I’d say that it’s more proof that we medicinal chemists all come to the lab bench with our own sets of prejudices, based on our own experiences. We’re not always aware of them, but they’re certainly with us, “sewn into the lining of our lab coats”, as Tom Wolfe might have put it. The tricky part is figuring out which of these quirks are actually useful, and how often. . .

19 comments on “How Do Chemist (Think That They) Judge Compounds?”

  1. Phys Chem Props! says:

    Ineteresting that none of he chemists thought outside the box and used calculated phys chem properties themselves as a guide in addition to their eyes. When faced with this task in the past I’ve tended to at least include columns for cLogP, MW and TPSA.

  2. Lizzard says:

    What exactly is the point of all this? It sounds like the starting point for another round of the “my favorite property is best” game. Why can’t we just admit that molecules and their properties are just too complex to reduce down to a manageable and useful parameter set?
    What was the core assumption to this whole exercise? “We only have XXX dollars to spend on new chemistry, what should we buy?” Maybe we should take all the resources we spend doing “Random Forest” analysis (WTF?), and use it to address the core problem: the economics of compound screening. If I can come up with some way, either through affinity screening, or mixture screening, or miniaturization, or whatever, to get 10 times more bang for my screening buck, then I won’t need to employ these Ix eggheads telling me about the holes in our chemical space.
    We need to stop accepting this premise that we can predict what’s going to happen when chemicals meet proteins. Just do the experiment!!! All this hand-wringing over our prejudices is just making excuses for what experiments we won’t do. Better to come up with methods that will allow us to do more. Not make more excuses.

  3. marcello says:

    totally agree with Lizzard

  4. Kazoo Chemist says:

    The “well known paper” that you referred to was unfortunate in that it did not tell ALL aspects of the process. The computational algorithm was employed to (1) eliminate undesirable compounds based on in-house filters and (2) maximize the dissimilarity relative to the existing compound collection to optimize the use of limited funds for expansion of the collection. The resulting list of about 22,000 compounds was intentionally organized to group compounds with similarities such as the basic ring structure to make them easier for the chemists to review. The computer was apparently not smart enough to realize that a meta chloro compound is not all that different from a para bromo analog or a dichloro compound. Often a vendor would have an entire series of close analogs and the computer would select many of them. One of the main responsibilities for the chemists was to address this problem. They were specifically instructed to avoid approving a whole series of related compounds for purchase, but to select a few good representatives and reject the remainder. OF COURSE two chemists might pick a different subset of two or three compounds from a dozen halogen isomers of a common substructure. The authors of the paper use this to imply that the chemists did not agree on the merit of the structures. The data cannot be used to support that conclusion. A similar comment can be made about the hidden set of 250 compounds that had been selected by a senior chemist. Many of these compounds may well have been rejected not on the basis of an underlying concern, but because they were too similar to several other compounds in the set for review at that time. Placing these otherwise acceptable compounds out of context in the other review lists and then expecting the other reviewers to reject a high percentage of them is ludicrous. Similarly, the experienced chemist who eliminated them on the basis that they were part of a common subset of structures might well find them perfectly acceptable when they were evaluated individually. The authors of the paper were fully aware of these issues since I personally discussed this with them back in the Pharmacia days. They did not publish this work back then and it was therefore not subject to review by others who knew the details.

  5. MoMo says:

    Stable Bias! I knew there was a high faluttin’ name for having a shallow view of SAR!
    Again, more diversion from the real work of synthesis and examining actual biology. Now we have experts who put metrics to everything except the emotional aspects of drig discovery- where BIG EGOS and STABLE BIAS get in the way of actual potent and useful therapeutics.
    I see Novartis is spending their cached monies wisely instead of in REAL DRUG DISCOVERY AND DEVELOPMENT!
    Halogens! Yea, go ahead and add chlorines! LOL!
    More chaffe from the “Watchers” instead of the “DOers”!
    Get back to work Novartis! Get your hands dirty!

  6. psl says:

    I agree with lizard,….. to a point.
    If you can, do the experiment! You should always screen as many compounds as you can, but not more!
    Chemical space is HUGE. There HAS to be a filter for the subset you intend to screen….you simply can’t explore 10^20+ compounds……
    e.g. look at all these unknown fragments hat will soon be added to the list of things to screen:
    Methodological biases of this study aside, if you can find a MORE sensible way for that filter to work, it’s worth doing.

  7. entropyGain says:

    I can see the HR consultants gearing up a new “Myers-Briggs” assessment to evaluate potential hires. They will charge HR groups $1000/candidate for this incredible value-add, and only $1500/candidate to evaluate internal chemists for possible promotion to management positions….Oh wait, not hiring any chemists? – they will help you prioritize your upcoming layoffs….
    Then again, it might have been interesting to know what the “good” chemists in the sample thought and how they differed from the rest of the herd. Guess nobody wanted to take the risk of defining a good chemist.
    (sorry, woke up feeling a bit snarky this morning).

  8. Hap says:

    1) Because guessing isn’t likely to be a successful drug discovery strategy? No one (even with combichem, DOS, and the like) can test everything, so you have to choose what to do. All properties aren’t equal – some are more important than others, and while their relative importance might vary from situation to situation, the importance of some will vary less than that of others. You have to make choices, and if those choices are no better than random, monkeys are cheaper than you are.
    2) Because everyone has knowledge biases, and it would help to figure out what those are – if there are systemic biases that lead one to spend time and money on (most often) dead ends, then you’d like to know (and maybe even when, because what’s a dead end for one project might not be for another).
    3) Assuming the world is too complex to understand reasonably is a recipe for not being around long enough to find out if the assumption is correct.

  9. MTK says:

    I didn’t read the paper, so take the following FWIW,
    what happens when you take the set of compounds which had the most individual chemists choose as favorable? Crowdsourcing in today’s lingo, I guess.

  10. Lizzard says:

    Thanks for engaging.
    1) I would suggest that guessing is a successful strategy if we have enough guesses. Isn’t that the whole point of HTS? Aren’t there enough examples of seredipity and surprises in this game? My point isn’t that we can test everything (that’s absurd), but that testing everything you can get your hands on is more likely to lead to success than testing only the (small) subset you think will work. So it would be wiser to spend your money figuring out how to practically get your hands on everything than to come up with reasons not to. I don’t mean this as an excuse to add insoluble greaseballs to the collection, but rather to remove target- and project-based biases.
    2) I agree, I think.
    3) I guess we’ve just had different experiences, because I feel exactly the opposite. I believe that assuming you understand a world that may be too complex for the human mind to comprehend is a recipe for disaster. I’ve just seen enough surprises, failures, and mistakes to approach the whole enterprise with humility. I guess it’s just the evolutionist in me, but I’d rather design a system that will spit out the answer than try to come up with the answer myself.

  11. Hap says:

    1) My assumption is that there is not enough available diversity to get a good sampling of chemical space, and that (depending on the people doing it) libraries can be a mess. I don’t want to concede uselessness, but there’s a lot of not-good-stuff in libraries, and GIGO. Looking for natural products is one way to search chemical space, and it selects for things active in biological systems, but it is not liked lately and leaves less room for analoging.
    I don’t think arbitrary cutoffs for anything are good, but at least some of the processes involved in drug treatment (metabolism and distribution) are held in common by lots of drug targets; knowing them might help limit the places to look. Of course, you sometimes have to break your own rules, but you have to have an idea (really greasy enzyme? active transport?) that it’s a good idea.
    2) I think knowing enough to design a system to do what you want could take less information than designing it directly, but not always. The problem with knowing something is that we can jump to thinking we know everything, and that we don’t know when our knowledge is limited sometimes, and rarely how limited and where. My reasoning was that when people decide that they can’t know something, they give up entirely. If they can create things to do what they want, that’s not true. I’m not certain which path has the potential to put us in deeper trouble.

  12. DCRogers says:

    I find myself a bit shocked at the vehemence and vitriol of the anti-computational commenters. The suggestion to “take all the resources we spend doing “Random Forest” analysis (WTF?)” (i.e., fire the computational workers) is not a sign of humility, but of hubris.
    WTF, indeed.

  13. RM says:

    The devil, as they say, is in the details:
    “I don’t mean this as an excuse to add insoluble greaseballs to the collection,” – Ah, yes. But how do you define “insoluble greaseballs”? Are you basing it on experimental solubility tests with experimentally relevant cutoffs (in which case you may need to buy a sample of each compound to test), or are you using some heuristic to evaluate it? If so, which heuristics are you using? Can they be automated objectively? If not, you’re back to the same situation the researchers were investigating, but with “insoluble greaseball” instead of “promising candidate”.
    Also, once you open the door to one exception, you open to a bunch of them. While you don’t like insoluble greaseballs, Bob says “obviously, we don’t mean this as an excuse to add ubiquitious false positives.” Jenny says “obviously, we don’t mean this as an excuse to add compounds that will be PK dogs”. David says “obviously, we don’t mean this as an excuse to add [insert his pet peeve here]”.
    Sure, it’d be great to be able to experimentally verify all the compounds, but realistically there’s going to be some sort of limit on which ones you try. Either because you’re simply not going to purchase everything that’s commercially available, or because you have only limited screening ability. Compounds are going to be eliminated (or not selected) for some reason, and it’s rose-colored glasses not to realize that the primary way it’s done currently is by chemists eyeballing it.
    What’s the point of this paper? Well, chemists eyeballing list apparently aren’t any more complex than very simple computational models, aren’t necessarily any better than them, and are less consistent. My take away is that you want to encode your “obviously, we don’t mean this as an excuse to”s as objective computational tests, reduce reliance on eyeballing (as it’s highly variable over time and eyeballs), and test away on things that pass through.

  14. LeeH says:

    I agree with you. Don’t worry about the trolls – there are some in every discussion.
    That being said, I think that a weakness of the paper was the lack of a description of exactly how one would use the rules learned in the study. This invariably sets off a worst-case paranoid response in the medicinal chemists minds (although admitedly, sometimes they ARE watching you). However, getting a handle on the biases of the medchem mind is not a bad goal, if the lessons learned are used sensibly.
    In a previous life, we had chemists look at structures and rank them on a scale of 1 to 3. The models generated from those results were used as one of the filters applied to potential compounds for our high-throughput screening deck. The result was that the chemists were happier with the hits coming out of our HTS efforts, and fewer compounds were rejected out-of-hand during hit triaging. It was a very worthwhile exercise, because it rejected structures that other filters just couldn’t catch.
    I think this study suffers from a few problems. First, if the rules are to be used during lead-op, I’m not sure the opinions about fragments are appropriate. The effect on the MedChem eye is different for a particular fragment if it sits on a larger structure, rather than if it sits alone. Conversely, size and lipophilicity are usually obvious in a larger compound, but are underwhelming in fragments. These exagerations may be useful for proving a point, but I’m not convinced that the models are really practical as a result.
    I’m also not sure why it’s important to understand whether a chemist is actually aware of which particular factors affect their decisions, as long as they acknowledge the consequences of their decisions. That is, I’ve seen more projects fail from the systematic disregard of measured properties (such as solubility or metabolic rates), rather than the pernicious inclusion of a particular functional group.

  15. kristall says:

    I agree that everyone has knowledge biases. Biologists are so too.
    I want to know how biologists judge biological targets. I think the decisions are based on the biologists’ biases rather than on scientific relevance.

  16. bobthebuilder says:

    I wonder if left-handed chemists choose different molecules to right-handed chemists…

  17. Helical_Investor says:

    Odd paper. It appears to be looking for inherent biases, almost in the absence of any discriminating information.
    Specifically, we asked chemists to sort through ~4,000 chemical fragments over several sessions, and to identify those they deemed attractive for follow-up.
    Follow-up based on what? Unless I missed something in scanning the article, there was little in the way to discriminate the compounds other than the structures themselves. In such a case, I would not expect consensus on what might be ‘good follow-ups’, because there is nothing to base it on. I would expect some consensus on ‘don’t bother with these’, and there seems to have been some. I would expect most of the chemists to say ‘reduce the library thus, and get back to me when you have some data’.

  18. Anthony says:

    The next obvious step would be to rerun the test using a sample which includes a sample from the existing pharmacopeia, a sample of compounds which were promising through animal trials, a sample of compounds which “unexpectedly” failed in earlier trials, then check the results versus the actual results.

  19. li says:

    Perhaps the paradigm is wrong? Use of Chemical fragments assumes that a reductionist approach is, in 2012, the optimal strategy for compound selection. Perhaps the differences between chemists is due to the presumed targets (and their environment) which for each will be based on their experience, research and biases.
    The analogy that occurs to me is given 10 mechanical engineers and the following list:
    hammer, shop knife, slot screw driver, cresent wrench, pipe wrench, wood saw, drill, rope, chisel, torch, and hack saw. Tell them they can only pick 3 and will be sent on their next job with only those 3. How much overlap should we/would we see? History isn’t a good predictor of the future, but its the best we have. We do not think “out of context”. If not given a context, we will use our own. I don’t see what this paper really tells us. Nothing useful. Given confirmation bias and over-estimation of our own expertise are well established human traits (and the norm) it is actually quite interesting that there was even 8% concordance. (Possibly due to local history of the small pool sampled).
    Other point is that there are many paths to knowledge. While some aspects of a problem may be too complex for us to grok, that is often corrected by a change in perspective. I think we should wait another century or two before we throw in the towel on understanding it. (and of course by that time there will be minds that can understand it I would expect.)

Comments are closed.