There’s an interesting paper out in PLoS One, called “Inside the Mind of a Medicinal Chemist“. Now, that’s not necessarily a place that everyone wants to go – mine is not exactly a tourist trap, I can tell you – but the authors are a group from Novartis, so they knew what they were getting into. The questions they were trying to answer on this spelunking expedition were:
1) How and to what extent do chemists simplify the problem of identifying promising chemical fragments to move forward in the discovery process? 2) Do different chemists use the same criteria for such decisions? 3) Can chemists accurately report the criteria they use for such decisions?
They took 19 lucky chemists from the Novartis labs and asked them to go through 8 batches of 500 fragments each and select the desirable compounds. For those of you outside the field, that is, unfortunately, a realistic test. We often have to work through lists of this type, for several reasons: “We have X dollars to spend on the screening collection – which compounds should we buy?” “Which of these compounds we already own should still be in the collection, and which should we get rid of?” “Here’s the list of screening hits for Enzyme Y: which of these look like useful starting points?” I found myself just yesterday going through about 350 compounds for just this sort of purpose.
They also asked the chemists which of a set of factors they used to make their decisions. These included polarity, size, lipophilicity, rings versus chains, charge, particular functional groups, and so on. Interestingly, once the 19 chemists had made their choices (and reported the criteria they used in doing so), the authors went through the selections using two computational classification algorithms, semi-naïve Bayesian (SNB) and Random Forest (RF). This showed that most of the chemists actually used only one or two categories as important filters, a result that ties in with studies in other fields on how experts in a given subject make decisions. Reducing the complexity of a multifactorial problem is a key step for the human brain to deal with it; how well this reduction is done (trading accuracy for speed) is what can distinguish an expert from someone who’s never faced a particular problem before.
But the chemists in this sample didn’t all zoom in on the same factors. One chemist showed a strong preference away from the compounds with a higher polar surface area, for example, while another seemed to make size the most important descriptor. The ones using functional groups to pick compounds also showed some individual preferences – one chemist, for example, seemed to downgrade heteroaromatic compounds, unless they also had a carboxylic acid, in which case they moved back up the list. Overall, the most common one-factor preference was ring topology, followed by functional groups and hydrogen bond donors/acceptors.
Comparing structural preferences across the chemists revealed many differences of opinion as well. One of them seemed to like fused six-membered aromatic rings (that would not have been me, had I been in the data set!), while others marked those down. Some tricyclic structures were strongly favored by one chemist, and strongly disfavored by another, which makes me wonder if the authors were tempted to get the two of them together and let them fight it out.
How about the number of compounds passed? Here’s the breakdown:
One simple metric of agreement is the fraction of compounds selected by each chemist per batch. The fraction of compounds deemed suitable to carry forward varied widely between chemists, ranging from 7% to 97% (average = 45%), though each chemist was relatively consistent from batch to batch. . .This variance between chemists was not related to their ideal library size (Fig. S7A) nor linearly related to the number of targets a chemist had previously worked on (R2 = 0.05, Fig. S7B). The fraction passed could, however, be explained by each chemist’s reported selection strategy (Fig. S7C). Chemists who reported selecting only the “best” fragments passed a lower fraction of compounds (0.13±0.07) than chemists that reported excluding only the “worst” fragments (0.61±0.34); those who reported intermediate strategies passed an intermediate fraction of compounds (0.39±0.25).
Then comes a key question: how similar were the chemists’ picks to each other, or to their own previous selections? A well-known paper from a few years ago suggested that the same chemists, looking at the same list after the passage of time (and more lists!) would pick rather different sets of compounds. Update: see the comments for some interesting inside information on this work.)Here, the authors sprinkled in a couple of hundred compounds that were present in more than one list to test this out. And I’d say that the earlier results were replicated fairly well. Comparing chemists’ picks to themselves, the average similarity was only 0.52, which the authors describe, perhaps charitably, as “moderately internally consistent”.
But that’s a unanimous chorus compared to the consensus between chemists. These had similarities ranging from 0.05 (!) to 0.52, with an average of 0.28. Overall, only 8% of the compounds had the same judgement passed on them by at least 75% of the chemists. And the great majority of those agreements were on bad compounds, as opposed to good ones: only 1% of the compounds were deemed good by at least 75% of the group!
There’s one other interesting result to consider: recall that the chemists were asked to state what factors they used in making their decisions. How did those compare to what they actually seemed to find important? (An economist would call this a case of stated preference versus revealed preference). The authors call this an assessment of the chemists’ self-awareness, which in my experience, is often a swampy area indeed. And that’s what it turned out to be here as well: “. . .every single chemist reported properties that were never identified as important by our SNG or RF classifiers. . .chemist 3 reported that several properties were important, for failed to report that size played any role during selections. Our SNG and RF classifiers both revealed that size, an especially straightforward parameter to assess, was the most important .”
So, what to make of all this? I’d say that it’s more proof that we medicinal chemists all come to the lab bench with our own sets of prejudices, based on our own experiences. We’re not always aware of them, but they’re certainly with us, “sewn into the lining of our lab coats”, as Tom Wolfe might have put it. The tricky part is figuring out which of these quirks are actually useful, and how often. . .