Skip to main content

In Silico

Picking Diverse Compounds

Diversity deck, diversity set, diversity collection: most chemical screening efforts try to have some bunch of compounds that are selected for being as unlike each other as possible. Fragment-based collections, being smaller by design, are particularly combed through for this property, in order to cover the most chemical space possible. But how, exactly, do you evaluate chemical diversity?
There are a lot of algorithmic approaches, and a new paper helpfully tries to sort them out for everyone. Here’s the take-home:

We assessed both the similar behavior of the descriptors in assessing the diversity of chemical libraries, and their ability to select compounds from libraries that are diverse in bioactivity space, which is a property of much practical relevance in screening library design. This is particularly evident, given that many future targets to be screened are not known in advance, but that the library should still maximize the likelihood of containing bioactive matter also for future screening campaigns. Overall, our results showed that descriptors based on atom topology (i.e., fingerprint-based descriptors and pharmacophore-based descriptors) correlate well in rank-ordering compounds, both within and between descriptor types. On the other hand, shape-based descriptors such as ROCS and PMI showed weak correlation with the other descriptors utilized in this study, demonstrating significantly different behavior.

One of the best-performing methods was Bayes Activity Fingerprints, a technique proposed a few years ago by a group at Novartis. That (at least to my non-computational eyes) doesn’t seem too surprising, since this new paper is trying to see how well diversity measure perform when compared to bioactivity space, and that earlier one was specifically adding in a measure to account for bioactivity space as well.
On the other hand, shape-based descriptors were problematic. One that turns up a lot is Principle Moments of Inertia (PMI), the scheme that separates compounds into rod-like, disk-like, and sphere-like shape families, but it and ROCS (based on overlaying molecular volumes) were definitely off in their own world when compared to the other descriptors. In fact, the authors found that there seemed to be no correlation at all between PMI diversity and diverse bioactivity, which should be worth thinking about. You’d apparently do better just picking things randomly than using PMI.

13 comments on “Picking Diverse Compounds”

  1. Anonymous says:

    If only drug discovery processes were as diverse: We might find one that works much more efficiently.

  2. Hap says:

    Isn’t PMI what the Schreiber/Broad groups use to argue (in part) that their diversity-oriented libraries are more diverse than other more combinatorial libraries?

  3. Pete says:

    Pure shape-based methods don’t encode chemical information like an atom’s interaction type. Benzamidinium cation and benzoate anion are very similar in shape. Using PMI to classifying molecules as rod-like, disk-like and sphere-like is effectively a binning procedure and raises the question of why not just use the PMI descriptors themselves. How one handles conformational space is a huge issue when using molecular shape-based descriptors.

  4. Anonymous says:

    Hap – yes, they have used PMI, among many other metrics, to compare “shape diversity” among small molecule libraries. Schreiber himself has said repeatedly the he doesn’t know whether or not PMI is useful in this regard (and especially for predicting the overall “performance” of a library… which is another word worth defining), only that it’s data that is easy to generate and allows one to test hypotheses regarding the “shape” (as defined by PMI) of a molecule, despite its limitations. And, my sense is that this analysis is usually retrospective and not actually used in the planning of a library; although I could be wrong.
    In the end, I’m happy to see these types of publications attempt to assess the assessments. As chemists, we intuitively know that shape is an important feature that affects a compound’s activity. But is PMI the best calculation of “shape”? Probably not.

  5. Hap says:

    I wasn’t trying to argue that it makes the DOS libraries he and Broad do worthless. I was concluding that PMI’s use as a comparison measure and as evidence of diversity might not be all that helpful; if PMI doesn’t correlate to biological diversity (which is the real point of the libraries) than whatever PMI diversity is shown isn’t really relevant, other than in showing the DOS libraries are different than others.

  6. It’s a nice paper. PMI has lately been a popular metric; the Broad group along with David Spring’s group at Cambridge and Derek Tan’s group at Sloan-Kettering have used it to design libraries including ones with macrocycles, but since they haven’t published the results of this design in terms of hit rate the verdict is still out on whether PMIs actually give you biological diversity. The original 2003 paper by Sauer and Schwarz i JCIM is worth reading though.
    The problem with 3D methods is that they tend to introduce a lot of noise and other complications. 2D methods often work better not because they are better per se but because they strip off this extraneous noise. It’s interesting how diverse (pun) studies (including Shoichet’s Nature study on the similarity of drugs) have concluded the utility of simple 2D fingerprints, especially ECFP4.
    We in the macrocycle field especially struggle with this whole issue of diversity (and especially to what extent building block diversity correlates with product diversity). Ultimately there is really no final answer on what the ‘correct’ diversity metric is for getting biological diversity, so even after hours of brainstorming you inevitably end up throwing a few in the mix, crossing your fingers and spinning around thrice.

  7. Anonymous says:

    “Ultimately there is really no final answer on what the ‘correct’ diversity metric is for getting biological diversity, so even after hours of brainstorming you inevitably end up throwing a few in the mix, crossing your fingers and spinning around thrice.”
    I have tried spinning around only twice, but to no avail. I will have to try your ‘thrice’ method, thanks for the pointer! 🙂

  8. Anonymous says:

    There is of course no perfect metric for chemical diversity. Probably the best metric is the geometric mean of all other diversity metrics one can dream up!

  9. DCRogers says:

    One reason diversity is a hard problem (as suggested by continuing publications on this topic) is that random selection sets a pretty reasonable baseline, and is computationally easy and cheap.
    An analogous problem is the estimation of an integral over N-dimensional space from a set of sampled points. Certainly, optimally-positioned sample points are superior, but randomly-chosen points still give pretty damn good estimates.

  10. @kayakphilip says:

    Interesting timing for this blog article.
    I’m working on a POC this break on using Spotfire to help with a similar issue. Note that Spotfire is only usng the built in, or a third part, fingerprint, to do the actual analysis of similarity/diversity.
    The question posed to us however was as such: If you have e.g an SDFile of a set of commercial compounds, but can only afford to buy a given number, can you have something tell you which ones to buy once you tell it how many you can afford.
    I hope to have a video or something up in the new year but it was a different way of thinking of things for me. I’d been thinking of chemical clustering as good for identifying things that were similar, but this usecase seems very much more interesting in some ways.
    Interesting thoughts re the different models, I’d have to research how many different models we could incorporate.

  11. DCRogers says:

    One more thing: an under-appreciated aspect of descriptor selection has to do with their dimensionality, independent of their content. The choice of dimensionality itself is a choice about the distance structure of your data space.
    Sets of whole-molecule descriptors (logP, molecular weight, etc) typically can be compressed using PCA to a handful of dimensions – useful for visualization, and with several choices of well-behaved distance metrics.
    But this amounts to building in an assumption – if the information of interest cannot be represented in low dimensions, the results will be limited in quality.
    High dimensions, on the other hand, are hard to visualize; worse, in high-dimensional spaces, our natural intuitions about distances break down. In short, from the perspective of any single sample, most other samples are ‘the same’ distance away – that is, far. Distance is basically uninformative other than in a tight near-neighborhood. (For an explanation of this, see Kanerva’s “Sparse Distributed Memory” book.)
    Such descriptors can be useful for de-cluttering local neighborhoods of near-duplicates, but that’s about it — measures of distance between different neighborhoods are effectively random.
    TL;DR — don’t read to much into the value of the content of a descriptor when many effects are explained mostly by its dimensionality.

  12. Kelvin Stott says:

    Graph theory might help solve this problem.

  13. Der Hindenburg says:

    #3 von K is sage in these matters.

Comments are closed.