A common question – well, it should be a common question, anyway – is “How do I make sure that this compound collection is a useful one to screen?” There are alternative forms that come down to the same issues – if you’re putting together a new focused screening set, what should be in it? What’s the minimum set you need to cover mechanism-of-action space with validated small-molecule probes? Is screening set A really that different from screening set B? And so on.
This new paper has some tools and suggestions for answering such questions, and it’s well worth a read. What’s more, it comes with a site that lets you use these tools yourself: www.smallmoleculesuite.org. This will try to assemble sets of compounds based on known selectivities and cellular phenotypes, on chemical structure diversity, stage of clinical development, etc. The authors take several reported kinase-inhibitor screening sets and analyze them: the 429-compound Selleck kinase inhibitor collection, GSK’s 362-member Published Kinase Inhibitor set, the 209-compound Dundee collection, the EMD collection – 266 compounds from Tocris, the 495-member LINCS collection from Harvard, and the 95-compound Pfizer-licensed collection (also from SelleckChem). Given the numbers of compounds involved, their chemical diversity, and their reported (and widely-scattered-reported) affinities for various kinases, doing a real comparison of these is actually quite a bit of work.
The LINCS and large SelleckChem collections overlap on about 50% of their compounds, and that’s actually the largest overlap of the whole set. On the other end, 350 of the compounds in the GSK set are unique to it (there are a lot of reported kinases inhibitors out there). The LINCS and Dundee collections are the most structurally diverse, while the GSK published set is the least (but that’s because it was deliberately designed to have clusters of compounds in it). Comparing structures is a pretty straightforward problem, though, while comparing biological activities is much less so. The ChEMBL database has a lot of assay data in it (both at the protein and cell levels) but the level of annotation in it varies widely compound-to-compound, as it would have to. Meanwhile, the CMAP database has a huge amount of gene expression data on exposure to various compounds (in various cell lines), but when you find correlated compounds in one of those data collections, they’re almost never correlated in the other (only happens around 5% of the time), which is something to think about.
Clinical research is another good selection tool, but one with its own biases and difficulties. The GSK set has 0% approved drugs in it, while the Pfizer-licensed collection is 57% approved drugs, but (as the authors note) the most selective compounds aren’t necessarily the ones that make it through the clinic the best. That’s been a big part of the kinase story over the last 25 years, well known to those who’ve followed it – clinical efficacy is often a combination of effects, while exquisite selectivity doesn’t always translate (to put it lightly) to real-world benefits across a larger population. Another thing to keep in mind is that the common annotations for kinase inhibitors, especially, are not very useful. “Compound X is a c-MET inhibitor” does not mean that it’s only a c-MET compound, or that that’s the enzyme that it hits with the highest affinity, or even that c-MET inhibition the basis for its clinical effects. In fact, this paper shows that very little useful information can be obtained by knowing a given compound’s nominal target, when you compare that to the larger picture.
Compounds with low structural similarity tend to have low target profile similarity, which makes sense, but you can’t make the converse work: structural similarity is of little or no use in predicting target profile. Overall, there’s a lot of non-redundant information in all the various profiles, which is one of the things that prompted this work as an attempt to gather all of these into one place. The software at the link above can go through the public databases and try to assemble compound sets at various levels – from most coverage in the fewest compounds, to more deliberately redundant sets with clusters or ones that try to cover the most space regardless of overlap.
It’s worth remembering that this is the situation with one of the most trampled-over areas of small-molecule drug discovery:
While the LSP-OptimalKinase collection represents an optimized way to construct a compound collection using commercially available compounds, it is interesting that only 12% of the kinome (63 proteins) can be targeted with two structurally diverse compounds that have (high) selectivity (as compared with 8% by the LINCS and <1% by the PKIS libraries); we therefore estimate that a truly optimal kinase library would consist of approximately 1,000 compounds. Kinases are one of the most heavily studied classes of protein targets, particularly in oncology, and it is noteworthy that a substantial portion of the kinome nonetheless remains unaddressed with available chemical tools. . .
GPCRs are the next most-addressed target class in the literature, and things start to fall off pretty rapidly after that. (I’m sure that there are a lot of nuclear receptor compounds as well, but those have such widely varying effects that no one should feel too confident). It’s been said before, and it’ll be said many times again, for a good long time to come: We Need More Good Chemical Probes. (We also need to stop using the crappy ones, which you’d think would be easier, but apparently isn’t). There are a lot of targets out there, and a lot of questions which can best be addressed by small-molecule perturbations of them, and a lot of gaps to fill in.