Here’s a paper that’s analyzing the popularity of different structural scaffolds in medicinal chemistry over time. The authors are using the ChEMBL database and looking for the core structures with the most work done on them, tracking changes over time (1998-2014). That’s a set of nearly 283,000 unique compounds to work with, but when you filter that down to scaffolds with more than ten compounds derived from them, present in at least five years’ worth of records, and larger than just a single six-membered ring, you knock that down to just 764 structural motifs.
The paper then analyzes these (over time) by total number of compounds reported, total number of assays reported, and the impact factor of the journals in which they appeared (these three measures turned out themselves not to be very well correlated, and journal impact factors turned out to be something of a mess). By far, the three biggest in the enumerated-compound list are biphenyl, biphenyl ether, and chalcone, each with over 500 members. (I had a look at the “200 most popular drugs” posters from the Njardarson group’s web site, out of curiosity, and it’s true that you can find a lot of biphenyls buried in widely prescribed drug structures – biphenyl ethers, though, were largely confined to thyroid hormones. Chalcones. . .not so much). As far as trends over time, the great majority of scaffolds that showed hits across multiple years did not show any particular pattern or slope when plotted by year. Interestingly, the scaffold with the steepest positive trend in that data was the chalcones.
As for number of assays reported, the unexpected winner was daunorubicin-like compounds. That’s a set of only 21 distinct compounds in the database. Compounds in the paclitaxel family (82 of them total) fall into the same sort of category. In both cases, though, it’s the parent compound that seems to be at least partly driving this effect, because they get run through nearly every oncology assay on the list (often, I wonder, as control compounds?) Two other scaffolds (corresponding to sorafenib and curcumin) show this effect to an extreme – their high assay counts are basically only from their one famous member. About half the scaffolds showed statistical signs of being affected by this sort of thing.
The biological assay data were also searched through by target, and another trend showed up. The scaffold classes with larger numbers of compounds tended to be assayed more against unique protein targets, whereas the lower-represented ones tended to be more in cell-line assays only (an effect of cell-line profiling in oncology targets or of phenotypic screens in general, one has to assume). The chalcones are the exception, though, with both a high compound count and a high number of cell assays.
The authors take a bit of time to go into those. Looking at the types of targets and assays that the chalcones have been run against, they start out as cytotoxics/anticancer agents and then sort of wander all over the place. “It is tempting to speculate“, the paper says, “that a lot of effort was driven by the high synthetic feasibility of chalcone derivatives. In addition, researchers in the field might not have been aware of the unspecific responses of chalcones due to lack of systematic literature surveys.” Unfortunately, that’s my usual response when I see work like that, too, only I put it less diplomatically: papers that present just a long list of chalcones run through a cell assay, and there are a lot of them, are disproportionately from people who don’t really know what they’re doing, or who at the very least don’t have the facilities or expertise available to do anything better.
One thing to keep in mind is that the ChEMBL database is rather biased towards academic work compared to the wider universe of drug discovery. It would be a horrible undertaking to do something similar for the patent literature, but I’m very much willing to bet that the results would be different enough to be of interest. This is another example of the tricky part of using any sort of data-mining or (a step further) machine learning in this area: there’s a ferocious amount of data curation that has to be done up front before you can believe the results you get.