Skip to main content

Chemical News

Scaffold Popularity

Here’s a paper that’s analyzing the popularity of different structural scaffolds in medicinal chemistry over time. The authors are using the ChEMBL database and looking for the core structures with the most work done on them, tracking changes over time (1998-2014). That’s a set of nearly 283,000 unique compounds to work with, but when you filter that down to scaffolds with more than ten compounds derived from them, present in at least five years’ worth of records, and larger than just a single six-membered ring, you knock that down to just 764 structural motifs.

The paper then analyzes these (over time) by total number of compounds reported, total number of assays reported, and the impact factor of the journals in which they appeared (these three measures turned out themselves not to be very well correlated, and journal impact factors turned out to be something of a mess). By far, the three biggest in the enumerated-compound list are biphenyl, biphenyl ether, and chalcone, each with over 500 members. (I had a look at the “200 most popular drugs” posters from the Njardarson group’s web site, out of curiosity, and it’s true that you can find a lot of biphenyls buried in widely prescribed drug structures – biphenyl ethers, though, were largely confined to thyroid hormones. Chalcones. . .not so much). As far as trends over time, the great majority of scaffolds that showed hits across multiple years did not show any particular pattern or slope when plotted by year. Interestingly, the scaffold with the steepest positive trend in that data was the chalcones.

As for number of assays reported, the unexpected winner was daunorubicin-like compounds. That’s a set of only 21 distinct compounds in the database. Compounds in the paclitaxel family (82 of them total) fall into the same sort of category. In both cases, though, it’s the parent compound that seems to be at least partly driving this effect, because they get run through nearly every oncology assay on the list (often, I wonder, as control compounds?) Two other scaffolds (corresponding to sorafenib and curcumin) show this effect to an extreme – their high assay counts are basically only from their one famous member. About half the scaffolds showed statistical signs of being affected by this sort of thing.

The biological assay data were also searched through by target, and another trend showed up. The scaffold classes with larger numbers of compounds tended to be assayed more against unique protein targets, whereas the lower-represented ones tended to be more in cell-line assays only (an effect of cell-line profiling in oncology targets or of phenotypic screens in general, one has to assume). The chalcones are the exception, though, with both a high compound count and a high number of cell assays.

The authors take a bit of time to go into those. Looking at the types of targets and assays that the chalcones have been run against, they start out as cytotoxics/anticancer agents and then sort of wander all over the place. “It is tempting to speculate“, the paper says, “that a lot of effort was driven by the high synthetic feasibility of chalcone derivatives. In addition, researchers in the field might not have been aware of the unspecific responses of chalcones due to lack of systematic literature surveys.” Unfortunately, that’s my usual response when I see work like that, too, only I put it less diplomatically: papers that present just a long list of chalcones run through a cell assay, and there are a lot of them, are disproportionately from people who don’t really know what they’re doing, or who at the very least don’t have the facilities or expertise available to do anything better.

One thing to keep in mind is that the ChEMBL database is rather biased towards academic work compared to the wider universe of drug discovery. It would be a horrible undertaking to do something similar for the patent literature, but I’m very much willing to bet that the results would be different enough to be of interest. This is another example of the tricky part of using any sort of data-mining or (a step further) machine learning in this area: there’s a ferocious amount of data curation that has to be done up front before you can believe the results you get.

16 comments on “Scaffold Popularity”

  1. Schinderhannes says:

    most all of the time I am very much impressed by your careful analysis of interesting papers.
    But the fact that you actually review this pointless counting exercise is disappointing me.
    A pile of manure contains more useful information.
    Statistics from last year in Flamingo to predict my luck today in the Bellagio today are more reliable than this.
    One look at the scaffolds in figure 4 made me lough out loud and kept me from reading.
    What a waste of time. (Sorry for the authors but also for you…)

    1. Heliastes says:

      I’m sure Derek will lament the loss of your steady viewership on the basis of one disagreeable post. Personally, I take the review for what it is; a fairly novel and interesting look at the popularity of scaffolds over the past couple decades and a stab at perhaps why some scaffolds are becoming more popular than others. It’s easy to pan some of these scaffolds as frivolous or dumb until they show up in your screens as hits and you have to explain to people why they can/can not pursue these for med chem.

    2. Iced says:

      Ya bro totally instead we should just all discuss some excellent or at least above average photoredox papers 🔦

      1. anon says:


    3. RunDMChem says:

      Rough day at the casino? Geez.

    4. SaltBae says:


  2. RBW says:

    Regrading the chalcones, phlorizin led to the development of the gliflozins. That may not explain all of their popularity, or justify it, but could have contributed.

  3. neo says:

    “…the tricky part of using any sort of data-mining or (a step further) machine learning in this area: there’s a ferocious amount of data curation that has to be done up front before you can believe the results you get.” I have always been fascinated by the typical industrial researchers: they can make claims as wrong as this and still believe that they know what they are talking about. I guess this is a good skill to have when you often have to convince people and did not have time to gather enough evidence.

    1. Derek Lowe says:

      Counterexamples? I would be glad to be wrong about this.

      1. Sophist says:

        There are no counterexamples. Data mining of patent literature without significant curation of the source documents is nothing more than a word count. Of course, the more often a word is used, the more important it must be. Geesh.

      2. WildCation says:

        Text mining patents tends to give you a lot of very common solvents and reagents, and structure mining patents tends to give you lots of erroneous structures. I’ve linked an article about one such gem. As a non-industry chemical data curator, I approach data sets mined from patents with great caution.

    2. Chester says:

      Machine learning of any kind ultimately follows one of the basic tenets of computing: garbage in, garbage out. Feed the smartest program a bunch of crap data and it will spit out crap correlations back at you.

  4. MFernflower says:

    Can we please not ever talk about curcumin and related compounds ever again? I still have nightmares!

    1. Me says:

      Obviously that means it is time for another “god-forsaken hyperexplosive” or “devil-forsaken fluorine compound” article!

      1. Me says:

        Oi!! You stole my identity!!

  5. John Court says:

    Seems to me that counting the number of times a scaffold goes through an assay and using the journal a a weighting factor is counting events that are way too random. Need more relevant data to count! Problem is there is a lot less of that around such as validated clinical data. The numbers there are lower and not as much of a wow factor to report. Keep up the good work on the blog Derek!

Comments are closed.