Skip to main content

Drug Assays

Virtual Compound Screening: The State of the Art

Here’s an interesting article from a former colleague of mine, Pat Walters, on virtual chemical libraries. Those, of course, are meant to fill in the (large, enormous, humungous) gap between “compounds that we have on hand to screen” and “compounds that we could screen if we actually had them”. That second group, if taken seriously, sends you right into contemplating just how many compounds there are if you could screen everything, and that sends you into a discussion of which “everything” you mean.

Is that “all the commercially available compounds” (a constantly changing set, mind you)? All the ones that have ever been reported in Chemical Abstracts? All the ones that would be pretty easy to synthesize, whether anyone’s ever done so or not? All the ones that are theoretically possible? For each of those sets you’re also going to be thinking about cutoffs – molecular weight, polarity and other “druglike” properties, and so on. Just how many compounds are we looking at, then?

If you’re into that sort of question, the paper has a valuable review of the various computational attempts that have been made to answer that last question (how many druglike molecules are there in total?) The answer you get will naturally depend on the assumptions you make and the method you use to estimate feasible molecular structures, but it seems that most estimates are landing within a comfortable ten orders of magnitude or so: somewhere between 10 to the twentieth and ten to the thirtieth. How you feel about that level of precision depends on how abstract a thinker you are, I suppose. If you geared up a compound-archive system to handle some huge horrific unprecedented planetary compound collection and were then told “Oops. A bit off on the numbers. We’re actually going to come in at ten thousand times that size”, that might cause a bit of distress. But in the abstract, the difference between eleventy Godzillion and umpteen Godzillion compounds doesn’t matter as much: what matters is that all of these estimates are massive by any human standard. They’re massive physically, to the point of complete impossibility (the Earth itself weighs about 6 times ten to the thirtieth milligrams), and they’re massive computationally, too.

The paper goes on to describe methods used to produce huge enumerated compound sets (such as GDB-17) and ones based on known, reliable reactions and commercial starting materials (even the latter start heading into the quadrillions pretty rapidly). And that brings up the nontrivial question of how you handle compound databases of this size. As Walters points out, clustering algorithms are just not efficient enough to deal with any list longer than about ten million compounds, and if your virtual library is equivalent to a million sets of ten million compounds each, well then. Comparing libraries/library designs to see what parts of chemical space they’re covering is no picnic, either – the most commonly used techniques start to gasp for breath at about the million-compound level, which is totally inadequte.

Then there’s, well, the actual screening part. You can run 2-D structures through a screening algorithm pretty quickly, but do you want to? 3-D ones should be a lot better, but that immediately should make you ask “Which 3-D structure might you mean?”, which takes you into conformational searching, energy minimization and so on – and that’s if you screening against a static binding pocket, and how realistic is that? We have just slowed down the whole process by perhaps a millionfold by merely thinking about such issues. Oh, and just how accurate are your methods to estimate the binding interactions that you’re depending on for your scoring? What kind of false positive rate do you think you’ll get? Simply making the virtual library bigger does not exactly deal with these problems. Quite the opposite. As Walters shows in the paper, parallel processing on modern hardware can deal with some pretty hefty screens. But the expertise needed to set these things up properly is not in long supply.

One take-home from the paper is that (even with our current hardware) we’re reaching the scaling limits (on several fronts!) of the techniques that have brought us this far. A possible solution is the use of generative techniques, where a virtual library is built from the ground up based on on-the-fly estimates of compound activity. This has a lot of merit compared to the brute-force here’s-ya-zillion-compounds approach, but it’s just barely been put to the test so far, and may well require new machine learning algorithms to reach anything close to its full potential. Another take-home is that no matter what the technique, we’re going to have to hammer down the false positive rates in order to have any chance of usefully navigating Big Druglike Space – and that’s probably going to require new methods as well. The good news is that there is no reason why such things should not exist.

The overall lesson (in my opinion) is that every virtual screen is going to involve compromises of some sort, and you have to decide for yourself how many of those you’re willing (or forced) to make and how much weight you should give the results thereafter. In my own limited experience, people don’t like to hear talk like that, but there are a lot of true things that people don’t particularly enjoy contemplating, right?


14 comments on “Virtual Compound Screening: The State of the Art”

  1. Peter Kenny says:

    The ability to generate large, relevant chemical spaces would certainly be valuable. The key virtual screening challenge is likely to be predicting chemical behavior (e.g. affinity, stability, solubility, pharmacokinetic, tox, etc) of compounds that have no close structural analogs. Coverage of chemical space is a key consideration when selecting a representative subset of compounds for synthesis and I’m guessing that we’ll also need to learn how best to sample from large chemical spaces. I would agree that on-the-fly methods will be necessary and this suggests a different type of search parallelism in which each generated structure might be screened against many predictive models.

  2. exGlaxoid says:

    Would it not be more practical to simply design, model and optimize, a “virtual pharmacophore” for a receptor which has a few properties in space, such as a “hydrophobic space”, a positive or negative charge, and a hydrogen bonding spot, as a simple triangle of some vectors and sizes and then screen out all molecules that don’t have that 2-D “model” first, in order to minimize the number of 3-D molecules to consider?

    Or in simpler terms, if you are looking for a ER ligand, simply ignore all molecules that don’t have an acidic proton, a certain sized aromatic ring or two, and certain other interactions present. That reduces the screening set (whether virtual or real) to a more manageable size quickly. I think that is the use of a medicinal chemist, to help narrow the search to a practical scope. That obviously biases the search, but in many cases that proves reasonable. I know this is not an original idea, but given the number of real or virtual compounds to screen computationally, it seems a good way to start to me.

    1. Peter Kenny says:

      You can encode presence of structural features (e.g. aliphatic hydroxyl; phenolic hydroxyl; carboxylate hydroxyl) in bit strings for more rapid retrieval of structures although this works less well when defining generic atom types (e.g. hydrogen bond acceptor; anion) for pharmacophore searches. It’s also possible to encode the presence of pharmacophoric features (e.g. cation and anion separated by a distance) in bit strings.

    2. Mark says:

      Presence or absence (or counts!) of simple pharmacophoric features is pretty easy to do and very fast, so that’s certainly possible. However, once you’re talking about “vectors and sizes” then you need a conformation analysis, and that’s where things get nasty. Pat goes into the computational costs, which are non-trivial but certainly doable (he estimates CPU costs of only a thousand dollars or so for half a billion molecules, although I think that’s too low), but the sting in the tail is the data requirements: assume an average of 25 heavy atoms for your molecules, so 50 atoms in total, so 150 coordinates per conformer, times 100 confs per molecule[*], and you’re looking at north of 30TB of data: that’s doable, but the storage costs are going to be significantly more than your CPU costs. If your virtual library is 10 billion molecules, now you’re looking at 600TB of data, and storing that and backing it up is now costing you serious money.

      [*] Big conformer databases often use 10, but that’s grossly undersized. Why do the calculations when the result is going to be dominated by noise from undersampling?

      1. Questor says:

        Is it valid to omit H-coordinates and infer/regenerate them as needed? This would be half of the atoms…

        1. Mark says:

          Yes, you can do this (except for e.g. hydroxyl hydrogens), and there’s some other tricks you can use to cut down the storage space (quantizing the coordinates, and so on). However, once you get above a billion molecules you are going to be using *lots* of disk space, and very large data sets bring challenges that you just don’t have to deal with otherwise.

      2. David Koes says:

        I disagree with 10 conformers being grossly undersized. As you generate more conformers for structure-based pharmacophore screening, you start to see the number of false positives increase at a greater rate than the number of true positives (this is assuming you are preferentially selecting the lower energy conformers). I’ve found the sweet spot to be somewhere between 10 and 25 conformers, which is the range I use in Pharmit (although not for the largest libraries – cut corners there).

  3. Wavefunction says:

    Solubility prediction would especially be nice; so many compounds from screening fall out because of this single parameter.

  4. Chrispy says:

    Some of the most valuable hits I have had in real screening were surprising. The best were exosite-binding, non-competitive inhibitors that are not affected by the ligand. No one would have expected a priori that these would have any effect, and virtual screening would miss them all.

    I still cast a jaundiced eye on these virtual screening efforts; the people pushing them have been saying since at least the nineties that they could do it, and back then they really couldn’t. Some of the early successes against HIV protease were touted as virtual screening successes, when in fact they were the result of conventional screens against libraries of renin inhibitor hopefuls.

    Maybe things have changed, but I’ve still not seen any of these programs that can effectively deal with the solvent, and I’m told that water is a particularly complex one. So I’ll remain skeptical, but I really wish someone would prove me wrong.

  5. Np says:

    Very reason why the Natural products (NP’s) are good validated starting points for developing and designing libraries for drug discovery, provided the supply problems are addressed via scalable synthetic routes amenable for the analogues preparation. Long Live NP based drug discovery!

    1. Big Freddie says:

      Yeaaaah…but then this flops onto shore…
      and at that point you should begin chewing your leg off to get out of the trap..

  6. Anonymous says:

    Dear Derek,
    As you embark into the virtual screening of millions of compounds for drug-like properties and biological activity, your rodent friends would very much appreciate if you could, in parallel, start screening for virtual adverse effects.
    It would also protect your favorite discovery toxicologists from countless of headaches and some mean, mean words in group meetings.
    Perhaps this publication could serve as a starting point:
    Machine Learning of Toxicological Big Data Enables Read-Across Structure Activity Relationships (RASAR) Outperforming Animal Test Reproducibility.
    Toxicological Sciences, Volume 165, Issue 1, 1 September 2018, Pages 198–212,

  7. metaphysician says:

    This really sounds like one of those “Call me back after you’ve built a friendly artificial superintelligence” situations. . .

  8. The Greatest says:

    If you want to check out chemical biology, abpp, and basically anything that goes on near la jolla, those bad boys are the masterminds. They are basically god. And above all, a little someone special that Lowe likes, the great BFC.

Comments are closed.