Here’s an interesting article from a former colleague of mine, Pat Walters, on virtual chemical libraries. Those, of course, are meant to fill in the (large, enormous, humungous) gap between “compounds that we have on hand to screen” and “compounds that we could screen if we actually had them”. That second group, if taken seriously, sends you right into contemplating just how many compounds there are if you could screen everything, and that sends you into a discussion of which “everything” you mean.
Is that “all the commercially available compounds” (a constantly changing set, mind you)? All the ones that have ever been reported in Chemical Abstracts? All the ones that would be pretty easy to synthesize, whether anyone’s ever done so or not? All the ones that are theoretically possible? For each of those sets you’re also going to be thinking about cutoffs – molecular weight, polarity and other “druglike” properties, and so on. Just how many compounds are we looking at, then?
If you’re into that sort of question, the paper has a valuable review of the various computational attempts that have been made to answer that last question (how many druglike molecules are there in total?) The answer you get will naturally depend on the assumptions you make and the method you use to estimate feasible molecular structures, but it seems that most estimates are landing within a comfortable ten orders of magnitude or so: somewhere between 10 to the twentieth and ten to the thirtieth. How you feel about that level of precision depends on how abstract a thinker you are, I suppose. If you geared up a compound-archive system to handle some huge horrific unprecedented planetary compound collection and were then told “Oops. A bit off on the numbers. We’re actually going to come in at ten thousand times that size”, that might cause a bit of distress. But in the abstract, the difference between eleventy Godzillion and umpteen Godzillion compounds doesn’t matter as much: what matters is that all of these estimates are massive by any human standard. They’re massive physically, to the point of complete impossibility (the Earth itself weighs about 6 times ten to the thirtieth milligrams), and they’re massive computationally, too.
The paper goes on to describe methods used to produce huge enumerated compound sets (such as GDB-17) and ones based on known, reliable reactions and commercial starting materials (even the latter start heading into the quadrillions pretty rapidly). And that brings up the nontrivial question of how you handle compound databases of this size. As Walters points out, clustering algorithms are just not efficient enough to deal with any list longer than about ten million compounds, and if your virtual library is equivalent to a million sets of ten million compounds each, well then. Comparing libraries/library designs to see what parts of chemical space they’re covering is no picnic, either – the most commonly used techniques start to gasp for breath at about the million-compound level, which is totally inadequte.
Then there’s, well, the actual screening part. You can run 2-D structures through a screening algorithm pretty quickly, but do you want to? 3-D ones should be a lot better, but that immediately should make you ask “Which 3-D structure might you mean?”, which takes you into conformational searching, energy minimization and so on – and that’s if you screening against a static binding pocket, and how realistic is that? We have just slowed down the whole process by perhaps a millionfold by merely thinking about such issues. Oh, and just how accurate are your methods to estimate the binding interactions that you’re depending on for your scoring? What kind of false positive rate do you think you’ll get? Simply making the virtual library bigger does not exactly deal with these problems. Quite the opposite. As Walters shows in the paper, parallel processing on modern hardware can deal with some pretty hefty screens. But the expertise needed to set these things up properly is not in long supply.
One take-home from the paper is that (even with our current hardware) we’re reaching the scaling limits (on several fronts!) of the techniques that have brought us this far. A possible solution is the use of generative techniques, where a virtual library is built from the ground up based on on-the-fly estimates of compound activity. This has a lot of merit compared to the brute-force here’s-ya-zillion-compounds approach, but it’s just barely been put to the test so far, and may well require new machine learning algorithms to reach anything close to its full potential. Another take-home is that no matter what the technique, we’re going to have to hammer down the false positive rates in order to have any chance of usefully navigating Big Druglike Space – and that’s probably going to require new methods as well. The good news is that there is no reason why such things should not exist.
The overall lesson (in my opinion) is that every virtual screen is going to involve compromises of some sort, and you have to decide for yourself how many of those you’re willing (or forced) to make and how much weight you should give the results thereafter. In my own limited experience, people don’t like to hear talk like that, but there are a lot of true things that people don’t particularly enjoy contemplating, right?