DNA-encoded libraries are a technique that many in the field should be familiar with, and they’ve come up many times here on the blog. The basic idea is simple: you build up a set of small molecules with some relatively simple synthetic steps, with plenty of branching at each stage. As a thought experiment, this core, condensed with 200 different amines gives you these 200 heterocycles, and when you alkylate each of those with these 100 electrophiles you get 20,000 different products, and each of those coupled with these 200 carboxylic acids gives you four million final compounds – that sort of thing. How do you keep track of all of these compounds? Well, you do all of this chemistry starting with the core starting material attached to a short piece of DNA. And every time you split off into a new set of wells to branch out into a new compound, you use molecular biology techniques to add on another specific “bar code” of DNA to the far end of the sequence.
By the time you get to those 4 million compounds every specific one of them should (in theory) have an oligonucleotide stretching off behind it that encodes its exact synthetic history. This one right here? That’s from condensation with amine #184, alkylation with electrophile #55, and then coupling with acid #103. Ah, but how do you read off said bar code? Through more molecular biology, specifically amplification through PCR and zippy modern sequencing. So in practice, you’re screening mixtures, huge mixtures, and using affinity to the target to select out the best binders. A standard way to do that is to produce your protein target of interest with a “His tag”, a run of histidine residues at one end, incubate it with the mixtures, and then run it over a solid support of nickel-containing resin. The His tag sticks to the Ni atoms (a standard protein trick), allowing you to gently wash all the non-binders and weak binders away until nothing but the tight-binding ligands are left. You knock the bound proteins and their ligands off under more strenuous conditions, read off the DNA barcodes, and there’s your hit set.
I have made this sound rather more simple than it is in reality. As you can imagine, you have to establish that the assay is ready to be run for real in the first place (protein still binds ligands, ligands can be washed off/retained as advertised, nothing falls to pieces during the process, and so on).You would be well advised to run some controls while you do this, such as a separate experiment that reproduces this one except with pre-incubation with a known tight inhibitor of your target protein (assuming there is one!), or a maybe with a mutant form of it with a hosed-up binding site. If you don’t have a lot of data on your DNA-encoded compound library set, you probably could use to run them against a totally different dummy protein just to start weeding out nonselective binders that stick to both targets, and so on. At the end of the process, working up the data is not light duty, either – in fact, some of the advances with doing DEL on a more routine basis have been due to the availability of better software. That lets you more quickly get an idea of overall hit rates, whether there were certain compound classes that tended to hit more often, to more easily compare against the control experiments, and so on.
A new paper has come out in J. Med. Chem. with another take on the whole technique, and it’s from a somewhat unexpected source: Google, along with the well-known DEL company X-Chem. I guess we shouldn’t be surprised, though: a good DEL screening campaign generates whalloping amounts of structured data, which the company flies toward like a hummingbird spotting red flowers. The idea here is that the output of such a DEL screening effort could be used to train a machine-learning model – there should be a lot for such a thing to get its virtual teeth into – which could then be used to run a virtual screen through collections of compounds that weren’t in the DNA-encoded libraries to start with.
Why would one do such a thing, when there are hundreds of millions of compounds in the DELs themeselves? Well, one limitation is that those libraries are deep – very deep indeed – but perhaps not particularly broad. Because of the synthetic routes to their members, they all share some core features, and have regions of space in their structures that will be quite well covered or quite sparsely covered. As an amateur astronomer, I think of DEL collections as individual galaxies or globular clusters in the universe of compound space. Each of them contains a huge number of stars (compounds), but they’re widely scattered from each other and (in the big picture) can each only cover a small area thoroughly. (This analogy pleased me very much when it occurred to me, and I’m very glad to have the chance to break it out for this article). Companies try to use a variety of chemistry schemes to prepare collections of such compound libraries, to at least make sure that each of these hundred-million-compound collections is thoroughly unlike the other, but compound space is very, very large (and large hunks of it are just not accessible with the kinds of chemistry that are compatible with all those DNA tags).
In this work, the X-Chem folks ran DEL screens against three well-studied (and quite disparate) protein targets: the soluble epoxide hydrolase enzyme, the kinase c-KIT, and the estrogen receptor (alpha). All of these have been exposed to multiple compounds screens over the years, and plenty of med-chem work. Each of these targets got 30 to 40 DNA-encoded libraries run against them (!) and that data was fed into machine-learning algorithms. The Google people used both random-forest (RF) techniques and the (rather fancier) graph convolutional neural networks (GCNN) approach at this point. Then the models the machine-learning algorithms built were used in turn to screen a virtual library of 80 million compounds – that one was built, in practical fashion, by taking X-Chem’s compound library and subjecting the appropriate members of it to (computer-generated) amide coupling reactions. That meant that the virtual hits could be easily synthesized to check them out in the real world afterwards. The Mcule database was also screened as a source of purchasable compounds.
So how well did all this work out? The DEL screens were pretty straightforward – those are well-behaved proteins, and X-Chem certainly has plenty of experience running such assays. A broad selection of the hits from the screens (50,000 to 300,000 compounds, depending on the targets) were fed into theGCNN machine-learning model (note that the random-forest modeling part of the project used both positive and negative examples from the screens, all in the 10,000 to 100,000 compound range). The paper takes care to mention that no known ligands for these targets were included, nor any information about the structure of the binding sites. In fact, the modelers were blinded to identity of the proteins that they were working on. Once compounds came out of the virtual screens, the authors note that “we avoided subjective selection of the most chemically attractive compounds from the predictions“, and that shows more self-control than most of us have. Another control experiment was done with the soluble epoxide hydrolase screen – they did two separate DEL selections, several months apart, and found that the data generated performed equivalently as a training set for the models.
Compounds from the virtual screens were filtered with a high-pass “everything over a specific cutoff”, then duplicate scaffolds were eliminated, followed by a Tanimoto similarity search filter to increase the structural diversity of the ordered/synthesized sets even more. The top-ranking compounds from each structural cluster were chosen. Some broad structural filters (nothing over 700 MW, no silicon atoms, PAINS filter) were applied, and everything that got back-ordered from the Mcule search was removed. For the newly synthesized compounds, scaffolds with multiple reactive groups were removed, for ease of synthesis and characterization. Finally, a chemist did inspect both the final Mcule and new-compound hit lists to remove things that looked too reactive or unstable (no word on how many compounds were stripped in that step).
Now the key question: how many compounds did the extra ML screen unearth, and how did they compare to the original DEL screen? It looks like the machine learning models definitely enriched the hit sets. The soluble epoxide hydrolase screen provided the most hits, followed by estrogen receptor-alpha, and then c-KIT. This order seems to be correlated, in fact, with the number of positive training examples from the DEL screens that the ML algorithms had available. The sEH effort was also the most enriched: 50% of the hits from the GCNN ML screen were single-digit micromolar or better, and between 2.5 and 3% of them were 10 nM or better in an actual sEH assay, which is a very high hit rate indeed. For ER-alpha, 6% of the hits were below 10 µM, and for c-Kit it was 5%. In all cases, it appears that the GCNN model outperformed the random forest in coming up with potent compounds. The MCule database of purchasable compounds, if you’re wondering, outperformed the “stuff we can make” amide collection.
I will, on the basis of these data, take the conclusion that you can indeed use the output of a DNA-encoded library screen to provide a truly useful start for a machine learning effort. It seems like that should work, and this paper would seem to provide proof that it does. But now let’s put these results into context.
DEL, as it’s currently used, is one of those techniques that usually gets hauled out for difficult targets, ones for which a standard-deck screen hasn’t provided anything that people can get traction with. (To be sure, that’s the fate of every new hit-generation method – virtual screening by any means, fragment-based drug discovery, and so on). What I’m saying is that in reality no one is likely to perform a DEL screen on targets like the ones listed in this paper, because interesting chemical starting points are easy to come by for them. That’s not a complaint about this paper; these are perfectly fine as test beds and to prove a point that (as mentioned) I think this work proves.
For real-world use, though, there could be a problem: as mentioned, the ML models return the most robust hit sets when they are given the most positive data from the DEL screens to build on. But the screens that return the most solid hits are the ones that are least likely to need any such assistance in finding new chemical matter. That’s a way of asking “When would I do this – add a machine-learning model on top of my DEL screen?” And to me, anyway, the answer is “When I’ve run a DEL screen – which means that I’ve probably run a more traditional screen already – and I still need starting points to work from” That, though, looks like when these models are going to run into the most trouble.
This isn’t a new problem in virtual screening – far from it! There’s always a “To those that have, more shall be given” tendency in the field, for just these reasons. As a medicinal chemist, what I am looking for from a virtual screen is the same thing I’m looking for from a DEL screen (or a fragment screen): help with the hard stuff. That’s where I hope this work is going. I would be very interested in seeing how the GCNN machine-learning model performs with a less extensive positive hit set than is provided by soluble epoxide hydrolase. The good news is that I don’t expect, or even need, for that hit set to return 30% of its members under ten micromolar. Just a few single-digit or sub-micromolar compounds across different structural classes will do fine, as long as they’re real. Can it provide those? I’m sure X-Chem (and many other organizations) can easily provide a list of low-hit-rate screening targets – will the Google team be able to find out what happens when those go into the digital hopper?