Skip to main content

Drug Assays

Machine Learning On Top of DNA Encoded Libraries

DNA-encoded libraries are a technique that many in the field should be familiar with, and they’ve come up many times here on the blog. The basic idea is simple: you build up a set of small molecules with some relatively simple synthetic steps, with plenty of branching at each stage. As a thought experiment, this core, condensed with 200 different amines gives you these 200 heterocycles, and when you alkylate each of those with these 100 electrophiles you get 20,000 different products, and each of those coupled with these 200 carboxylic acids gives you four million final compounds – that sort of thing. How do you keep track of all of these compounds? Well, you do all of this chemistry starting with the core starting material attached to a short piece of DNA. And every time you split off into a new set of wells to branch out into a new compound, you use molecular biology techniques to add on another specific “bar code” of DNA to the far end of the sequence.

By the time you get to those 4 million compounds every specific one of them should (in theory) have an oligonucleotide stretching off behind it that encodes its exact synthetic history. This one right here? That’s from condensation with amine #184, alkylation with electrophile #55, and then coupling with acid #103. Ah, but how do you read off said bar code? Through more molecular biology, specifically amplification through PCR and zippy modern sequencing. So in practice, you’re screening mixtures, huge mixtures, and using affinity to the target to select out the best binders. A standard way to do that is to produce your protein target of interest with a “His tag”, a run of histidine residues at one end, incubate it with the mixtures, and then run it over a solid support of nickel-containing resin. The His tag sticks to the Ni atoms (a standard protein trick), allowing you to gently wash all the non-binders and weak binders away until nothing but the tight-binding ligands are left. You knock the bound proteins and their ligands off under more strenuous conditions, read off the DNA barcodes, and there’s your hit set.

I have made this sound rather more simple than it is in reality. As you can imagine, you have to establish that the assay is ready to be run for real in the first place (protein still binds ligands, ligands can be washed off/retained as advertised, nothing falls to pieces during the process, and so on).You would be well advised to run some controls while you do this, such as a separate experiment that reproduces this one except with pre-incubation with a known tight inhibitor of your target protein (assuming there is one!), or a maybe with a mutant form of it with a hosed-up binding site. If you don’t have a lot of data on your DNA-encoded compound library set, you probably could use to run them against a totally different dummy protein just to start weeding out nonselective binders that stick to both targets, and so on. At the end of the process, working up the data is not light duty, either – in fact, some of the advances with doing DEL on a more routine basis have been due to the availability of better software. That lets you more quickly get an idea of overall hit rates, whether there were certain compound classes that tended to hit more often, to more easily compare against the control experiments, and so on.

A new paper has come out in J. Med. Chem. with another take on the whole technique, and it’s from a somewhat unexpected source: Google, along with the well-known DEL company X-Chem. I guess we shouldn’t be surprised, though: a good DEL screening campaign generates whalloping amounts of structured data, which the company flies toward like a hummingbird spotting red flowers. The idea here is that the output of such a DEL screening effort could be used to train a machine-learning model – there should be a lot for such a thing to get its virtual teeth into – which could then be used to run a virtual screen through collections of compounds that weren’t in the DNA-encoded libraries to start with.

Why would one do such a thing, when there are hundreds of millions of compounds in the DELs themeselves? Well, one limitation is that those libraries are deep – very deep indeed – but perhaps not particularly broad. Because of the synthetic routes to their members, they all share some core features, and have regions of space in their structures that will be quite well covered or quite sparsely covered. As an amateur astronomer, I think of DEL collections as individual galaxies or globular clusters in the universe of compound space. Each of them contains a huge number of stars (compounds), but they’re widely scattered from each other and (in the big picture) can each only cover a small area thoroughly. (This analogy pleased me very much when it occurred to me, and I’m very glad to have the chance to break it out for this article). Companies try to use a variety of chemistry schemes to prepare collections of such compound libraries, to at least make sure that each of these hundred-million-compound collections is thoroughly unlike the other, but compound space is very, very large (and large hunks of it are just not accessible with the kinds of chemistry that are compatible with all those DNA tags).

In this work, the X-Chem folks ran DEL screens against three well-studied (and quite disparate) protein targets: the soluble epoxide hydrolase enzyme, the kinase c-KIT, and the estrogen receptor (alpha). All of these have been exposed to multiple compounds screens over the years, and plenty of med-chem work. Each of these targets got 30 to 40 DNA-encoded libraries run against them (!) and that data was fed into machine-learning algorithms. The Google people used both random-forest (RF) techniques and the (rather fancier) graph convolutional neural networks (GCNN) approach at this point. Then the models the machine-learning algorithms built were used in turn to screen a virtual library of 80 million compounds – that one was built, in practical fashion, by taking X-Chem’s compound library and subjecting the appropriate members of it to (computer-generated) amide coupling reactions. That meant that the virtual hits could be easily synthesized to check them out in the real world afterwards. The Mcule database was also screened as a source of purchasable compounds.

So how well did all this work out? The DEL screens were pretty straightforward – those are well-behaved proteins, and X-Chem certainly has plenty of experience running such assays. A broad selection of the hits from the screens (50,000 to 300,000 compounds, depending on the targets) were fed into theGCNN  machine-learning model (note that the random-forest modeling part of the project used both positive and negative examples from the screens, all in the 10,000 to 100,000 compound range). The paper takes care to mention that no known ligands for these targets were included, nor any information about the structure of the binding sites. In fact, the modelers were blinded to identity of the proteins that they were working on. Once compounds came out of the virtual screens, the authors note that “we avoided subjective selection of the most chemically attractive compounds from the predictions“, and that shows more self-control than most of us have. Another control experiment was done with the soluble epoxide hydrolase screen – they did two separate DEL selections, several months apart, and found that the data generated performed equivalently as a training set for the models.

Compounds from the virtual screens were filtered with a high-pass “everything over a specific cutoff”, then duplicate scaffolds were eliminated, followed by a Tanimoto similarity search filter to increase the structural diversity of the ordered/synthesized sets even more. The top-ranking compounds from each structural cluster were chosen. Some broad structural filters (nothing over 700 MW, no silicon atoms, PAINS filter) were applied, and everything that got back-ordered from the Mcule search was removed. For the newly synthesized compounds, scaffolds with multiple reactive groups were removed, for ease of synthesis and characterization. Finally, a chemist did inspect both the final Mcule and new-compound hit lists to remove things that looked too reactive or unstable (no word on how many compounds were stripped in that step).

Now the key question: how many compounds did the extra ML screen unearth, and how did they compare to the original DEL screen? It looks like the machine learning models definitely enriched the hit sets. The soluble epoxide hydrolase screen provided the most hits, followed by estrogen receptor-alpha, and then c-KIT. This order seems to be correlated, in fact, with the number of positive training examples from the DEL screens that the ML algorithms had available. The sEH effort was also the most enriched: 50% of the hits from the GCNN ML screen were single-digit micromolar or better, and between 2.5 and 3% of them were 10 nM or better in an actual sEH assay, which is a very high hit rate indeed. For ER-alpha, 6% of the hits were below 10 µM, and for c-Kit it was 5%. In all cases, it appears that the GCNN model outperformed the random forest in coming up with potent compounds. The MCule database of purchasable compounds, if you’re wondering, outperformed the “stuff we can make” amide collection.

I will, on the basis of these data, take the conclusion that you can indeed use the output of a DNA-encoded library screen to provide a truly useful start for a machine learning effort. It seems like that should work, and this paper would seem to provide proof that it does. But now let’s put these results into context.

DEL, as it’s currently used, is one of those techniques that usually gets hauled out for difficult targets, ones for which a standard-deck screen hasn’t provided anything that people can get traction with. (To be sure, that’s the fate of every new hit-generation method – virtual screening by any means, fragment-based drug discovery, and so on). What I’m saying is that in reality no one is likely to perform a DEL screen on targets like the ones listed in this paper, because interesting chemical starting points are easy to come by for them. That’s not a complaint about this paper; these are perfectly fine as test beds and to prove a point that (as mentioned) I think this work proves.

For real-world use, though, there could be a problem: as mentioned, the ML models return the most robust hit sets when they are given the most positive data from the DEL screens to build on. But the screens that return the most solid hits are the ones that are least likely to need any such assistance in finding new chemical matter. That’s a way of asking “When would I do this – add a machine-learning model on top of my DEL screen?” And to me, anyway, the answer is “When I’ve run a DEL screen – which means that I’ve probably run a more traditional screen already – and I still need starting points to work from” That, though, looks like when these models are going to run into the most trouble.

This isn’t a new problem in virtual screening – far from it! There’s always a “To those that have, more shall be given” tendency in the field, for just these reasons. As a medicinal chemist, what I am looking for from a virtual screen is the same thing I’m looking for from a DEL screen (or a fragment screen): help with the hard stuff. That’s where I hope this work is going. I would be very interested in seeing how the GCNN machine-learning model performs with a less extensive positive hit set than is provided by soluble epoxide hydrolase. The good news is that I don’t expect, or even need, for that hit set to return 30% of its members under ten micromolar. Just a few single-digit or sub-micromolar compounds across different structural classes will do fine, as long as they’re real. Can it provide those?  I’m sure X-Chem (and many other organizations) can easily provide a list of low-hit-rate screening targets – will the Google team be able to find out what happens when those go into the digital hopper?

19 comments on “Machine Learning On Top of DNA Encoded Libraries”

  1. no body says:

    In my organization DEL screens are often run prior to regular HTS because they are much faster and cheaper to run.

  2. exGlaxoid says:

    I worked with encoded libraries at a previous company, and was not very impressed with the real number of correct compounds in the actual library. Even when using only 36 x 36 compounds, there were was far less expected desired compounds present than predicted. It turns out than in many cases, like where the R groups are bulky, than many variants do not get formed in good yield, so of the 4 million compounds, I would guess than only a fraction of that is there, maybe 400,000, mostly for the smaller R groups or such. Plus you get a lot of very similar compounds in the end. Given that many pharma companies already have arrays of 1-5 million discrete compounds (that can certainly be tested in small pools if desired), I don;t know what the DEL arrays really add, other than big numbers.

  3. Mlman says:

    The Random Forest was not done by Google

  4. Barry says:

    DEL carries some big liabilities. “some relatively simple synthetic steps” as you say is limiting; lots of chemistry just isn’t compatible with the tag. And that tag is a huge solublizing group that can mask some very un-druglike properties of compound you’re screening.

  5. He FDR ddd says:

    My god science is stupid

    1. Mac Del-olds says:

      Billions and billions screened

  6. Christophe L Verlinde says:

    All authors declared competing financial interests.
    A healthy dose of scepticism may be warranted.

  7. Ilya says:

    How come this DNA fragment doesn’t interfere with the ligand binding? How many compounds are missed or ranked incorrectly due to the DNA part shielding moieties useful for binding?

    1. Barry says:

      In any tagging scheme, you must (tacitly or explicitly) designate a binding “head” and a solvent-exposed “tail” to your molecules. Because the phospho-sugar backbone of the DNA fragment is so lipophobic, it’s unlikely to fold onto your putative binding surface. But yes, those base may do just that.

      1. ezra abrams says:

        not only lipophobic, but has lots of charge that can stick to proteins, and the bases themselves have, iirc, exocyclic amines that have a lot of free energy available
        not to mention stacking

  8. David Campbell says:

    I liked Dfrek’s analgy of the galaxy. Another fruitful analogy is Daniel Dennett’s “Library of Mendel/Library of Babel” analogy in Darwin’s Dangerous idea.
    It’s in Chapter 5, and he entertainingly goes through some of the combinatorial arithmetic. For those who don’t know the book, it’s on Kindle at $14

  9. AlloG says:

    DEL is like attaching a refrigerator to aspirin. It makes da aspirin look good, but you still get da headache!

  10. Magrinho says:

    DELs are a prettier and less useful variant of technology developed&used by Affymax and Pharmacopeia in the 90s. Ultimately, Pharma does not need more leads – I heard this verbatim while at one of the aforementioned companies (followed by the sound of the mic hitting the floor). It’s still true.

    1. AnonymousDELResearcher says:

      Derek’s analysis is spot on. The question remains how much value can ML layer on against tough targets?

      But I have to take issue with the comments.

      “Pharma does not need more leads”

      Are you joking? We have tool compounds for 4% of the proteome. Not leads, tool compounds. Because chemists like those that have commented on this entry (not Derek) only want to use the same platforms and methodologies bthey were comfortable with 10 or 20 years ago, screening against the same targets. Hell, some of the comments are from the same people that were bad mouthing the tech 10 years ago. What’s next exGlaxoid, going to equate the Praecis purchase with Sirtris again and damn the fact that GSK considers it one of the best deals it ever made that happened to coincide with the worst? Finding leads against the same small subset of “chosen” targets is not the way to actually help patients.

      If you are getting crap from your DEL screens, it’s because you are crap at doing DEL. Build better libraries, screen them better, and follow them up better. The investment pays off. If you go with the lowest bidder using outdated designs and screening methodology then don’t be surprised your results are crap. It’s BS to claim the tech sucks just because you don’t know how to use it correctly.

      1. The emperors new combichem says:

        Ahh, the classic ‘you’re doing it wrong’

      2. exGlaxoid says:

        Yes, I would compare the purchase of Praecis as stupid as Sirtris. Show me ONE good from having come from encoded libraries where there was not a hit in normal screening, and I will buy your company. But GSK spend billions on magic beans over the last 10-20 years, while their stock has gone from $64 a share to $41 per share over the last 20 years while the S&P has tripled. The scientists at GW were getting 1-2 good compounds to the clinic each year with screening, targeted design, and common sense before they merged with SKB and started heading down the path of stupid decisions, just look at the stock chart to see that it was up until the merger and then down since then. And virtually every path away from simplicity and good science towards “big science” has been a disaster.

        I saw the original arrays from both Affymax (a huge load of crap, unlike the Affymetrix group which did make a great idea from the huge mess that was Affymax) and Praecis, and they are were filled with triazines, peptide like amides, and crap. If GSK considers it a good buy, then that shows how bad management there is now. The problem is that the DEL companies were the ones who designed and made the libraries and most were terrible.

        Sadly, most of big pharma has gone from the likes of the old companies run by doctors and scientists to the current batch of mostly CPAs, salesmen, MBAs, and crooks, who spend their time and effort trying to make a fast buck, rather than letting the scientists do science and seeing where it goes. When I was there a long time back, we routinely got decent screening results from HTS, found good hits from further work, and optimized them to good candidates, but management was great at then killing the good ones, like selling Cialis off, cancelling most of the best cancer programs (many of which succeeded at competitors like AZ), and investing money in fish oil and reveritrol.

        I have seen pharma say that combined huge companies are better, then a few years later split into smaller, more agile, companies and back. They sell off the generics and consumer products (both of which seem like good businesses to me to keep) and then buy them back a few years later for vastly more than they sold them. So yes, if they like Praecis, then bad for them.

  11. DEL chemist says:

    A little publicized fact is the tendency of DEL to succeed mostly with proteins that also have good ligands available from ‘regular’ screening…

  12. Ezra Abrams says:

    I used to work at Praecis, back when it’s Lupron competitor was gonna bring in pots of money

    Give Malcolm Gefter one thing, he was ahead of his time in making a big investment in that technology

    Pretty surprised no one has found a way to “mask” the oligos so they are chemically stable and less “sticky” (if you recall phage display, it was pretty easy to get polyTrp or poly Tyr plastic binders

  13. John Conway says:

    Thanks Derek for this article. We are having huge successes and your investigation has been spot on for us.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.