Skip to Content

In Silico

Neural Networks for Drug Discovery: A Work in Progress

There have been many attempts over the years to bring together large amounts of biological and drug activity data, winnow them computationally, and come up with insights that would not be obvious to human eyes. It’s a natural thing to imagine – there are so many drug targets in the body, doing so many things, and there’s an awful lot of information out there about thousands and thousands of compounds. There’s no way that a human observer could pick up on all the things that are going on; you need tireless software to sift through the piles.
The success record of this sort of thing has been mixed, though. Early attempts can now be set aside as underpowered, but what to make of current attempts at “virtual clinical trials” and the like? (We’re probably still underpowered for that sort of thing). Less ambitiously, people have tried to mine for new targets and new drug activities by rooting through the Big Data. But this sort of thing is not without controversy: many of us, chemists and biologists alike, don’t have the mathematical background to say if the methods being used are appropriate, or what their weaknesses and blind spots might be.
A new paper has gotten me thinking about all this again. It’s a collaboration between several researchers at Stanford and Google (press release here) on machine learning for drug discovery. What that means is that they’re trying to improve virtual screening techniques, using a very Google-centric approach that might be summed up as “MOAR DATA!” (That phrase does not appear in the paper, sadly).

In collaboration with the Pande Lab at Stanford University, we’ve released a paper titled “Massively Multitask Networks for Drug Discovery”, investigating how data from a variety of sources can be used to improve the accuracy of determining which chemical compounds would be effective drug treatments for a variety of diseases. In particular, we carefully quantified how the amount and diversity of screening data from a variety of diseases with very different biological processes can be used to improve the virtual drug screening predictions.
Using our large-scale neural network training system, we trained at a scale 18x larger than previous work with a total of 37.8M data points across more than 200 distinct biological processes. Because of our large scale, we were able to carefully probe the sensitivity of these models to a variety of changes in model structure and input data. In the paper, we examine not just the performance of the model but why it performs well and what we can expect for similar models in the future. The data in the paper represents more than 50M total CPU hours.

I end up with several trains of thought about this kind of thing. On track one, I appreciate that if virtual screening is going to work well, it needs to draw from the largest data sets possible, since there are so many factors at work. But on track two, I wonder how good the numbers going into this hopper really are, since I (like anyone else in the business) have seen some pretty garbagey screening numbers, both in person and in the literature. Piling more noise into the computations cannot improve them, even if your hardware is capable of dealing with landfills of the stuff. (The authors do note that they didn’t do any preprocessing of the data sets to remove potential artifacts. The data come from four main sources (see the paper, which is open access, for more), and only one of these has probably been curated to that level.) And that brings us to track three: my innate (and doubtless somewhat unfair) suspicions go up when I see a lot of talk about just how Incredibly Large the data sets are, and how Wildly Intense all the computations were.
Not to be too subtle about it, asking for a virtual screen against some target is like asking for a ditch to be dug from Point A to Point B. Can you dig the ditch, or not? Does it get to where it’s supposed to go, and do what a ditch is supposed to do? If so, then to a good approximation, I don’t care how many trained badgers you herded in for the job, or (alternatively) about the horsepower and fuel requirements of the earth-moving equipment you rented. If someone spends a lot of time telling me about these things (those engines! those badgers!) then I wonder if they’re trying to distract me from what really matters to me, which is the final product.
Multitask
Well, I’m willing to accept that that’s not a completely fair criticism, but it’s something that always crosses my mind, and I may not be alone in this. Let’s take a look at the ditch – uh, the virtual screening – and see how well it came out.

In this work, we investigate several aspects of the multitask learning paradigm as applied to virtual screening. We gather a large collection of datasets containing nearly 40 million experimental measurements for over 200 targets. We demonstrate that multitask networks trained on this collection achieve significant improvements over baseline machine learning methods. We show that adding more tasks and more data yields better performance. This effect diminishes as more data and tasks are added, but does not appear to plateau within our collection. Interestingly, we find that the total amount of data and the total number of tasks both have significant roles in this improvement. Furthermore, the features extracted by the multitask networks demonstrate some transferability to tasks not contained in the training set. Finally, we find that the presence of shared active compounds is moderately correlated with multitask improvement, but the biological class of the target is not.

As the paper notes, this is similar to Merck’s Kaggle challenge of a couple of years back (and I just noticed this morning that they cite that blog post, and its comments, as an example of the skepticism that it attracted from some parts of the med-chem community). In this case, the object isn’t (yet) to deliver up a bunch of virtual screening hits, so much as it is to see what the most appropriate architecture for such a search might be.
One of the biggest problems with these papers (as this one explicitly states) is that the criteria used to evaluate the performance of these systems are not standardized. So it’s basically impossible to compare one analysis with another, because they’re scoring by different systems. But that graphic gives some idea of how things worked on different target classes. The Y axis is the difference between using multitask models (as in this paper) and single-task neural network models, and it shows that in most cases, most of the time, multitask modeling was better. But I note that almost every class has some cases where this doesn’t hold, and that (for reasons unknown) the GPCR targets seem to show the least improvement.
But what I don’t know is how well these virtual screening techniques compared to the actual screening data. The comparisons in the paper are all multi-task versus single-task (which, to the fair, is the whole focus of the work), but I’d be interested in an absolute-scale measurement. That shows up, though, in Table B2 in the appendix, where they use Jain and Nicholls’ “enrichment” calculation. Assuming that I’m reading these correctly, which may or may not be warranted, the predictions look to be anywhere from about 5% to about 25% better than random, depending on what false-positive rate you’re looking at, with occasional hops up to the 40% better range. Looking at the enrichment figures, though, I don’t see this model performing much better than the Random Forest method, which has already been applied to med-chem work and activity prediction many times. Am I missing something in that comparison? Or does this all have quite a ways to go yet?

25 comments on “Neural Networks for Drug Discovery: A Work in Progress”

  1. Pete says:

    Enrichment becomes progressively less useful the further you get into lead optimization. I’m guessing that being able to measure free intracellular concentration of arbitrary compounds in live humans might be more useful.

  2. anon the II says:

    I don’t want to stir the pot too much, but aren’t these the aminal people?

  3. Pete says:

    I couldn’t help being reminded that my edition of March had an index entry for ‘ketene animals’…

  4. anonymous says:

    Off-topic, but any thoughts on Andy Myers’s newly funded company, Macrolide Pharmaceuticals? It is based on the macrocyclization chemistry described in WO2014165792?
    Chemistry looks pretty, but do all those steps really scale if you need to make a ton? And, how does this approach get around the well-recognized issues of broad-spectrum activity, resistance, cell penetration, and reimbursement?

  5. hypnos says:

    Google is very good at scaling things up that work. However, virtual screening tends to work only up to a certain extend – and enrichment factors in the reported ranges are only marginally relevant in practical applications. If I want to reduce the size of my screening deck by a factor of 10, I can also do a little bit of property filtering followed by a diverse subset selection.
    In particular, if one considers the cost of the 50M cpu hours in comparison to a simple plain HTS campaign.

  6. anonao says:

    @hypnos Do you know how much it cost for google to do 50M CPU hours? And how much does a HTS cost (200/300k compounds to buy and screen)?

  7. milkshake says:

    Garbage in, garbage out. My former computational chemistry boss consulted for a company trying to develop a “virtual docking” software. Every time he tried their package on a kinase that he was familiar with (kinases tend to have a deep and well defined binding site that loves heterocyles, and lots of published data on ligands is available in the literature) the virtual docking churned out impressive lists of useless stuff while missing important interactions. And I am not even considering things like plasticity of the binding site and the entropic effects…

  8. Anonymous says:

    @1, why? (interested)

  9. oldnuke says:

    @8 Just be thankful that these people are enamored with computers and not fissionable materials…
    Well, if a little bit is good, a lot more has got to be a lot better. Lets just keep throwing more into the corner, maybe something will happen.

  10. LeeH says:

    With all due respect to Vijay Pande, who is an extremely bright guy, I do have some concern/questions about this paper.
    Why did they choose those particular data sets? For instance, why DIDN’T they use chEMBL, which is arguably the most extensive and cleanest publicly available set on the planet? And why did they include a tox data set, which is close to unmodelable (where all the methods are more or less guaranteed to do poorly)?
    What was their criteria for labeling compounds as active or inactive? Was it a single number, or chosen by target?
    What parameters did they use for the other data mining methods (such as Random Forests)? Did they optimize those parameters for this data set?
    Lastly, Derek, I think you really revealed a soft spot in the paper by highlighting that the comparison of the Deep Learning to standard neural network was front-and-center in the body of the paper, while the comparison to other methods was relegated to the appendix.

  11. Puhn Dunners says:

    Looks like folks at JH partner with a much smaller company and they believe they have something to patent – Macrolide Compounds for Liver Stage Malaria.

  12. Anon anon anon says:

    It’s concerning that they didn’t address redundancies in the data.
    Quoting from the paper: “Every model we trained performed extremely well on the DUD-E datasets (all models in Table 2 had median 5-fold-average AUCs ≥ 0.99), making comparisons between models on DUD-E uninformative.”. Unfortunately, “every model” also includes the incredibly simple linear model of logistic regression.
    I don’t think anybody here believes that binding activity can be (nearly perfectly!) explained through a linear combination of molecule features. Therefore, these results are almost certainly due to some training artifact.

  13. Anon anon anon says:

    The “≥” in the previous comment should read “>=”. (Showed up fine in the preview!)

  14. hypnos says:

    @anonao: I don’t know what their internal cost is. But if you play around with https://cloud.google.com/products/calculator/, you end up with costs in the range of at least 500k – 1M$. (Much more for decent hardware.)
    I would guess that this is not that far away from the cost of a HTS campaign in medium to big pharma. (Assuming that you already have the compounds, of course.) And keep in mind that you still have to set up the assay and screen a lot of compounds anyway.

  15. Anonymous says:

    “The success record of this sort of thing has been mixed, though”
    No it hasn’t. It has been completely and utterly useless. Complete crap. And I don’t see this changing anytime soon with Big Data and the like. It’s a bottomless pit that will suck your balance sheet dry. Deeper and deeper into the rabbit hole.
    Why? Because there are 3 billion potential variables (genetic base pairs) which can vary independently and in combination, and only 7 billion people on the planet to observe – and that is assuming you could characterize and sequence every one of us.
    Testing more hypotheses with more data will indeed throw up more potential correlations, but a greater proportion of those correlations will turn out to be meaningless and irreproducible outcomes of pure chance. So ultimately, we still have to rely on experiments to collect more data and test each and every hypothesis in turn.
    Check out Family-Wise Error Rate and Bonferri’s Correction.
    Big Data is a myth.

  16. Anonymous says:

    Oops, sorry for the typo, I meant Bonferroni’s Correction.

  17. Pande is a crackpot says:

    100% crackpot. Doing no valuable science. Should be fired.
    I could say more but why bother?

  18. Frank Adrian says:

    You do spread a slight misconception here when you say “Piling more noise into the computations cannot improve them”. In fact, adding small amounts of noise to a neural network system does improve the outcome, because it prevents a technical issue called overfitting.
    I’m not sure what the quality of the base data set is, but I think that what you’re concerned about is if the raw data has too much noise (i.e., too many misleading studies), they’re just trying to fit noise anyway. This is a valid issue. Although it varies with the domain, the good news is that neural algorithms can be pretty robust with respect to noise – occasionally recovering signals that seem to have been too deeply buried to retrieve.
    That being said, I tend to think that if these guys are working for the Goog, they’re not slouches in either vetting the data or doing proper neural processing. They may well be a bit naive in interpretation of the data and overestimating the clinically usefulness (or even laboratory usefulness) of what they’ve found. I still think it’s a good experiment to see where we’re at WRT algorithmic screening. Mainly because I think that we’ll keep building bigger computers and I don’t think we’re going to speed up evolution that much.

  19. Anon anon anon says:

    An honest question for Derek, #8 Milkshake, or anyone else with an opinion:
    What analysis could they have shown that would convince you that their technology isn’t “garbage” or “complete crap”? We’re scientists, right? *Some* level of evidence should be convincing. Is there any retrospective analysis that’s sufficient? How many prospective tests are needed?
    I mean, we’ve been seeing post after post here about the care that’s needed to keep down the error rate of HTS (do a Google search for “site:pipeline.corante.com PAINS”), yet people still spend lots of time and money on it.

  20. John-john says:

    ‘delta log odds mean AUC’ is the wonkiest metric I have ever seen- as far as I can tell the developed it for this work. They state the log odds reduces the impact of outliers, so why didn’t they use median?? *Very* strangely transformed data that is a key result of the paper makes me highly suspicious….

  21. PeterC says:

    @15: Note that the costs you’re talking about only come in play during the training of the method. Once trained, one could basically run it on iPhone and be done in a few seconds.

  22. PeterC says:

    @20: Easy. 1. assemble a list of compound-protein interactions and non-interactions for which you know the right answers; 2. send this list to the authors (without the right answers); 3. see the results.

  23. Anon anon anon says:

    @23 Where does one get that kind of data? I mean, how do I convince @8 Milkshake to bother running the science to test it, if (s)he’s already convinced that the technique must be crap?

  24. randomdude says:

    @20 A good start would be to post a few chemical
    structures. Why? The data used was not filtered for promiscuous binders, aggregators, dyes, etc. So what they’ve likely trained on is this kind of non-specific junk that hits across multiple assays. So it may be valid, understandable science from that point of view (it can be an independent validation of PAINS-type compounds they’ve discovered and that would be valuable to explore) but they certainly haven’t dramatically advanced the worlds ability to find actives in screening the way their blog post reads.

Comments are closed.