There have been many attempts over the years to bring together large amounts of biological and drug activity data, winnow them computationally, and come up with insights that would not be obvious to human eyes. It’s a natural thing to imagine – there are so many drug targets in the body, doing so many things, and there’s an awful lot of information out there about thousands and thousands of compounds. There’s no way that a human observer could pick up on all the things that are going on; you need tireless software to sift through the piles.
The success record of this sort of thing has been mixed, though. Early attempts can now be set aside as underpowered, but what to make of current attempts at “virtual clinical trials” and the like? (We’re probably still underpowered for that sort of thing). Less ambitiously, people have tried to mine for new targets and new drug activities by rooting through the Big Data. But this sort of thing is not without controversy: many of us, chemists and biologists alike, don’t have the mathematical background to say if the methods being used are appropriate, or what their weaknesses and blind spots might be.
A new paper has gotten me thinking about all this again. It’s a collaboration between several researchers at Stanford and Google (press release here) on machine learning for drug discovery. What that means is that they’re trying to improve virtual screening techniques, using a very Google-centric approach that might be summed up as “MOAR DATA!” (That phrase does not appear in the paper, sadly).
In collaboration with the Pande Lab at Stanford University, we’ve released a paper titled “Massively Multitask Networks for Drug Discovery”, investigating how data from a variety of sources can be used to improve the accuracy of determining which chemical compounds would be effective drug treatments for a variety of diseases. In particular, we carefully quantified how the amount and diversity of screening data from a variety of diseases with very different biological processes can be used to improve the virtual drug screening predictions.
Using our large-scale neural network training system, we trained at a scale 18x larger than previous work with a total of 37.8M data points across more than 200 distinct biological processes. Because of our large scale, we were able to carefully probe the sensitivity of these models to a variety of changes in model structure and input data. In the paper, we examine not just the performance of the model but why it performs well and what we can expect for similar models in the future. The data in the paper represents more than 50M total CPU hours.
I end up with several trains of thought about this kind of thing. On track one, I appreciate that if virtual screening is going to work well, it needs to draw from the largest data sets possible, since there are so many factors at work. But on track two, I wonder how good the numbers going into this hopper really are, since I (like anyone else in the business) have seen some pretty garbagey screening numbers, both in person and in the literature. Piling more noise into the computations cannot improve them, even if your hardware is capable of dealing with landfills of the stuff. (The authors do note that they didn’t do any preprocessing of the data sets to remove potential artifacts. The data come from four main sources (see the paper, which is open access, for more), and only one of these has probably been curated to that level.) And that brings us to track three: my innate (and doubtless somewhat unfair) suspicions go up when I see a lot of talk about just how Incredibly Large the data sets are, and how Wildly Intense all the computations were.
Not to be too subtle about it, asking for a virtual screen against some target is like asking for a ditch to be dug from Point A to Point B. Can you dig the ditch, or not? Does it get to where it’s supposed to go, and do what a ditch is supposed to do? If so, then to a good approximation, I don’t care how many trained badgers you herded in for the job, or (alternatively) about the horsepower and fuel requirements of the earth-moving equipment you rented. If someone spends a lot of time telling me about these things (those engines! those badgers!) then I wonder if they’re trying to distract me from what really matters to me, which is the final product.
Well, I’m willing to accept that that’s not a completely fair criticism, but it’s something that always crosses my mind, and I may not be alone in this. Let’s take a look at the ditch – uh, the virtual screening – and see how well it came out.
In this work, we investigate several aspects of the multitask learning paradigm as applied to virtual screening. We gather a large collection of datasets containing nearly 40 million experimental measurements for over 200 targets. We demonstrate that multitask networks trained on this collection achieve significant improvements over baseline machine learning methods. We show that adding more tasks and more data yields better performance. This effect diminishes as more data and tasks are added, but does not appear to plateau within our collection. Interestingly, we find that the total amount of data and the total number of tasks both have significant roles in this improvement. Furthermore, the features extracted by the multitask networks demonstrate some transferability to tasks not contained in the training set. Finally, we find that the presence of shared active compounds is moderately correlated with multitask improvement, but the biological class of the target is not.
As the paper notes, this is similar to Merck’s Kaggle challenge of a couple of years back (and I just noticed this morning that they cite that blog post, and its comments, as an example of the skepticism that it attracted from some parts of the med-chem community). In this case, the object isn’t (yet) to deliver up a bunch of virtual screening hits, so much as it is to see what the most appropriate architecture for such a search might be.
One of the biggest problems with these papers (as this one explicitly states) is that the criteria used to evaluate the performance of these systems are not standardized. So it’s basically impossible to compare one analysis with another, because they’re scoring by different systems. But that graphic gives some idea of how things worked on different target classes. The Y axis is the difference between using multitask models (as in this paper) and single-task neural network models, and it shows that in most cases, most of the time, multitask modeling was better. But I note that almost every class has some cases where this doesn’t hold, and that (for reasons unknown) the GPCR targets seem to show the least improvement.
But what I don’t know is how well these virtual screening techniques compared to the actual screening data. The comparisons in the paper are all multi-task versus single-task (which, to the fair, is the whole focus of the work), but I’d be interested in an absolute-scale measurement. That shows up, though, in Table B2 in the appendix, where they use Jain and Nicholls’ “enrichment” calculation. Assuming that I’m reading these correctly, which may or may not be warranted, the predictions look to be anywhere from about 5% to about 25% better than random, depending on what false-positive rate you’re looking at, with occasional hops up to the 40% better range. Looking at the enrichment figures, though, I don’t see this model performing much better than the Random Forest method, which has already been applied to med-chem work and activity prediction many times. Am I missing something in that comparison? Or does this all have quite a ways to go yet?