Where has machine learning made the most strides in recent years? A lot of people who are into this topic will tell you that it’s image processing, specifically recognition and differentiation of objects. You can see that just by how much more effective reverse imagine searching on the internet has become (to pick a free example), and in the you-gotta-pay-for-it world there are many more. Some of this has been driven by the amount of effort and funding that has gone into things like facial recognition for security (note the system that Apple uses for their latest round of iPhones), and some of it is due to digital images being a very algorithm-friendly form of data to work with. Compared to the shagginess of other data sets, a pile of image files is already wonderfully homogeneous stuff; some of the hardest curation has already been done up front by forcing things into a grid of defined dimensionality, full of defined pixel chunks, each in a defined color space.
So if you ask me where the most likely applications of machine-learning algorithms in drug discovery will come from, I’d say this is a good place to bet. I’ve seen some impressive presentations on automated histopathology evaluation, and (moving back to the early stages of things) high-content cell imaging assays are another obvious target that a lot of work has gone into. This new paper is a perfect illustration. People have been running cell imaging assays for quite a while now, but often in a targeted fashion: does this compound affect the cell phenotype in the way that we’re looking for? Once in a while you’ll see a more open-ended one: which of these compounds do unusual things to the cells that the others don’t do? This paper tries to fill in that gap. They’re looking at a screening set of over 500,000 compounds that had been run through a high-content cell imaging assay, and then looking at the company database (Janssen/J&J) to see what assays these compounds had been run though.
Turning the machine-learning algorithms loose on the combination of these two data sets allowed the software to construct fingerprints for all sorts of activities. I have just skipped over a good deal of work in the middle of the paper, of course, because the details of how you arrive at these relationships are the real business end of this business. I’m not competent to evaluate the techniques used here, but what I can tell you is that the authors tried the Bayesian matrix factorization method Macau, which throws everything into a high-dimensional vector space and thus avails itself of vector operations to do the processing. That’s about the limit of my guidance. They compared this method with a deep-neural-network one, layers of simulated “neurons” where the top layer gets the imaging files and the ones below it deal with the data according to their own algorithmic specializations and weightings, eventually (several layers later) giving you an output. This is a simplified version of what goes on in the visual cortex, with various layers and groups of neurons that are sensitive to straight lines, contrast features, and so on – tricking the processing routines of such neurons is the basis for optical illusions.
Setting a pretty stiff threshold for significance when comparing the imaging data to 535 various protein/target assays, the Macau method found high-quality models for 31 of them, while the DNN procedure found such for 43. Note that the original imaging screen was for just one target (effects on the glucocorticoid receptor). This suggests that a single high-content imaging screen might be able to replace two or three dozen other screening campaigns. The team put this to the test, looking at a kinase target for an oncology target, and another enzyme target in a CNS program. Comparing the compounds picked out by the imaging screen models with the regular HTS hit rates, the first one was fifty-fold enriched in hits, and the second one was 289-fold enhanced. (It should be noted that both hit sets included a number of different chemical scaffolds). This strongly suggests that they’re on to something.
And as with every other discussion of stuff that’s driven by hardware and software, this is only going to get better and faster. I think the authors would agree with this summary: if you have a large enough compound library that has already a legacy of being tested across many different targets – that is to say, if you’re a big drug company – you should strongly consider unlocking the latent information in there via a selection of high-content cell imaging screens. Here’s how they put it:
We emphasize that our approach relies on a supervised machine-learning method, and hence activity measurements and imaging data must be acquired for a reasonably sized library of compounds to train the model. Subsequently, however, it seems possible to replace many particular assays with the potentially more cost-efficient imaging technology together with machine-learning models. Specifically, one would execute one or a few image screens on the library instead of dozens of target-focused assays. This raises an interesting question of the breadth of drug targets that could be accessed by imaging screens if the screen were optimized for that purpose, or if a combination of screens was used that explored multiple cell lines or sources, culturing conditions, staining of organelles, and/or incubation times.
Indeed. In fact, this paper could be just a first crude step compared to what’s possible in this line. I very much look forward to what comes next. This whole thing, I might add, gives a good framework to think about the role of machine learning and automation in the drug business in general. Instead of our machine overlords coming to eat our lunch (“Unclear-input-we-do-not-require-“lunch”), this is an example of our tireless machine servants running off to do completely insane amounts of grunt work that would drive us nuts, but will return high-quality results that we can then use our human brains and powers of decision to act on. Bring ’em on.