I notied this piece on Slate (originally published in New Scientist) about Kaggle, a company that’s working on data-prediction algorithms. Actually, it might be more accurate to say that they’re asking other people to work on data-prediction algorithems, since they structure their tasks as a series of open challenges, inviting all comers to submit their best shots via whatever computational technique they think appropriate.
PA: How exactly do these competitions work?
JH: They rely on techniques like data mining and machine learning to predict future trends from current data. Companies, governments, and researchers present data sets and problems, and offer prize money for the best solutions. Anyone can enter: We have nearly 64,000 registered users. We’ve discovered that creative-data scientists can solve problems in every field better than experts in those fields can.
PA: These competitions deal with very specialized subjects. Do experts enter?
JH: Oh yes. Every time a new competition comes out, the experts say: “We’ve built a whole industry around this. We know the answers.” And after a couple of weeks, they get blown out of the water.
I have a real approach-avoidance conflict with this sort of thing. I tend to root for outsiders and underdogs, but naturally enough, when they’re coming to blow up what I feel is my own field of expertise, that’s a different story, right? And that’s just what this looks like: the Merck Molecular Activity Challenge, which took place earlier this fall. Merck seems to have offered up a list of compounds of known activity in a given assay, and asked people to see if they could recapitulate the data through simulation.
Looking at the data that were made available, I see that there’s a training set and a test set. They’re furnished as a long run of molecular descriptors, but the descriptors themselves are opaque, no doubt deliberately (Merck was not interested in causing themselves any future IP problems with this exercise). The winning team was a group of machine-learning specialists from the University of Toronto and the University of Washington. If you’d like to know a bit more about how they did it, here you go. No doubt some of you will be able to make more of their description than I did.
But I would be very interested in hearing some more details on the other end of things. How did the folks at Merck feel about the results, with the doors closed and the speaker phone turned off? Was it better or worse than what they could have come up with themselves? Are they interested enough in the winning techniques that they’ve approached the high-ranking groups with offers to work on virtual screening techniques? Because that’s what this is all about: running a (comparatively small) test set of real molecules past a target, and then switching to simulations and screening as much of small molecule chemical space as you can computationally stand. Virtual screening is always promising, always cost-attractive, and sometimes quite useful. But you never quite know when that utility is going to manifest itself, and when it’s going to be another goose hunt. It’s a longstanding goal of computational drug design, for good reason.
So, how good was this one? That also depends on the data set that was used, of course. All of these algorithm-hunting methods can face a crucial dependence on the training sets used, and their relations to the real data. Never was “Garbage In, Garbage Out” more appropriate. If you feed in numbers that are intrinsically too well-behaved, you can emerge with a set of rules that look rock-solid, but will take ou completely off into the weeds when faced with a more real-world situation. And if you go to the other extreme, starting with wooly multi-binding-mode SAR with a lot of outliers and singletons in it, you can end up fitting equations to noise and fantasies. That does no one any good, either.
Back last year, I talked about the types of journal article titles that make me keep on scrolling past them, and invited more. One of the comments suggested “New and Original strategies for Predictive Chemistry: Why use knowledge when fifty cross-correlated molecular descriptors and a consensus of over-fit models will tell you the same thing?”. What I’d like to know is, was this the right title for this work, or not?