Skip to main content

In Silico

Did Kaggle Predict Drug Candidate Activities? Or Not?

I notied this piece on Slate (originally published in New Scientist) about Kaggle, a company that’s working on data-prediction algorithms. Actually, it might be more accurate to say that they’re asking other people to work on data-prediction algorithems, since they structure their tasks as a series of open challenges, inviting all comers to submit their best shots via whatever computational technique they think appropriate.

PA: How exactly do these competitions work?
JH: They rely on techniques like data mining and machine learning to predict future trends from current data. Companies, governments, and researchers present data sets and problems, and offer prize money for the best solutions. Anyone can enter: We have nearly 64,000 registered users. We’ve discovered that creative-data scientists can solve problems in every field better than experts in those fields can.
PA: These competitions deal with very specialized subjects. Do experts enter?
JH: Oh yes. Every time a new competition comes out, the experts say: “We’ve built a whole industry around this. We know the answers.” And after a couple of weeks, they get blown out of the water.

I have a real approach-avoidance conflict with this sort of thing. I tend to root for outsiders and underdogs, but naturally enough, when they’re coming to blow up what I feel is my own field of expertise, that’s a different story, right? And that’s just what this looks like: the Merck Molecular Activity Challenge, which took place earlier this fall. Merck seems to have offered up a list of compounds of known activity in a given assay, and asked people to see if they could recapitulate the data through simulation.
Looking at the data that were made available, I see that there’s a training set and a test set. They’re furnished as a long run of molecular descriptors, but the descriptors themselves are opaque, no doubt deliberately (Merck was not interested in causing themselves any future IP problems with this exercise). The winning team was a group of machine-learning specialists from the University of Toronto and the University of Washington. If you’d like to know a bit more about how they did it, here you go. No doubt some of you will be able to make more of their description than I did.
But I would be very interested in hearing some more details on the other end of things. How did the folks at Merck feel about the results, with the doors closed and the speaker phone turned off? Was it better or worse than what they could have come up with themselves? Are they interested enough in the winning techniques that they’ve approached the high-ranking groups with offers to work on virtual screening techniques? Because that’s what this is all about: running a (comparatively small) test set of real molecules past a target, and then switching to simulations and screening as much of small molecule chemical space as you can computationally stand. Virtual screening is always promising, always cost-attractive, and sometimes quite useful. But you never quite know when that utility is going to manifest itself, and when it’s going to be another goose hunt. It’s a longstanding goal of computational drug design, for good reason.
So, how good was this one? That also depends on the data set that was used, of course. All of these algorithm-hunting methods can face a crucial dependence on the training sets used, and their relations to the real data. Never was “Garbage In, Garbage Out” more appropriate. If you feed in numbers that are intrinsically too well-behaved, you can emerge with a set of rules that look rock-solid, but will take ou completely off into the weeds when faced with a more real-world situation. And if you go to the other extreme, starting with wooly multi-binding-mode SAR with a lot of outliers and singletons in it, you can end up fitting equations to noise and fantasies. That does no one any good, either.
Back last year, I talked about the types of journal article titles that make me keep on scrolling past them, and invited more. One of the comments suggested “New and Original strategies for Predictive Chemistry: Why use knowledge when fifty cross-correlated molecular descriptors and a consensus of over-fit models will tell you the same thing?”. What I’d like to know is, was this the right title for this work, or not?

28 comments on “Did Kaggle Predict Drug Candidate Activities? Or Not?”

  1. eugene says:

    This technology is still in the testing phase and hasn’t been successfully used by Merck. You don’t have a link to the Slate article as of now, but the quotes sound like a lot of hubris for something that is untested.
    It also makes me a bit uncomfortable that there was no chemical structure out there at all during the contest. I don’t know why that should be, since I guess they had other descriptors that could substitute well for it… but it just does.

  2. Morten G says:

    I still wonder how Kaggle determines the winner if the test set is released with the training set?
    Also, when I read that article I went straight to the Merck challenge and noticed that it was won by machine learning experts, not amateurs. Hell, more than experts – these are the people who invent these algorithms / data treatment structures / whatever you want to call them.
    PS Noticed the $3 mio challenge up now?

  3. Jack H. Pincus says:

    The best use of machine learning at this time would not be virtual screening but predicting outcomes of compounds with certain properties. That may be Merck is hoping for. Machine learning has been useful for predicting consumer behavior or the outcome of elections. As you correctly point out, the outcome in this case may depend on the quality of the data and how the teams interpret it. I haven’t looked at Merck’s dataset but your description suggests it may be lacking details critical for successful projections.
    Successful machine learning projects often include a domain expert to help analyze and interpret the results. Kaggle teams are almost exclusively data scientists which could effect the outcome of a drug candidate analysis project

  4. LeeH says:

    I competed in an earlier modeling exercise, from Boehringer, and came in 31st out of 703 entrants. Rather than being an exercise in creating a linear prediction of activity, it was a categorical one (i.e. what was the probability of a particular compound being active).
    How did I do it? By successive tries, with various methods and combinations of methods, optimizing what I did by using feedback from the test set (which, by the way, is only 25% of the total test instances). Some would say that I overfit, but in fact the performance by the methods on the smaller test set mirrored the performance on the final test set extremely well. I would have done somewhat better (15th or so) by choosing one of my other methods, but I’m sure that’s true of everyone. And I did choose the second best methods, out of about 100 candidates.
    The good thing about the exercise was that I was able to test out various methods in a blinded way. I wouldn’t have put as much effort into a non-blinded situation (money and besting the “experts” is a strong motivator), but I’m glad I did it once. I did learn some interesting things about the available data mining methods which I can use in everyday life. For example, in general, Forest of Trees is king.
    On the other hand, it’s important to note a few things. First, the amount of effort put into this competition is way more than you could hope to spend in a real-world environment. Second, as mentioned before, you didn’t know what the descriptors referred to or anything about the structural features of the compounds. And third, and most importantly, the criteria used for winning are rather ridiculous. The actual score was a log-loss calculation, which is sensitive to the log of the difference between your prediction (i.e. probability of belonging to one class or another). The difference between the winner and the 50th place finisher was 0.009, and the scores were reported to 5 decimal places. Under these conditions, you could predict the ranking extremely well, getting them all at the top of a ranked list, but if you predicted a probability of 0.5 for too many of the inactives (which should be 0.000) you would blow your overall performance. This scoring method artificially created a user ranking which was not appropriate to the exercise, that is, finding active compounds.
    I did voice my concerns on the forum boards related to that particular competition. My impression is that, with a few exceptions, people didn’t really get it, and seemed impressed with improvements in the 3rd and 4th decimal places that we, in the modeling community, would consider trivial.
    What would be interesting for everyone, I think, would be to see which methods, over the entirety of the submitted results, performed best. Perhaps the folks at Boehringger are writing up a paper.

  5. Random observer says:

    #2: According to Kaggle data description test dataset as released had activity information removed.
    Still, an important caveat about their approach for determining the “winner” is that the difference between 1st place score of 0.494 and 2nd place score of 0.488 could very well be due to chance and not necessarily indicative of how performance of these models would compare on a truly new dataset that have not been seen before. There were 99 teams participating with their scores higher than Merck’s internal result of 0.423 (to arbitrarily draw line in the sand for choosing qualified and committed competitors) that made over 2000 (two thousand!) submissions to solve this problem (pardon my quantitative upbringing and fondness for numbers). This is a lot of shots at the same goal (i.e. same test set). Especially if the competitors know results of their previous submission as they work on their next one – if they do, then such process is guaranteed to overfit (i.e. make error look smaller than what it would be on a new undisclosed dataset) – albeit elaborate one, for sure – with many teams, distributed environment, web submissions, etc.
    And, as another comparison of different (genomics based) models – MAQC-II (certainly better conducted as far as process goes) – tells us, there are datasets with more information in them, less and none to speak of. That and team proficiency were in that case what mostly determined the performance of the models on properly blinded test datasets – the choice of the modeling tool did not matter so much.
    Then consider that some of the models allow for higher interpretability and some for much less (it is usually tough to understand what drives predictions by neural networks) and the value of 0.494 versus 0.423 might be very-very questionable.

  6. ptm says:

    Only training set is released, from what I understand of the rules contestants submit models and people at Kaggle evaluate them on test data.
    Looking at the leaderboard the winners R2 correlation coefficient is 0.494 compared to Merc internal standard of 0.423 so an improvement of 0.071.
    I have no idea how valuable such an improvement is in practice. Naively it doesn’t strike me as a groundbreaking result but it’s clearly not negligible either.

  7. drug_hunter says:

    For all the reasons given by previous posters and Derek, the difference of 0.07 is NOT CONCLUSIVE of any actual improvement.
    However, I am a fan of this kind of work. I think these kinds of computational experiments are going to be very useful to help us understand which methods work best under which circumstances, so over time I think we WILL get to the point where we can say with more confidence that an improvement of 0.07 is in fact useful.
    We just aren’t there yet. We need another dozen or so competitions and I bet at that point some trends will start to emerge.

  8. gwern says:

    It’s worth noting that one of the team members was Geoff Hinton. Yes, *that* Hinton, the deep learning neural networks Hinton whose work has been routinely racking up rewards and records over the past few years.
    > Since our goal was to demonstrate the power of our models, we did no feature engineering and only minimal preprocessing. The only preprocessing we did was occasionally, for some models, to log-transform each individual input feature/covariate. Whenever possible, we prefer to learn features rather than engineer them. This preference probably gives us a disadvantage relative to other Kaggle competitors who have more practice doing effective feature engineering. In this case, however, it worked out well. We probably should have explored more feature engineering and preprocessing possibilities since they might have given us a better solution.
    I could believe that.

  9. MTK says:

    “Uh, I was told there would be no math.”

  10. Guy Cavet says:

    Disclosure: I work for Kaggle.
    Thanks for the nice post and discussion.
    Regarding overfitting, there are two test sets: one used to give feedback on submissions and generate the leaderboard during the competition, and another used to determine the final standings. No information from the second test set is available to the participants during the competition. So, if the participants overfit the first test set, they will not win. And conversely, the winners will not have overfit the test set. There’s a nice blog post about this at

  11. MoMo says:

    Sounds like another scheme by Big Pharma-Merck to get high quality work for free or next to nothing.
    All of you who think you are doing something scientific are just being taken advantage of.
    Hire some computational scientists MERCK!
    You’ve been Kaggled!

  12. As has been pointed out here the improvements over basic methods is small- small enough that such criteria such as distribution drift, noise in the data, change in performance metric, assignment uncertainty or uneven sampling of the original distribution can easily swamp any differences. See David Hand’s wonderful paper for a better description:
    What is really shocking about the Merck event, in particular, is how poor the statistics are- the performance metric is over 15 systems, so essentially the resultant average- reported to 5 decimal places- is over 15 numbers. Not reported is the variance over those 15 systems, but even a reasonable estimate of the variability of performance over 15 systems would suggest the winning entry is statistically indistinguishable from the entry from Merck scientists. We don’t know though because all we get is the average R**2.
    If you actually plot all scores on the entries on the “leaderboard”, they form a pretty good gaussian distribution (centered around the Merck entry), again leading me to think it’s pretty much a case of random variance of methods.
    I’ve heard Merck is very excited about the 0.07 increase in R**2 over their methods- 0.07 which equates, back of the envelope to an increase in predictive accuracy of 0.05 of an activity unit over their own method. Good luck with that, Merck.
    The great golfer Lee Trevino used to say, “Driving is for show and putting is for dough”. These competitions are for show- even though Merck was happy to cough up real dough! This is the next great hype upper management in pharma is going to fall for hook, line and sinker.

  13. LeeH says:

    Thanks for elaborating on the point I was trying to make. I would contend that things are EVEN WORSE, especially in this case.
    Let’s assume that the models are completely real (yes, I know, they’re not, but let’s pretend). The problem is that they are linear models in log space. Have you ever noticed the 95% confidence limit prediction (not model) estimates on a linear model with an R2 in the neighborhood of 0.4 or 0.5? They’re huge. They easily span 2 or 3 orders of magnitude (in non-log space), maybe more, at the extremes. That means that your predictions for the compounds that supposedly live at the high end of the activity curve (i.e. where you want to be) are really not much better (if at all) than the range of activities of all of the compounds that you started with. Some model.
    This why I never even attempt linear models.

  14. Teddy Z says:

    Back in a former life, I worked at a company that had a “blackbox” model for figuring out your SAR and predicting what the next set of compounds should look like. After a few sets of black box based compound picking, the compounds where looking very much like antibiotics (but interestingly of several different classes), but this was a PPI. Well, it turns out young biologist didn’t understand his assay well, it used a luciferase reporter readout and the compounds were simply shutting down protein synthesis, not indicative at all of inhibition of the PPI.

  15. Q tsar says:

    A stopped clock tells the right time twice a day. If you have 238 stopped clocks, one of them is likely to always be close to the right time.

  16. chris says:

    I’ve worked with academic groups who are developing novel computation methods, things like this do give them the chance to try them out on a range of different problems.
    I don’t think the results are unexpected and probably reflect the current state of virtual screening, and whilst it would have been interesting if one group had been substantially better than the basic method, if you have to select one prize winner then this is the result.

  17. George says:

    Hi Derek,
    I’m the leader of the winning team in the competition. I have recently discovered your wonderful blog and I am ecstatic that my team’s work has been mentioned on it!
    I hope what we did will be useful to Merck and I think it probably will be, but of course these sorts of things need to be evaluated very carefully. I don’t have the drug discovery expertise to know how it will play out. If people in these pharma companies provide lots of data to train models on and a metric they are interested in improving, we machine learning researchers can only try to improve the models according to the metric and hope that we are optimizing something useful and that there is enough data.

  18. drug_hunter says:

    Regarding George’s post, and the general question of whether machine learning in the absence of any knowledge of chemistry can be useful at all, I recommend everyone take a look at:

  19. Neo says:

    Derek said: “Virtual screening is always promising, always cost-attractive, and sometimes quite useful. But you never quite know when that utility is going to manifest itself, and when it’s going to be another goose hunt. It’s a longstanding goal of computational drug design, for good reason.”
    The problem is this area is that there are lot of software vendors who oversell their virtual screening methods using flawed retrospective “validations”. So it is hard for end-users to distinguish between things that actually work and things that just look good in paper. Always look for prospective applications of the technique published in peer-reviewed journals. And correct by the number of scientists using that technique.

  20. Neo says:

    Also, regarding your comment that you never know when virtual screening it’s going to be another goose hunt.
    Is it not the same with HTS when used against new targets? After all, you are only screening about 10^5 molecules out of an estimated 10^60 possible drug-like molecules. You cannot find hits where there are not hits to find. This is a very real possibility (e.g. antibacterial HTS). It is on challenging targets where virtual screening can be very useful.

  21. George says:

    drug_hunter, there is knowledge of chemistry in this process, I am just not the one who has it. Chemistry knowledge is not something that is in short supply at Merck. It is too much to ask that the people with specialized chemistry knowledge also have specialized machine learning knowledge which is why it makes sense for chemists and machine learning researchers to collaborate.

  22. ChrisL says:

    The Merck experiment as described completely prevents any conclusions as to the relative abilities of experts compared to outsiders. The reason is that chemical structures were not shown. Reducing a chemical structure to a bit string of descriptors completely eliminates the power of chemical structure to biomedical information pattern recognition which I would maintain is the forte of medicinal chemists. Imagine reducing a histo-pathology slide to the equivalent of a bit string or an old world painting to a bit string. You would eliminate the pattern recognition skill of the pathologist in reading slides or the pattern recognition skill of the art expert in detection of a bona fide versus fake picture. I think the old maxim still holds – the computational expert system typically only does about 85% as well as the human expert. The book “blink” by Malcolm Gladwell has an extensive and excellent discussion of the power of pattern recognition among human experts.

  23. gogoosh says:

    The power of these pattern recognition algorithms isn’t that they outperform human experts, it’s that they are automated. Experiments like scanning a huge virtual library cannot be done with human expertise alone.
    Some of the comments in this thread make me think that the scientists writing them have never been involved in a productive collaboration between a medicinal chemist and a computational chemist.
    In my opinion, any modeler worth her salt will acknowledge that most models are a poor substitute for human expertise, and any med chemist worth her salt will acknowledge that automated, objective methods of assessing the vastness that is chemical space are useful tools.

  24. Chris says:

    @12 Anothony Nicholls.
    I’m unclear how the scores looking Gaussian implies the methods were essentially random. If scores from students taking a test followed a Gaussian curve I’d think it was mostly because the students had different abilities, even if there were some randomness involved (i.e. test questions, test topic, etc..)
    disclaimer–I was on the winning kaggle team.

  25. Chris says:

    Also, awesome post. It’s a great question, and I hope we’ll soon learn the answer.

  26. @24 Chris
    In answer to your question as to why I think the distribution of scores looking Gaussian is an indication that you were lucky, not good, I would suggest a simple statistical argument- the metric for success was an average of 15 numbers- a very small sample that inevitably will have a significant variance. Different methods will produce a different set of 15 numbers- essentially at ‘random’. At question here is whether such a variance between methods could explain the distribution seen in the Kaggle event. By eye it would look like the std of contributions is about 0.02 (R**2), i.e. your entry was about 3 standard deviations out (1 in a 100). Anyone in our field would be very comfortable with the concept any two different methods giving (averaged) results different by 0.02 over a sample size of 15. Hence, it is a reasonable assumption you were lucky.
    Of course, the organizers could have partially addressed this by providing all 15 scores, not just the average, because then we could have looked at the correlated improvement of your method- i.e. was your approach consistently better across all systems-that would have been interesting. But this was not provided, confirming my suspicions that a lot of people who do machine learning actually have a fairly poor grasp of statistics.

  27. The true validation of the approach will be: take absolutely NEW dataset with the same data nature/distribution, randomly split it up to 20-30 times to the training and test set, build mode and test them. The actual improvement of 0.07 is not really a deal, but if the approach behave in the same manner with new datasets – it’s a brilliant.

  28. Sergio says:

    In response to Anthony Nicholls’s argument:
    If the variation between teams were due to the addition of random effects on the 15 problems, you would expect that, given a different set of problems, the order of the teams would be scrambled.
    This doesn’t happen. Kaggle uses a separate test set to produce the final leaderboard, which is different from the one that is used by contestants throughout the competition. Nonetheless, the leaderboards before and after the deadline tend to look quite similar.

Comments are closed.