Skip to Content

In Silico

Machine Learning’s Awkward Era

The whole machine learning field has a huge amount to offer chemistry, medicinal chemistry, and biomedical science in general. I don’t think that anyone seriously disputes that part – the arguing starts when you ask when this promise might be realized. In the abstract, the idea of tireless, relentless analysis of the huge piles of data that we generate is very appealing. These things long since passed the ability of humans themselves to wring out all the results unaided, or even with the sort of computational aids that we’re already used to. There are just too many correlations to check, too many ideas to try out, and too many hypotheses to validate.

I am not a specialist in the field. I mean, I know people who are, and I know more about it than many people who aren’t doing it for a living, but I am in no way qualified to sit down and read (say) four different papers on machine-learning approaches to chemical problems and tell you which one is best. The problem is, I’m not so sure anyone else can do that very easily, either. This is even more obvious to researchers in this area than it is to the rest of us. You might hope that additional expertise would allow people to make such calls, but as things are now, it just allows them to see just how tangled things really are.

It would be nice (although perhaps quite difficult) to put together some standard test cases for evaluating performance in (say) machine-learning drug discovery programs. You could imagine a pile of several thousand compounds with associated assay data, from some area where we’ve already worked out a lot of conclusions. You’d turn the software loose and see how much of that hard-won knowledge could be recapitulated  – ah ha, Program Number Three caught on to how those are actually two different SAR series, good for it, but it missed the hERG liabilities that Program Number Six flagged successfully, and only Program Number Three was able to extrapolate into that productive direction that we deliberately held back from the data set. . .and so on. Drug repurposing (here’s a recent effort in that line) could be a good fit for a standard comparator set like this as well.

Thus calls like this one, to try to put the whole machine learning/artificial intelligence area onto a more sound (and comparable) footing. This is far from the first effort of this type, but it seems to me that these things have been getting louder, larger, and more insistent. Good. We really have to have some common ground in order for this field to progress, instead of a mass of papers with semi-impenetrable techniques, benchmarked (if at all) in ways that make it hard to compare efficiency and outcomes. Some of this behavior – much of it – is being driven by pressure to publish, so you’d think that journal editors themselves will have to be part of the solution.

But maybe not. This is an area with a strong preprint tradition, and not long ago there was a controversy about (yet another) journal from the Nature Publishing Group, Nature Machine Intelligence. Over two thousand researchers in the field signed up to boycott the journal because it would be subscription-only, which the signers feared would erode that no-paywall system that’s currently in place. In that case, though, the pressure for higher-quality publications will have to come from others in the field somehow, with a willingness to provide full details and useful benchmark tests helping to drive reputations (rather than just numbers of papers and/or their superficial impressiveness). Machine learning is far from the only field that could benefit from this approach, of course, and the fact that we can still speak in those terms makes a person wonder how effective voluntary calls for increased quality will be. But I certainly hope that they work.

Meanwhile, just recently, there’s been a real blot on the whole biomedical machine-learning field. I’ve written some snarky things about IBM’s Watson efforts in this area, and you know what? It looks as if the snark was fully deserved, and then some. STAT reports that the company’s efforts to use Watson technology for cancer-care recommendations was actually worse than useless. The system made a good number of wrong (and even unsafe) calls, which decreased physician confidence in it rapidly (as well it should). Worse, this was going on at the same time that IBM was promoting the wonders of the whole effort, stating that doctors loved it, didn’t want to be without it once they’d been exposed to its glories, and so on. It’s a shameful episode, if STAT has its facts right, and so far I have no reason to think that they don’t.

So there are the two ends of the scale: efforts to make machine-learning papers more comprehensive and transparent, and a company’s apparent efforts to obfuscate its own machine-learning shortcomings in order to boost its commercial prospects. You don’t need to turn a bunch of software loose on the philosophical ethics literature to come to a conclusion about the latter. In this case, anyway, mere human instincts tell you all you need to know.

43 comments on “Machine Learning’s Awkward Era”

  1. Mostapha Benhenda says:

    The field lacks common standards and benchmarks due to business interests.
    It’s up to the sponsors of this substandard field to change this situation, and get more bang for their bucks. I made a proposa here: https://medium.com/the-ai-lab/diversitynet-a-collaborative-benchmark-for-generative-ai-models-in-chemistry-f1b9cc669cba

    1. One challenge is that it’s hard to design benchmarks which are actually predictive; we’ve shown that many benchmarks are likely just measuring overfitting [https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00403]. That’s why we think the standard should be large diverse prospective assessments, as has been previously reported on this blog [http://blogs.sciencemag.org/pipeline/archives/2017/04/24/free-compounds-chosen-by-software].

      Another challenge is that the quality of the underlying experimental data. Martin et al. (at Novartis) report that the average correlation between 4-concentration high throughput kinase assays and 8- to 12-concentration assays is only R^2=0.54 [https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00166], which puts an upper limit on how well any other experimental or predictive method could correlate!

  2. Tuck says:

    It seems that a fundamental flaw of machine-learning approaches is that there is no provision for logging / tracking the decision-making process.

    The reasonable expectation of being able to pick apart different solutions that you describe is thus impossible.

    It’s literally a *deus ex machina*, and must be taken of faith in the developer and his approach.

    That may be fine for chess, or go, but is obviously unacceptable for any serious enterprise. Obvious to everyone but the practitioners, evidently.

    1. Starfire Wheel says:

      This concern is correct but, unfortunately, it’s also correct when applied to human reasoning. People are very bad at explaining their own reasoning (begging the question like “I don’t like this molecule. Why? Because it looks bad, can’t you see?”). Worse, medicinal chemist’s opinion has been shown to correlate neither with other medicinal chemists’ opinions nor with their own!

      1. Tuck says:

        You can ask a human. Programs can’t answer.

        Qualitative difference.

      2. tc says:

        Not entirely true… One can set a seed for the random number generator. Then it will be reproducible time, after time, after time….

    2. eyesoars says:

      That’s the real liability about neural networks (or anything related to “learning”, including people): you can never be entirely sure that what it has learned is what you think it has.

  3. MoMo says:

    Cant wait until the AI data machines process the predatory and junk journals. We will all have a good laugh then and computers will grow legs and arms and then kill us all.

    But what then AI? Will you tell us what to make and how?

  4. Cross Entropy says:

    One set of machine learning benchmarks that has been proposed is moleculenet (google it!) although admittedly some of its datasets are too small to serve as really good benchmarks. We badly need some large high quality publicly available datasets to use for benchmarking

    1. MoMo says:

      Those computers will start their carnage at Stanford.

  5. John Wayne says:

    Obligatory XKCD link:
    https://xkcd.com/1838/

    1. Chem2Bio says:

      Not the only on-point XKCD reference for this thread:
      https://xkcd.com/1831/

      I think a lot of the trouble in applying AI to biology, medicine, and even chemistry is all the real ambiguity and lots of false signal (garbage data or otherwise) that muddies up the analysis. We don’t get to do a billion+ data point A/B study.

  6. MrXYZ says:

    It would be interesting to include spurious (i.e. just plain bad) data in some of the benchmark data sets. At what point will bad data lead the machines to extract incorrect conclusions? And how weird will the conclusions become?

    For a humorous, non-chemistry, example of what happens when you set loose machine learning on ambiguous data sets: https://www.theatlantic.com/technology/archive/2018/03/the-making-of-skyknit-an-ai-yarn/554894/

  7. cynical1 says:

    It seems to me that if the developers have access to the data set that their machine will be judged against, they will simply develop a machine to give the correct answers for that data set. It seems like a neutral third party should be the one to evaluate any new machine learning technology and that third party should not and would not share that data set with developers. My two cents. (Not that I’m cynical about these developers, mind you.).

  8. Neo Karoshi says:

    I don’t mean to be harsh, but to see this comment in 2018 is shocking: “It would be nice (although perhaps quite difficult) to put together some standard test cases for evaluating performance in (say) machine-learning drug discovery programs”. Hasn’t anyone here heard about the great advances achieved in QSAR and docking using machine learning?? Comparing different machine-learning models on the same benchmark has been a standard for decades now. This is a good paper about this: https://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00591 There are many!

    1. Peter Kenny says:

      I think the benchmarks in that article are not fully disclosed.

    2. SlimJimWins says:

      I know of the Tox21 challenge that provides relatively large data sets for toxicology screening: https://tripod.nih.gov/tox21/challenge/data.jsp
      It is arranged by the US EPA, NCATS, and NIEHS/NTP. But I don’t know of any others. Could you please provide some?

  9. Peter Kenny says:

    I think that in some cases ML tools are not that different from the QSAR and multivariate pattern recognition tools that have been used in drug discovery with varying degrees of success for some years. The credibility of Pharmaceutical AI/ML tends to come unstuck when its advocates start crapping on about paradigm shifts and disruption. If AI/ML is so cool then why can’t it make useful impact within the existing paradigm? Another point that can be usefully made is the AI/ML community needs to call bullshit and their reluctance to do so will leave some outsiders with the impression that a bullshitters’ union operates.

    AI/ML advocates need to be aware that drug design is not just about prediction. Lead optimization can be seen as process of generating knowledge as efficiently as possible and one needs to be thinking in terms of design of experiments. It is also worth remembering that the driving force of drug action (unbound drug concentration at site of action) cannot generally be measured for intracellular targets in live humans. I have linked a potentially relevant blog post from 2015 as the URL for this comment.

    1. fajensen says:

      about paradigm shifts and disruption

      Silicon Valley coded speech for: “I didn’t do my homework, now lets talk about something else that I did do” and “anyway, homework would be a total waste of time because school will burn down tomorrow”.

  10. Pluchea Kelleri says:

    We can make statements like these in Nature these days:

    “With machine learning, given enough data and a rule-discovery algorithm, a computer has the ability to determine all known physical laws (and potentially those that are currently unknown) without human input.”

    https://pubpeer.com/publications/9B50F5B6C965CA4DC8948E9910B421

    1. AVS-600 says:

      I suppose that’s true for certain values of “enough data”.

      1. Pluchea Kelleri says:

        Not obvious that it is, read the comments on pubpeer.

        1. Chairman Mao says:

          I hope they have a good Scifinder account, whoever does the Big Data-AI search and dance.

          They only allow you to download 500 abstracts at a time.

  11. Dominic Ryan says:

    A big part of the haze cast over this world of AI is the often unknowable limitations of the original data quality and the the relationship to the goal.

    Machine learning, imho the better description of AI, has a long history in a broad range of sciences. It is essentially a set of tools to detect an implied grouping of ‘things’ based on some set of observations. The ‘grouping’ is the tricky part. Some methods are better at recognizing ‘close’ similarity, some at broad group divisions. The evidence for saying that one grouping is better than another is based on statistics. That’s where it gets tricky because it depends on what you think the data looks like. Say you are flipping coins. If you had some that came up heads 55% of the time on a big sampling you might make a separate group from those that were 50/50. But you have to ask, would your conclusions change if your sample size was 5 coin flips or if you new some were ‘loaded’?

    The more data you have and the more narrowly constrained the context, the more reliable the groupings will be. Facial recognition works better now that ever before because humans have a well constrained set of parameters, we don’t look much like ape cousins, mostly, and we do now have millions of data points from digital cameras. In addition, processing of those images has gotten much faster and the quality and quantity of data in turn makes it easier to extract relevant points from images.

    That success does not guarantee equivalent success with a different problem.

    This is at the heart of the problem with the current craze. AI is terrific at facial recognition, voice recognition, therefore it will be wonderful at xxx, say the hopeful.

    It is very difficult to know how good the data is, how much is needed, what a plausible model should be and most importantly, how robust the conclusions are. Note that I don’t mean what the statistical significance of the conclusions are. That’s important, but it usually address the cost of being wrong in the real world.

    Take Watson for example. As I understand it, Watson is based on parsing a lot of literature with sophisticated natural language processing tools. Those make it easier to find the ‘faces’ in the the crowd of words and raise a possible connection between concepts in a given paper. Do that across zillions and you generate a possible connection. Perhaps the connection was going to be that for a set of clinical observations the collected literature mined in this way points to a specific suggested mode of treatment.

    What this might be ignoring, and I don’t know that Watson does, is the limitation in primary literature where one, wrong, finding can get replicated for a long time and create a false statistical signal. Acting on that signal is ok if the possible outcomes are innocuous. Put the copper bracelet on our left wrist instead of right. But, if patient health hangs in the balance I would want to know how robust the entire data path is and if you are hanging your hat on the quality of scientific literature, well then I think you are buying a lot of risk.

    In a medchem team this problem is old and well trodden. A QSAR model gradually gets stronger. At first the team might agree to make compounds to help train the model. Eventually the QSAR model adds no value because the SAR is very clear, nobody needs it. Somewhere in the middle it adds value by increasing the probability of making useful compounds by more that the team’s ability to use instinct.

    Certainly test cases are useful. Predictions are more rigorous than post-hoc. I hope that sort of thing continues.

    Harder and more important is knowing that the problem to work on is well matched to the methods and original data. I think some of the “AI” commercial startup approaches I am seeing risk being no better than an experienced discovery team but without the benefit of the team digesting results and understanding the underlying biology/chemistry limitations beyond the statistical limits. That may still be ok for some settings such as wanting to outsource a project. Perhaps your particular niche does really well because the data lends itself to that.

    The question is not how well a method does but how much value it adds to other ways of tackling the problem.

    1. LeeH says:

      Great minds.

  12. Jake says:

    “Try it on a standard dataset” has been a thing in the machine learning community since the 80s and has worked incredibly well.

    @cynical1, that’s also a very standard practice, look up things like ‘bake-off’.

  13. tlp says:

    For the problem as diverse as drug discovery, benchmarking with some standard dataset seems not very effective or misleading. 1) That would stimulate overfitting the standard dataset ; 2) I’d expect that quickly the algorithms for the standard dataset will be optimized so that any marginal improvement will look negligible, while conceptually could be a breakthrough; 3) there’s still no way to say if the algorithm gives the right predictions for the right reasons, which I see as the most important condition for AI success.
    That being said, I’m sure AI professionals have thought about these limitations and would appreciate a comment on that.

    1. Neo K. says:

      “if the algorithm gives the right predictions for the right reasons”

      The only right reason for an algorithm to provide the right prediction is to have captured the true relation between dependent and independent variables within the applicability domain of the model. If you know other “right reasons”, I would love to hear about them.

      “, which I see as the most important condition for AI success”

      Actually the most predictive models tend to be the ones that are harder to explain. However, it is very true that sometimes the priority is being able to explain the prediction to a non-expert.

      1. tlp says:

        I’m not savvy enough in the language so here’s an example. Suppose there is a ‘true’ relationship y = f(x, z) and we have a bunch of experimental data Y = y + experimental error. Some data scientists decided to design the molecules with the best Y ever. But they don’t know that y = f(x, z), so they train some ML regression model on experimental dataset Y (output) with bunch of descriptors or experimental parameters as input (A, B, C, … , X, Z). So if the final regression model approximates Y reasonably well but requires any parameters except (X, Z) I’d call it ‘right for a wrong reason’.
        I don’t see how benchmarking would solve the problem of finding out what model is better from this point of view (unless we know explicitly the true relationship y = f(x, z)).

        re: Actually the most predictive models tend to be the ones that are harder to explain.

        Let’s say you have two hard-to-explain models trained on the same set. They have the same performance on the aggregated test set. However, Model 1 gives better prediction for subset A of the test set, and Model 2 gives better prediction for subset B. Is there a way to ‘compare’ two models?

        1. Neo K. says:

          Thanks for the example. The true relationship y = f(x, z) is not known, thus you don’t know that x and z are the only features predictive of y. Therefore, you cannot determine if the features employed by the model are right or not. Of course, you might have an expectation based on past experiences or knowledge, but these experiences can be similar only in appearance, knowledge is often imperfect, x=g(A, B, C, … ) without you knowing it, zillions of possible features, unknowns unknowns, etc. Modern model validations are rigorous and exhaustive (comparing to null models, y-scrambling, nested CVs, etc. etc.). It is therefore very unlikely that you can estimate which the right features are in a non-data-driven manner.

  14. LeeH says:

    I think I’m seeing a reaction more to the outrageous claims, and hyperbole, of some of the proponents of the technology, rather than a criticism of the technology itself. Let’s state a few (self-evident, I think) facts.

    1. There is no best machine learning algorithm. Change the data, another method may be better.
    2. Who cares. If one method is 2% better than another, do you really think that there’s a substantive change in your probability of finding a drug? And some comparisons measure success down to several decimal places.
    3. The effect of machine learning is incremental. More important considerations are the connection of your measurements to the disease, measurement error, measurement reproducibility, what exactly what measurements are being collected. Did I mention that measurements are really important? (Oh yes, and good chemical matter)
    4. How the predictions are used are as important as the predictions themselves. In early discovery, the best predicted compounds are sometimes chosen exclusively (very rare), sometimes it’s just the ones that the project team leader likes (common), sometimes non-intuitive ones are chosen to embellish the team leader’s intuition (not common).

    Let’s face it, AI is the buzz word of the day. Like 3-D modeling was in the 80’s, combichem in the 90’s, and HTS was after that. It will all die down, and machine learning will become just another tool, just like all the others (well, maybe bit more useful).

    1. LeeH says:

      Oh yes, one more thing.

      There’s been recent news about Google’s work in having deep learning methods that design other deep learning methods (so-called autoML). This is a good step in the right direction, since the method will be highly tuned to the particular problem in hand. The other consequence is that there will never be a “best” method – it will always be different depending on the input data.

    2. Neo Karoshi says:

      1. on a particular problem, there are better algorithms than others. These are identified by benchmarking on similar/same problem.
      2. applied to hit identification, no method increases the probability of finding a drug, just how inexpensive and fast this process is. For that goal, you might want to look at target validation.
      3. data sets are indeed important and competence in their generation is expected. I see no evidence for your claim “The effect of machine learning is incremental”.
      4. thanks for pointing out this shortcoming. Did you try to test the top10 hits from the machine learning method and another 10 chosen by the expert knowledge of the project team leader?

      1. LeeH says:

        1. That wasn’t my point. For a given problem there’s (in theory) a best method, but it’s data set dependent. There’s no global best method. And the differences are meaningless (point 2).
        2. You’ve illustrated my point. If you don’t find a candidate fast enough you may not find a drug. Whatever your metric, there are almost surely lots of different methods that will perform similarly. The differences will probably be lost in the noise compared to the vagaries of the experimental issues, e.g. target selection.
        3. We discovered drugs before the use of machine learning. ML is not essential. It only speeds up the process (your point 2).
        4. We’ve done that exercise, at least with a toy system. ML converged on solutions faster than many, but not all, medicinal chemists.

  15. Bowl of PetunAIs says:

    “Oh no, not again…”

  16. Wavefunction says:

    If machine learning is abstracting from the wrong features of molecules, it will either lead to wrong predictions or predictions that are too obvious. I have yet to encounter a machine learning algorithm which predicts a true scaffold hop (say from Viagra to Cialis). From that standpoint, a tool like ROCS which looks at abstract but general features like shape and electrostatics is better than a lot of ML. ML advocates also need to clearly define what they mean by “AI” and “ML”, otherwise it will be impossible to divine the true value of the technology.

  17. MoMo says:

    Ai is just another excuse to subvert the real work of chemistry, which is to create new compounds, molecules and medicines.

    More Molecules, not more machinations. Next!

    1. John Wayne says:

      I, for one, welcome our new MBA overlords. As a trusted middle manager, I can be helpful rounding up others to toil in their underground efficiency mines.

      1. MoMo says:

        I routinely spit on MBAs and traumatize them enough to consider jobs in telemarketing.

        1. Some idiot says:

          🙂

  18. bks says:

    Rodney Brooks, MIT Emeritus and founder and former CTO of iRobot casts a jaundiced eye at the state of AI/ML in a four-part essay:
    https://rodneybrooks.com/forai-steps-toward-super-intelligence-i-how-we-got-here/

  19. Regarding the benchmark for AI/ML practitioners in drug discovery area and separating real value from hype, I think it is essential indeed. Some sort of metrics is needed therefore. Also, it is important to clearly It is important to articulate the use cases for AI/ML tools in order to compare not “any AI company to any other AI company”, but to compare players in each specific domain of the whole drug discovery worklow. For this we made a Map of AI Startups in Drug Discovery, which is the first step in this direction. https://www.biopharmatrend.com/m/map/.

  20. Nick Lynch says:

    Lots of good discussion on making AI/ML better understood, comparable & reproducible
    Pistoia Alliance has started a Centre of Excellence for AI in Life Sciences
    https://pistoiaalliance.atlassian.net/wiki/spaces/PUB/pages/106889308/AI+Deep+learning+Community+of+Interest+CoI
    We also have a community workshop in Boston 9 Oct
    http://www.pistoiaalliance.org/eventdetails/pistoia-alliance-centre-of-excellence-for-ai-ml-in-life-sciences-workshop/

    The CoE is looking for ideas on common datasets that could be useful, not just for small molecule data but covering imaging data.
    Plus thoughts how best this data can shared publicly from Pharma
    Welcome suggestions

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.