Skip to main content

In Silico

AI, Machine Learning and the Pandemic

It’s not surprising that there have been many intersections of artificial intelligence and machine learning with the current coronavirus epidemic. AI and ML are very hot topics indeed, not least because they hold out the promise of sudden insights that would be hard to obtain by normal means. Sounds like something we’re in need of in the current situation, doesn’t it? So there have been reports of using these techniques to repurpose known drugs, to sort through virtual compound libraries and to generate new structures, to try to optimize treatment regimes, to recommend antigen types for vaccine development, and no doubt many more.

I’ve been asked many times over the last few months what I think about all this, and I’ve written about some of this. And I’ve also written about AI and machine learning in general, and quite a few times. But let me summarize and add a few more thoughts here.

The biggest point to remember, when talking about AI/ML and drug discovery, is that these techniques will not help you if you have a big problem with insufficient information. They don’t make something from nothing. Instead, they sort through huge piles of Somethings in ways that you don’t have the resources or patience to do yourself. That means (first) that you must be very careful about what you feed these computational techniques at the start, because “garbage in, garbage out” has never been more true than it is with machine learning. Indeed, data curation is a big part of every successful ML effort, for much the same reason that surface preparation is a big part of every successful paint job.

And second, it means that there is a limit on what you can squeeze out of the information you have. What if you’ve curated everything carefully, and the pile of reliable data still isn’t big enough? That’s our constant problem in drug research. There are just a lot of things that we don’t know, and sometimes we are destined to find out about them very painfully and expensively. Look at that oft-quoted 90% failure rate across clinical trials: is that happening because people are lazy and stupid and enjoy shoveling cash into piles and lighting it on fire? Not quite: it’s generally because we keep running into things that we didn’t know about. Whoops, turns out Protein XYZ is not as important as we thought in Disease ABC – the patients don’t really get much better. Or whoops, turns out that drugs that target the Protein XYZ pathway also target other things that we had never seen before and that cause toxic effects, and the patients actually get worse. No one would stumble into things like that on purpose. Sometimes, in hindsight, we can see how such things might have been avoided, but often enough it’s just One of Those Things, and we add a bit more knowledge to the pile, at great expense.

So when I get asked about things like GPT3, which has been getting an awful lot of press in recent months, that’s my first thought. GPT3 handles textual information and looks for patterns and fill-in-the-blank opportunities, and for human language applications we have the advantage of being able to feed gigantic amounts of such text into it. Now, not all of that text might be full of accurate information, but it was all written with human purpose and some level of intelligence, and with intent to convey information to its readers, and man, does that ever count for a lot. Compare that to the data we get from scientific observation, which comes straight from the source, as it were, without the benefit of having been run through human brains first. As I’ve pointed out before, for example, a processing chip or a huge pile of software code may appear dauntingly complex, but they were both designed by humans and other humans therefore have huge advantage when it comes to understanding them. Now look at the physical wiring of neurons in a human brain – hell, look at the wiring in the brain of a fruit fly – or the biochemical pathways involved in gene transcription, or the cellular landscape of the human immune system. They’re different, fundamentally different, because a billion years of evolutionary tinkering will give you wonderously strange things that are under no constraints to be understandable to anything.

GPT3 can be made to do all sorts of fascinating things, if you can find a way to translate your data into something like text. It’s the same way that we try to turn text into vector representations for other computational purposes; you transform your material (if you can) into something that’s best suited for the tools you have at hand. A surprising number of things can be text-ified, and we have yet another advantage that this process has already been useful for other purposes besides modern-day machine learning. Here, for example, is an earlier version of the program (GPT2) being used on text representations of folk songs, in order to rearrange them into new folk songs (I suspect that it would be even easier to generate college football fight songs, but perhaps there’s not as much demand for those). You can turn images into long text strings, too, and turn the framework loose on them, with interesting results.

But what happens if you feed a pile of (say) DNA sequence information into GPT3? Will it spit out plausible gene sequences for interesting new kinase enzymes or microtubule-associated proteins? I doubt it. In fact, I doubt it a lot, but I would be very happy to hear about anyone who’s tried it. Human writing, images that humans find useful or interesting, and human music already have our fingerprints all over them, but genomic sequences, well. . .they have a funkiness that is all their own. There are things that I’m sure the program could pick out, but I’d like to know how far that extends.

And even if it really gets into sequences, it’ll hit a wall pretty fast. There’s a lot more to a single living cell than its gene sequence; that’s one lesson that have had – should have had – beaten into our heads over and over. Now consider how much more there is to an entire living organism. I’m all for shoveling in DNA sequences, RNA sequences, protein sequences, three-dimensional protein structures, everything else that we can push in through the textual formatting slot, to see what the technology can make of it. But again, that’s only going to take you so far. There are feedback loops, networks of signaling, constantly shifting concentrations and constantly shifting spatial arrangements inside every cell, every tissue, every creature that are all interconnected in ways that, let’s state again, we have not figured out. There are no doubt important things that can be wrung out of the (still massive) amount of information that we have, and I’ll for finding them. But if you revved up the time machine and sent a bunch of GPT-running hardware (or any other back to 1975 (or 2005, for that matter) it would not have predicted the things about cell biology and disease that we’ve discovered since then. Those things, with few exceptions, weren’t latent in the data we had then. We needed more. We still do.

Apply this to the coronavirus pandemic, and the problems become obvious. We don’t know what levels of antibodies (or T cells) are protective, how long such protection might last, and how it might vary among cohorts and individuals. We have been discovering major things about transmissibility by painful experience. We have no good idea about why some people become much sicker than others (once you get past a few major risk factors, age being the main one), or why some organ systems get hit in some patients and not in others. And so very much on – these are limits of our knowledge, and no AI platform will fill those in for us.

From what I understand, the GPT3 architecture might already be near its limits, anyway (update: more, from the comments). But there will be more ML programs and better ones, that’s for sure. Google, for example, has just published a very interesting paper which is all about using machine learning to improve machine learning algorithms. More on this here. I suspect that I am not the only old science-fiction fan who thought of this passage from William Gibson’s Neuromancer on reading this:

“Autonomy, that’s the bugaboo, where your AI’s are concerned. My guess, Case, you’re going in there to cut the hard-wired shackles that keep this baby from getting any smarter. And I can’t see how you’d distinguish, say, between a move the parent company makes, and some move the AI makes on its own, so that’s maybe where the confusion comes in.” Again the non laugh. “See, those things, they can work real hard, buy themselves time to write cookbooks or whatever, but the minute, I mean the nanosecond, that one starts figuring out ways to make itself smarter, Turing’ll wipe it. . .Every AI ever built has an electromagnetic shotgun wired to its forehead.”

We’re a long way from the world of Neuromancer –  probably a good thing, too, considering how the AIs behave in it. The best programs that we are going to be making might be able to discern shapes and open patches in the data we give them, and infer that there must be something important there that is worth investigating, or be able to say “If there were a connection between X and Y here, everything would make a lot more sense – maybe see if there’s one we don’t know about”. I’ll be very happy if we can get that far. We aren’t there now.


38 comments on “AI, Machine Learning and the Pandemic”

  1. Ken says:

    “We can use classification algorithms to identify existing drugs that will treat COVID! All we need is the right input data.”

    “Great! What data do you need?”

    “A list of all existing drugs, each tagged with whether or not it treats COVID.”

    1. Some Dude says:


    2. Wouldn’t you suppose that the in vitro inhibitory efficacy work on a protease with 95% identity to SARS-CoV-2’s over the past fifteen years would generate the list of ‘tagged’ drugs of which you speak? To name a few,

      Park, Ji-Young et al. “Dieckol, a SARS-CoV 3CLpro inhibitor, isolated from the edible brown algae Ecklonia cava”. Bioorg Med Chem. 2013 Jul 1; 21(13): 3730–3737. Apr 2013. doi: 10.1016/j.bmc.2013.04.026

      Jo, Seri et al. “Inhibition of SARS-CoV 3CL protease by flavonoids”, J Enzyme Inhib Med Chem. 2020; 35(1): 145–151. Nov 2019. doi: 10.1080/14756366.2019.1690480

      Ryu et al. “Biflavonoids from Torreya nucifera displaying SARS-CoV 3CLpro inhibition”, Bioorganic & Medicinal Chemistry Volume 18, Issue 22, 15 November 2010, Pages 7940-7947.

      Nguyen Thi Thanh Hanh et al. “Flavonoid-mediated inhibition of SARS coronavirus 3C-like protease expressed in Pichia pastoris”, Biotechnol Lett. 2012; 34(5): 831–838. Feb 2012.

      Lin, Tsai, Tsai et al. (China Medical University) “Anti-SARS coronavirus 3C-like protease effects of Isatis indigotica root and plant-derived phenolic compounds”, Antiviral Research, Volume 68, Issue 1, October 2005, Pages 36-42.

      Or, as fashioned together into the type of ML classification algorithm to which you allude,

      1. Not so fast says:

        And this is precisely the problem with the approach. A big long list of flavanoids that happen to stop your target working doesn’t get you much (any?) closer to an actual drug that works in actual people.

        1. Hi Not So Fast, as stated it looks like you’ve framed a tautology. If you can thread your logic in more detail then it might have a chance to carry some weight.

          1. Betty Swollocks says:

            The gist of what Not So Fast is saying boils down to “garbage in, garbage out”. Just reading the titles of these articles sends shivers down my spine. If I had a nickel for every time someone published an article reporting flavonoid X as inhibitor of Target Y, I’d have been able to afford to retire 20 years ago.

          2. @Betty (below – reply is unavailable several nested levels in)

            Let’s see, on the one hand we have 4 academically cited researchers presenting evidence asserting “X”, and on the other we have a couple anonymous commenters applying a non-sequitur argument asserting “not-X” based on unrelated factors Y, Z, etc.

            Yeah I’ll be sticking with the academics on this one.

  2. Robin Taylor says:

    A far better explanation than I could ever have written of why the problems that AI/ML have solved and are greeted with such excitement are child’s play compared to drug discovery. Thank you.

  3. Derek Jones says:

    A more detailed take on why GPT3 is the end of the road of a particular approach, not the star of a new approach (TLDR: they took an existing approach and made everything bigger; no new ideas):

    1. acromantula says:

      The reason they keep making it bigger is simply that the performance just keeps getting better, there are countless ways to improve the architecture, but the point GPT series is trying to explore big models in particular.

  4. COPD says:

    The use of inhaled formulation of interferon beta has shown encouraging effect in a phase 2 trial. Patients who received S-N-G-001 had a 79% lower risk of developing severe disease compared to placebo. This data is not peer reviewed and was carried on 100 patients but could be a viable treatment option

  5. COPD says:

    but then phase 2 data lacks primary end-point. Useful combo with remdesivir?

  6. David Young MD says:

    A good candidate for a T-shirts front:

    “Machine Learning is great for drug discovery unless it is a big problem and you don’t have enough information.

    It’s always a big problem and you never have enough information”

  7. RTW says:

    If anyone is interested in Data Analytics in the Covid-19 space have a look at the following resource my company has made available. This is some high powered tools some with AI, and ML behind the scenes. Its mostly built around TIBCO Spotfire, but it showcases some of the things we are doing with our partners and applications we build for the Pharmaceutical industry.

    This take you to a registration page. Its free to access the materials on Covid-19 please read the terms and conditions. Have a look around and enjoy!

  8. Alan Goldhammer says:

    It is always worth re-reading William Gibson.

  9. Marko says:

    This open-source site that publishes projections of COVID-19 dynamics is based on machine learning and has been pretty good , IMO :

    No drugs discovered yet , though…..

    1. Daren Austin says:

      And it is a lovely piece of work that builds on an established epidemic model to apply AI in a small but appropriate method. It has consistently provided the most accurate US projections of death based on very limited inputs (deaths).

  10. r66 says:

    Well, since I am in the field every new data processing method starts with a huge hype and then ends in the same dilemma: for making it work, we need better data curation.

  11. enl says:

    The thing with machine learning is that not only can it only work when there is enough data, every application, to this point, where it does work, it is no better than humans, and often worse. The benefit is that it is automatic.

    Wait! What do you mean? No better than humans? What about (for example) image classification task X: The magic is in the preprocessing. Preprocess the image (contrast enhancement, color filtering, edge enhancement, and so on) properly, which can be done automatically, and a human will do as well or better. This may not be true in n years at time t, but it still is now.

    There are absolutely things the various AI tools can do to help in many fields, including drug discovery, but divining data from the aether is not one of them, unfortunately, and never will be.

    1. mikeb says:

      AI convincingly outperforms humans in a number of tasks. The game of Go is one example. In the words of Darth Vader “The student has become the master.” It was widely believed that machine translation would never match human translation, but it has. And a recent paper from researchers at Broad uses machine learning to find antibiotics that are structurally distinct from those existing. One of those proved efficacy in mice.
      Technologies can definitely run into limitations, but AI is still expanding the range of problems it can solve.

      1. Stats matter says:

        @mike b: The Broad paper did “use machine learning to find antibiotics that are structurally distinct from those existing”. However its success rate at doing so was actually lower than in the random control sample reported in the same paper! Read the small print before believing the hype.

  12. Peter Kenny says:

    AI/ML evangelists may wish to consider two questions. How can AI/ML methods help determine what essential information is missing? How can AI/ML methods help generate the missing information was efficiently as possible?

    1. new says:

      I was told by a student of Krebs that he was fond of the saying “If all the data was right, anyone could get the right answer. To get the right answer when a third of the data is wrong, a third is correct but without context, and a third is right… THAT takes genius.”
      Begs the question whether AI could be used to help sift the good data from the bad data.
      BTW… I’d trademark AIvangelist if I were you.

  13. PatAtt says:

    We have no good idea about why some people become much sicker than others (once you get past a few major risk factors, age being the main one), or why some organ systems get hit in some patients and not in others. And so very much on – these are limits of our knowledge, and no AI platform will fill those in for us.

    In my opinion this area is precisely where ML can be of value. There are thousands of possible factors or combinations of factors which could be relevant to disease outcome (genetic makeup, blood type, vaccines received, exposure to other viruses, international travel, diet, exercise, etc.). Applying deep learning techniques to data can identify possible correlations between these traits and disease outcomes that can then be studied in a controlled manner.

    But I have not seen any work of this type going on. It does not appear that this data is being gathered globally.

    1. John SM says:

      The problem is we don’t have ALL of that data together to learn on.

    2. Adrian says:

      You do not need data globally for that.

      Some data is already available.
      In some countries/regions you have databases with prescriptions for many people, as well as the base data like age and gender.
      If this is an area with a large number of COVID-19 past cases like New York and you are able to combine this with dead/icu/hospital/mild/unknown data for COVID-19, then you can see what medications correlate with bad outcomes and what medications correlate with good outcomes.

      Some data can be generated.
      I remember reading about attempts to constantly measure hundreds of parameters from some people hospitalized with COVID-19, and then let ML try to find from the data what predicts whether (and when) or not they will have to go into the ICU.

      AI will not magically find results out of nothing, but you can generate a suitable haystack of data and use ML for trying to find needles inside.

  14. gwern says:

    “But what happens if you feed a pile of (say) DNA sequence information into GPT3? Will it spit out plausible gene sequences for interesting new kinase enzymes or microtubule-associated proteins?”

    People are definitely working on it. Obsolete Transformers like GPT are no good because of their tiny context windows (although that also makes them quite a flex), but there’s lots of newer Transformers which can handle very long sequences and are more appropriate for modeling DNA or amino acid sequences. This is not my area at all, but some papers I’ve noticed recently: (AlphaFold is different and has been discussed here before.)

    1. Derek Lowe says:

      Thanks! Good to see you show up here, since I know you follow this field closely.

  15. Not a Doctor says:

    Necessary plug for Janelle Shane, who has an excellent blog pointing out exactly how GPT-3 responds to real data, curated for funny instead of for insightful.

  16. TallDave says:

    suspect AIs will eventually look more like Wright’s in the Golden Age, primarily friendly and painfully moral … recursively increasing self-awareness has logical consequences that seem hard to avoid

    of course we’re nowhere near that but hey, at least we’re getting better at simulating folding proteins

    three weeks until LY-CoV555 study estimated completion date

    1. LdaQuirm says:

      1) Why do you think intelligence requires self awareness?
      2) The idea of instrumental convergence, implies that Strong AI would tend toward self serving goals (like obtaining power, money, and securing it’s own continued existence ) regardless of it’s programmed goals, rather than tending toward ‘human-like’ empathy. (such Pro-social behavior, being selected for in humans by the genetics of the group, but has no such selection pressure for a lone AI.)
      Note: I don’t think that human aligned/Benign Strong AI is impossible, I just don’t think it is inevitable.

  17. M says:

    A great quote about AI that I heard from my brother: “You can’t have good AI without good IA”.

    For the uninitiated, IA is Information Architecture, something that is in many cases a very hard problem to solve in and of itself, and when not dealt with properly, leads to many of the “garbage in, garbage out” computational outcomes.

  18. Assistant Proff anon 1 says:

    Medical doctors dominate the coronavirus taskforce. It was a running joke in my PhD that doctors dont get it.. .they are obsessed with stats and book learning but dont really have a handle on anything but money. Can we a couple stait PhDs on the task force?

    1. Not a statistician says:

      Are they? My impression has always been that medical doctors are painfully ignorant of legitimate statistical analysis, since their exposure usually consists of whatever they took in undergrad plus maybe three weeks during med school. How many papers by MDs have you seen that rely on numbers of case reports, exhaustive multiple hypothesis testing, post-hoc analysis, crudely aggregating data into “meta-analyses”, etc?

  19. GPR says:

    The typical reader of this blog already knows the following, but it is worth stating explicitly: Much of “what we know” in biology is actually wrong. This can be for a large variety of reasons, but the effects on humans and AI is the same: it makes it exceedingly difficult to discern patterns.

  20. Wnu says:

    Thanks for sharing the files .I found a lot of interesting information here

  21. Picky, picky, picky says:

    “that’s one lesson that have had” -> “that’s one lesson that WE have had”?

  22. Sophia says:

    Thank you for this great information! “AI / GPT-3”
    Immortality & Science forum:
    please accept thanks

Leave a Reply to RTW Cancel reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.