It’s not surprising that there have been many intersections of artificial intelligence and machine learning with the current coronavirus epidemic. AI and ML are very hot topics indeed, not least because they hold out the promise of sudden insights that would be hard to obtain by normal means. Sounds like something we’re in need of in the current situation, doesn’t it? So there have been reports of using these techniques to repurpose known drugs, to sort through virtual compound libraries and to generate new structures, to try to optimize treatment regimes, to recommend antigen types for vaccine development, and no doubt many more.
I’ve been asked many times over the last few months what I think about all this, and I’ve written about some of this. And I’ve also written about AI and machine learning in general, and quite a few times. But let me summarize and add a few more thoughts here.
The biggest point to remember, when talking about AI/ML and drug discovery, is that these techniques will not help you if you have a big problem with insufficient information. They don’t make something from nothing. Instead, they sort through huge piles of Somethings in ways that you don’t have the resources or patience to do yourself. That means (first) that you must be very careful about what you feed these computational techniques at the start, because “garbage in, garbage out” has never been more true than it is with machine learning. Indeed, data curation is a big part of every successful ML effort, for much the same reason that surface preparation is a big part of every successful paint job.
And second, it means that there is a limit on what you can squeeze out of the information you have. What if you’ve curated everything carefully, and the pile of reliable data still isn’t big enough? That’s our constant problem in drug research. There are just a lot of things that we don’t know, and sometimes we are destined to find out about them very painfully and expensively. Look at that oft-quoted 90% failure rate across clinical trials: is that happening because people are lazy and stupid and enjoy shoveling cash into piles and lighting it on fire? Not quite: it’s generally because we keep running into things that we didn’t know about. Whoops, turns out Protein XYZ is not as important as we thought in Disease ABC – the patients don’t really get much better. Or whoops, turns out that drugs that target the Protein XYZ pathway also target other things that we had never seen before and that cause toxic effects, and the patients actually get worse. No one would stumble into things like that on purpose. Sometimes, in hindsight, we can see how such things might have been avoided, but often enough it’s just One of Those Things, and we add a bit more knowledge to the pile, at great expense.
So when I get asked about things like GPT3, which has been getting an awful lot of press in recent months, that’s my first thought. GPT3 handles textual information and looks for patterns and fill-in-the-blank opportunities, and for human language applications we have the advantage of being able to feed gigantic amounts of such text into it. Now, not all of that text might be full of accurate information, but it was all written with human purpose and some level of intelligence, and with intent to convey information to its readers, and man, does that ever count for a lot. Compare that to the data we get from scientific observation, which comes straight from the source, as it were, without the benefit of having been run through human brains first. As I’ve pointed out before, for example, a processing chip or a huge pile of software code may appear dauntingly complex, but they were both designed by humans and other humans therefore have huge advantage when it comes to understanding them. Now look at the physical wiring of neurons in a human brain – hell, look at the wiring in the brain of a fruit fly – or the biochemical pathways involved in gene transcription, or the cellular landscape of the human immune system. They’re different, fundamentally different, because a billion years of evolutionary tinkering will give you wonderously strange things that are under no constraints to be understandable to anything.
GPT3 can be made to do all sorts of fascinating things, if you can find a way to translate your data into something like text. It’s the same way that we try to turn text into vector representations for other computational purposes; you transform your material (if you can) into something that’s best suited for the tools you have at hand. A surprising number of things can be text-ified, and we have yet another advantage that this process has already been useful for other purposes besides modern-day machine learning. Here, for example, is an earlier version of the program (GPT2) being used on text representations of folk songs, in order to rearrange them into new folk songs (I suspect that it would be even easier to generate college football fight songs, but perhaps there’s not as much demand for those). You can turn images into long text strings, too, and turn the framework loose on them, with interesting results.
But what happens if you feed a pile of (say) DNA sequence information into GPT3? Will it spit out plausible gene sequences for interesting new kinase enzymes or microtubule-associated proteins? I doubt it. In fact, I doubt it a lot, but I would be very happy to hear about anyone who’s tried it. Human writing, images that humans find useful or interesting, and human music already have our fingerprints all over them, but genomic sequences, well. . .they have a funkiness that is all their own. There are things that I’m sure the program could pick out, but I’d like to know how far that extends.
And even if it really gets into sequences, it’ll hit a wall pretty fast. There’s a lot more to a single living cell than its gene sequence; that’s one lesson that have had – should have had – beaten into our heads over and over. Now consider how much more there is to an entire living organism. I’m all for shoveling in DNA sequences, RNA sequences, protein sequences, three-dimensional protein structures, everything else that we can push in through the textual formatting slot, to see what the technology can make of it. But again, that’s only going to take you so far. There are feedback loops, networks of signaling, constantly shifting concentrations and constantly shifting spatial arrangements inside every cell, every tissue, every creature that are all interconnected in ways that, let’s state again, we have not figured out. There are no doubt important things that can be wrung out of the (still massive) amount of information that we have, and I’ll for finding them. But if you revved up the time machine and sent a bunch of GPT-running hardware (or any other back to 1975 (or 2005, for that matter) it would not have predicted the things about cell biology and disease that we’ve discovered since then. Those things, with few exceptions, weren’t latent in the data we had then. We needed more. We still do.
Apply this to the coronavirus pandemic, and the problems become obvious. We don’t know what levels of antibodies (or T cells) are protective, how long such protection might last, and how it might vary among cohorts and individuals. We have been discovering major things about transmissibility by painful experience. We have no good idea about why some people become much sicker than others (once you get past a few major risk factors, age being the main one), or why some organ systems get hit in some patients and not in others. And so very much on – these are limits of our knowledge, and no AI platform will fill those in for us.
From what I understand, the GPT3 architecture might already be near its limits, anyway (update: more, from the comments). But there will be more ML programs and better ones, that’s for sure. Google, for example, has just published a very interesting paper which is all about using machine learning to improve machine learning algorithms. More on this here. I suspect that I am not the only old science-fiction fan who thought of this passage from William Gibson’s Neuromancer on reading this:
“Autonomy, that’s the bugaboo, where your AI’s are concerned. My guess, Case, you’re going in there to cut the hard-wired shackles that keep this baby from getting any smarter. And I can’t see how you’d distinguish, say, between a move the parent company makes, and some move the AI makes on its own, so that’s maybe where the confusion comes in.” Again the non laugh. “See, those things, they can work real hard, buy themselves time to write cookbooks or whatever, but the minute, I mean the nanosecond, that one starts figuring out ways to make itself smarter, Turing’ll wipe it. . .Every AI ever built has an electromagnetic shotgun wired to its forehead.”
We’re a long way from the world of Neuromancer – probably a good thing, too, considering how the AIs behave in it. The best programs that we are going to be making might be able to discern shapes and open patches in the data we give them, and infer that there must be something important there that is worth investigating, or be able to say “If there were a connection between X and Y here, everything would make a lot more sense – maybe see if there’s one we don’t know about”. I’ll be very happy if we can get that far. We aren’t there now.