Skip to main content

In Silico

Machine-Mining the Literature

We’ve made it to the point – a while back, actually – where people who actually know the subject roll their eyes a bit when the term “artificial intelligence” is used without some acknowledgment that it’s not very useful. I think that’s a real sign that it’s becoming useful. Things are to the point where you have to say, implicitly, “You know that phrase that everyone’s used for years? That’s never really been defined but that headline writers like? Well, this is probably what people had in mind, and it’s actually about to be worth something, but we really should have better words for it”

Machine learning” was an attempt at those better words, but I fear that one is heading down the same water slide, although “AI” does have a substantial head start. This interesting new paper from a group at LBNL avoids either one in its title, and uses “machine learning” once in the abstract, sparingly thereafter, and “artificial intelligence” not at all. But it is what people have in mind when they talk about those things. The big topic is how you get data into these things, particularly how you get it into them in a form where you can hope to get anything useful back out.

In the same way as the old line about how armchair military buffs talk strategy and tactics while professionals talk logistics, professionals in this field tend to devote a lot of time to data curation and preparation. That’s partly because the real-world data we would like to use are often in rather shaggy piles, and also because even the best machine-learning techniques tend to be a bit finicky and brittle compared to what you’d actually want. We’re used to that with internal combustion engines: diesel fuel, ethanol, gasoline, and jet fuel are not perfectly interchangeable in most situations, and so it is with engines of knowledge. They are tuned up for specific types of input, and will stall if fed something else. To use a different analogy, data curation is very much akin to the advice that you should spend more time preparing a surface for a good paint job than you do in applying the actual paint. In almost every case, you will definitely spend more time getting your data in shape for machine learning than the actual computations will take.

The paper under discussion notes this, and also notes that a lot of the information out there is (1) not in very structured formats and (2) not even numerical. This has been clear for a long time; thus the interest in “natural language processing”. Can ML algorithms sort through the formats that we humans tend to use to communicate with each other? Words, sentences, paragraphs, journal articles, slide decks, chapter headings, reference lists, bibliographies, conference proceedings, abstracts, patent claims, reports and summaries? These things often have numerical data stuck to them, but look where it goes: into the appendices or the supplementary files. When people are communicating to people, we use words and pictures – we try to summarize the numbers in graphical form with captions, rather than just plop a big spreadsheet up on the screen or scroll through a long table of formatted ten-digit numbers. That’s what you feed the software; we humans are not built for it, although you do attend talks where the speaker doesn’t quite seem to have realized that.

Dealing usefully with natural language has been a general information-sciences goal for a long, long time (think machine translation, voice-activated controls, automated customer service, etc.) And we really have been getting better at it, although we’re still well short of our science-fiction-movie goals. Most of the attempts at extracting information from publications have needed significant human supervision – think of the old joke of the machine translation program parsing “Out of sight, out of mind” as equivalent to “Invisible lunatic”. But this new paper is trying to move beyond that, and here’s the eye-catching part of the abstract:

Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. 

That vector-representation trick has been around a while, in various forms, and it is pretty neat. It can allow you to slip right up into the world of mathematics, with all the available tools. A series of 2013 papers from a team at Google on the “Word2Vec” technique (here’s one) really set off a lot of work in the field (here’s an intro, and here’s another), and this paper builds on that work. The idea is that you represent a word as a multidimensional vector (as many “directions” as you like) in an effort to encode its meaning and definition.

If you set the length of those vectors arbitrarily from 0 to 1, the word “cat” would max out to 1.0 on the “mammal” and “carnivore” vectors, and it would have a pretty high setting along the “furry” vector (still, there are those weird-looking sphinx cats) and on the “tail” vector (still, there are Manx cats). It would have a strong partial score along the “pet” vector, because cats are an important class of pets (just ask them), but not all cats are pets (just ask a bobcat, who lowers the score for the “tail” vector as well). “Cat” would zero out on many others, of course: “metallic”, for example, although based on the photo at right I would argue for a bit of length for the “liquid” vector.

That’s what this team has done for the materials science literature. Importantly, you don’t have to do all these vector evaluations by hand – that’s what the algorithms inside things like Word2vec and GloVe do for you. In this case, the group turned Word2vec loose on about 3.3 million abstracts in the materials science literature from 1922 to 2018, which led to a vocabulary of about 500,000 words. There are details about how to handle phrases (such as “resistance of manganese alloys”), how many vector dimensions are used (200, in this case), what training algorithm is used and how many cycles it’s run for with various cutoffs, etc. In the loose spirit of the “skip-gram” technique used here, I’m going to skip over those and refer folks who want the nitty-gritty to the actual paper. We will skip right to the results:

The abstracts delivered a pretty robust set of embeddings (vector representations) which allow for real vector operations. The example given:

For instance, ‘NiFe’ is to ‘ferromagnetic’ as ‘IrMn’ is to ‘?’, where the most appropriate response is ‘antiferromagnetic’. Such analogies are expressed and solved in the Word2vec model by finding the nearest word to the result of subtraction and addition operations between the embeddings.

I’d guess that one reason things worked out so well is that many of the “words” in the abstracts were empirical formulae, which encode a lot of information by themselves. And what’s kind of startling is that when you project the embeddings into two dimensions of the various element symbols, pulled right out of the literature abstracts, is that it’s robust enough to largely recapitulate the periodic table. A number of other real-world properties emerge from the embeddings as well, such as crystal symmetries.

But interesting as that is, it’s just demonstrating that we can pull out things that we already knew. The team also noticed, though, that if you looked for “cosine similarities” in vector space between (say) empirical formulae and a word like “thermoelectric”, you did indeed find compounds whose thermoelectric properties had been noted as interesting. But you also see such cosine similarities for compounds that do not actually show up in the same abstracts, anywhere in the database, as the word “thermoelectric” (or any other words that would clearly identify them as thermoelectric materials). Examining these by DFT (density function theory) and by experimental data from the full-text literature (not found in the abstracts used to produce the vectors) confirmed a good correlation with real-world evidence.

The paper then slices the literature up into abstracts that had only been published between 18 cutoff years between 2001 and 2018, and uses each of these to try to predict the most likely new thermoelectric materials that would show up in the later literature. They found that materials in the top 50 embedding predictions were about eight times more likely to have shown up as studied thermoelectrics within the next five years of publications, as compared to some random material. Even when you restrict things to materials that have a non-zero bandgap by density function theory (the first price of admission to this property), the embedding-based prediction materials were still three times more likely to show up than a random material from that list. For example, looking at abstracts from before 2009, the embedding predictions had CuGaTe2 (currently one of the best materials of its kind), as a top-five prediction four years before it actually appeared in the literature. How is this possible? Here you go:

For instance, CsAgGa2Se4 has high likelihood of appearing next to ‘chalcogenide’, ‘band gap’, ‘optoelectronic’ and ‘photovoltaic applications’: many good thermoelectrics are chalcogenides, the existence of a bandgap is crucial for the majority of thermoelectrics, and there is a large overlap between optoelectronic, photovoltaic and thermoelectric materials (see Supplementary Information section S8). Consequently, the correlations between these keywords and CsAgGa2Se4 led to the prediction. This direct interpretability is a major advantage over many other machine learning methods for materials discovery. We also note that several predictions were found to exhibit promising properties despite not being in any well known thermoelectric material classes (see Supplementary Information section S10). This demonstrates that word embeddings go beyond trivial compositional or structural similarity and have the potential to unlock latent knowledge not directly accessible to human scientists.

Now that’s machine learning, as far as I’m concerned. The authors are well aware, though, that they’re working in a comparatively orderly field (as opposed, say, to drug discovery!) and that by using paper abstracts they’ve been able to tap into a deliberately information-dense source of raw material. The natural extension would be to dive into full-text databases, but frankly that’s going to need better software than the current stuff – who knows, maybe even better hardware, by the time we’re through. But there are already further refinements to the context-independent method like Word2vec, ones that try to infer context and thus zoom in on the important things more quickly. That idea has its own pitfalls, naturally, but it does look like a promising way to go.

Extending this to the biomedical literature will be quite an effort – many will recall that this is just what one aspect of “Watson For Drug Discovery” was supposed to do (root through PubMed for new correlations). As I mentioned in that linked post, though, the failure of Watson (and some other well-hyped approaches, some of which are in the process of failing now, I believe) does not mean that the whole idea is a bust. It just means that it’s hard. And that people who are promising you that they’ve solved it and that you can get in on the ground floor if you’ll just pull out your wallet should be handled with caution. The paper today gives us a hint of what could be possible, eventually, after a lot of time, a lot of money, and a lot of (human) brainpower. Bring it on!

25 comments on “Machine-Mining the Literature”

  1. Grad Student says:

    Maybe not (yet) possible on biomedical discovery en masse, but could this be done with smaller subsets/areas where there has already been a lot of work done (ex. kinase inhibitors, predicting new classes of inhibitors or improving selectivity for a particular family or individual in a family)?

  2. Earl Boebert says:

    Re: terminology: Just call it “pattern recognition” and get back to work 🙂

    1. I’ll second that. Pattern recognition has served us well in the evolution of intelligence over the past 10 million years or so and deserves a name check.

  3. Philip says:

    Famous computer translation mistake.
    hydraulic ram –> water goat

    I hope this is not from the blog:

    So in the data curation process, make sure the computer does not have access to data that you do not want it to use. If you do not want a patient’s age to bias a finding, make sure the computer does not know the age of the patient. Same for sex, referring doctor or even the model of the equipment.

  4. Curious Wavefunction says:

    Can the ML algorithm work if it were provided with the literature as it stood in 1869?

  5. anon says:

    I wonder if we can use this to fish out fabricated results.

    1. Eric says:

      This raises an important point. The entire approach rests on the body of published abstracts used for training being reasonably accurate, and the results they describe more or less true. With the rise of predatory journals that will print anything at all, I’m concerned that the training set could become so polluted with noise that nothing meaningful could ever be predicted. Enough homeopathy papers and “water” will be given a high score on the “GABA agonist” axis, or whatever. Which means even good machine learning methods will require very careful human curation.

      1. Nile says:

        Eric – you have the answer, right there: curated training sets, and a parallel effort to automate the process of finding interesting findings, alongside the effort to automate the exclusion of garbage.

        And you *may* be right, if it turns out that the system cannot operate on the whole world of polluted and (mostly) genuine publications; but you will – fortunately – be partially right, as the operators will then curate the data fed to the live system, doing pretty much what you already do: not bothering with garbage journals and – probably – weighting the prestigious ones.

        1. Anon says:

          As if human curation will not be biased and add more noise of its own?

    2. LdaQuirm says:

      One should be able to, yes. As long as there is more correct data than incorrect.
      Train the model, then run it over each abstract again individually, and sum up home much the generated vectors agree or disagree with the internal model. That should generate a “trust score” that could point out papers that warrant extra verification. (by a human researcher of course.)

  6. Anon says:

    “The paper today gives us a hint of what could be possible, eventually, after a lot of time, a lot of money, and a lot of (human) brainpower.”

    Unless we’re running out of viable new drug targets, in which case AI will not help, or it tells us to cure Alzheimer’s with anti-amyloid antibodies because the literature is already so biased with the idea?

  7. Aleksei Besogonov says:

    The power of AI (neural networks) is that it can (sometimes) grasp connections that are not readily apparent and it’s much better able to deal with ambiguity. And what’s more important, it can be taught using poorly structured data.

    Watson used a completely different approach – lots of rules and an inference engine to evaluate them, coupled with statistical analysis. It also has to use highly-structured and reliable data as its input.

  8. Shazbot says:

    Admit it, you just wanted an excuse to post the cat picture, didn’t you?

  9. Kaleberg says:

    If nothing else, this kind of system might make it easier to search through the literature.

  10. Someone says:

    To an astrophysicist, a cat would rate pretty high on the “metallic” vector…

    1. NJBiologist says:

      To a biochemist, there’s not a lot of metal in a cat, but definitely some–iron prosthetic groups, manganese/magnesium/selenium cofactors, chromium(III) doing whatever it does for insulin sensitivity….

      1. Derek Lowe says:

        Yeah, the astrophysicists and cosmologists are notorious for referring to everything past helium as a “metal”.

    2. Someone Else says:

      To a quantum physicist the cat would be half way along the dead-or-alive vector

  11. Ken says:

    To some extent AI has always been a victim of its successes. Chess used to be a hot area in AI, but now that computers regularly beat grandmasters, we’ve decided what they’re doing isn’t really “intelligent”.

  12. RW says:

    I wonder if this could lead to using such algorithms in patent litigation/prosecution to assess the “obviousness” of inventions based on knowledge of the collected literature (or subsets of literature) available at a certain time-point? I think it would probably be possible to score an idea/invention with an “obviousness coefficient,” at least in certain fields.

  13. Paul Davis says:

    For anyone interested to learn more about natural language processing, and who knows a bit of the Python programming language, I would recommend taking a look at this Python package:

  14. Earl Boebert says:

    In other news:
    Machine translation of Linear B and Ugaritic

    Linear A is the dog which fails to bark in the article.

    1. Derek Lowe says:

      Yeah, Linear A would really make me (and lot of other people!) sit right up. . .

      1. metaphysician says:

        Indeed. I lean towards “Provably translate Linear A” as a pretty good “Well, you’ve just made an ASI” benchmark. . .

        ( “Provably” would be the seriously understated sticking point. . . )

Comments are closed.