The literature of synthetic chemistry is large, and it goes back well over a century. Those of us who know the field sometimes despair of the state that literature is in – it can be pretty messy – but we really shouldn’t. It’s actually far more orderly than many other fields, and it has a lot of aspects that make it intrinsically more “organizable”, not least the backbone of chemical structures that underlies it. Now, it’s for sure that not all those structures are drawn correctly and that not all those transformations of them actually work when you try them, but at least there’s a structured form to the data, as opposed to (say) the literature on rodent behavioral science or something.
This makes the chemical literature very attractive for a machine learning approach, and of course that’s just what we’ve seen in recent years. The advent of retrosynthesis software in organic chemistry is an applied example of just that, and there have been many other investigations into how to extract rules, trends, and even predictions of new reactions and substances from the existing literature. But to get any of those to work well – to get any machine learning approach to anything to work at all – you have to pay close attention to the state of the data that you’re pouring into the hopper. It needs to be reliable, well-formatted, wide-ranging, and with a good selection of both positive (here’s something that worked) and negative (here’s something that you’d think would work, but didn’t) results. All of those factors need some work, and by “some” it will be understood that I mean “sometimes a whole bunch”.
A useful way to check the reliability of a given transformation would be to see how many times it shows up in the literature. I know that the people building retrosynthesis programs think about this a lot. Einmal ist keinmal, as they say (one time is no time), and you wouldn’t want to fill up your database with a pile of one-off reactions that might not be real (or might only work under far more limited conditions than the titles of the papers might lead one to believe!) Here’s a new paper that looks into just that question of finding repeat syntheses and what that tells us about the chemical literature.
The authors, from Georgia Tech, look at the metal-organic framework (MOF) literature, and I’d say that’s a good choice. I did a fair amount of MOF work a few years ago (on the “crystalline sponge” X-ray structure idea, if you’re wondering), and if you’ve never looked at that stuff, let me tell you that the literature in that field is a massive shaggy pile. There are a zillion MOFs out there, produced under a ridiculous number of synthetic conditions, and the barrier to making new ones is extremely low. I mean it – you can step right up to your hood and within a few days make some that have never been reported before. I sure did, and it was a blast.
Those things can form spectacular crystals, and what synthetic chemist doesn’t like that? I would set up a whole line of sealed vials with various combinations of metal salts, multivalent ligands, and additives and heat them up in something like DMF for a few days, and likely as not collect a series of brand-new MOFs. This accounts for the vast “stamp-collecting” literature on these things. They’re generally not that hard to collect X-ray data on (all those metal atoms!), and even a Neanderthal like me could get decent data sets, although you don’t want me to be the guy who processes and refines them. Below are some of the many that I prepared with my own hands, and I can’t tell you when I’ve had a better time in the lab. If you’re guessing cobalt as the first metal and copper as the third one, right you are. Now, getting them to do what I wanted them to do (sequester small molecules in an ordered fashion) was another topic entirely, but making crystals to try that out on? Oh, yeah. If you’re going through one of those periods where it seems that you can’t get anything to work in the lab, go make some MOFs – you’ll feel better quickly.
The Georgia Tech team used the CoRE MOF database, a curated collection of thousands of X-ray structures in the field. They selected 130 MOFs randomly from the pre-2014 literature. The papers describing these had been cited between 8 and 168 times (average 34 citations). What they found was that most of these had never been resynthesized at all, as you might expect, while others had been made multiple times:
Only 1 material was synthesized more than 3 times: a Zn-based MOF first produced by An et al. (16) with structure code SAPBIW (common name Bio-MOF- 100) has been synthesized 7 times, including 2 instances by groups distinct from the original authors. Seven of the 130 MOFs have been resynthesized by a group distinct from the original authors, and 15 of the MOFs have been synthesized more than once by anyone.
Now that’s for direct replication – if they broaden the field to modifications of the original synthesis, 65% of the 130 MOFs had had some sort of follow-up work. It seems quite possible, they note, that many of these papers also made the original substance along the way but did not bother to report it in the actual paper, so the replication statistics are likely lower bounds. A broader analysis of the MOF literature, though, picked up a short list of substances that had had hundreds of papers written about them, and the vast majority with no direct replications at all.
This is not the 80/20 distribution beloved of consultants everywhere – it’s more like a very few substances account for nearly all the replications while everything else tails out extremely fast. Their estimate is that 0.03% of the reported MOFs account for 50% of the replications, which does not make for as instantly memorable a PowerPoint slide title. Outside of those “super repeats”, the distribution broadly follows a power law, although proving that a power law is actually at work is a much harder task. How do some of these things end up on the greatest hits list? It’s impossible to say for sure, but the authors note that they involve inexpensive materials and undemanding experimental conditions (for starters), and that there are surely sociological factors at work as well, not least timing of the original publications.
How reproducible were all these repeat runs? In the MOF case, we have parameters like the Brauner-Emmett-Teller (BET) surface area, which is pretty easy to measure. The paper has some interesting plots of how these numbers have shaped up over time for the high-repeat substances. You can see a rough bimodal distribution in some of the moisture-sensitive ones, and tighter numbers in the ones known to be more robust, which makes sense, although the numbers do not seem to be converging over time, either. Having MOFed around myself, I feel sure that what we’re seeing are variations due to different levels of solvent removal (activation) before measuring, samples with varying amounts of impurities and small crystalline defects, etc., and the authors advance just such explanations.
So how widely applicable is this analysis? Of special relevance is the repeatability of synthetic methods, and I would love to see statistics on that. You’d have to deal with a lot of variation in conditions and be willing to loosen up your literature-search constraints, but a look at these situations could be very useful in setting cutoffs for machine-learning in the synthesis literature as a whole. The same sociological factors that made some MOFs super-popular have surely been at work in making some reactions and types of reactions popular as well (think, for example, of the wave of olefin metathesis reactions that hit the literature some years back). But reactions that just don’t work well don’t ever get to be popular at all. How many good or interesting ones are there, though, that never got their time in the spotlight? It might be worth mining the old journals for unusual transformations that didn’t get followed up on and devote some high-throughput synthesis effort to seeing how many of those things can be revived. Would anyone fund such an effort?