Since I was just blogging the other day about a machine-learning paper that worked its way through decades of abstracts for materials science papers, this news is timely. Carl Malamud and a team at Jawaharlal Nehru University in New Delhi have assembled a rather huge repository of journal articles (they’re aiming for about 73 million of them), in full text.
Full-text data mining is what people want to do, for the most part. Abstracts are a pretty concentrated form of knowledge and can be easier to work with, but you’d have to think that the full papers are where some of the most interesting unseen connections would turn up – the things mentioned in passing that weren’t the main thrust of the papers they come from could well be crucial clues in another context. Multiply that by the number of such clues per paper and by 73 million papers. There is, in fact, already evidence to show that full-text data mining is a real improvement versus abstract-only. Of course, you’re also going to pull out more false positives (by sheer number) through mining more data, but that link shows that the full text gives you both the most hits and the best true positive/false positive ratios at the same time. It clearly seems to be the way to go (although, as mentioned in that earlier post, the very promising idea of natural-language data mining through vector representations is going to need some better tools in order to take on such a challenge).
The idea of the JNU collection is that no one is going to be allowed to read or download any individual papers, since that would indeed be a copyright violation for the publishers involved. Instead, the corpus as a whole will be available for machine-learning purposes to extract trends and connections – which Malamud contends is no violation at all. As will be easy to imagine, the publishers themselves aren’t quite seeing it that way. As things stand, even for people who have full-text access, such data mining across the literature is difficult to impossible. The various publishers tend not to just throw things open for as many queries as you want, as quickly as you feel like making them. And if they see you downloading in bulk to run those queries on your own hardware, there goes your access, very quickly. To be sure, there are places (such as the UK) where there is an explicit legal right for non-commercial users to data-mine whatever they have legal access to, but as the article notes, there’s a clause in that law that grants publishers “reasonable restrictions” on that process. And what seems reasonable to the owners of the data often doesn’t overlap well with what seems reasonable to the people who want to use it.
Some of the large scientific publishers have, though, allowed various parties to run large query campaigns or engage in bulk downloading – but those were through individual deals with commercial customers (drug companies, to pick one category), and you can be sure that (A) money changed hands and (B) no one’s going to tell you how much. And you can also be sure that there were plenty of restrictions placed on those efforts as well (what was downloaded, who was allowed to use the data, for how long, etc.) The business model for any large data holder in any field is to be able to sell the same information to as many different users as possible, so I feel sure that the JNU effort will face more legal challenges as it gets closer to launch.
That launch is some months in the future – whipping all the information into a coherent, consistent searchable shape is not the work of a moment, and anyone who’s done machine learning will tell you that preparation of a large database is by far the biggest part of the battle. Remember, we’re talking tens of millions of papers, from all sorts of publishers going back into the 1800s, so the formatting issues alone are enough to make you stare out the window. But none of this is unsolvable, and once it’s done, it’s done. But I still can’t imagine that the likes of Elsevier will watch the advent of this database with just a shrug of their digital shoulders. If you look at their pages on text-mining, everything sounds fine at first, but real-world experiences are rather different. And even the Elsevier site gets into “consult your license terms” territory very quickly. Everything useful (from what I can see) is done on a strictly case-by-case basis, with permissions and restrictions at every step. (That article refers to Elsevier’s 2014 policy, which from what I can see has not been superseded). You’re not going to get much further with the ACS, with Wiley, or Springer or the rest of them, either.
The whole reason this project is taking place in India to start with are several legal rulings there that allow the reproduction of copyrighted papers for research and educational purposes. Various publishers have already fought it out in the Indian court system (and lost) against university photocopy operations and the like, but you can be sure that they’re gearing up for another round. I expect challenges to every aspect of the plan: whether these papers should have been downloaded in the first place, where they actually came from (which is somewhat obscure), the right to non-consumptive data mining in general (that is, the kind where no one actually reads any particular whole paper), the ability to do that remotely, what use can be made of the insights so obtained and who owns them – the whole enchilada. Or since we’re talking New Delhi, the whole paratha. It’ll be quite a fight.