Skip to main content

In Silico

Text-Mining: Preparing for Battle in India

Since I was just blogging the other day about a machine-learning paper that worked its way through decades of abstracts for materials science papers, this news is timely. Carl Malamud and a team at Jawaharlal Nehru University in New Delhi have assembled a rather huge repository of journal articles (they’re aiming for about 73 million of them), in full text.

Full-text data mining is what people want to do, for the most part. Abstracts are a pretty concentrated form of knowledge and can be easier to work with, but you’d have to think that the full papers are where some of the most interesting unseen connections would turn up – the things mentioned in passing that weren’t the main thrust of the papers they come from could well be crucial clues in another context. Multiply that by the number of such clues per paper and by 73 million papers. There is, in fact, already evidence to show that full-text data mining is a real improvement versus abstract-only. Of course, you’re also going to pull out more false positives (by sheer number) through mining more data, but that link shows that the full text gives you both the most hits and the best true positive/false positive ratios at the same time. It clearly seems to be the way to go (although, as mentioned in that earlier post, the very promising idea of natural-language data mining through vector representations is going to need some better tools in order to take on such a challenge).

The idea of the JNU collection is that no one is going to be allowed to read or download any individual papers, since that would indeed be a copyright violation for the publishers involved. Instead, the corpus as a whole will be available for machine-learning purposes to extract trends and connections – which Malamud contends is no violation at all. As will be easy to imagine, the publishers themselves aren’t quite seeing it that way. As things stand, even for people who have full-text access, such data mining across the literature is difficult to impossible. The various publishers tend not to just throw things open for as many queries as you want, as quickly as you feel like making them. And if they see you downloading in bulk to run those queries on your own hardware, there goes your access, very quickly. To be sure, there are places (such as the UK) where there is an explicit legal right for non-commercial users to data-mine whatever they have legal access to, but as the article notes, there’s a clause in that law that grants publishers “reasonable restrictions” on that process. And what seems reasonable to the owners of the data often doesn’t overlap well with what seems reasonable to the people who want to use it.

Some of the large scientific publishers have, though, allowed various parties to run large query campaigns or engage in bulk downloading – but those were through individual deals with commercial customers (drug companies, to pick one category), and you can be sure that (A) money changed hands and (B) no one’s going to tell you how much. And you can also be sure that there were plenty of restrictions placed on those efforts as well (what was downloaded, who was allowed to use the data, for how long, etc.) The business model for any large data holder in any field is to be able to sell the same information to as many different users as possible, so I feel sure that the JNU effort will face more legal challenges as it gets closer to launch.

That launch is some months in the future – whipping all the information into a coherent, consistent searchable shape is not the work of a moment, and anyone who’s done machine learning will tell you that preparation of a large database is by far the biggest part of the battle. Remember, we’re talking tens of millions of papers, from all sorts of publishers going back into the 1800s, so the formatting issues alone are enough to make you stare out the window. But none of this is unsolvable, and once it’s done, it’s done. But I still can’t imagine that the likes of Elsevier will watch the advent of this database with just a shrug of their digital shoulders. If you look at their pages on text-mining, everything sounds fine at first, but real-world experiences are rather different. And even the Elsevier site gets into “consult your license terms” territory very quickly. Everything useful (from what I can see) is done on a strictly case-by-case basis, with permissions and restrictions at every step. (That article refers to Elsevier’s 2014 policy, which from what I can see has not been superseded). You’re not going to get much further with the ACS, with Wiley, or Springer or the rest of them, either.

The whole reason this project is taking place in India to start with are several legal rulings there that allow the reproduction of copyrighted papers for research and educational purposes. Various publishers have already fought it out in the Indian court system (and lost) against university photocopy operations and the like, but you can be sure that they’re gearing up for another round. I expect challenges to every aspect of the plan: whether these papers should have been downloaded in the first place, where they actually came from (which is somewhat obscure), the right to non-consumptive data mining in general (that is, the kind where no one actually reads any particular whole paper), the ability to do that remotely, what use can be made of the insights so obtained and who owns them – the whole enchilada. Or since we’re talking New Delhi, the whole paratha. It’ll be quite a fight.

13 comments on “Text-Mining: Preparing for Battle in India”

  1. anonymous says:

    About your statement “Or since we’re talking New Delhi, the whole paratha….” and coming from India, I like that take!

  2. loupgarous says:

    I can see this working out two ways:
    First, if the publishers lose their challenge in Indian courts, some sort of “great firewall” prohibition by publishers to the IP addresses serving JNU (just as some publishers cut off Max Häussler’s access at UC at Santa Cruz, when they detect his software combing through their sites) forcing the Malamud team to manually scan printed copies of the papers they haven’t yet scanned (assuming they’re all available in hard copy).

    After all, the publishers have, as a last resort, ownership of the digital texts and can decline to make them available to presumptively be used in ways inconsistent with the license terms they’ve outlined.

    It’s unclear how many papers remain to be incorporated in the JNU corpus, but denying downloading of the materials to Malamud’s team could make full implementation of the a practical impossibility. The existing corpus could still be used, but the broader issue of an embargo on downloads of new papers to researchers and students at JNU could be a deal-breaker.

    Alternatively, the publishers might reflect that Google Books already has established the legality of nonconsumptive data mining in Anglo-American courts and decide not to challenge the Malamud/JNU effort in exchange for exacting stricter access restrictions than now exist – or even buy into the effort.

    The reasoning there could be that broader access by researchers to a new kind of metadata about the papers’ content would increase demand for legal, paid access to the papers themselves – and the publishers’ revenue. It’s an idea worth looking into, if no one’s studied it yet.

    Of course, the publishers are probably looking at Malamud’s track record of publishing data which has come into his hands whenever he could. It’s understandable that any scientific or technical publisher aware of the general direction of Malamud’s career would view him and his work as an existential threat, and any nonconsumptive data mining in which he was involved as merely an initial step toward making copies of papers in the JNU corpus available at no cost to Indian students and researchers on the same basis as photocopies of papers there now.

    The broader nightmare to publishers would be illegal databases in India selling articles on the Dark Web for bitcoin.

    1. Tyty says:

      ” … selling articles on the Dark Web for bitcoin.”

      Sci-Hub works pretty well and it’s free.

  3. AnonEmuss says:

    Since this effort is out of JNU, a key “finding” will be that Marxism is the savior of humanity and that Lysenko was right after all.

  4. a. nonymaus says:

    Who cares about publishers, what is Sci-Hub’s position on full-text mining?

  5. Nishanth says:

    Hi Derek, you link here does not work:

    “If you look at their pages on text-mining, everything sounds fine at first, but real-world experiences are rather different”

    Where “real-world experiences” incorrectly points to your local filesystem.

    1. Derek Lowe says:

      Dang, thanks – just added the link that I thought I was adding.

  6. Xylem says:

    73 million papers? sci-hub has 74 million at the moment, so you can stop wondering where they came from.

  7. Paul van den Bergen says:

    Some years ago a friend described working on a software project that was, frankly, horribly managed. Weekly reporting against targets, that sort of thing… It was pretty obvious the reports never got read because occasional glaring mistakes were not picked up on so they spent 2 weeks as a team writing a natural language parser to create first-glance passable reports…

    Oh my. They became the golden team that could do no wrong, merely by getting the reports in on time week after week…

    that was 2 decades ago.

    I imagine one could do a really great job of automatically writing realistic looking research papers with a trove of 73 million papers to mine…

    …and that’s where the nightmares begin…

    1. Peter Gerdes says:

      Not really. Like with murder we deter papers which fabricate data simply by imposing a very harsh sentence (academic death). If you were willing to publish auto-generated fake papers with your name on them you’d strictly prefer to just fake the data yourself since that makes you less likely to be caught. If some machines toil away in a dark corner of the internet submitting fake papers to fake journals where other machines read them…shrug. If actual people read them either they eventually get found out or the AI turns out to be so good that it actually generates correct results.

      Note that what accounts for the difference is that academia involves relatively rare publication that we are willing to impose really harsh penalties for faking. Biweekly reports are largely useless paperwork which, if you get caught, means a lecture not to do it again.

  8. james says:

    i think this is completely useless, we already have sci hub, and source of these article also sci hub, because Carl Malamud was collaborator with Aaron_Swartz who did suicide in 2013, you can read more about him here, and sci-hub named in the memory of Aaron swartz by founder of Alexandra_Elbakyan who influence with him.

    i still not getting purpose of this, jnu got this data that is 100 percent of sci hub, because if he carl actually want to support open science why not he make publicly available this data so that deserving candidate can do deep learning and nlp thing with data, not some lab in jnu, what is the big achievement of jnu in that, just a data that given by carl, it’s like exchanging a pendrive with some data.indian government have to take action against them for such illegal activity and all those who supporting illegal.
    why jnu is only doing this illegal thing, this is same jnu that become famous in february 2016 for speech against country wow.

    1. james says:
      Malamud won’t say where the articles came from, but he did tell Nature that he came into possession of eight hard-drives’ worth of articles from Sci-Hub, the pirate research site whose mission is to liberate scholarly and scientific works from paywalls and ensure that they are universally available. Sci-Hub was founded in memory of Aaron Swartz, a collaborator of Malamud’s who was persecuted by the FBI and threatened with decades in prison for downloading scientific articles from MIT’s network. Swartz hanged himself in 2013, after the federal prosecutors on the case had used legal delaying tactics to drain Swartz’s savings, including the sums he got from the sale of Reddit, which had acquired a company he founded, to Conde Nast.

    2. biba says:

      i am agree with James, this is complete illegal work, Indian police should take action against all the people who involved in this specially so called jnu team, as taken by us based FBI against Aaron Schwartz.
      jnu already popular for his bad reputation in india after 2016 , now this what you guys doing, is this research, strict action should be taken against jnu team.

Comments are closed.