Skip to Content

Drug Assays

Why Not Share More Bioactivity Data?

The ChEMBL database of compounds has been including bioactivity data for some time, and the next version of it is slated to have even more. There are a lot of numbers out in the open literature that can be collected, and a lot of numbers inside academic labs. But if you want to tap the deepest sources of small-molecule biological activity data, you have to look to the drug industry. We generate vast heaps of such; it’s the driveshaft of the whole discovery effort.
But sharing such data is a very sticky issue. No one’s going to talk about their active projects, of course, but companies are reluctant to open the books even to long-dead efforts. The upside is seen as small, and the downside (though unlikely) is seen as potentially large. Here’s a post from the ChEMBL blog that talks about the problem:

. . .So, what would your answer be if someone asked you if you consider it to be a good idea if they would deposit some of their unpublished bioactivity data in ChEMBL? My guess is that you would be all in favour of this idea. ‘Go for it’, you might even say. On the other hand, if the same person would ask you what you think of the idea to deposit some of ‘your bioactivity data’ in ChEMBL the situation might be completely different.
First and foremost you might respond that there is no such bioactivity data that you could share. Well let’s see about that later. What other barriers are there? If we cut to the chase then there is one consideration that (at least in my experience) comes up regularly and this is the question: ‘What’s in it for me?’ Did you ask yourself the same question? If you did and you were thinking about ‘instant gratification’ I haven’t got a lot to offer. Sorry, to disappoint you. However, since when is science about ‘instant gratification’? If we would all start to share the bioactivity data that we can share (and yes, there is data that we can share but don’t) instead of keeping it locked up in our databases or spreadsheets this would make a huge difference to all of us. So far the main and almost exclusive way of sharing bioactivity data is through publications but this is (at least in my view) far too limited. In order to start to change this (at least a little bit) the concept of ChEMBL supplementary bioactivity data has been introduced (as part of the efforts of the Open PHACTS project,

There’s more on this in an article in Future Medicinal Chemistry. Basically, if an assay has been described in an open scientific publication, the data generated through it qualifies for deposit in ChEMBL. No one’s asking for companies to throw open their books, but even when details of a finished (or abandoned) project are published, there are often many more data points generated than ever get included in the manuscript. Why not give them a home?
I get the impression, though, that GSK is the only organization so far that’s been willing to give this a try. So I wanted to give it some publicity as well, since there are surely many people who aren’t aware of the effort at all, and might be willing to help out. I don’t expect that data sharing on this level is going to lead to any immediate breakthroughs, of course, but even though assay numbers like this have a small chance of helping someone, they have a zero chance of helping if they’re stuck in the digital equivalent of someone’s desk drawer.
What can be shared, should be. And there’s surely a lot more that falls into that category than we’re used to thinking.

18 comments on “Why Not Share More Bioactivity Data?”

  1. Teddy Z says:

    The way I think it would be thunk about is this. Data is a corporate asset. That’s what is constantly pounded in our heads (that’s why you countersign your notebook, those of you that actually do). So, when do companies give away corporate assets? When they have no value. So, I would hope that companies would give away the biological data to at least the marketed drugs. The idea there is that there is no meat left to pick off of those bones, so they have no value. Imagine the wealth of GPCR data that exists from the anti-depressives alone.

  2. Anonymous says:

    I suspect its an activation barrier problem. One – the time and effort taken to convince the organisation to release the data – and two – the time taken to find the data and curate it even in a minimal way – when the project is long in the past (I guess most Med Chem get written at least a year after the work was current)
    Given the recent post about data quality in assays how much of this data would actually be useful anyway ?

  3. Chris Swain says:

    Perhaps we need to move one step at a time. Perhaps a requirement of publication should be that all data in a publication must be made available in a standard format so that it can be very easily imported into ChEMBL and other public repositories.
    The next step might be to disclose HERG, AMES activity for all structures in the public domain, with the hope that better predictive tools might be designed.

  4. Pete says:

    One way forward might be to provide those who hold data with financial incentives (e.g. tax breaks) to deposit the data. Sharing the results of toxicology studies would be particularly helpful (and some might suggest to be an ethical requirement). One issue that will need to be addressed (particularly in litigation-happy USA) is what legal liabilities might result from sharing data. For example, one would want some sort of guarantee that you’re not going to have some smart ass lawyer building a patent infringement case out of the data that you’ve shared.

  5. JAB says:

    Kudos to Bill Zuercher and Dave Drewry of GSK for their efforts to distribute the well curated GSK kinase inhibitor set to as many investigators as possible, including us.

  6. Cellbio says:

    @2- There are many ways the data would have value, even with the limitations of reliability of absolute values. One such way is to inform academics, who by virtue of their capacity (they only run the assay that pays the bills) and environment, have limited access to broad data sets that help one to discern between an interesting lead, a class promiscuous compound (pan kinase inhibitor), or a compound that broadly reports as a hit but is either garbage or the cure for death of all causes.

  7. will says:

    @ Pete – generally, the use of a patented compound in a research setting of developing new drugs is exempted from infringement. I would be primarily concerned about someone making an invalidity attack on my patent based on previously unreleased in-house data
    @ Teddy – data on even quite old drugs is still potentially valuable, as a new indication can breathe life into an otherwise decaying product

  8. Pete says:

    @ Will, Compounds are not the only things that get patented in drug discovery.

  9. sgcox says:

    Second to #5.JAB
    GSK guys go extra mile with this project. Very helpful.

  10. will says:

    @ pete – I guess I misunderstood your comment then, I thought your concern was that a company might publish biodata on a particular test compound, and then a separate entity would rise up with a patent covering said compound
    I don’t know if the question of whether a method patent covering a particulary assay would also be subject to the research exemption. logically, i think it would
    it’s too late in the day for me to think of any other patentable subject matter that published biodata would constitute evidence of infringement

  11. Pete says:

    @ Will, To be quite honest my original comment was fairly generic and I’d not been thinking too much about detailed scenarios. My main point was that we need to at least acknowledge the possibility that data could be used against those who have deposited it. Assay technology patents do have the potential to make life difficult especially when patent lawyers say what one might have developed in house was ‘obvious’.

  12. Anonymous BMS Researcher says:

    Even getting people to submit content for *internal* data repositories can be like pulling teeth. Unlesss something is either required from on high or directly on the critical path to making the metrics, it ain’t gonna happen. It took stringent auditing to make everybody maintain good lab notebooks, for instance.

  13. Insiliconsulting says:

    Chembl came into being when Wellcome trust paid ~4 million pounds for the data and further development. It was arguably a last ditch effort by the content owner to make SOME money in the face of curated database competitors and ever reducing profits. Still a good deal for scientists the world over though. Thanks Wellcome trust.

  14. SK says:

    @ Will & Pete:
    Risk of finding evidence infringement when releasing such data is not such a big issue, although the research exemption is somewhat narrow and only really applies to experimentation on drugs which could be the subject of an FDA submission. There may be a non-negligible risk of infringement that might deter such release.
    A bigger risk is that release of the data would act as a “defensive publication” which increases the chance of an obviousness challenge. This could impact on already marketed drugs and also prevent otherwise socially valuable compounds from getting to market due to unpatentability.
    A more efficient solution would be for companies to share data and then provide FDA clinical exclusivity once a drug candidate is ready for clinical development and then an extended period of FDA-administered market exclusivity upon regulatory approval (similar to the Orphan Drug Act, but say, 12-15 years). This would allow researchers to share valuable data at the “pre-competitive” stage, while providing ownership rights to the company which is willing to enter clinical testing (there would likely be a period where trade secret protection is used by smaller biotechs that develop drug candidates). It would also reduce waste due to excessive patent litigation between generics and innovators. This will possibly address the current productivity issues facing the industry.

  15. cdsouthan says:

    In the first instance there is a big public payoff if patent assignees (academic, US Gov or commercial) can surface at least some of their better SAR data published in patents but never written up in the journals that ChEMBL captures. What would have even more impact is if a) journals desisted in publishing pharmacological data (in vitro or in vivo) on blinded structures (violating the principals of scientific reproducibility) and b) ensured even the most basic level of transparency by (promptly) publishing clinical trial results linked to a structure. (See and

  16. cliffintokyo says:

    Reality check:
    Who has time to peruse other people’s data when we are all so busy keeping up with, analysing, reporting, explaining, and utilizing (for patents and follow-up lead discovery) the mountain of data for our own compounds?

  17. MO says:

    I can’t get that excited. As usual the benefits to the wider community are oversold. Typically the best compounds from a series (most potent or best illustrating the SAR) are the ones selected for publication, so these additional Chembl-only data points will be the unexciting also-rans in the series, which fit with the trends but weren’t interesting enough to talk about or to generate n=2. Also, those who don’t know better will combine data from different sources inappropriately.
    One outcome will be an increase in the number of publications “mining” this data, but how many new findings will these uncover? Few, I suggest.

  18. @MO There is real value, the curated kinase inhibitor databases are a godsend us academics who need decent starting points and binding profiles to get funding in this valley of death.
    As an academic trying to set up a drug discovery initiative to open source precompetitive data, I can attest that UK funders and UK universities remain skeptical, this won’t happen unless big pharma and Wellcome (or similar) co-invest in an academic-led consortium that is committed and resourced to do the work. The SGC@Oxford is a great model.
    To move forward I’d like to know if a useful first step would be to share quality fragment binding data matrices (hits/affinities/sites) across a target family, comparing full length and single domains as well as activated states and complexes with published inhibitors/leads. This could allow academics to derisk potential targets and develop new mechanism-driven strategies to identify early lead matter. If this might be of interest, please let me know.

Comments are closed.