Progress in data and code deposition

It’s been just over 5 years since then Editor-in-Chief Marcia McNutt published an editorial committing Science to adoption of the Transparency and Openness Promotion (TOP) framework to facilitate reproduction of research published in the journal. This was a multifaceted commitment, but a big component involved directing authors to repositories for permanent, accessible archiving of the data and code underlying their articles. In a small subset of disciplines, primarily (macro)molecular structure determination as well as genetics and genomics, specific community-standard repositories were already mandated. For the many other disciplines featured in the journal, a more flexible paradigm encouraged authors to consider a range of subject-specific as well as general repositories, including those affiliated with their institutions, so long as independent curation ensured permanent, unfettered access. Withholding data pending reader requests was no longer permitted. Authors could still archive small datasets in Science’s supplementary material, but our preference was, and still is, to support the growing ecosystem of community scholarly repositories that aim to promote data and code reuse through effective quality control, metadata, and cross-referencing. An integral part of this system is the formal citation of deposited datasets and software, which helps to allocate specific credit to the researchers involved in those components of a study.

In the interest of transparency and accountability, Science is pleased to share here that 70% of the 748 Research Articles and Reports published in the journal in the year 2020 archived data and/or code in an external repository, as shown in the figure*. Genetics and genomics data were the most frequently deposited, with the databases encompassed by US National Center for Biotechnology Information accounting for 19% (141 papers) and the European, Chinese, and Japanese counterparts adding another 3%. Next were protein and small-molecule structures, with the Protein Data Bank, Electron Microscopy Data Bank, and Cambridge Crystallographic Data Centre accounting for 15%. The Zenodo general repository was used for custom software code deposition in 14% of papers and for data deposition in 8%. Institutional repositories were used for data in 12% of papers. Dryad was used for data deposition in 2% of papers, Dataverse and Figshare each for 1%.

Because Science is a multidisciplinary journal, the editors have deliberately sought to respect community preferences with regard to which repositories authors choose. With the exception of the genetics and structural databases that have decades-long legacies, it is clear that authors published in Science are currently opting for general over subject-specific repositories. There were 13 more narrowly focused repositories, mainly in Earth and environmental sciences, that were each used three or fewer times†; PANGAEA was used six times. Whether this overall distribution stems from a scarcity of repositories in particular disciplines or a barrier to deposition that authors opt not to surmount is somewhat unclear. Regardless, Science is committed to engaging with both the research and repository communities (across which there is often considerable overlap) to help ensure that data and code become more findable, accessible, interoperable, and reusable going forward.

*Because some papers used more than one repository, the percentage sum exceeds 100. †Subject-specific repositories used three or fewer times: Materials Data Facility; Materials Cloud; National Centers for Environmental Information; National Snow and Ice Data Center; Sea Scientific Open Data Publication;; Environmental Information Data Center; EarthChem; Biological and Chemical Oceanography Data Management Office; Neptune Sandbox Berlin; OneStratigraphy; Knowledge Network for Biocomplexity; ESRF heritage database for palaeontology, evolutionary biology and archaeology.

