4 Repositories for Data Sharing

The second explicit requirement associated with the new NIH Data Management and Sharing Policy (in addition to the Data Management and Sharing Plan requirement we discussed above) is that at the conclusion of your project, you must share your project data by publishing it in a third party facing data repository in accordance with your DMS Plan commitments.

A data repository is essentially a digital archive that is designed to preserve research data beyond the life of a given project, so that other researchers can easily reuse and build on previous research and data collection efforts. More generally, data repositories have the goal of making research data FAIR: Findable, Accessible, Interoperable, and Reusable. It is worth reiterating that hosting your data on a personal website is not considered an acceptable alternative to sharing data via a FAIR-aligned third party data repository.

4.1 The Landscape of Data Repositories

There are three general types of data repositories within the broader repository ecosystem:

  • General purpose repositories accept data deposits from researchers from various disciplines, regardless of their institutional affiliation. An example of a general purpose repository is the Harvard Dataverse.
  • Disciplinary repositories accept deposits from researchers from designated discipline(s) or research communities, regardless of their institutional affiliation. An example of a disciplinary or community-based repository is the Inter-university Consortium for Social and Political Research (ICPSR), which hosts data from social science researchers (including public health data)
  • Institutional repositories accept data deposits from researchers who are members of a specific institution, regardless of their discipline or area of research. An example of an institutional repository is CU Scholar, which is CU-Boulder’s own institutional repository.

Within these broad categorical distinctions, there are additional policy and resource-based differences across repositories. For example, there is considerable variation in the size of the data deposits that repositories are able to accept. To take another example, some repositories only host open access data, while others give researchers the option to deposit restricted-use data that potential users must apply to access (which is a useful way to meet NIH data sharing requirements in the context of sensitive human subjects data that would not be appropriate for public release on an open access basis).

For certain grants, NIH may mandate the use of certain repositories, but in general, researchers are given the flexibility to use the repository of their choice. Given the diversity of the data repository landscape, it can be challenging to decide on an appropriate repository (or repositories) to deposit data in order to fulfill NIH data sharing requirements. This guide to selecting a data repository from the NIH is a useful starting point for thinking about possible repository destinations for your research data. The NIH has also provided a catalog of repositories that might give you some more specific ideas about repository options. If you would like more tailored guidance about potential repository destinations appropriate to your project needs, we also encourage you to contact CRDDS for a consultation.

4.2 CU Scholar and Repository Services at CU Boulder

CU Scholar is CU Boulder’s institutional repository. You can use CU Scholar to publish your research data, and thereby fulfill NIH data sharing requirements. CU Scholar is a popular venue for CU-affiliated researchers looking to publish their research data, since it is supported by CRDDS staff, who help coordinate the publication process. In addition, CU Scholar has been certified by CoreTrustSeal, which is seen by grant reviewers and members of the scholarly communications community as a credible signal of a repository’s overall quality. If you are interested in publishing your NIH project data on CU Scholar, the subsections below offer a brief overview of repository practices and policies.

4.2.1 Basic CU Scholar Policies

If you are thinking of depositing your data in CU Scholar, please review the Data Set Policy (scroll all the way down). Two things, in particular, are worth emphasizing:

  • All of the contributors to a dataset or project do not necessarily have to be affiliated with CU-Boulder, but the person who deposits the data must be employed by CU-Boulder (the depositor must authenticate with CU-Boulder credentials).
  • Once the data is published, it is immediately open-access and accessible to anyone who wishes to view or download the data. As a result, datasets with personally identifying information are not appropriate for CU Scholar.

4.2.2 Submitting data to CU Scholar

If you decide that you would like to meet NIH data sharing requirements using CU Scholar, please consult the submission guidelines, for an overview of the submission process.

The basic steps of the CU Scholar data submission workflow are as follows:

  1. Click the blue “Share Your Work” button on the main CU Scholar page.
  2. When prompted to select the type of work, select the “Data Set” option.
  3. You will be prompted to fill out several fields, and upload your data. Please also upload a Readme or documentation file that provides relevant metadata. You can use the suggested Readme template for CU Scholar for this purpose.
  4. Once you have completed the submission, we will review it for adherence to FAIR data principles. We may recommend changes to your submission based on this review.
  5. Once the submission has been finalized, we will register a digital object identifer (DOI) for the data, which can be used to uniquely identify the dataset. You can share this DOI link with relevant stakeholders, which provides proof of compliance with the NIH data publication requirement.

4.2.3 Sample CU Scholar datasets

If you would like to see what dataset publications look like on CU Scholar, here are a few recent examples:

4.2.4 CU Scholar size constraints and costs

If your data submission is less than 500 GB, we can publish it on CU Scholar at no additional cost to you.

For data submissions greater than 500 GB, there is a one-time data deposit fee of $450/terabyte. When assessing costs, file sizes are always rounded up (for example, a deposit of 750 GB will be assessed a deposit fee of $450; a deposit of 1.4 TB will be assessed a deposit fee of $900; and so on).

4.2.5 Privacy, Ethics, and CU Scholar

As we noted above, all data on CU Scholar is open data; if you decide to use CU Scholar to meet data publication requirements, please make sure that the data is appropriately de-identified or anonymized to protect human subjects. We can offer some advice on de-identification, but the responsibility for ensuring that the data is appropriate for public dissemination, and that human subjects are protected, ultimately rests with the depositor. For practical guidance on data anonymization, please see this resource from the UK Anonymization Network.

As noted above, an alternative to anonymization is to deposit your data with a repository that has an infrastructure the supports the deposit of restricted-use data. If a repository has a restricted-use option, you will be able to deposit your data just as you would deposit data in a normal open-access repository, but the repository will only make the data available to researchers under controlled conditions that guarantee the safety of human subjects. For an example of how restricted-use data policies generally work, see this overview of restricted use data policies at ICPSR.