5 Data Management for In-Progress Projects
Once you are awarded an NIH grant, the Data Management and Sharing Policy requires you to implement the data management protocol specified in your DMS Plan, and follow data management best practices while undertaking your research. To that end, Subsections 5.1 to 5.4 discuss some useful resources:
5.1 Data Storage
In your data management and sharing plan, you should expect to discuss where and how research data will be stored while your project is ongoing (data storage during an active project is distinct from data archiving for long-term preservation and reuse, which you should also discuss in your Plan; for more on fulfilling this requirement, see Sections 6 and 7 below). Once the project is actually underway, you will need to store data using the active storage plan presented in your DMS Plan.
OIT has put together a helpful guide to data storage options at CU Boulder:
- In most cases, especially if your research data is expected to be less than one Terabyte, OneDrive is probably the best storage option for research data. OneDrive has been approved by OIT for the storage of highly confidential data, and is considered a secure storage environment.
- If you are working in the realm of “big data”, and need more than one terabyte of storage space, the CU Boulder PetaLibrary is an option that is worth exploring. While you can store data >1 terabyte on the PetaLibrary, note that unlike OneDrive, storing your data in the PetaLibrary involves an out-of-pocket cost ($45/terabyte/year), and is not approved for the storage of sensitive or highly confidential data.
- If you need a secure way to transfer large amounts of data between project team members, CU Boulder’s free Large File Transfer Service is a useful resource.
A general principle of data storage and security to keep in mind is the “3-2-1” rule: Keep 3 copies of important data (1 original record, and 2 copies), on 2 different storage environments, with at least 1 copy held physically off-site or in a secure cloud storage environment.
5.2 Data Management Best Practices
The NIH has assembled an extremely useful guide to data management best practices for active projects. Two topics mentioned in that guide are worth emphasizing:
- Documentation and Metadata: Documenting your data workflows as you carry out your research will make it easier to prepare your data for long-term archiving at the end of your project. For guidance on best practices for data documentation and metadata, this document (from MIT) is a good place to start: https://libraries.mit.edu/data-management/store/documentation. Cornell University Libraries’ brief primer on standards-based metadata is also helpful. When you deposit your data in a repository for long term preservation, it is usually a minimal requirement that you at least provide “readme” style metadata. We recommend that you review the sort of information that is typically presented in a data readme file, and document your ongoing work with a view towards creating such a metadata file at the conclusion of your project.
- File management: Data files and directories can quickly proliferate over the course of a project, and it’s important to have a framework in place that regulates how files and directories are named, organized, and versioned. For a useful primer on file naming and organization conventions, see this resource from the University of Wisconsin.
5.3 Data Management Tools
There are a variety of tools that can help you to implement data management best practices. There are too many to catalog here, but a few are worth highlighting:
- The Unix Shell/Command Line: When creating, naming, deleting, or moving around files and directories, it is easier to keep track of everything if you implement these tasks programmatically (rather than pointing and clicking in a graphical user interface). The Unix Shell is a basic scripting tool that allows you to automate these aspects of file management relatively easily; this makes file management quicker, less error-prone, and easier to document. For an excellent tutorial on using the Unix shell for file management, see this lesson from the Carpentries program: https://swcarpentry.github.io/shell-novice/.
- Git and GitHub: Version control is an important part of data management that refers to your approach to tracking changes made to files containing data and code over time. Git is a version control system widely used in the open-source software and academic research communities. GitHub is a platform for hosting projects and source code that is based on Git; it’s a great way to implement versioning in a collaborative project. GitHub is a particularly useful way to store and keep track of changes in the code and scripts you use to process your data. Ultimately, these scripts can also be submitted alongside your datasets for long-term preservation in data repositories. To get started with Git and GitHub, this lesson is a good place to start: https://swcarpentry.github.io/git-novice/.
- Open Science Framework: The Open Science Framework (OSF) is designed to be a one-stop shop for all of your data and project management needs. It integrates various data storage, analysis, versioning, and management applications into one unified platform.
5.4 CRDDS Services
CRDDS staff are available to consult on data management issues related to ongoing projects. We host consult hours on Tuesdays (12:00-1:00) and Thursdays (1:00-2:00), which you can sign up for on the CRDDS Events Page. You can also request a consultation outside of those hours by emailing crdds@colorado.edu. CRDDS regularly provides workshops and trainings related to data management (including workshops on some of the tools and best practices mentioned above), which you can also find on the CRDDS Events page.