Library Guides: Research Data Management: Writing a Data Management Plan: 5) Data Sharing & Long Term Preservation

Purpose of this Section

The Data Sharing & Long Term Preservation of your DMP will outline how data will be shared and preserved long term in line with the FAIR principles.

FAIR Data

High-quality data have the potential to be reused in many ways. Archiving and publishing your data properly is at the core of making your data FAIR and will enable both your future self as well as others to get the most out of your data.

FAIR stands for Findable, Accessible, Interoperable and Reusable. The FAIR Data Principles were developed and endorsed by researchers, publishers, funding agencies and industry partners in 2016 and are designed to enhance the value of all digital resources.

Following the lead of the European Commission and Horizon 2020, Irish funders, including the Health Research Board (HRB) and Irish Research Council (IRC) are now asking Irish researchers to address, via a Data Management Plan (DMP), how they will make their data FAIR.

If your goal is to make your data FAIR you should build this into your research plan from the start.

The Four Basics of FAIR:

Findable - Discoverable with metadata, identifiable and locatable by means of a standard identification mechanism.
Accessible - Always available and obtainable by humans and machines; even if the data is restricted, the metadata is open.
Interoperable - Data and metadata should conform to recognised formats and standards to allow them to be combined and exchanged.
Reusable - Sufficiently described and shared with the least restrictive licences, allowing the widest reuse possible and the least cumbersome integration with other data sources.

Things to remember:

FAIR is a set of principles; not a standard.
Does following the FAIR principles mean that your data has to be shared openly with everyone? NO.
- Data can be FAIR but not open. For example, data could meet the FAIR principles, but be private or only shared under certain restrictions.
- Open data may not be FAIR. For example, publically available data may lack sufficient documentation to meet the FAIR principles, such as licensing for clear reuse.
The FAIR principles can help you understand how to practically describe how to create, store, share, manage and preserve your data in your DMP.

(CESSDA Data Management Expert Guide) (OpenAIRE) (UCD Library Guide) (Jones, Sarah, & Grootveld, Marjan)

Why Publish and Archive your Data?

Archiving data for future reference:
Research data archiving is about storing and preserving research data for the long term. When you archive your data, you make sure you can read and access the data later on. You can then also allow access to others for verification purposes when such a request arrives. In all cases, you should store your data safely, in a suitable file format, with adequate documentation.

Publishing data for reuse:

To make your data reusable for purposes beyond the one for which you collected them, you should publish your data. Publishing your data is the act of publicly disclosing the research data you have collected, making them findable, accessible and reusable.

(CESSDA Data Management Expert Guide)

What to Include in the Data Sharing & Long term Preservation Section

While writing this section of your DMP bare in mind data should be as open as possible and as closed as necessary and in line with the FAIR principles i.e. data should be findable, accessible, interoperable and reusable.

Outline how and where data will be shared & discoverable, for how long, and who can use the data. Explicitly name a data repository or archive and demonstrate that the repository policies and procedures (including any metadata standards, and costs involved) have been checked.
Note when data will be made available baring in mind data needed to validate the research results presented in scientific publications should be made available at the time of publication or as soon as possible thereafter.
Outline what data must be retained or destroyed for contractual, legal or regulatory purposes.
Explain how will you prepare your data for sharing. Do you need to remove identifiable / sensitive information etc.
Outline how will data be actively maintained post research project if not submitting to a repository. Including what would happen the data if you left your institution, what resources you need for data preservation, how you will prepare the data for long term preservation, and who will be the contact person for data queries.
If data cannot be shared this must be justified. i.e. commercially sensitive data, confidential data etc. However metadata and documentation describing the data and research process should still be made available, in compliance with the FAIR Data principles.

(UCD Data Management Checklist)

Deciding what Data to Share

How you decide which data to share will be based on academic judgement, funder or legal requirements, and practical factors such as volume and cost.

One way to approach the question is to consider what data another researcher would need to validate your findings.

Or, turn the question around: if you read a research paper which included a statement about where to locate the underlying data, what would you expect to see?

You should:

Identify how data might be reused – for instance, verification or further analysis.
Identify data that must be shared – is there is a policy requirement / funder requirement?
Identify data that should be shared – does the data have long term value?
Weigh up the costs and benefits, in terms of time, resource, and costs of repository storage and long-term curation.

(University of Leeds Library) (University of Edinburgh MANTRA Training)

How to Prepare your Data for Sharing & Preservation

Create a thoughtful and through DMP - Writing a data management plan (DMP) at the beginning of your project will help you to think about the different stages of data generation, storage and archiving. This will help you to plan ahead and make sensible choices, making the archiving process easier when the time comes.
Store data using recommended formats - Where possible, avoid using proprietary file formats which require specific licensed software to open, as this will help future proof your data and ensure that you data continue to be useful and usable in the long term.
Keep detailed documentation - While you may fully understand your data while you are actively using it, with passing time it may become difficult to remember exactly how your data are structured, how your variables were generated, or the steps or procedures needed to actually open and use your data. Keeping suitable documentation detailing how your data were generated, how they were manipulated, and how they can be used will help to ensure that you and others are able to fully understand and use your data in the future.

(University of Edinburgh MANTRA Training)

Most of the work in terms of preparing your data for preservation should have already been completed in the Data Collection Section and the Documentation Section of your Data Management Plan. The most important thing is to store the data using recommended formats and keep detailed documentation.

Recommended formats can be found in this OpenAIRE Guide (Data Formats for Preservation) and on the Data Collection page and Documentation & Metadata page of this Library Guide.

Data Publication / Archiving / Long term Preservation

For a dataset to count as a publication the data should be:

Properly documented with metadata.
Reviewed for quality.
Searchable and discoverable in catalogues (or databases).
Citable in publications.

There are different ways to publish your data. Your preference may depend on the existing practices in your discipline or on the expectations of your funder.

	Advantages	Disadvantages
Journal Supplementary Service	Most likely to comply with the journal or publisher’s requirements; Data readily available alongside published findings.	May be costly; May claim copyright over the data; May keep data behind a subscription wall; Unlikely to offer a data repository’s functionality or long-term solution; May not apply user-friendly or preservation formats; More likely to accept subsets rather than complete datasets.
Institutional Data Repository	Most likely to accept any data of value, especially if no suitable home can be found for it elsewhere, and to ensure that policy requirements for long-term access are met; Researchers may trust such a repository more readily; Possibly no charge for the data deposit; May make your data visible via dissemination and promotion.	May not offer sustainable long-term access to your data collection; Might not have sufficient expertise in data and metadata standards needed for long time preservation and access.
General Purpose Repository	Most likely to offer useful search, navigation and visualisation functionality; Reach a wider audience of potential users; Accepts a wide range of data types; Suitable for cross-disciplinary data.	Requires scrutiny of terms and conditions to ensure consistency with your funder, journal or institution’s policies on cost recovery, copyright/IP, and long-term preservation; No editorial control over quality of deposited materials; In most cases, only simple metadata is available, which is usually not enough for reuse.
Doman Specific Data Repository	Offers specialist domain knowledge and data management expertise, e.g. to create a catalogue record and documentation; Likely to accept complete datasets (and not only the part of the dataset on which a publication is based); May make your data visible via dissemination and promotion.	Likely to be selective about what kind of data they accept.
Trusted Domain Specific Data Repository.	Offers specialist domain knowledge and data management expertise, e.g. to create a catalogue record and documentation; More likely to accept complete datasets; Provides preservation and curation to community standards, e.g. file formats migration; Ability to control access of (sensitive) personal data; May handle data re-use queries; May make your data visible via dissemination and promotion.	Most likely to be selective about what kind of data they accept; May charge for data publishing; Requires advance planning of the effort needed to meet high standards for metadata and documentation.

(CESSDA Data Management Expert Guide)

One of the best ways to make data discoverable and sharable is to submit to discipline specific, community recognised repository where possible, or to a general, multidisciplinary repository if no suitable discipline specific repository is available.

Ideally, persistent identifiers should used so that data can be reliably and efficiently located and referred to and citations and reuse can also be tracked. Typically, a trustworthy, long-term repository will provide a persistent identifier.

(UCD Data Management Checklist)

A data repository allows researchers to upload and publish their data, thereby making the data available for other researchers to re-use. Similarly, a data archive allows users to deposit and publish data but will generally offer greater levels of curation to community standards, have specific guidelines on what data can be deposited and is more likely to offer long-term preservation as a service. Sometimes the terms data repositories and data archives are used interchangeably. A data repository or archive will provide services such as:

Persistent identifier such as a “digital object identifier” or DOI; the presence of a DOI facilitates discoverability and citeability.
Assistance with metadata provision e.g. through the use of a template.
Allow you to apply a licence to your data.
Aid compliance with the FAIR data principles (data that are Findable, Accessible, Interoperable, and Reusable) as data are published online with appropriate metadata and are assigned a persistent identifier, see Jones, Sarah, & Grootveld, Marjan. (2017). How FAIR are your data?. Zenodo. http://doi.org/10.5281/zenodo.1065991.
Accept a wide range of data types.
Long-term access and, in some cases, long-term preservation.
Offer useful search, navigation and visualisation functionality.
Reach a wider audience of potential users.
Manage requests for data on your behalf.

(University of Galway Library Guide)

When choosing a data repository you need to be sure that your data will be curated and managed appropriately over time. Choosing a repository that has been certified will help you to be confident that the repository you choose is trustworthy and your data will remain findable, accessible and usable in the future.

Several certification standards have been developed which allow the principles and practices of digital repositories to be evaluated.

Some examples of certification standards for digital repositories include:

CoreTrustSeal - a peer-reviewed certification standard that evaluates data repositories against a set of specific criteria.
'Audit and certification of trustworthy digital repositories' (ISO 16363).
'Criteria for Trustworthy Digital Archives' (DIN 31644).

The repository community has also established a set of principles called 'TRUST: defining best practice through transparency, responsibility, user focus, sustainability, and technology' which help promote ongoing preservation standards and practice among the repository community.

(University of Edinburgh MANTRA Training)

Additionally Science Europe has the following criteria for evaluating repositories:

Provision of Persistent and Unique Identifiers (PIDs):

Allow data discovery and identification.
Enable searching, citing, and retrieval of data.
Provide support for data versioning.

Metadata:

Enable finding of data.
Enable referencing to related relevant information, such as other data and publications.
Provide information that is publicly available and maintained, even for non-published, protected, retracted, or deleted data.
Use metadata standards that are broadly accepted (by the scientific community).
Ensure that metadata are machine-retrievable.

Data access and usage licences:

Enable access to data under well-specified conditions.
Ensure data authenticity and integrity.
Enable retrieval of data.
Provide information about licensing and permissions (in ideally machine-readable form).
Ensure confidentiality and respect rights of data subjects and creators.

Preservation:

Ensure persistence of metadata and data.
Be transparent about mission, scope, preservation policies, and plans (including governance, financial sustainability, retention period, and continuity plan).

(Science Europe Practical Guide to the International Alignment of Research Data)

Questions to Ask Yourself when Choosing a Repository:

Has a data repository been specified by my funder? e.g. NERC Data Centre for research funded by the UK’s Natural Environment Research Council.
Has a data repository been specified by my publisher? e.g. SpringerNature recommended repositories. PLOS recommended data repositories. Scientific Data recommended data repositories.
Is there a disciplinary-specific community-recognised data repository I can submit my data to, thereby helping to preserve my data according to recognised standards in my discipline? e.g. Irish Social Science Data Archive. Cancer Imaging Archive. PubChem. PANGAEA.
Is it reputable? Is it listed in Re3data thereby meeting their conditions of inclusion?
Is it appropriate to my discipline?
Will it take the data you want to deposit?
Is there a size limit?
Does it provide a DOI / persistent identifier?
Does it provide guidance on how the data should be cited?
Does it provide access control, where necessary, for your research data?
Does it ensure long-term preservation / curation?
Does it provide expert help with e.g. metadata provision, curation?
Is there a charge?

(University of Galway Library Guide)

PLOS has an excellent list of both discipline specific and multi disciplinary Recommended Repositories.

The Registry of Research Data Repositories (re3data.org) can be searched by discipline to find discipline specific data repositories worldwide with community specific standards.

FAIRsharing.org is a curated registry with a focus on the life sciences.

Multi-disciplinary Repositories
Zenodo	Trusted multi-disciplinary repository funded by the EU and run by CERN. Accepts data sets, publications, presentations, posters, multimedia, software, or educational resources. Datasets deposited will get a DOI (persistent and unique identifier). Suitable for long-term preservation and sharing of research results, but not for data management in ongoing projects.
Figshare	Repository where users can make all of their research outputs available in a citable, shareable and discoverable manner.
Data Hub	Provides free access to its core features letting you search for data, register published datasets, create and manage groups of datasets.
Open Science Framework	Free open platform that supports research and enables collaboration.
Dataverse	A personal dataverse is easy to set up, allows you to display your data on your personal website, can be branded uniquely as your research program, makes your data more discoverable to the research community, and satisfies data management plans.
Dryad	Hosts a wide range of data types. For some journals there is no charge to deposit in Dryad.

Irish Repositories

Irish Qualitative Data Archive (IQDA)

Irish Social Science Data Archive (ISSDA)

HRB Open Research

DATA.GOV.IE

Resources to compare repositories:

Generalist Repository Comparison Chart.

OpenAIRE - How to find a trustworthy repository for your data.

(University of Galway Library Guide)

Licensing your shared data is important to make potential users aware how they can use your data. A license states what can be done with the data and how that data can be redistributed.

Before considering the licensing options that are available, you should first check whether you are obliged to use a certain licence as a condition of funding or deposit, or as a matter of local policy. While bespoke licences are useful for catering for very specific circumstances, most research projects would be better served using one of the standard licences e.g. Creative Commons Licences.

How to add a License to your Data:

How to License Research Data from Alex Ball and DCC provides an in-depth guide on how to add a license to your data. See the section Mechanisms for licensing data.

Creative Commons Licences:

Licenses offer by Creative Commons

Creative Commons Licence Chooser Tool.

(University of Galway Library Guide) (Digital Curation Centre)

The DCC has an excellent guide on How to Cite Datasets and Link to Publications.

Harvard Referencing Style:

Citation elements:

Author / Principal Investigator / Data Creator.
Date (in round brackets) - Publication date/Release Date, for a completed dataset.
Title of dataset (in single quotation marks) - Title of Data Source.
Available at: DOI or URL - location or persistent identifier, persistent URL where dataset can be accessed (repository link, handle, etc.) or DOI.
(Accessed: date) - when data is accessed online.

In-text citation example:

The dateset by Le et al. (2021) provided ...

Reference list example:

Le, T. et al. (2021) ‘SyntheticFur dataset for neural rendering’. Available at: https://arxiv.org/abs/2105.06409 (Accessed: 2 August 2021).

Or, if you are required to reference of all named authors:

Le, T., Poplin, R., Bertsch, F., Toor, A.S. and Oh, M.L. (2021) ‘SyntheticFur dataset for neural rendering’. Available at: https://arxiv.org/abs/2105.06409 (Accessed: 2 August 2021).

(CiteThemRight) (University of Galway Library Guide)