So You Want to Cite your Data: The Consequences of Data Citation

(using Sage Bionetworks data analysis pipeline as a test bed and demonstrator)

Background

Modern research upholds the key principle of the scientific method, hypothesis driven research, as a fundamental dogma; hypotheses must be postulated, tested and either rejected or tested further. Scholarly communication was built to support exchange and independent validation of scientific discourse. Transparency with access and peer review were adopted as enactments of these principles.

We consider scholarly communication to require two concepts, evidence of experimental process or hypothesis (data) and publication of ones opinion based on such evidence (interpretation and presentation). Traditional publication was conceived of to support these two concepts and has fulfilled the roles of creating a forum to exchange ideas and evidence.

Such ‘exchange-environments’ may exist in many forms but generally permit the following to occur…

- Dissemination: Act as a focus and distributor for entire communities
- Registration: Registering unique findings or ideas to authors
- Validation: Peer review based on double-blind analyses and re-analyses
- Filtration: Focus on discipline or core value-based rules
- Designation: Professional impact for authors
- Curation: Maintaining a record or archive of collected works

Until recently traditional publication was able to accommodate both concepts of scholarly communication but advancing technology has provided an opportunity to generate more evidential data that has ever been considered; certainly more that can be handled by traditional publication.

This has created emergent roles of data centres and data publishers that are tasked with maintaining the evidence base in much the same conceptual space as libraries have for printed materials.

Why data citation

Data represent the underlying evidence of research. Without evidential data nearly all research is simply a collection of un-supported assertions. In addition, many data have value beyond their original purpose, either as validation / reproducibility activity or as aggregated subunits of larger datasets. There is also increasing evidence that multi disciplinary reuse of data is occurring more often. In all these cases datasets become valuable research objects and constituents of professional impact. In recognising data as a first class research objects they must fulfil the publication roles listed above. In the same way that citation of books and articles has facilitated the roles of publication, a citation framework for data is, in principle at least, required. But while the principle of data citation may be straightforward, it’s implementation and consequences are far reaching [refer to citation. WHY manuscript]. Most pressing is determining the different requirements for data as a citable object when compared to the existing citation frameworks. For this piece of work we examined the Sage Bionetworks data analysis pipeline to establish how data citation may apply.

Concept→ Implication→Consequence→Requirement

Cited datasets should exist or have existed

Implies a location and persistence
Requires a physical location and resource / infrastructures for preservation
Data must be managed and preserved

Cited datasets are immutable

Implies fixity
No changes and authority to control
Datasets are logically citable unit
Logical dataset with context
Recognition of ‘logical dataset’ and agreement on context

Citations should persist

Implies preservation of citation and./or dataset
Implies the persistence of governance structure
Resource in addition to data curation and maintenance

Datasets contribute to professional impact

Implies a capacity to measure impact
New ways to identify and measure impact as none exist for data
Incentives must be created

Cited datasets have creators

Implies ownership and rights
Application of existing or new IPR and access agreements
Clear lines of authority and accountability must be defined

Citation framework

Implies an agreed format, syntax and ID system
Resource for standards

Citation should confer value

Implies quality
Requires a review or validation process

In reality, these concepts, their implications and consequences resolve to the following requirements:

- Citation of datasets requires data management and persistence
- Citation of datasets requires clear roles of authority and accountability
- Citation of datasets requires resource to preserve the datasets and their citations

In turn Data citation will benefit:

- Resreachers, by create a strong incentive to share and manage data responsibly.
- Funders, by realising the value of research data and it’s re-use.
- Libraries, re-joining the scholarly record.

Data must be managed and preserved

What data to cite

Data are empirical and exist in all research. They can represent substantial investment in resource and effort e.g. observational data that are unique, or reproducible data like metabolite measuring in defined/controlled experimental systems. Unlike traditional scholarly communication, data are diverse and may be exceptionally large collections from one experiment or relatively small complex collections from many experiments. What constitutes a citable object in a data landscape?
In Sage Bionetworks it was important to understand the properties and provenance of the dataset/s used in the bioinformatics pipeline. Reassurance of the validity of ingested data are important considerations for any data analysis pipeline. The current value as a research object and their added value as a research object resulting from the Sage Bionetworks analysis processes should also be considered. E.g. how much effort would be required to re-create the output data? e.g.I need to have a clear grasp of the datasets used in the pipeline

Where is the data?

Data are only useful for reuse or validation if they exist somewhere and are accessible, either freely or otherwise. Data sharing is generally accepted as good scientific practice whether sharing between individuals or free open data movements, but maintaining data is not a trivial undertaking. The data must be in a form that can be maintained, transmitted, accepted and understood, in addition there muist be the relevant software or rendering environments for reuse by either human or machine consumers. For any of this to occur a data centre role must exist and be capable of maintaining and preserving data so that is may be shared.

In Sage Bionetworks individuals can contribute their collections to the analysis pipeline, as can institutions. The likelihood of persistence and preservation increase with increasing organisational status and so data citation would involve organisations rather than individuals. What are the contributor roles in Sage Bionetworks and how can a data citation framework fulfil data citation needs?

Incentives must be created

How is it recorded?
In the example from DataCite a record is maintained of the Dataset’s DOI and associated metadata (including some administrative metadata). The DOI is for exclusive use and re-use is not permitted. The data controller/data publisher updates this metadata whenever changes are required, e.g. change of URI or supplement metadata. The dataset itself should remain static and immutable.
For how long is my citation valid?
Temporal properties of datasets are ill defined. In the context of citation this can be misleading. Datasets persistence is dependent on many things and need not impact citation beyond an inability to access via a citation. i.e. a citation declares the existence, past or present, of a dataset, it need not contain or imply the ability to access the dataset. However it does commit to permanence and once registered a citation identifier should not be changed.

We believe that citation increases the likelihood of dataset persistence and supports data preservation.
How can I be attributed to it?
A formal citation framework allows for authors and creator to be registered against a dataset. Together with a temporal record and submission tag the citation object can be registered as a surrogate for dataset creator. This should not be underestimated as it provides a mechanism where by creators can be attributed and credited for something that has always been considered the foundation of research but for which no credit mechanism was available.

Data citation has the ability to contribute to the formal recognition of professional contributions of researchers, groups, institutions, funding agencies and national governments. This is presently the preserve of traditional publication and their impact calculations. A data citation framework can make considerable advances in the way professional impact is calculated.

Clear lines of authority and accountability

Whose data are these?

Empirical data has a creator. The creator or a designated representative will have rights over how those data are used. This broadly maps to the ‘data controller’ as defined in the UKs Data Protection Act 1998. The individual assuming the role of data controller must have the authority to transfer rights to a third party for specified purposes, whether free unrestricted use and re-use or according to specified restrictions. Either way, good scientific practice requires acknowledgement when using others data and there is presently no agreed method for doing this in the same way that citation of published work references derived or adopted concepts. Who/what is responsible for it?
- Is it the organisation that funds the research that generated the data?
o Government?
o Research council?
o …..
- Is it the researcher that generated the data to undertake novel research?
- Is it the institution that houses the researcher in the endeavour of their research?
Are any of the roles able to transfer re-use right to these data or take responsibility in a legal context? Responsibility and authority for data must be determined if rights are to be defined, from open to totally closed.

Who pays for what?

Permananent identifiers a rarely free. Permanence requires effort and support of a framework and infrastructure that is also persistent and fulfils the requirements of citation concepts, i.e. referential immutability. Permanence is not a natural function of UK and many international research funding structures unless a clear policitcal will supports citation frameworks as part of their funding processes. Permanence is a property that is transferred between long term governance structures rather than short term projects e.g. libraries have fulfilled this role for traditional publication. They may have a role to play for data citation also.

What about adding value to someone else’s dataset?

When datasets are changed in anyway they become derivatives, abstractions etc. For the purposes of data citation they become new datasets and should be identified as such. Those derivates or abstractions are no less valuable and in some cases will have increased the value of the source dataset. Where this takes effort and resource then they become their own logical citable unit that declares it’s own value and that of the original dataset. Any citation framework must fulfil this requirement. In Sage Bionetworks this is a particular relevant point. When does a consumed dataset become a more valuable dataset?

So what now?

Implementing a citation framework has implications and consequences. Citation is a formal declaration that some thing exists and a reference to its location and access, whether free or otherwise. As such it shares many properties with a data management function; identification, location and access. The core differences between data management and citation is the immutability and external independent dimension to citation. In this sense data citation has implications that extend beyond the usefulness of data and it’s proper management, i.e. data citation supports scholarly communication rather than data management functions.

How do I cite my data?

Data exist in context, most data are not useful without some description of what they are. This metadata can be light, where simple bibliographic information are recorded, e.g. Dublin Core types. Alternatively they can be complex where full metadata specifications and formats are defined for every datum type in the dataset, e.g. ISO11179 metadata standard. Taken to extreme, a peer reviewed publication can represent the ultimate vernacular metadata record of a dataset or evidence base.

Traditional citation captures various information, most commonly
• Creator, e.g. authors
• Date, e.g publication date or collection date
• Title, textual description
• Identifiers, references
• Location, the journal or publisher

In addition there may be a format of presentation style suggested for text based communication and perhaps an actionable identifier that, in an internet enabled protocol, can locate and transfer/access the object being cited.

In data very much the same framework has been adopted in DataCite. Once minimum requirements are met issuance of a permanent, actionable identifier (in the form of a DOI) is provided as a citation object for data. The minimum requirements for DataCite are:

- Data are maintained or persisted
- Landing pages are fully open
- Minimum meta data are submitted
- Responsibility to update DOI metadata

Persistent identifiers are commonly used to identify citable objects, however, they are rarely persistent, often ambiguous in what they identify and almost never free of charge. For example, identifiers are sometimes re-used and identification implies but does not guarantee access. In addition there is often a transfer point that is not the citation target, e.g. an abstract page for a published journal article or a catalogue entry. Finally, unless there is some stability to dedicated resource in maintaining the record of identification and metadata then persistence decays and ambiguity increases.

There is no common or widely agreed format to cite or reference data sets. Commons types include:
- DOI: Global citation syntax based on and using handle resolution services. Already in use by many publishers for journal articles through the CrossRef service. Also used as PUIDs for films, standards concepts, books, and other uses. Independent of network protocol but is most ‘resolved’ using http
- URI: Uniform Resource Identifier is a standard consisting of a scheme (http), domain (www.forexample.com) and a path (/forexample.htm). Relies on the DNS and is a cornerstone of the WWW http identification
- PURL: Persistent URL which implements a resolution layer for URLs allowing URLs to persist while redirecting PURLs to new locations when required.
- ARK: Archival resource keys. Used in archiving processes, the ARK is a URL that returns a metadata statement on the digital object and/or a service commitment for the current provider (e.g. archive).
- HDL: upon which DOIs are based but can be implemented at lower cost and without global, institutional support/stability. Used in conjunction with hashtags in MIT’s DataVerse project. Has been useful and cost effective for institutional or consortia management of data.