D-Lib Magazine
spacer
The Magazine of Digital Library Research
spacer
transparent image

D-Lib Magazine

January/February 2014
Volume 20, Number 1/2
Table of Contents

 

Data Identification and Citation — The Key to Unlocking the Promise of Data Sharing and Reuse

Adam Farquhar
British Library and DataCite
adam.farquhar@bl.uk

Jan Brase
DataCite
jan.brase@tib-uni-hannover.de

doi:10.1045/january2014-farquhar

 

Printer-friendly Version

 

Abstract

The Fourth DataCite Annual Meeting was held jointly with CODATA and co-located with the Research Data Alliance (RDA) Second Plenary, 19-20 September 2013. DataCite is an international organisation whose mission is to make research better by enabling researchers to find, share, reuse, and cite data. It engages researchers, scholars, data centers, libraries, publishers, and funders through advocacy, guidance and services. The 2013 annual conference attracted nearly 200 attendees from around the world. This report describes the important discussion topics, and recent accomplishments and findings highlighted at the meeting, and also introduces some important remaining challenges.

 

Introduction

Data driven science, scholarship, and policy is of growing importance. To be successful, however, it relies on the ability of researchers to share and reuse data. Data sharing and reuse, however, requires data that are connected — connected to the scientific literature, to data producers, to the data centers that curate it, and even to other data. Reliable connections in turn require reliable persistent identification. This is the challenge that has given rise to DataCite and that was addressed at the Fourth DataCite Annual Meeting held jointly with CODATA, co-located with the Research Data Alliance (RDA) Second Plenary, 19-20 September 2013.

Shortly prior to the conference, the Data Science Journal published the final report from ICSTI-CODATA task force. It highlighted that "the use of published digital data, like the use of digitally published literature, depends upon the ability to identify, authenticate, locate, access, and interpret them. Data citations provide necessary support for these functions, as well as other functions such as attribution of credit and establishment of provenance." [1]

In planning the 2013 annual meeting it was an obvious choice to co-locate it with the RDA plenary. Since the founding of the RDA, DataCite members have been active in the Organization Advisory Board Task Force of the RDA and in the RDA working groups on data publication, data citation, metadata and PID information types. There is a mutual overlap of interest in the work of the RDA and DataCite's goals, and both organisations have the clear focus to allow better access and re-use of research data on an international level. Both initiatives agree that this can only be achieved in open co-operation and harmonising of efforts.

 

About DataCite

DataCite is an international organisation whose mission is to make research better by enabling researchers to find, share, reuse, and cite data. It engages researchers, scholars, data centers, libraries, publishers, and funders through advocacy, guidance and services.

DataCite has grown to 18 full members from around the world which work with data stewards that provide long-term archiving for data. For example, the British Library is a member that works with the UK Data Archive to assign persistent identifiers to high quality social science data. The UK Data Archive works with researchers to package and archive data; it also provides additional services to researchers to help them work with the data and use it in new research. As of 2013, DataCite members work with over 250 different data stewards to ensure that data sets are uniquely and persistently identified. Together they have assigned over 2,000,000 DOI® names.

DataCite's key services enable reliable persistent identification of data, especially through digital object identifiers (DOI names). Identifiers play an essential enabling role for citation, discovery, tracking usage, and ensuring that researchers get the credit that they deserve for the data that they produce and collect.

The organisation's goals, as set out in its 2013-2015 strategy, are to:

  1. Become a sustainable organization. DataCite is rooted in work that began in Germany a decade ago and was itself founded in 2009 with strong support from a major national organisation. As data identification and citation becomes more deeply embedded in research practices, however, it is essential to build on this basis.
  2. Become part of the global research infrastructure. This is the key focus of both the organisation's and members' activities. There is considerable work to embed data identification and citation practices and ensure that seamless services can be provided at the infrastructure level.
  3. Nurture our membership & build strong communities. DataCite is a membership and community driven organisation engaged with hundreds of data stewardship organisations around the world and members with deep understanding of how research works within their purview. This provides a challenge to work effectively across subject, organisational, and geographical boundaries, but also an opportunity to learn from each other and speak with a community voice.
  4. Build and maintain services, guidelines, policies. DataCite operational and working groups are engaged in the global discussion and specifying essential guidelines, including best practices [2] and the essential metadata to support citation and use of data. [3, 4]
 

Fourth DataCite Annual Meeting

The 2013 annual conference attracted nearly 200 attendees from around the world.

The conference opened with a keynote by Salvatore Mele, Head of Open Access at CERN. He gave a memorable talk that culminated in demonstrating that the dataset evidencing the observation of the Higgs Boson was recently provided as an independent data publication with a DataCite identifier, 10.7484/INSPIREHEP.DATA.A78C.HK44.

The closing keynote was from George Alter, director of the Interuniversity Consortium for Political and Social Research (ICPSR), the leading US social sciences data centre and DataCite affiliate member. He provided an analysis of the scholarly communication chain and argued that it was time to focus on journal editors in order to integrate data more completely.

The sessions highlighted work by North American members and clients of DataCite. Michael Witt (Purdue U) described how the Purr (Purdue Repository) system integrated data archiving and identification into an institutional repository. Laure Haak, highlighted the need for further integration between articles, data, and researchers. She demonstrated the progress that ORCID has made in providing unique identifiers for active researchers. ORCID launched as a system one year ago and has already enabled more than 350,000 researchers to receive their own unique ID. DataCite is working closely with ORCID to ensure that there can be clear bi-directional links between articles, data, and the researchers that create them.

In the ODIN project (the ORCID and DataCite Interoperability Network), co-funded by the European Commission under the 7th framework program, ORCID and DataCite design conceptual models to harmonize the use of data and contributor identifiers. One example of such harmonization is the recently launched service for searching and claiming works in DataCite. This tool enables users to search the DataCite Metadata Store for their works and subsequently to add (or claim) those research outputs, including datasets, software, and other types, to their ORCID profile.

The January 2014 Memorandum of Understanding between DataCite and ORICD calls for bi-directional links between people and the data they produce. This follows on the 2012 joint statement between DataCite and the STM Publishers Association calling for bi-directional links between articles and data. This brings us closer to a world in which researchers, articles, and data are seamlessly interlinked in an integrated scientific record.

A key message from the conference is that provision and use of persistent identifiers for data are becoming mainstream operations. DataCite members are working in an increasing number of subject areas and focusing on ease-of-use. The basic mechanisms are now well understood and becoming embedded in robust services.

Open challenges do, of course, remain. Examples include effective handling of versioning; precise citations to data (e.g., to a subset or query of a large dataset); common agreement on specifying programmatic transformations or processing. There are also many practical challenges as well. Dr Alter highlighted a critical challenge to journal editors, many of whom do not yet require reliable citations to data in the way that they do for articles.

This move to an operational basis has been accompanied by recent work to establish principles for data citation. People have been working on data citation issues for a long time. Different groups and different communities have come up with different guidelines and recommendations on how data should be cited. The different approaches generally agree in principle, but there are some distinct differences. Some might even say that the different approaches were almost seen as competing in some way.

Between the RDA 2nd Plenary and the DataCite meeting in Washington, delegates from the ICSTI-CODATA Task Group on Data Citation Standards and Practices, co-chaired by Christine Borgman (U.S. CODATA), Jan Brase (ICSTI), and Sarah Callaghan (U.K. CODATA) by and from the Force11 Data Citation Synthesis Group, co-chaired by Merce Crosas (Harvard) and Todd Carpenter (NISO), met to harmonize approaches.

The group is now pleased to announce a "Draft Declaration of Data Citation Principles". The synthesis group developed these draft principles over the past 9 months and now welcome feedback and comments from the community. The request for feedback has also been sent out to the broad RDA community. The feedback received by the end of 2013 will be reviewed and incorporated into the final principles. Once the final principles are published, a mechanism will be in place for worldwide endorsement. More critically, the group will also begin to promote implementation of the principles while exploring detailed examples of implementation. This promises to lead to a broad community-endorsed set of principles for data citation that could strongly influence global implementation efforts in the coming years.

Aligning the DataCite annual meeting with the second plenary of the Research Data Alliance has led to positive interactions with good overlap in attendees and themes and we will explore further options of collaboration. DataCite is aiming to becoming an affiliate member of the RDA.

 

References

[1] CODATA-ICSTI Task Group on Data Citation Standards and Practices. (2013) "Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data". Data Science Journal, Vol. 12. http://dx.doi.org/10.2481/dsj.OSOM13-043

[2] DataCite Business Practices Working Group. (2012) "Business Model Principles". DataCite, Version 1. http://dx.doi.org/10.5438/0007

[3] DataCite Metadata Schema v 3.0. http://dx.doi.org/10.5438/0008

[4] Brase, Jan, Adam Farquhar. (2011) "Access to Research Data". D-Lib Magazine, January 2011, Vol 17, No. 1/2. http://dx.doi.org/10.1045/january2011-brase

 

About the Authors

Photo of Adam Farquhar

Adam Farquhar is Head of Digital Library Technology at the British Library, where he initiated the Library's dataset programme and co-founded its digital preservation department. From 2006-2010, he led the EU co-funded Planets Digital Preservation project. He is President of DataCite, Chairman of the Open Planets Foundation, and Board member of the Digital Preservation Coalition. Prior to joining the Library, he was the principle knowledge management architect for Schlumberger (1998-2003) and research scientist at the Stanford Knowledge Systems Laboratory (1993-1998). He completed his PhD in Computer Sciences at the University of Texas at Austin (1993). His work focuses on improving the ways in which people can represent, find, share, use, exploit, and preserve digitally encoded knowledge.

 
Photo of Jan Brase

Jan Brase has a degree in Mathematics, and a PhD in Computer Science. His research background is metadata, ontologies and digital libraries. Since 2005, he has been head of the DOI Registration Agency for research data at the German National Library of Science and Technology (TIB). He is also Managing Agent of DataCite. DataCite was founded in December 2009 and has set itself the goal of making online access to research data for scientists easier by promoting the acceptance of research data as individual, citable scientific objects.

 
transparent image