Achieving FAIR Data Principles at the Environmental Data Initiative, the US-LTER Data Repository

Abstract

The Environmental Data Initiative (EDI) is a continuation and expansion of the original United Stated Long-Term Ecological Research Program (US-LTER) data repository which went into production in 2013. Building on decades of data management experience in LTER, EDI is addressing the challenge of publishing a diverse corpus of research data (Servilla et al. 2016). EDI’s accomplishments span all aspects of the data curation and publication lifecycle, including repository cyberinfrastructure, outreach and training, and enhancements to data documentation methodologies used by the environmental and ecological research communities. EDI is managing almost 43,000 unique data packages and their revisions from a community of nearly 2,300 individual data authors, most of which are contributed by LTER sites, and are openly accessible and documented with rich science metadata in the Ecological Metadata Language (EML) standard. Here we will present how EDI achieves FAIR data principles (Wilkinson et al. 2016, Stall et al. 2017), and report data use metrics as a measure of success. The FAIR principles serve as benchmarks for EDI’s operation and management: the data we curate are Findable because they reside in an open repository, with unique and persistent digital object identifiers (DOIs) and standard metadata indexed as a searchable resource; they are Accessible through industry standard protocols and are, in most cases, under an open-access license (access control is available if required); Interoperability is achieved by archiving data in commonly used file formats, and both metadata and data are machine readable and accessible; rich, high quality science metadata, with automated congruence and completeness checking, render data fit for Reuse in multiple contexts and environments, along with easily generated data provenance to document their lineage. The success of this approach is proven by the number and spatial and temporal extent of recent re-analyses and synthesis efforts of these data. Although formal data citations are not yet common practice, a Google Scholar search reveals over 400 journal articles crediting data re-use through an EDI DOI. However, despite improved data availability, researchers still report that the largest time investment in synthesis projects is discovering, cleaning and combining primary datasets until all data are completely understood and converted to a similar format. Starting with long-term biodiversity observation data EDI is addressing this issue by implementing a pre-harmonization of thematically similar data sets. Positioned between the data author’s specific data format and larger biodiversity data stores or synthesis projects, this approach allows uniform access without the loss of ancillary information. This pre-harmonization step may be accomplished by data managers because the dataset still contains all original information without any aggregation or science question specific decisions for data omission or cleaning. The data are still distributed into distinct datasets allowing for asynchronous updating of long-term observations. The addition of specific and standardized metadata makes them easily discoverable.

Publication
Biodiversity Information Science and Standards

Related