Archivematica Integration Phase One
Building a preservation system for research data
Archivematica Prototype / testing
Curation and file normalization
Overview and project background
Digital curation of research data involves capturing, maintaining, preserving, and adding value to digital research data for access and archiving purposes. Data curation that follows best practices in archiving and digital preservation, can ensure that research data are actively managed for replication of research results, and reuse by others, well into the future.
Beyond sharing data, libraries can play a role in providing safe storage, active data management, enhanced discovery, and providing access to data in open archives and repositories, so that others may easily access data at any time. Discovery and access to important research data are crucial to interdisciplinary, digital, scientific inquiry.
Archivematica is an open source tool that provides file-level micro-services built using archival best practices and standards from across libraries and archives. Many museums, archives, and libraries, across Canada and the U.S., use Archivematica to preserve digital collections such as images, video, data and text, for the preparation of archival packages, to be stored for long-term preservation.
There is a strong interest in developing open-source tools to support the management, curation, and preservation of research data in Canada. This charge has led to the formation of several national initiatives around research data management, including the Canadian Association of Research Libraries (CARL) Portage Network, to develop in partnership, national infrastructure for research data in Canada.
As part of the CARL-RDC Federated Pilot, Archivematica was chosen as a digital preservation tool that could be integrated with existing research data management platforms such as Dataverse.
OAIS Framework
Building a preservation system for research entails much more than just preserving the "bits", it requires active management, and thoughtful digital curation, using proven standards and formats that enable open and stable maintenance over time. OCUL and Scholars Portal have some experience with the Open Archives Information System (OAIS) - ISO 14721:2012, after developing a preservation system for the large repository of Scholars Portal / OCUL Journals content. Since 2013, Scholars Portal is certified as a Trustworthy Digital Repository (TDR), undergoing an external audit and certification process, for electronic journals content.
OAIS is framework for understanding the responsibilities, and processes, both technical and human, in a preservation system / OAIS archive. Archivematica is a suite of open source tools that adhere to the principles of the OAIS Archive model. It allows for automation of processing, creation of preservation metadata and repository-independent standards for packaging Submission Information Packages (SIPs) and Archival Information Packages (AIPs), using METS, PREMIS, and Bagit standards (see below for more information about preservation metadata and standards).
Building a preservation system for research data
There is a strong interest in the OCUL community to preserve research data using trusted standards and formats recommended by the archival and digital preservation community. This interest has been demonstrated by a number of OCUL communities including the OCUL Data, Geo, and Digital Curation community, made up of members from university libraries across Ontario, around research data management activities and services in libraries.
Dataverse, which is OCUL's research data repository service, provides a platform for researchers to deposit their data, share it, and invite others to curate and reuse it, to enable long-term viability. Dataverse supports active data management activities including collaborative data management, version control, Data Documentation Initiative (DDI) metadata, Open Archive Initiative transfer protocols such as OAI-PMH for metadata, Digital Object Identifiers (DOIs) for data, LOCKKS integration, and so on. However, Dataverse lacks certain kinds of digital preservation activities, processes, and standards, that are required to ensure compliance and certification with TDR or other certification standards. It does not have a built in secure storage infrastructure, nor does it use replicated storage and backup, flexible or configurable file normalization processes and/or checks, which are required by a robust preservation system.
Building a robust archival and preservation system for research data entails capturing the essence and context of original research data, including the descriptive and structural metadata required to enable reuse. Archivematica can be used to develop a series of checks and processes, to ensure that data and metadata are complete, and preserved using open standards and formats. Data files and metadata can easily be packaged and connected to secure digital storage infrastructure.
Building a separate but integrated "preservation system" , while challenging, would enable the integration of Dataverse as the platform for researchers, and, the preservation pipeline "middle-ware" (Archivematica), and the storage piece to ensure that standard archival packages are replicated across a network of secure storage locations on the Ontario Library Research Cloud (OLRC) (OCUL's cloud storage Infrastructure).
This flow chart (below) represents the typical ingest and processing that Archivematica is capable of (Archivematica is represented by the red boxes in the diagram), beginning with ingest of digital objects from a repository such as Dataverse, to the automated processing and packaging to get data into an AIP, to output to storage somewhere, such as the SWIFT Cloud.
Initial Project Participants
Artefactual – Evelyn McLellan, Justin Simpson
Dataverse – Phil Durbin, Eleni Castro (& others)
Scholars Portal and University of Toronto – Leanne Trimble, Alan Darnell, Steve Marks, Amber Leahey
UBC – Allan Bell, Eugene Barsky
University of Alberta – Geoff Harder, Larry Laliberte, Peter Binkley, Umar Qasim
SFU – Alex Garnett
CARL Portage, Chuck Humphrey
Archivematica Prototype
In summer through fall 2015, project participants gathered requirements and provided feedback to Artefactual about what would be needed for digital preservation of research data in Dataverse, including working with the latest version of Dataverse 4. In late fall 2015, Artefactual provided the group with a prototype of the Archivematica software, utlizing the Dataverse APIs to make API calls and retrieve datasets for ingest into Archivematica for further processing and packaging.
The initial requirements gathering and analysis suggested that the best way to ingest the data is to have an automated ingest script that parses the information and metadata coming from Dataverse into a METS file (see below for more information about METS), so that Archivematica can correctly interpret the contents of the study, including metadata, data files, derivative data files etc. The script itself would be confirgurable and would allow for changes to Dataverse (new version), rather than modifying Archivematica itself.
Below is a workflow of how data files from Dataverse can be ingested and processed in Archivematica, and packaged into Archival Information Packages (AIPs) in the system:
Artefactual Systems. "Dataverse-Archivematica Workflow Diagram". https://wiki.archivematica.org/Dataverse (Accessed 2/5/2016)
Testing of the Archivematica prototype is ongoing. Using Scholars Portal's test instance of Dataverse, a range of data collections can be tested and assessed using the Archivematica ingest script. The plan is to assist in the development of flexible options for research data in Archivematica, and assess the usability of Dataverse and Dataverse APIs for a preservation. This two-fold approach will help us to understand many of the existing digital preservation needs and gaps for research data in our existing collections.
The testing is currently exploring aspects of the Archivematica prototype, including investigating options to utilize and improve:
- Dataverse APIs;
- File and format processing options;
- Preservation and descriptive metadata (METS, PREMIS, DDI, DC);
- Metadata and data file packaging;
- Scalability of Archivematica.
Curation and file normalization
There are clearly risks associated with format and media obsolescence already being encountered by researchers and libraries today. The inability to read file formats in old software on old media is a major challenge faced in the digital age. The rapidly changing technological landscape is increasing the speed at which change is occurring. This challenge is heightened due to the often proprietary nature of research data collection in certain disciplines, file and data size, volume, and so on.
File normalization is an archiving procedure that identifies and targets a particular file format and converts it, to some other identified file format that is considered open and less vulnerable to change over time.
The diagram below provides an overview of the different levels of curation and file normalization that can be achieved in Archivematica (adapted from Mitchel et al., 2015. Filling the digital preservation gap: A Jisc Research Data Spring project. Phase One Report. Accessed 01-31-2016.)
File format identification and normalization rules for research data are still evolving, and not well understood or established in most disciplines. Beyond statistical and spatial data formats, in the medical sciences for example, research data can be collected using expensive, proprietary, laboratory equipment and technology, such as image scans, that are not yet suited for data sharing or reuse. Efforts to establish standards for data sharing, including metadata for descriptive and reuse purposes, are becoming more important, however, there are significant gaps for digital preservation. This is an area being explored by libraries both within Canada (as part of the CARL Portage project) and elsewhere (Mitcham et al. 2015).
Metadata
Metadata is important for understanding research data, it provides valuable information about how data was collected, coded, analysed, and for what purposes. It provides essential contextual information that lends to others who may be trying to determine its appropriateness for reuse, or to replicate important research results.
In addition to file normalization processing, Archivematica is able to capture basic file-level metadata, as well as produce preservation metadata (think of it as structured process logs) that document each step and process that was performed on the data file. Similarly, descriptions about the data such as coding, metadata, and study-level information, are preserved in open formats so that others can read and understand data in context well into the future.
Archivematica can create standard encoding of descriptive, administrative, and structural metadata (using the METS standard expressed in XML), alongside data files and metadata for preservation, to ensure the original file and metadata representation, in the archival packages being stored. Preservation metadata is also generated by Archivematica (using the PREMIS standard), to ensure that preservation processes, such as normalization procedures, are captured in a structured, consistent, and well understood encapsulation of the archival package itself.
METS
Example METS metadata file, with encoding of Data Documentation Initiative (DDI) metadata, for research data from Dataverse software (produced by Archivematica software, taken from Archivematica-Dataverse Wiki accessed 01-31-2016):
PREMIS
DDI
Dataverse metadata
Dataverse APIs
OAI-PMH
More information and other resources
Mitcham, J. et al. (2015). Filling the digital preservation gap: A Jisc Research Data Spring project. Phase One Report. Accessed 01-31-2016.
Artefactual Systems (2015). Archivematica - Dataverse Integration Wiki
Canadian Association of Research Libraries (2015). RDC - CARL Federated Pilot for Research Data Ingest and Preservation
Data Documentation Initiative (DDI)
Metadata Encoding and Transmission Standard (METS)
Preservation Metadata Maintenance Activity (PREMIS)