Research Data Preservation Guidelines 2023.05.
1. Overview
Purpose
- ㆍThe Presentation of research data preservation guidelines applicable to data preservation at the Geoscience Data Center of the Korea Institute of Geoscience and Mineral Resources (KIGAM).
- ㆍSelect a durable format and follow procedures for submitting data to a repository for long-term preservation.
Target
- ㆍAdministrator of Geoscience Data Center who want to preserve deposited research data.
Scope of Application
- ㆍApplies to research data generated through in-house research activities and research data donated by external organizations and individuals.
Application
- ㆍMatters not specified in these guidelines may be subject to the research data management guidelines of the National Research Council of Science & Technology (NST) and KIGAM Data Management Regulations.
2. Concept of Data Preservation
Meaning of Data Preservation
- ㆍA set of management activities taken to ensure the long-term viability and continued accessibility of research data.
- ㆍLong-term refers to a period of time long enough to be concerned about the loss of integrity of digital information held in a repository, including damage to storage media, changes in technology, support for old and new media and data formats, and changes in the user community.
Necessity of Data Preservation
- ㆍDigital data preservation should be a key aspect of any research project. Some research data is unique and cannot be replaced if destroyed or lost. However, referencing verifiable data can be enough to determine that a study is sound.
- ㆍEffective documentation of data.
- ㆍStorage media may degrade, or data may be lost.
- ㆍData may not be readable if software file formats change in the future.
- ㆍData may be difficult to understand if there is no documentation left for the data file.
- ㆍData files may become unintelligible or unreliable when opened with new software to the extent that research cannot continue.
- ㆍThe preserving period of data is stipulated to be permanent according to Article 16 of Chapter 5 of the KIGAM Data Management Regulations.
The Goal of Data Preservation
- ㆍData management: Ensuring that digital records can be managed through inevitable changes.
- ㆍAccessibility: Ensure that data is easy to find and accessible.
- ㆍAvailability: Ensure that users can work with data the way they need to.
- ㆍData documentation: Help users understand what the data is and what it is about.
- ㆍIntegrity: Ensure the reliability of data throughout the Data Lifecycle.
Data Management Plans and Preservation
-
ㆍThe Data Management Plan should specify the following retention-related matters
◦ The administrator who is responsible for Data Preservation.
◦ The data format description to be produced.
◦ The size of the dataset to be produced.
◦ Where the data will be stored.
◦ State if a data repository for the research field or institution exists and explain if it will be utilized.
Data File Organization and Description
- ㆍData preservation is a set of management activities taken to ensure the long-term viability and continued accessibility of research data, and, therefore, includes data file organization and data description.
- ㆍThe format of data files should follow non-proprietary and open standards to the extent possible, given the ongoing access and potential reuse of data.
- ㆍMetadata and documentation should be used to describe the data to be preserved.
- ㆍ<Table 1> provides guidelines based on the type of material.
Data Type | Guide Content |
---|---|
Data File |
|
Documentation File |
|
Metadata |
|
3. Selecting and Evaluating Data to Preserve
The Need to Choose Long-Term Preserving Data
-
ㆍEven if data storage is not costly, there are reasons to select data for long-term preservation rather than storing all data, including:
◦ The rapid growth of digital data makes storing everything unaffordable.
◦ Digital preservation methods are not sustainable without proper mirroring and backup systems, and ultimately, backup and mirroring increase the cost of preservation, which means that storage costs at least double.
◦ Storing all data can require additional effort to determine which data are relevant to a search, which can be reduced by selectively storing data.
◦ Since a lot of data management and preservation costs are required, the cost of creating and managing preservation metadata and the preservation cost of the data to be preserved must be considered.
Criteria for Selecting Long-Term Preserving Data
- ㆍDue to data storage resource limitations, long-term preservation of all data is not possible, so the criteria listed in <Table 2> can be used to select data with a long-term preservation value.
- ㆍ<Table 2> shows the criteria for selecting long-term archival data.
Category | 내용 |
---|---|
Legal considerations |
|
Scientific or historical Value |
|
Original |
|
Conditions |
|
Storage and Preservation |
|
Access/Use |
|
Format/ technical limitations |
|
4. Data Repository
Definition of Data Repository
- ㆍA data repository is an online database service, an archive that manages the long-term storage and preservation of digital data resources and provides a catalog for navigation and access.
Considerations for Choosing a Data Repository
- ㆍProvide a persistent identifier for the submitted dataset.
- ㆍAfter exploring the dataset, metadata that supports checking and using the contents of the dataset are provided as a landing page for the dataset.
- ㆍSupport tracking of data use.
- ㆍRespond to community needs or be recognized as a “trusted data repository.”
- ㆍMeet legal requirements, such as data protection, and allow for data reuse without unnecessary licensing requirements.
Examples of Data Repository
-
ㆍGeneral-purpose repositories:
◦ FigShare(http://figshare.com)
◦ Dryad(https://datadryad.org)
◦ Zenodo(http://zenodo.org/)
◦ DataHub(http://datahub.io)
◦ DANS(http://www.dans.knaw.nl/)
- ㆍ<Table 3>shows repositories in the field of geoscience.
Repository | Explanation |
---|---|
National Geoscience Data Centre (NGDC) |
|
Centre for Environmental Data Analysis (CEDA) |
|
UK Polar Data Centre(UK PDC) |
|
PANGAEA |
|
TOAR Surface Observation Database |
|
Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) |
|
Norwegian Marine Data Center (NMD) |
|
International Council for the Exploration of the Sea (ICES) |
|
ICTS SOCIB Data Repository |
|
Storing Data
- ㆍIdentify how many copies of data to store and how to synchronize them.
- ㆍProvide storage for data.
- ㆍRetain a backup site for data transfer systems stored in cloud-based services.
- ㆍProvide a data download service from the backup site in the event of a power outage.
- ㆍProvide criteria for comparing storage (storage space) solutions.
- ㆍEnsure integrity and accessibility when backing up data.
Backup and Recovery
- ㆍTo prevent and protect data loss and damage, researchers are responsible for regularly and automatically backing up their data to multiple locations.
-
ㆍThe backup system of the Geo Big data Open Platform consists of a double backup with InnoStor Appliance (ISA-2000) and Quantum Scalar i500, and backup is performed by storing the data periodically backed up from the service storage in the backup system.
◦ Backup target: Performs backups for data, databases, and user data files of the Geo Big Data Open Platform.
◦ Backup cycle
- Backup of user data files, databases, and system data: Daily
- Full backup of user data, database, and research data (files): Saturday
-
ㆍRecovery Policy and Guidelines:
◦ System and application software are recovered from local GIT repositories.
◦ Recovery for research data, database, and user data files is performed from data stored on the backup device.
◦ Perform recovery from tape backups at the point of origin if the backed-up data fails.
Archiving and Preserving Data
- ㆍPeriodically archive (magnetic tape) research data to preserve research data.
- ㆍVaulting and archiving of backup tapes to a remote location through tape backup (yearly).
- ㆍArchiving tapes are retained for a minimum of five years.
Preserving Strategies for Descriptional and Procedural Stability
- ㆍMigration: Convert file formats from less common or deprecated file formats to current file formats.
- ㆍEmulation: Emulation, which involves mimicking the functionality of an older or obsolete computer, allows a computer to read an older file format and then save it in a current file format (a combination of emulation and migration) or a technique for reading and using older, obsolete files in the future.
- ㆍNormalization: Restrict data formats to common formats for preservation (e.g., limiting text files to open document formats or Word format) or converted software-dependent file formats to software-independent file formats (e.g., SPSS system files) or software-dependent file formats (e.g., ASCII or XML-based formats).
5. Digital Assets Preservation Framework
Overview of Digital Assets Preservation Framework
- ㆍThe Digital Asset Preservation Framework was presented by the National Digital Stewardship Alliance in 2013. This framework can be used to assess the level of digital preservation using <Table 4>.
- ㆍThe appendix is a guide for assessing the level of preservation of digital assets, which can be used to evaluate the state of preservation in a repository and provide a year-by-year indication of where the level of preservation should be increased in the future.
Content | Level 1 (Data Protection) | Level 2 (Data Recognition) | Level 3 (Data Monitoring) | Level 4 (Data Recovery) |
---|---|---|---|---|
Storage & Geolocation |
|
|
|
|
File fixity & data integrity |
|
|
|
|
Information security |
|
|
|
|
Metadata |
|
|
|
|
File Format |
|
|
|
|
Version No. | Date | Contents |
---|---|---|
0.1 | 2023. 03. 20. | Create document outline |
0.6 | 2023. 04. 28. | Create draft |
0.8 | 2023. 05. 08. | Guideline review |
1.0 | 2023. 05. 19. | Accept review comments |