Data Preservation Guideline.pdf

Research Data Preservation Guidelines 2023.05.

1. Overview
Purpose
  • ㆍThe Presentation of research data preservation guidelines applicable to data preservation at the Geoscience Data Center of the Korea Institute of Geoscience and Mineral Resources (KIGAM).
  • ㆍSelect a durable format and follow procedures for submitting data to a repository for long-term preservation.
Target
  • ㆍAdministrator of Geoscience Data Center who want to preserve deposited research data.
Scope of Application
  • ㆍApplies to research data generated through in-house research activities and research data donated by external organizations and individuals.
Application
  • ㆍMatters not specified in these guidelines may be subject to the research data management guidelines of the National Research Council of Science & Technology (NST) and KIGAM Data Management Regulations.
2. Concept of Data Preservation
Meaning of Data Preservation
  • ㆍA set of management activities taken to ensure the long-term viability and continued accessibility of research data.
  • ㆍLong-term refers to a period of time long enough to be concerned about the loss of integrity of digital information held in a repository, including damage to storage media, changes in technology, support for old and new media and data formats, and changes in the user community.
Necessity of Data Preservation
  • ㆍDigital data preservation should be a key aspect of any research project. Some research data is unique and cannot be replaced if destroyed or lost. However, referencing verifiable data can be enough to determine that a study is sound.
  • ㆍEffective documentation of data.
  • ㆍStorage media may degrade, or data may be lost.
  • ㆍData may not be readable if software file formats change in the future.
  • ㆍData may be difficult to understand if there is no documentation left for the data file.
  • ㆍData files may become unintelligible or unreliable when opened with new software to the extent that research cannot continue.
  • ㆍThe preserving period of data is stipulated to be permanent according to Article 16 of Chapter 5 of the KIGAM Data Management Regulations.
The Goal of Data Preservation
  • ㆍData management: Ensuring that digital records can be managed through inevitable changes.
  • ㆍAccessibility: Ensure that data is easy to find and accessible.
  • ㆍAvailability: Ensure that users can work with data the way they need to.
  • ㆍData documentation: Help users understand what the data is and what it is about.
  • ㆍIntegrity: Ensure the reliability of data throughout the Data Lifecycle.
Data Management Plans and Preservation
  • ㆍThe Data Management Plan should specify the following retention-related matters

    ◦ The administrator who is responsible for Data Preservation.

    ◦ The data format description to be produced.

    ◦ The size of the dataset to be produced.

    ◦ Where the data will be stored.

    ◦ State if a data repository for the research field or institution exists and explain if it will be utilized.

Data File Organization and Description
  • ㆍData preservation is a set of management activities taken to ensure the long-term viability and continued accessibility of research data, and, therefore, includes data file organization and data description.
  • ㆍThe format of data files should follow non-proprietary and open standards to the extent possible, given the ongoing access and potential reuse of data.
  • ㆍMetadata and documentation should be used to describe the data to be preserved.
  • ㆍ<Table 1> provides guidelines based on the type of material.

<Table 1> Data Type and Content

Data Type Guide Content
Data File
  • ㆍ Data in a machine-readable form, i.e., in a state that allows the software to view the individual content or internal structure of the data or to process it, such as modify, transform, extract, etc.
  • ㆍ It is recommended that the dataset or data file intended to be deposited be provided in a widely accepted format for future reuse or in a specific format that is universally accepted by the domain area community.
Documentation File
  • ㆍMetadata describing the contents of the data file should be provided along with the data files.
  • ㆍExamples of documentation files may include codebooks, data collection instruments, summary statistics, project summaries, and lists of data-related publications.
  • ㆍIn addition, it may include:
  • ㆍProject background and objectives
  • ㆍInformation about the methodology
  • ㆍSources used
  • ㆍRelevant studies
  • ㆍSampling procedure
  • ㆍContent and structure of the dataset
  • ㆍA description of the data and a list of file names
  • ㆍTools or software needed to work with or read the data
  • ㆍA description of any known errors or weaknesses in the data
  • ㆍReferences to publications related to or resulting from the project
  • ㆍDocumentation of records, data transformations, or format changes.
Metadata
  • ㆍMetadata describing the contents of the data file must be provided with the data file.
  • ㆍMetadata includes project title, principal investigator name, summary, distributor, keyword, geographic scope, temporal scope, and depositor.
3. Selecting and Evaluating Data to Preserve
The Need to Choose Long-Term Preserving Data
  • ㆍEven if data storage is not costly, there are reasons to select data for long-term preservation rather than storing all data, including:

    ◦ The rapid growth of digital data makes storing everything unaffordable.

    ◦ Digital preservation methods are not sustainable without proper mirroring and backup systems, and ultimately, backup and mirroring increase the cost of preservation, which means that storage costs at least double.

    ◦ Storing all data can require additional effort to determine which data are relevant to a search, which can be reduced by selectively storing data.

    ◦ Since a lot of data management and preservation costs are required, the cost of creating and managing preservation metadata and the preservation cost of the data to be preserved must be considered.

Criteria for Selecting Long-Term Preserving Data
  • ㆍDue to data storage resource limitations, long-term preservation of all data is not possible, so the criteria listed in <Table 2> can be used to select data with a long-term preservation value.
  • ㆍ<Table 2> shows the criteria for selecting long-term archival data.

<Table 2> Criteria for Selecting Long-Term Preserving Data

Category 내용
Legal considerations
  • ㆍIs there a legal reason to retain the data?
  • ㆍIs the data used or could be used in a lawsuit, public inquiry, police investigation, or a report or paper that could be legally challenged?
  • ㆍIs there a financial or contractual obligation to retain the data?
  • ㆍWas the data used to write the paper also used to register its performance?
Scientific or historical Value
  • ㆍDoes the data have a geographic or temporal scope that makes it useful to others?
  • ㆍDoes the data have historical value (e.g., can it be presented as a landmark of scientific discovery)?
  • ㆍDoes the data involve changes in processing methods, new standards, or precedents?
  • ㆍDoes the data support a trend or current project in science?
  • ㆍIs there potential for more research in the relevant scientific field?
  • ㆍIs it likely to meet the future needs/directions of the scientific community?
  • ㆍIs the data contributing to a broader collection?데이터가 광범위한 수집에 기여하고 있는가?
  • ㆍIs the data likely to be reused?
  • ㆍIs the data cited in publications?
Original
  • ㆍIs the data unique?
  • ㆍDoes the data remain unchanged and maintain its existing integrity?
  • ㆍWould it be cost-prohibitive to reproduce or re-collect the data?
  • ㆍIs this believed to be the primary copy of this data?
  • ㆍAre copies of this data at risk?
Conditions
  • ㆍAre the data accompanied by relevant metadata?
  • ㆍAre there more scientific value data than non-scientific value data?
  • ㆍCan the data be ingested without additional processing (e.g., differentiation, format conversion, etc.)?
  • ㆍAre the data in good condition to be added to the collection (i.e., readable, intact, and robust enough to be handled)?
Storage and Preservation
  • ㆍCan the data be stored without special requirements (digital or hard copy)?
  • ㆍCan the data be preserved without special requirements (digital or hard copy)?
Access/Use
  • ㆍCan the data be deposited without intellectual property or copyright restrictions?
  • ㆍCan the data be deposited without conditions imposed by external sources or existing terms and conditions?
  • ㆍCan the data be deposited without any temporal restrictions on its the use ?
Format/ technical limitations
  • ㆍIs the deposit in an acceptable data format?
  • ㆍAre the data accessible without specialized (and generally unavailable) software?
  • ㆍIs specialized software readily available from the Geoscience Data Center?
  • ㆍIf the data is not in an acceptable format, can it be transferred to an appropriate storage/archiving system or converted into a commonly used format?
4. Data Repository
Definition of Data Repository
  • ㆍA data repository is an online database service, an archive that manages the long-term storage and preservation of digital data resources and provides a catalog for navigation and access.
Considerations for Choosing a Data Repository
  • ㆍProvide a persistent identifier for the submitted dataset.
  • ㆍAfter exploring the dataset, metadata that supports checking and using the contents of the dataset are provided as a landing page for the dataset.
  • ㆍSupport tracking of data use.
  • ㆍRespond to community needs or be recognized as a “trusted data repository.”
  • ㆍMeet legal requirements, such as data protection, and allow for data reuse without unnecessary licensing requirements.
Examples of Data Repository

<Table 3> Repository for geoscience datasets

Repository Explanation
National Geoscience Data Centre (NGDC)
  • ㆍThe National Geoscience Data Centre (NGDC) is a repository that manages datasets from the British Geological Survey (BGS) in the UK, collecting and preserving geoscience data and information for long-term use by the community.
  • http://www.bgs.ac.uk/services/ngdc/
Centre for Environmental Data Analysis (CEDA)
  • ㆍCEDA operates the Atmospheric and Earth Observation Data Center function on behalf of the Natural Environmental Research Council (NERC) for the UK atmospheric science and Earth observation community.
  • https://www.ceda.ac.uk/
UK Polar Data Centre(UK PDC)
  • ㆍThe UK Polar Data Centre (UK PDC) is the center for Arctic and Antarctic environmental data management in the UK and is part of the Natural Environmental Research Council's (NERC) Environmental Data Network.
  • https://www.bas.ac.uk/data/uk-pdc/
PANGAEA
  • ㆍPANGAEA has a 30-year history as an open access library for archiving, publishing, and distributing georeferenced data in the earth, environmental, and biodiversity sciences.
  • https://www.pangaea.de/
TOAR Surface Observation Database
  • ㆍThe Tropospheric Ozone Assessment Report (TOAR) database is the world's most extensive database of surface ozone measurements.
  • https://toar-data.fz-juelich.de/
Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC)
  • ㆍThe Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) is one of the Earth Observing System Data and Information System (EOSDIS) data centers managed by the National Aeronautics and Space Administration (NASA) Earth Science Data and Information System.
  • https://daac.ornl.gov/
Norwegian Marine Data Center (NMD)
International Council for the Exploration of the Sea (ICES)
  • ㆍICES is an intergovernmental marine science organization that advises on conservation, management, and sustainability. It aims to increase and share scientific knowledge of the marine environment and its living resources and to utilize it.
  • https://ecosystemdata.ices.dk/
ICTS SOCIB Data Repository
  • ㆍBalearic Islands Coastal Observing and Forecasting System (ICTS SOCIB) is a multi-platform distributed and integrated system that provides oceanographic data product streams and modeling services.
  • https://www.socib.es/data/
Storing Data
  • ㆍIdentify how many copies of data to store and how to synchronize them.
  • ㆍProvide storage for data.
  • ㆍRetain a backup site for data transfer systems stored in cloud-based services.
  • ㆍProvide a data download service from the backup site in the event of a power outage.
  • ㆍProvide criteria for comparing storage (storage space) solutions.
  • ㆍEnsure integrity and accessibility when backing up data.
Backup and Recovery
  • ㆍTo prevent and protect data loss and damage, researchers are responsible for regularly and automatically backing up their data to multiple locations.
  • ㆍThe backup system of the Geo Big data Open Platform consists of a double backup with InnoStor Appliance (ISA-2000) and Quantum Scalar i500, and backup is performed by storing the data periodically backed up from the service storage in the backup system.

    ◦ Backup target: Performs backups for data, databases, and user data files of the Geo Big Data Open Platform.

    ◦ Backup cycle

    - Backup of user data files, databases, and system data: Daily

    - Full backup of user data, database, and research data (files): Saturday

  • ㆍRecovery Policy and Guidelines:

    ◦ System and application software are recovered from local GIT repositories.

    ◦ Recovery for research data, database, and user data files is performed from data stored on the backup device.

    ◦ Perform recovery from tape backups at the point of origin if the backed-up data fails.

Archiving and Preserving Data
  • ㆍPeriodically archive (magnetic tape) research data to preserve research data.
  • ㆍVaulting and archiving of backup tapes to a remote location through tape backup (yearly).
  • ㆍArchiving tapes are retained for a minimum of five years.
Preserving Strategies for Descriptional and Procedural Stability
  • ㆍMigration: Convert file formats from less common or deprecated file formats to current file formats.
  • ㆍEmulation: Emulation, which involves mimicking the functionality of an older or obsolete computer, allows a computer to read an older file format and then save it in a current file format (a combination of emulation and migration) or a technique for reading and using older, obsolete files in the future.
  • ㆍNormalization: Restrict data formats to common formats for preservation (e.g., limiting text files to open document formats or Word format) or converted software-dependent file formats to software-independent file formats (e.g., SPSS system files) or software-dependent file formats (e.g., ASCII or XML-based formats).
5. Digital Assets Preservation Framework
Overview of Digital Assets Preservation Framework
  • ㆍThe Digital Asset Preservation Framework was presented by the National Digital Stewardship Alliance in 2013. This framework can be used to assess the level of digital preservation using <Table 4>.
  • ㆍThe appendix is a guide for assessing the level of preservation of digital assets, which can be used to evaluate the state of preservation in a repository and provide a year-by-year indication of where the level of preservation should be increased in the future.

<Table 4> Digital Asset Preservation Framework by Level

Content Level 1 (Data Protection) Level 2 (Data Recognition) Level 3 (Data Monitoring) Level 4 (Data Recovery)
Storage
& Geolocation
  • ㆍTwo complete copies stored physically separate from each other
  • ㆍFor data on heterogeneous media (optical disks, hard drives, etc.), transferring content from that media to the storage system.
  • ㆍAt least three complete copies
  • ㆍAt least one copy in a different geographic location
  • ㆍDocument the storage system, storage media, and what you need to use the storage.
  • ㆍOne or more copies in geographic locations with different disaster threats (e.g., hurricane zone vs. earthquake zone)
  • ㆍMaintain an obsolescence monitoring process for storage systems and media
  • ㆍAt least three copies in geographic locations with different disaster threats.
  • ㆍHave a comprehensive plan for archiving files and metadata on systems and media that are currently accessible.
File fixity &
data integrity
  • ㆍVerify file integrity on ingest (if provided)
  • ㆍGenerate checksums, if not provided
  • ㆍVirus scanning of all content
  • ㆍIntegrity checks on all data collection
  • ㆍRead-only when working with source media
  • ㆍVirus scanning for high-risk content
  • ㆍIntegrity checks at regular intervals
  • ㆍMaintain integrity logs; provide audit information as needed
  • ㆍMaintain procedures to detect compromised data
  • ㆍVirus scanning of all content
  • ㆍCheck the integrity of all content in response to specific events or activities
  • ㆍMaintain procedures for replacing or repairing corrupted data
  • ㆍEnsuring that no one person has write access to all copies of a file
Information security
  • ㆍIdentify users with permissions to read, write, move, and delete individual files
  • ㆍRestrict permissions on individual files
  • ㆍRestricting document access to content
  • ㆍMaintain a log of users who have taken actions on files, including delete and retention actions.
  • ㆍPerform a log audit
Metadata
  • ㆍInventory and storage locations of content
  • ㆍEnsure backup and physical separation of inventory information
  • ㆍStoring administrative metadata
  • ㆍStore transformative metadata and log events
  • ㆍPreserving standard technical and descriptive metadata
  • ㆍStore standard retention metadata
File Format
  • ㆍEncourage limited use of known open formats and codecs if they can be used to create digital files.
  • ㆍInventory of file types in use
  • ㆍMonitoring file types that are no longer supported
  • ㆍPerform format migration, emulation, and similar tasks
Version No. Date Contents
0.1 2023. 03. 20. Create document outline
0.6 2023. 04. 28. Create draft
0.8 2023. 05. 08. Guideline review
1.0 2023. 05. 19. Accept review comments