FCC Home   FCCIS Home      

Data management of GEO database

Data summary

Purpose

Acquisition and collection of geological data and its associated data analysis contributes essentially to the project's technical risk management plan. Data analysis and its associated results serve (1) obtaining credible project construction cost and schedule estimates, (2) a world-wide community of engineers and entrepreneurs to propose credible and feasible re-use possibilities for excavated tunnel material and (3) increase the outreach of the scientific publications based on this data.

The data will be published on ZENODO to provide transparency and data clearance for potential re-use scenarios. For quality management purposes, all raw data, analysis results and ancillary data such as device calibrations and metadata will be stored at CERN in an EOS/CERNBOX repository with limited access. Also, some proprietary raw data and selected analysis results that are subject to publication constraints and time-based embargoes will be kept in this internal data repository. Our approach enables follow-up projects and further generations of researchers continuing to build upon existing data sets, to validate the results and to document the improvement of technologies and techniques in a verifiable manner. This approach will ensure a durable impact of the EC funding obtained by projects such as FCCIS and DEBI EU projects beyond the project period.

Relation to the objectives of the FCC project

The objective of the FCC project and the FCCIS EC co-funded project is to develop a feasible new particle-collider based research infrastructure that serve a world-wide community of scientists until the end of the 21st century. The collection, processing, maintenance and publication of geological data will significantly help achieving this goal. In addition, establishing a durable library of raw data and analysis results can serve a large community of researchers, engineers and entrepreneurs from different fields to:

  • get a better understanding of the subsurface in the Franco-Genevois basin across France and Switzerland,
  • support the development of economically viable re-use scenarios for excavated material, and
  • serve as an example for a large-scale subsurface investigation project with a high green environmental impact.

Sample type & data format

Published data includes comprehensive results of rock samples that are listed among the following types:

  • cores: cylindrically-shaped rock samples, which can either be fully retreived (full core) or split (half core). Core samples typically range from several cm to m in length, with typical diameters according to Oil and Gas company standard drilling procedures (e.g.4 to 12 cm).
  • plugs: small cylindrically-shaped rock samples drilled from a core. Usually, these plugs range from 2 - 8 cm in length and between 1 and 3 cm in diameter.
  • cuttings: rock pieces as a result of the drilling process. Taken during drilling in an regular interval.
  • hand pieces: these rock samples are of irregular shape and usually taken from outcrops (surface geology). Note that previous sample types can be obtained from outcrop samples if they have a sufficient volume.

Depending on the type of sample different analyses are possible. The list above ranks qualitatively each sample type in decreasing order of the amount of analyses possible. Please check the metadata of each analysis for further information on analysed samples. Results of these sample types are used to

  • derive thorough correlations for the understanding of molasse material properties for tunnelling construction and engineering purposes,
  • develop a 3D subsurface basin model,
  • develop re-use scenarios for expected rock classification types during the excavation process and
  • create figures and plots in scientific publications, such that other researchers can compare results and full transparency and reproducibility is given.

Numeric data and plots are value tables in Open Document Spreadsheet format (.ODS) for limited amounts of data with typed columns.

Larger data series are UTF-8 encoded, comma-separated-value files (RFC 4810) in textual format files ({data filename}.CSV) with column value and data format description ({data filename}_description.csv).

Images and raw measurement data files are provided by the measurement instruments and stored in subfolders with a unique name.

Document record and change track will be included (author contact information, status, version, change reason and date, description of contents, title, origin of data including a brief description of the measurement and/or experiment setup, sample ID, analysis scientist, sampler and processor) in a separate metadata file for each characterization action called METADATA.ODS.

Data files and images will be included in the open data sets and made available through a quality-managed data release procedure on ZENODO.

Proprietary raw and calibration data delivered by the measurement device may be published if access restrictions permit publication. These restrictions depend on the ownership of the raw data and the contractual conditions for making data openly accessible (e.g. through the EU H2020 funding instrument, data produced in the course of a project for the purpose of the project must be made openly accessible by the project consortium).

Templates for the different files are available:

Template Type Version Use Case
FCC-1611250930-JGU_SpreadsheetTemplate_V0100.ods Open source spreadsheet 1.0 Limited amounts of data. Note that data columns have data types
FCC-1611250930-JGU_SpreadsheetTemplate_V0100.xlsx MS Excel proprietary spreadsheet 1.0 Limited amounts of data. Note that data columns have data types
demodata.csv Comma separated value file 1.0 Data series
demodata_description.csv Comma separated value file 1.0 Description of the .csv data file contents in a .csv file
characterisation_metadata.ods Spreadsheet 2.0 Metadata file for a set of characterised samples
metadata.ods Spreadsheet 1.0 Metadata file for a data set
Creative Commons Attribution 4.0 International (CC BY 4.0) licence Webpage 4.0 Contents of licence.txt file

Handling of existing data

Existing data from past and ongoing research and development projects within the scope of the FCC study serves as a basis for the data files and subsequent geomodelling.

Origin of data

The differentiation among data holder and data owner proves to be essential. The following list refers to data holders, who permit to analyse and process the data within the scope of the FCC project. These data sets might originate from:

  • past and present subsurface investigations in the region carried out for different (non-FCC) purposes,
  • past and present subsurface investigations contracted by CERN for different (non-FCC) purposes and
  • subsurface investigations contracted by CERN for FCC purposes.

Expected size of data

MAX REVISE

The exact size of data is today unknown. Initial experience from different measurements permit revising this initial data management plan. The main data storage volume stems from raw seismic and borehole data, SEM, QEMSCAN and optical microscope images stored in high-resolution bitmap format. The following list should give a first estimation of potential storage volume, expecting maximum data quality:

  • Active and passive seismic raw data: 1 - 15 TB.
  • Borehole data: 1 - 500 GB.
  • High-resolution images: 1 - 2 GB.
  • Laboratory measurement results (e.g. CSV-files): 1 - 10.000 MB.

Data utility

Within the Consortium:

The data sets will be shared within the consortium as the working baseline to:

  • develop re-use scenarios for tunnel excavation material,
  • create a 3D geological subsurface basin model of the project's perimeter and
  • produce scientific publications to validate results through repeated experiments at different locations.

Beyond the Consortium:

The data can be used by independent researchers and engineers to understand the contents and conclusions of the scientific publications, which base their findings on the data. Furthermore, independent researchers can use the files to produce figures and publications, showing comparisons of their and the proect's results. Scientists can use the data files to repeat the experiments and measurements to verify and validate the project's research. The data sets can be used by a world-wide set of researchers, engineers and entrepreneurs to develop credible re-use scenarios for tunnel excavation material. The data sets can be used by public entities to increase their understanding about the geology in the Franco-Genevois basin. The data sets may be used by scientific writers and the press to produce high-quality infographics, demonstrating the scientific impact and potential of the FCC project.

Fair data

Making Data Findable, Including Provisions for Metadata

Discoverability

The ZENODO.ORG platform will be used to make the data openly accessible and discoverable. Links to the data will be made available on the FCC Twiki collaborative web site together with metadata describing the data sets once they are released on ZENODO. A link on the FCC public website fcc.cern is provided. The data will be indexed using the EU Open Data Portal. Since the open data supports the quality and credibility of the open publications, all data are discoverable through the scientific publications. Each scientific publication will include Digital Object Identifiers (DOI) that direct to the associated open data sets.

Identification

Each data set is labelled by a DOI as a unique and persistent identifier.

Data sets will be referenced in scientific publications and if the open data platform permits, scientific papers based on the data will be linked on the open data platform. The DOI is reserved when a ZENODO entry is created before any data are uploaded to the platform. At this point, the data set is not published and its visibility is classified ad "Closed Access".

Metadata

The data sets follow the EU Open Data Portal Metadata definitions. From the comprehensive set of fields, a minimum set will be provided. The Dublin Core Metadata Initiative will be followed as much as possible. For all characterisations, at least the following metadata will be provided via the ZENODO upload form and the individual metadata files in the data folders:

Metadata element Description
Project identifier Points to a subfolder with the same name, holding the data
Title Meaningful name of the sample characterisation
Alternative title Other identifier in concise format
Description Brief description of the data set
Keywords List of keywords according to the library of congress terms
Identifier Digital Object Identifier (DOI) reserves for this data set
URI Uniform Resource Identifier linking to the place where the data are stored. Usually the path of the folder in the file system
Dataset type Type of the bore log or material sample (e.g. cube, surface, loose rock)
Documentation Detailed description of the dataset content
Format File type of the data set, usually a compressed ZIP archive
Issue date Date of the first issue
Modification date Last date the data sample set was modified
Publisher Usually the FCC collaboration
Contact point Name of the organisation and service at the organisation who is in charge of that data sample
Contact full address Address of the contact and organisation in charge of that data sample
Contact e-mail E-mail address of the contact person
Contact name First name and last name of the person who is in charge of the data sample
Contact Web page A web page
Version number Major and minor version number of the published data set. A minor version number 0 indicates a released version
Version description Incremental change record
Licence Link to licence text
Owner Name of the person who must authorize the release of specifically marked information items
Status One of IN WORK (minor version not equal 0), RELEASED (minor version number is 0), INVALID
Materials A list of individual materials present in the sample as far as known
Dimensions Sample dimensions (mass, diameter, length, width)
Origin Geographical coordinates of the sample's origin
Source Who provided the sample
History A record of actions at different locations that indicate, how the sample was obtained, used and characterised
Additional information Free text with additional comments

The folder data of the CERNBOX file share contains a file called CATALOG.ODS that provides the most important metadata, notably the project identifier of the sample characterisation, which directs to the folder in which the open data is stored. The GEO database catalog is also available on the geo database Twiki page.

The following data description elements are collected for the catalog:

Data element Description
Data set identifier Identifier in format LLL-YYMMDD
URI Uniform Resource Identifier linking to the place where the data are stored. Usually the path of the folder in the file system. A cernbox link may either be public or accessible only to members of the e-group if the data are not yet published
OID Digital Object Identifier filled if reserved in ZENODO
Title Concise name of the data set
Version Current version of the data set
Status Release status of the data set (RELEASED, IN WORK, INVALID )
Organisation Short name of the organisation that created and manages this data set
Contact name Name and e-mail (hyperlink) of the person who serves as contact point for this data set
Type Type of the data set, e.g. a bore log or material sample (e.g. cube, surface, loose rock)
Source Where the materials for this data set have been obtained from (e.g. contracted extraction, other project, literature)
Location N/E Geographical coordinates where the materials that relate to this data set have been taken from
Description Brief description of the data set
Last update Date when this record has been updated

*NOTE: * It is understood that these fields are duplicates of those fields, which are also stored at a lower measurement folder level. The repetition at higher level serves a simple, easy-to-use catalogue in the project.

Each data set is stored in a separate sub-folder. Each data set sub-folder has either a) folders for different samples that have been characterised or b) immediately folders for the characterisation data of the entire data set.

If a data set folder has sample sub-folders, each sample sub-folder contains a METADATA.ODS file that describes the specific data set according to the metadata elements indicated above.

Versioning

A dataset has a major (MM) and a minor (mm) version number, separated by a dot (MM.mm).

NOTE: The ZENODO recommended patch number is not used. It should always be “.0”.

If the minor version number is 0, the data set is released. Any minor version number different from zero indicates a data set that is in work. At each release the major version number is incremented by one. For each change in an "in work" version, the minor version number is incremented by one. The first "in work" version starts with a major version number equal to 0.

Examples:

V 0.0 - the first draft version

V 0.1 - a second draft version

V 1.0 - the first released version

V 1.1 - an update to Version 1.0, not yet released

V 2.0 - another released version. Version 1.0 is now invalid.

Versions comply with the data set status and can either be:

  1. IN WORK (not released),
  2. RELEASED or
  3. INVALID (must no longer be used as reference)

Note: INVALID and IN WORK version must not be published and be referenced in publications.

Naming convention

Data set identification

A data set comprises multiple samples and sample analysis records.

The project identifier for a data set that comprises uses the following convention: LLL-YYMMDD

where LLL is the three-letter abbreviation of the organisation at which the data set is created, e.g. MUL for Montanuniversität Leoben.

YY stands for the last two digits of the current year, e.g. 18 for 2018.

MM stands for the two digits of the current month, e.g. 04 for April.

DD stands for the two digits of the current day of month, e.g. 17 for the seventeenth day.

A complete example for a project identifier is TUW-180317.

The three letter abbreviations of an exemplary number of organisations, which typically carry out material characterizations are shown in the table below. This document will be regularly updated. The three-letter abbreviations do not coincide with the typical organisation abbreviations used to identify an organization in EC projects. They merely serve coming to unique project identifiers for sample characterizations.

Abbreviation Organisation
MUL Montanuniversität Leoben
UNIGE Université de Genève
ETH Eidgenössische Technische Hochschule

Sample and analysis identification

Files that relate to samples and data from different analysis processes are placed in subfolders of the data set according to this structure:

LLL-YYMMDD is the name of the folder for a data set that contains multiple sample characterisations.

LLL-TTMMDD-{running number}-{type} is the name of a sub-folder in LLL-YYMMDD that holds the analysis data of a specific sample or characterisation.

{type} can be

  1. either sample to indicate that the folder contains data of a single sample that has been analysed with one or several methods or
  2. {characterisation method abbreviation} to indicate that this folder holds data for a specific characterisation method.

Today, the following sample analyses are registered:

Characterisation method abbreviation Characterisation method name Description
QEMSCAN Automated mineralogy and Petrography apparatus Modal mineralogy, lithology, grain density
FTIR Fourier Transform Infrared Spectroscopy Mineral components in mixtures to distinguish among different types of clay minerals, characterization of both di- and trioctahedral clay minerals, amounts of kaolinite and Al/Mg/Fe types, organic material
OMI-T Optical Microscopy in transmitted light Optical analysis, 2.5x to 50x magnification
OMI-R Optical Microscopy in reflected light Optical analysis, 2.5x to 50x magnification
pXRF Portable X-Ray Fluorescence Elemental composition in portable format
XRD-P X-Ray diffraction on powder sample Identification of mineral phases on randomly oriented sample
XRD-S X-Ray diffraction on textured sample Identification of clay minerals on oriented sample
XRD-EG X-Ray diffraction on ethylenglycole-treated sample Identification of swelling clays
ICP-MS Inductively Coupled Plasma Elemental composition using mass spectroscopy
ICP-OES Inductively Coupled Plasma Ultra-trace elemental composition using an optical emission spectrometer to identify leaching characteristics
CUV Cuvette Tests Identification of chemical pollutants and leaching characteristics complementary to ICP-OES
UCS Uniaxial Compressive Strength Uniaxial Compressive Strength
LCPC Laboratoire Central des Ponts et Chaussées LAB and LAC values for abrasivity and breakability
CER Cerchar abrasivity Abrasivity behavior
VPS Sonic P- and S-wave velocity Sonic velocities in longitudinal and transversal direction
BRZ Brazilian tensile strength Tensile strength behavior
PL Point Load Point load index, rock strength parameter
PORO_gas Porosity Total & effective porosity measurements using gas adsorption
PERM_gas Permeability Helium / Nitrogen permeability measurements using gas adsorption
CEC Total Cation Exchange Capacity Total identification of solved ions in pore water and pores
CEC-ICP Effective Cation Exchange Capacity Effective identification of solved ions in pore water and pores using ICP optical emission spectrometer
Enslin-Neff Free Water Uptake Capacity Water suction of samples under free swelling conditions
Keeling Vapour Water Adsorption Water adsorption under 75% NaCl atmosphere identifying specific (inner crystalline) surface
BET Brunauer-Emmett-Teller Determination of specific (outer) surface
TC/TOC Total carbon / Total Inorganic Carbon Content Measurement of total organic and total inorganic carbon content, as well as derivation of total organic carbon content via difference-method
MIP Mercury Intrusion Porosimetry Identification of pore size, pore size distribution and porosity by high-pressure mercury intrusion

Filename and folder naming

The filename of a sample or data set contains a clear, concise and short name (e.g. borehole name or outcrop name) that describes the location, followed by the representative feature (e.g. depth) and analysis method in the format of e.g. fullwellname_depth_analysismethod. The "analysismethod" depicts the common abreviation of an analysis in a maximum of four letters, e.g. X-Ray Fluorescence (XRF) or PORO (porosity). Words are in lowercase and separated by underscores ("=_="). CamlCase is discouraged. In compliance with these regulations, an example file name would be: geo02_148_xrf. This sample originates from the well GEO02 (full well name) at a depth of 148m analysed by a X-Ray Fluorescence device, i.e. elemental (geochemical) composition analysis. Further filename extensions are encouraged to ease the understanding of the folder contents such as additional data sets within the same file (several measurements) or in case of measurements on different sample types (e.g. halfcore or cuttings).

All data sets are stored in the folder data of the fccgeo EOS/CERNBOX project.

Folder structure

Each top level data set folder contains at least the following files:

  • LICENCE.TXT - Text of the Creative Commons CC BY 4.0 licence. Additional note about the data creator if needed including specific clauses on a case-by-case basis
  • METADATA.ODS - the metadata if the data set including the change track record

Each data set folder contains sub-folders. These folders hold information about the entire data set from which the samples are taken and sub-folders for individual sample analysis tasks.

Collaboration members shall report needs for further characterisation methods in a timely fashion so that the template folder at top level can include further examples. The file and folder's naming convention will evolve according to project requirements and growing experience.

One sub-folder needs to be created in a data set folder for each sample that is analysed with multiple techniques or for each analysis technique as indicated above.

Each sample specific sub-folder holds a CharacterisationMetadata.ods file that describes the sample and the analysis carried out on that sample. Depending on the numbers of samples taken from the original material and the different types of characterisation performed, two options exist:

  1. Either there exists one sub-folder in the data set for each sample taken from the material lot (preferred) or
  2. there exists one sub-folder for each characterisation. Both is valid.

The sub-folders are named by extending the data set naming conventions with a dash (-) followed by a sequence number, another dash (-) and the word sample or the abbreviation of the characterisation method.

If a single sample is analysed with different analysis methods, the sample folder contains one sub-folder per analysis method that has the name of the analysis method (e.g. SEM, TEM).

In each sample or characterisation folder, the following sub-folders in each sample/characterisation folder exists:

  • img for images,
  • raw for any raw data files that devices deliver during the analysis, including calibration data files,
  • results for the derived analysis data and results of the characterisation,
  • doc for any additional documentation such as manuals, process descriptions, sample preparation instructions and
  • misc for any additional information files.

The following image shows two possible folder structures. The structure example and the files can be found in the template folder.

data_set_folder_structure_documentation.png

Another example is re-produced in tabular form:

Path name Description
/eos/project/f/fccgeo/data/CRN-200301 Folder for a bulk of materials taken at CERN
/eos/project/f/fccgeo/data/CRN-200301/Metadata.ods The metadata file that describes that material taken and the list of analysis performed
/eos/project/f/fccgeo/data/CRN-200301/img Sub-folder that holds images that relate to the entire data set (e.g. of the borelog)
/eos/project/f/fccgeo/data/CRN-200301/raw Sub-folder that holds the raw data files (e.g. the scanned borelog)
/eos/project/f/fccgeo/data/CRN-200301/results Sub-folder that holds the analysis results that relate to the entire data set
/eos/project/f/fccgeo/data/CRN-200301/doc Sub-folder that holds the additional documentation that permits better understanding the data set
/eos/project/f/fccgeo/data/CRN-200301/misc Sub-folder that holds miscellaneous additional information files
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample Sub-folder for a qualitative analysis of a part of the material extracted
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/CharacerisationMetadata.ods Metadata file that describes the subset of the materials and the characterisation
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/img Sub-folder with images of the sample
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/results Sub-folder with results of the sample analysis
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM Sub-folder with data relating to the SEM analysis method of that specific sample
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/img Sub-folder with images from the SEM analysis of that specific sample
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/raw Sub-folder with the SEM device raw data
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/doc Sub-folder with the SEM device description, sample preparation information and process description
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM Sub-folder for a spectroscopy analysis of another part of the material extracted
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/CharacerisationMetadata.ods Metadata file that describes the subset of the materials and the characterisation
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/img Sub-folder that holds images taken during the analysis
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/raw Sub-folder that holds the raw data files obtained during the analysis
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/results Sub-folder that holds the analysis results and result data files obtained after the analysis
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/doc Sub-folder that holds the additional documentation that permits better understanding the results (e.g. device manuals)
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/misc Sub-folder that holds miscellaneous additional information files

Data format

A set of sample characterisation data is uploaded as a compressed archive file in .ZIP format to ZENODO together with the following separate files:

  • licence.txt (text of the CC BY 4.0 licence in UTF 8 format)
  • METADATA.ODS (spreadsheet describing the entire data set)
  • ./{subfolders of selected sample characterisations to be published} (organisation of the sub-folders can either be by sample or by characterisation method)

Note: _Each subfolder corresponding to a specific measurement action will contain only those files for which the collaboration members agree to publish openly. Usually, these comprise spreadsheets in open format (.ODS) and comma separated value files (.CSV) as well as high resolution images (.PNG, .TIFF, .JPEG), vector plots (.EPS, .PDF) and summary documents (.PDF).

Keywords

Each dataset will at least be tagged with the following keywords:

  1. FCC
  2. FCCIS (if the data set is part of the FCCIS EC co-funded project)
  3. H2020 (if the data set was created as part of an EC co-funded project)

In addition, appropriate keywords from the Library of Congress Subject Headings classification will be added (see http://id.loc.gov/authorities/subjects.html):

The keywords need at least to include

  1. discipline
  2. sample type
  3. type of materials
  4. properties or methods used for characterization

A selected list of entries from the following keyword terms and the link to the keyword term need to be entered in the two distinct fields foreseen in the ZENODO upload webform:

Term Identifier link
Disciplines URI
Geology http://id.loc.gov/authorities/subjects/sh85054037
Engineering geology http://id.loc.gov/authorities/subjects/sh85043221
Materials science http://id.loc.gov/authorities/subjects/sh85082094
Sample type URI
Prospecting geophysical methods http://id.loc.gov/authorities/subjects/sh85107597
Electric prospecting http://id.loc.gov/authorities/subjects/sh85041941
Magnetic prospecting http://id.loc.gov/authorities/subjects/sh85079731
Seismic prospecting http://id.loc.gov/authorities/subjects/sh85119624
Geophysical well logging http://id.loc.gov/authorities/subjects/sh85054183
Materials URI
Molasse http://id.loc.gov/authorities/subjects/sh85086543
Limestone http://id.loc.gov/authorities/subjects/sh85077017
Moraines http://id.loc.gov/authorities/subjects/sh85087196
Gravel http://id.loc.gov/authorities/subjects/sh85056544
Properties URI
Abrasion resistance http://id.loc.gov/authorities/subjects/sh2008000012
Soil porosity http://id.loc.gov/authorities/subjects/sh86005323
Soil permeability http://id.loc.gov/authorities/subjects/sh85124367
Tensility ?
Characterisation URI
Physical measurements http://id.loc.gov/authorities/subjects/sh85101564
Microscopy http://id.loc.gov/authorities/subjects/sh92003369
Transmission electron microscopy http://id.loc.gov/authorities/subjects/sh93001918
Scanning electronc microscopy http://id.loc.gov/authorities/subjects/sh91002757
Compression testing http://id.loc.gov/authorities/subjects/sh85082069

Metadata standards

The Dublin Core Metadata Initiative will be followed (http://dublincore.org) as much as reasonably applicable. Metadata examples concerning the sample measurement campaigns such as http://icatproject-contrib.github.io/CSMD/csmd-4.0.html and an existing sample database at CERN have been considered. However, to our best knowledge, no domain-specific metadata standard for those sample characteristics identification campaigns specified in the project exist. Therefore, a column-oriented data format with an explanation of the columns will be created in the scope of this project.

Templates

Template folder and files can be found in the folder fccgeo/templates.

The entire folder fccgeo/templates/CRN-200728 is an empty example that can be copied into the folder /fccgeo/data to be renamed using the naming convention for the data set LLL-YYMMDD to create a new data set. It contains the exemplary required files and exemplary characterisation subfolders.

Storage administration and access permissions

Data store location

Data from subsurface samples are stored on a dedicated cernbox/eos data repository. This repository can be accessed either via a website or via a cernbox client application.

The path to the data share reachable directly on lxplus is /eos/project/f/fccgeo

The data share is owned by the service account fccgeo (fcc.geo@cern.ch).

If you have access to the repository through membership of one of the e-egroups, you can directly access the share using a web browser:

https://cernbox.cern.ch/index.php/apps/files?dir=/__myprojects/fccgeo

Access permissions

Access to the data store is managed via three e-groups following the cernbox access documentation:

cernbox-project-fccgeo-admins members have full access to the project in the cernbox website and can add readers and writers to the respective egroups.

cernbox-project-fccgeo-writers members can read, write, delete in the project space. They have only access to the cernbox share via the pathname.

cernbox-project-fccgeo-readers members can only read the files in the project space, They have only access to the cernbox share via the pathname.

Making data openly accessible

Data from characterization campaigns will be made openly available only after approval of all persons who were involved in that characterization. Publication on ZENODO follows a quality management Data Release Procedure.

Note: Persons publishing data on the ZENODO platform need to sign up with an account at that platform at https://zenodo.org/signup/.

Increase data re-use through clarifying licenses

Licence

Openly accessible data will be licensed under Creative Commons CC BY 4.0 (see https://creativecommons.org/licenses/by/4.0/).

Users are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material

for any purpose, also commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Notices: You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation documented by CC BY 4.0. No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

Timing

Data will at latest be made available with the publication of an accompanying scientific publication that references the data sets. All data will at latest be available with the project end, even those data sets, which are not referenced in scientific publications. Data publication may occur after the six months EC H2020 permitted embargo period has elapsed. Data published may contain a subset of those data that are not subject to access restrictions or if publication would violate other existing license restrictions.

Re-use

Published data can be used by other scientists. Original data revealing particular sample preparation or analysis techniques that are subject to access restrictions will only be usable by researchers upon explicit request and approval of the characterisation campaign manager and the data owner.

Validity

The data will remain usable until the repository withdraws the data or leaves business.

Quality assurance

Data sets, metadata and measurement setup and procedure description will be reviewed by at least one peer from the related scientific domain prior to engaging the release procedure. The author and the reviewer are named in the metadata.

Data sets, metadata and measurement setup and procedure description will be marked as RELEASED after approval of the measurement campaign manager (e.g. project supervisor) and one additional reviewer. Each approver is named in the metadata.

Review and release includes a validation of the measured sample, the measurement setup, conditions, procedure and equipment as well as sanity checks against similar studies and control of systematic errors by a related person from the scientific field. The person, who conducted the measurements cannot be simultaneously approve the measured data set. For scientific research (e.g. doctoral thesis), the associated academic supervisor must approve the data.

In case of data quality uncertainties after release, a new version IN WORK is created and the released data set version is marked INVALID.

The description of analysis/characterisation setup (materials and method) are annotated with product references.

The measurement conditions are described.

The measurement location, date and time (periods) are noted.

Any potential and known adverse effects (environmental influences, influences of the measurement equipment) must be described in the metadata.

Allocation of resources

Cost estimate

A person at CERN reachable through the service account fcc.geo@cern.ch keeps the measurement data sets and perform the publication in the open data repository. The estimated effort to manage a data set is 40 hours, 10 data sets per year, i.e. 400 hours or 10 weeks per year over the entire project period.

This resource is covered by the project management funds and CERN matching resources.

Note: the project coordination office will track the actual efforts and regularly update this estimation.

Each researcher in the project is responsible to create the data sets using the adopted open data format, providing the metadata files, describing the measurement setup, anonymising or selecting the data for publication, reviewing the data sets and performing the release process using the CERN provided storage infrastructure (EOS, cernbox) and the ZENODO platform. The estimated effort is another 40 hours per data set, 10 data sets per year, i.e. 10 weeks per year over the entire project period.

This resource is covered by the organisations that carry out the measurement campaigns.

Note: The participating institutes are strongly encouraged to track the time they are spending to prepare the data sets and to publish them and to report their actual estimates to the Coordinator.

Data management responsibilities

This data management plan is maintained by the FCC office at CERN (fcc.office@cern.ch).

All project members at the co-operating organisations commit to cooperate on the establishment of this DMP and to deliver the required information such that the associated deliverables and milestones can be produced in due time with the required quality levels:

Data storage and backup responsibilities are covered by the data repository provider (CERN).

The CERN project repository is managed by the CERN IT department.

Service account holder fcc.geo@cern.ch manages the data store.

The FCC secretariat and office (fcc.office@cern.ch) provides support for the upload to the CERNBOX/EOS data storage system and perform a formal (file integrity, naming, metadata completeness) check.

Long-term data preservation will be ensured by CERN at no additional cost.

Data security

All data delivered to the CERN project repository EOS/CERNBOX storage system is backed up by CERN's central IT services. In addition, a copy of released data will be kept on the ZENODO platform. Both services are intended for long-term storage of scientific research data. Upon unintentional loss of data (misuse of the collaborative workspace, accidental removal), the fcc.geo@cern.ch service account holder needs to be contacted via email. The person will interact with CERN's IT services to restore the latest known copy. No additional costs occur for storage, backup and restore activities.

Non-public data sets can be provided by the project members using HTTPS transfer protocol after authentication by sharing a CERNbox link and asking the service account holder in a reasonable secure fashion counteracting data manipulation.

Sensitive data, i.e. non-anonymized data sets can only be accessed by the author, the measurement campaign management and the project IT managers.

Access to sensitive data may be granted through a request to the network coordinator (CERN) with a justification of request. Access will be granted on a case-by-case basis in agreement with the measurement campaign manager and, if samples, analysis and products from industrial partners are involved, in agreement with the sample owners. The data will be communicated in electronic format from the network coordinator to the data requestor in digitally signed and encrypted form. An additional IP access process, such as the establishment of a Non-Disclosure Form may apply.

Every collaboration member must inform the network coordinator without delay if a person affiliated (associated or employed) with the institute and who has access to the project data, leaves the institute. In this case, the network coordinator will revoke as soon as technically possible and resources permitting (working hours) the access of the person to the data.

Note: E-mail is not considered a secure communication channel for data and metadata files. Data can be modified and it is unclear what fields have been modified with respect to the original data source. Therefore, only a link to the authentic data source shall be considered reliable information.

Ethical aspects

Sensitive information will be kept secure. Access to non-anonymized data is managed by the network coordinator in close cooperation with the organisation, who provides the data set. Non-anonymized data will only be communicated in encrypted fashion and digitally signed.

-- MaximilianHaas - 2021-03-17 -- JohannesGutleber - 2020-07-27

FCC Home   FCCIS Home      

OLD VERSION (SAVE COPY)

Geo Data Management Plan

Data Summary

Purpose

Collecting and making available the data of the geology and the analysis of the soil is an essential part of the project's technical risk management plan. It serves (1) obtaining credible project construction cost and schedule estimates, (2) serves a world-wide community of engineers and entrepreneurs to propose credible and feasible excavation materials re-use possibilities and (3) raises the quality of the scientific publications based on those data.

Eventually all data relevant to describe the project and to develop re-use scenarios will be made openly available on ZENODO. For quality management purposes, all raw data, analysis results and ancillary data such as device calibrations and metadata will be stored CERN internally in an EOS/CERNBOX repository with limited access. Also, some proprietary raw data and selected analysis results that are subject to publication constraints and time-based embargoes will be kept in this internal data repository. The repository serves as the source from which data that are freely accessible for the public will be pushed to ZENODO.

Our approach permits follow-up projects and further generations of researchers continuing the work to build upon existing data sets, to validate the results and to document the improvement of technologies and techniques in a verifiable manner. This approach will ensure a durable impact of the EC funding obtained by projects such as FCCIS and DEBI project beyond the project period.

Relation to the Objectives of the FCC Project

The objective of the FCC project and the FCCIS EC co-funded project is to develop a feasible new particle-collider based research infrastructure that can serve a world-wide community of scientists until the end of the 21st century. The managed collection, processing and publication of geological data will help achieving this goal. In addition, establishing a durable library of raw data and analysis results can serve a large community of researchers, engineers and entrepreneurs from different fields to

  • get a better understanding of the subsurface in the region of the Geneva lake basin across France and Switzerland,
  • support the development of economically viable re-use scenarios for excavated materials, namely the molasse type and
  • serve as an example for a large-scale subsurface investigation project.

Types and Formats of Data

The openly accessible data will be the comprehensive result data sets of characterized samples that are used to

  • develop a 3D subsurface model of the region
  • develop re-use scenarios for the expected different soil types during the excavation process and
  • create the figures and plots in scientific publications, such that other researchers can compare their results easier and such that further results including historic data can be produced quicker.

Numeric data and plots are value tables in Open Document Spreadsheet format (.ODS) for limited amounts of data with typed columns.

Larger data series are UTF-8 encoded, comma separated value (RFC 4810) in textual format files ({data filename}.CSV) with column value and data format description ({data filename}_description.csv) will be used.

In addition, images and raw measurement data files as provided by the measurement instruments will be stored in subfolders.

For all published files, a document record and change track will be included (author contact information, status, version, change reason and date, description of contents, title, origin of the data including a brief description of the measurement and/or experiment setup) in a separate metadata file for each characterization action called METADATA.ODS.

Data files and images will be included in the open data sets that will be made available through a quality-managed data release procedure on ZENODO.

Proprietary raw data delivered by the measurement instruments may be published if access restrictions permit the publication. Those restrictions depend on the ownership of the raw data and the contractual conditions for making those data openly accessible (e.g. through the EU H2020 funding instrument, data produced in the course of a project for the purpose of the project must be made openly accessible by the project consortium).

Templates for the different files are available:

Template Type Version Use Case
FCC-1611250930-JGU_SpreadsheetTemplate_V0100.ods Open source spreadsheet 1.0 Limited amounts of data. Note that data columns have data types
FCC-1611250930-JGU_SpreadsheetTemplate_V0100.xlsx MS Excel proprietary spreadsheet 1.0 Limited amounts of data. Note that data columns have data types
demodata.csv Comma separated value file 1.0 Data series
demodata_description.csv Comma separated value file 1.0 Description of the .csv data file contents in a .csv file
characterisation_metadata.ods Spreadsheet 2.0 Metadata file for a set of characterised samples
metadata.ods Spreadsheet 1.0 Metadata file for a data set
Creative Commons Attribution 4.0 International (CC BY 4.0) licence Webpage 4.0 Contents of licence.txt file

Re-use of Existing Data

Existing data from past and ongoing research and development projects in the scope of the FCC study on geomodelling will serve as basis for the data files.

Origin of the Data

The data stem from

  • past and present subsurface investigations in the region carried out for different (non FCC) purposes (e.g. University of Geneva, BRGM in France),
  • past and present subsurface investigations contracted by CERN for different (non FCC) purposes
  • subsurface investigations contracted by CERN for FCC purposes from 2022 onwards

Expected Size of Data

MAX REVISE

The size of the data is today not known. Initial experience with storing results from different kind of measurements will permit revising this initial data management plan. The main relevant data sizes will stem from images such as microscopic sample characteristic that are stored in high-resolution bitmap format. However, the total data set size for a single sample characterization is expected to be in the order of tens of MB only.

Data Utility

Within the Consortium:

The data sets will be shared within the consortium as the working baseline to

  • develop excavation materials re-use scenarios
  • establish a 3D subsurface model of the project's perimeter
  • produce the scientific publications, to verify and validate the results through repeated experiments at different locations

Beyond the Consortium:

The data can be used by independent researchers, engineers to understand better the contents and conclusions of the scientific publications, which base their findings on the data. Furthermore, independent researchers can use the files to produce figures and publications, showing comparisons of their own results and the project results. Scientists can also use the data files to repeat the experiments and measurements to verify and validate the project's research. The data sets can be used by a world-wide set of researchers, engineers and entrepreneurs to develop credible use-cases for excavation materials. The data sets can be used by public entities to increase their understanding about the geology in the France/Switzerland border region. The data sets may also be used by scientific writers and the press to produce high-quality infographics, demonstrating the impact potentials of the technology.

Fair Data

Making Data Findable, Including Provisions for Metadata

Discoverability

The ZENODO.ORG platform will be used to make the data openly accessible and discoverable. Links to the data will be made available on the FCC Twiki collaborative web site together with metadata describing the data sets once they are released on ZENODO. A link will also be provided at the FCC public website fcc.cern. The data will be indexed using the EU Open Data Portal. Since the open data support the quality and credibility of the open publications, all data are discoverable through the scientific publications. Each scientific publication will include Digital Object Identifiers (DOI) that point to the associated open data sets.

Identification

Each data set will carry a DOI as unique and persistent identifier.

Data sets will be referenced in scientific publications and if the open data platform permits, scientific papers based on the data will be linked on the open data platform. The DOI is reserved when a ZENODO entry is created before any data are uploaded to the platform. At this point, the data set is not published and its visibility is classified ad "Closed Access".

Metadata

The data sets follow the EU Open Data Portal Metadata definitions. From the comprehensive set of fields, a minimum set will be provided. The Dublin Core Metadata Initiative will be followed as much as possible. For all characterisations, at least the following metadata will be provided via the ZENODO upload form and the individual metadata files in the data folders:

Metadata element Description
Project identifier Points to a subfolder with the same name, holding the data
Title Meaningful name of the sample characterisation
Alternative title Other identifier in concise format
Description Brief description of the data set
Keywords List of keywords according to the library of congress terms
Identifier Digital Object Identifier (DOI) reserves for this data set
URI Uniform Resource Identifier linking to the place where the data are stored. Usually the path of the folder in the file system
Dataset type Type of the bore log or material sample (e.g. cube, surface, loose rock)
Documentation Detailed description of the dataset content
Format File type of the data set, usually a compressed ZIP archive
Issue date Date of the first issue
Modification date Last date the data sample set was modified
Publisher Usually the FCC collaboration
Contact point Name of the organisation and service at the organisation who is in charge of that data sample
Contact full address Address of the contact and organisation in charge of that data sample
Contact e-mail E-mail address of the contact person
Contact name First name and last name of the person who is in charge of the data sample
Contact Web page A web page
Version number Major and minor version number of the published data set. A minor version number 0 indicates a released version
Version description Incremental change record
Licence Link to licence text
Owner Name of the person who must authorize the release of specifically marked information items
Status One of IN WORK (minor version not equal 0), RELEASED (minor version number is 0), INVALID
Materials A list of individual materials present in the sample as far as known
Dimensions Sample dimensions (mass, diameter, length, width)
Origin Geographical coordinates of the sample's origin
Source Who provided the sample
History A record of actions at different locations that indicate, how the sample was obtained, used and characterised
Additional information Free text with additional comments

The folder data of the CERNBOX file share contains a file called CATALOG.ODS that provides the most important metadata, notably the project identifier of the sample characterization that permits pointing to the folder in which the open data are stored. The geo database catalog is also available on the geo database Twiki page.

The following data description elements are collected for the catalog:

Data element Description
Data set identifier Identifier in format LLL-YYMMDD
URI Uniform Resource Identifier linking to the place where the data are stored. Usually the path of the folder in the file system. A cernbox link may either be public or accessible only to members of the e-group if the data are not yet published
OID Digital Object Identifier filled if reserved in ZENODO
Title Concise name of the data set
Version Current version of the data set
Status Release status of the data set (RELEASED, IN WORK, INVALID )
Organisation Short name of the organisation that created and manages this data set
Contact name Name and e-mail (hyperlink) of the person who serves as contact point for this data set
Type Type of the data set, e.g. a bore log or material sample (e.g. cube, surface, loose rock)
Source Where the materials for this data set have been obtained from (e.g. contracted extraction, other project, literature)
Location N/E Geographical coordinates where the materials that relate to this data set have been taken from
Description Brief description of the data set
Last update Date when this record has been updated

*NOTE: * It is understood that these fields are duplicates of those fields, which are also stored at lower, measurement data set folder level. The repetition at higher level serves creating a simple to use catalogue in the project.

Each data set is stored in a separate sub-folder.

Each data set sub-folder has in turn either a) folders for different samples that have been characterised or b) immediately folders for the characterisation data of the entire data set.

If a data set folder has sample sample sub-folders, then each sample sub-folder contains in turn a METADATA.ODS file that describes the specific data set according to the metadata elements indicated above.

Versioning

A dataset has a major (MM) and a minor (mm) version number, separated by a dot (MM.mm).

NOTE: The Zenodo recommended patch number is not used. It should always be “.0”.

If the minor version number is 0, the data set is released. Any minor version number different from zero indicates a data set that is in work. At each release the major version number is incremented by one. For each change in an "in work" version, the minor version number is incremented by one. The first "in work" version starts with a major version number equal to 0.

Examples:

V 0.0 - the first draft version

V 0.1 - a second draft version

V 1.0 - the first released version

V 1.1 - an update to Version 1.0, not yet released

V 2.0 - another released version. Version 1.0 is now invalid.

Versions go together with the dataset status. It can either be

  1. IN WORK (not released),
  2. RELEASED or
  3. INVALID (must no longer be used as reference)

Note: INVALID and IN WORK version must not be published and be referenced in publications.

Naming Convention

Data Set Identification

A data set is comprises multiple samples and sample analysis records.

The project identifier for a data set that comprises uses the following convention: LLL-YYMMDD

where LLL is the three-letter abbreviation of the organisation at which the data set is created, e.g. TUW for Technische Universität Wien.

YY stands for the last two digits of the current year, e.g. 18 for 2018.

MM stands for the two digits of the current month, e.g. 04 for April.

DD stands for the two digits of the current day of month, e.g. 17 for the seventeenth day.

A complete example for a project identifier is TUW-180317.

The three letter abbreviations of an exemplary number of organisations, which typically carry out material characterizations are shown in the table below. This document will be regularly updated. The three-letter abbreviations do not coincide with the typical organisation abbreviations used to identify an organization in EC projects. They merely serve coming to unique project identifiers for sample characterizations.

Abbreviation Organisation
CRN CERN
MUL Montanuniversität Leoben
GEV Université de Genève

Sample and Analysis Identification

Files that relate to samples and data from different analysis processes are placed in subfolders of the data set according to this structure:

LLL-YYMMDD is the name of the folder for a data set that contains multiple sample characterisations.

LLL-TTMMDD-{running number}-{type} is the name of a sub-folder in LLL-YYMMDD that holds the analysis data of a specific sample or characterisation.

{type} can be

  1. either sample to indicate that the folder contains data of a single sample that has been analysed with one or several methods or
  2. {characterisation method abbreviation} to indicate that this folder holds data for a specific characterisation method.

Today, the following characterisation methods are registered:

Characterisation method abbreviation Characterisation method name Description
SEM Scanning Electron Microscopy n/a
TEM Transmission Electron Microscopt n/a
OMI Optical MIcroscopy n/a

If a sub-folder contains a sample that has been analysed with multiple methods, it contains sub-folders that are named according to the abbreviation of the specific characterisation method.

Filename and Folder Naming

The filename of a dataset contains a clear, concise and very short name that identifies the contents. Words are in lowercase and separated by underscores ("=_="). CamlCase is discouraged. Filename extensions are encouraged to ease the understanding of the folder contents.

All data sets are stored in the folder data of the fccgeo EOS/CERNBOX project.

Folder Structure

Each top level data set folder contains at least the following files:

  • LICENCE.TXT - Text of the Creative Commons CC BY 4.0 licence. Additional note about the data creator if needed including specific clauses on a case-by-case basis
  • METADATA.ODS - the metadata if the data set including the change track record

Each data set folder contains subfolders. Those can be folders that hold the information about the entire data set from which the samples are taken and sub-folders for individual sample analysis tasks.

Collaboration members shall report needs for further characterization methods in a timely fashion so that the template folder at top level can include further examples. The file and folder contents naming convention will evolve according to project needs and with growing experience.

One sub-folder needs to be created in a data set folder for each sample that is analysed with multiple techniques or for each analysis technique as indicated above.

Each sample-specific sub-folder holds a CharacterisationMetadata.ods file that describes the sample and the analysis carried out on that sample. Depending on the numbers of samples taken from the original material lot and the different types of characterisations performed, there exist two options:

  1. Either there exists one sub-folder in the data set for each sample taken from the material lot (preferred) or
  2. there exists one sub-folder for each characterisation

The sub-folders are named by extending the data set naming conventions with a dash (-) followed by a sequence number, another dash (-) and the word sample or the abbreviation of the characterisation method.

If a single sample is analysed with different analysis methods, the sample folder contains one subfolder per analysis method that has the name of the analysis method (e.g. SEM, TEM).

In each sample or characterisation folder, there exist the following sub-folders in each sample/characterisation folder:

  • img for images
  • raw for any raw data files that devices deliver during the analysis, including calibration data files
  • results for the derived analysis data and results of the characterisation
  • doc for any additional documentation such as manuals, process descriptions, sample preparation instructions
  • misc for any additional information files

The following image shows as the two possible folder structures. The structure example and the files can be found in the template folder.

data_set_folder_structure_documentation.png

Another example is re-produced in tabular form:

Path name Description
/eos/project/f/fccgeo/data/CRN-200301 Folder for a bulk of materials taken at CERN
/eos/project/f/fccgeo/data/CRN-200301/Metadata.ods The metadata file that describes that material taken and the list of analysis performed
/eos/project/f/fccgeo/data/CRN-200301/img Sub-folder that holds images that relate to the entire data set (e.g. of the borelog)
/eos/project/f/fccgeo/data/CRN-200301/raw Sub-folder that holds the raw data files (e.g. the scanned borelog)
/eos/project/f/fccgeo/data/CRN-200301/results Sub-folder that holds the analysis results that relate to the entire data set
/eos/project/f/fccgeo/data/CRN-200301/doc Sub-folder that holds the additional documentation that permits better understanding the data set
/eos/project/f/fccgeo/data/CRN-200301/misc Sub-folder that holds miscellaneous additional information files
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample Sub-folder for a qualitative analysis of a part of the material extracted
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/CharacerisationMetadata.ods Metadata file that describes the subset of the materials and the characterisation
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/img Sub-folder with images of the sample
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/results Sub-folder with results of the sample analysis
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM Sub-folder with data relating to the SEM analysis method of that specific sample
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/img Sub-folder with images from the SEM analysis of that specific sample
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/raw Sub-folder with the SEM device raw data
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/doc Sub-folder with the SEM device description, sample preparation information and process description
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM Sub-folder for a spectroscopy analysis of another part of the material extracted
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/CharacerisationMetadata.ods Metadata file that describes the subset of the materials and the characterisation
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/img Sub-folder that holds images taken during the analysis
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/raw Sub-folder that holds the raw data files obtained during the analysis
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/results Sub-folder that holds the analysis results and result data files obtained after the analysis
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/doc Sub-folder that holds the additional documentation that permits better understanding the results (e.g. device manuals)
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/misc Sub-folder that holds miscellaneous additional information files

Data Format

The combined set of sample characterization data is uploaded as a compressed archive file in .ZIP format to ZENODO together with the following separate files:

  • licence.txt (text of the CC BY 4.0 licence in UTF 8 format)
  • METADATA.ODS (spreadsheet describing the entire data set)
  • ./{subfolders of selected sample characterisations to be published} (organisation of the sub-folders can either be by sample or by characterisation method)

Note: _Each subfolder corresponding to a specific measurement action will contain only those files that the collaboration members agree to publish openly. Usually these comprise spreadsheets in open format (.ODS) and comma separated value files (.CSV) as well as high resolution images (.PNG, .TIFF, .JPEG), vector plots (.EPS, .PDF) and summary documents (.PDF).

Keywords

Each dataset will at least be tagged with the following keywords:

  1. FCC
  2. FCCIS (if the data set is part of the FCCIS EC co-funded project)
  3. H2020 (if the data set was created as part of an EC co-funded project)

In addition, appropriate keywords from the Library of Congress Subject Headings classification will be added (see http://id.loc.gov/authorities/subjects.html):

The keywords need at least to include

  1. discipline
  2. sample type
  3. type of materials
  4. properties or methods used for characterization

A selected list of entries from the following keyword terms and the link to the keyword term need to be entered in the two distinct fields foreseen in the ZENODO upload webform:

Term Identifier link
Disciplines URI
Geology http://id.loc.gov/authorities/subjects/sh85054037
Engineering geology http://id.loc.gov/authorities/subjects/sh85043221
Materials science http://id.loc.gov/authorities/subjects/sh85082094
Sample type URI
Prospecting geophysical methods http://id.loc.gov/authorities/subjects/sh85107597
Electric prospecting http://id.loc.gov/authorities/subjects/sh85041941
Magnetic prospecting http://id.loc.gov/authorities/subjects/sh85079731
Seismic prospecting http://id.loc.gov/authorities/subjects/sh85119624
Geophysical well logging http://id.loc.gov/authorities/subjects/sh85054183
Materials URI
Molasse http://id.loc.gov/authorities/subjects/sh85086543
Limestone http://id.loc.gov/authorities/subjects/sh85077017
Moraines http://id.loc.gov/authorities/subjects/sh85087196
Gravel http://id.loc.gov/authorities/subjects/sh85056544
Properties URI
Abrasion resistance http://id.loc.gov/authorities/subjects/sh2008000012
Soil porosity http://id.loc.gov/authorities/subjects/sh86005323
Soil permeability http://id.loc.gov/authorities/subjects/sh85124367
Tensility ?
Characterisation URI
Physical measurements http://id.loc.gov/authorities/subjects/sh85101564
Microscopy http://id.loc.gov/authorities/subjects/sh92003369
Transmission electron microscopy http://id.loc.gov/authorities/subjects/sh93001918
Scanning electronc microscopy http://id.loc.gov/authorities/subjects/sh91002757
Compression testing http://id.loc.gov/authorities/subjects/sh85082069

Metadata Standards

The Dublin Core Metadata Initiative will be followed (http://dublincore.org) as much as reasonably applicable. Metadata examples concerning the sample measurement campaigns such as http://icatproject-contrib.github.io/CSMD/csmd-4.0.html and an existing sample database at CERN have been considered. However, to our best knowledge, no domain-specific metadata standard for those sample characteristics identification campaigns specified in the project exist. Therefore, a column-oriented data format with an explanation of the columns will be created in the scope of this project.

Templates

Template folder and files can be found in the folder fccgeo/templates.

The entire folder fccgeo/templates/CRN-200728 is an empty example that can be copied into the folder /fccgeo/data to be renamed using the naming convention for the data set LLL-YYMMDD to create a new data set. It contains the exemplary required files and exemplary characterisation subfolders.

Storage Administration and Access Permissions

Data Store Location

Data from subsurface samples are stored on a dedicated cernbox/eos data repository. This repository can be accessed either via a website or via a cernbox client application.

The path to the data share reachable directly on lxplus is /eos/project/f/fccgeo

The data share is owned by the service account fccgeo (fcc.geo@cern.ch).

If you have access to the repository through membership of one of the e-egroups, you can directly access the share using a web browser:

https://cernbox.cern.ch/index.php/apps/files?dir=/__myprojects/fccgeo

Access Permissions

Access to the data store is managed via three e-groups following the cernbox access documentation:

cernbox-project-fccgeo-admins members have full access to the project in the cernbox website and can add readers and writers to the respective egroups.

cernbox-project-fccgeo-writers members can read, write, delete in the project space. They have only access to the cernbox share via the pathname.

cernbox-project-fccgeo-readers members can only read the files in the project space, They have only access to the cernbox share via the pathname.

Making Data Openly Accessible

Data from characterization campaigns will be made openly available only after approval of all persons who were involved in that characterization. Publication on Zenodo follows a quality management Data Release Procedure.

Note: Persons publishing data on the ZENODO platform need to sign up with an account at that platform at https://zenodo.org/signup/.

Increase Data Re-Use Through Clarifying Licenses

Licence

Openly accessible data will be licensed under Creative Commons CC BY 4.0 (see https://creativecommons.org/licenses/by/4.0/).

Users are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material

for any purpose, also commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Notices: You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation documented by CC BY 4.0. No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

Timing

Data will at latest be made available with the publication of an accompanying scientific publication that references the data sets. All data will at latest be available with the project end, even those data sets, which are not referenced in scientific publications.

Data publication may occur after the six months EC H2020 permitted embargo period has elapsed.

Data published may contain a subset of those data that are not subject to access restrictions or if publication would violate other existing license restrictions.

Re-use

Published data can be used by other scientists. Original data revealing particular sample preparation or analysis techniques that are subject to access restrictions will only be usable by researchers upon explicit request and approval of the characterization campaign manager and the data owner.

Validity

The data will remain usable until the repository withdraws the data or goes out of business.

Quality Assurance

Data sets, metadata and measurement setup and procedure description will be reviewed by at least one peer prior to engaging the release procedure. The author and the reviewer are named in the metadata.

Data sets, metadata and measurement setup and procedure description will be marked as RELEASED only after approval of the measurement campaign manager (e.g. project supervisor) and one additional reviewer. The approvers are named in the metadata.

Review and release includes a validation of the measured sample, the measurement setup, conditions, procedure and equipment as well as sanity checks against similar studies and control of systematic errors.

In case of data quality uncertainties after release, a new version IN WORK is created and the released data set version is marked INVALID.

The description of analysis/characterisation setup (materials and method) are annotated with product references.

The measurement conditions are described.

The measurement location, date and time (periods) are noted.

Any potential and known adverse effects (environmental influences, influences of the measurement equipment) are described in the metadata.

Allocation of Resources

Cost Estimate

A person at CERN reachable through the service account fcc.geo@cern.ch keeps the measurement data sets and perform the publication in the open data repository. The estimated effort to manage a data set is 40 hours, 10 data sets per year, i.e. 400 hours or 10 weeks per year over the entire project period.

This resource is covered by the project management funds and CERN matching resources.

Note: the project coordination office will track the actual efforts and regularly update this estimation.

Each researcher in the project is reponsible to create the data sets using the adopted open data format, providing the metadata files, describing the measurement setup, anonymising or selecting the data for publication, reviewing the data sets and performing the release process using the CERN provided storage infrastructure (EOS, cernbox) and the ZENODO platform. The estimated effort is another 40 hours per data set, 10 data sets per year, i.e. 10 weeks per year over the entire project period.

This resource is covered by the organisations who carry out the measurement campaigns.

Note: The participating institutes are strongly encouraged to track the time they are spending to prepare the data sets and to publish them and to report their actual estimates to the Coordinator.

Data Management Responsibilities

This data management plan is maintained by the FCC office at CERN (fcc.office@cern.ch).

All project members at the co-operating organisations commit to cooperate on the establishment of this DMP and to deliver the required information such that the associated deliverables and milestones can be produced in due time with the requirement quality levels:

Data storage and backup responsibilities are covered by the data repository provider (CERN).

The CERN project repository is managed by CERN IT department.

Service account holder fcc.geo@cern.ch manages the data store.

The FCC secretariat and office (fcc.office@cern.ch) provides support for the upload to the CERNBOX/EOS data storage system and perform a formal (file integrity, naming, metadata completeness) check.

Long-term data preservation will be ensured by CERN at no additional cost.

Data Security

All data delivered to the CERN project repository EOS/CERNBOX storage system is backed up by CERN's central IT services. In addition, a copy of released data will be kept on the ZENODO platform. Both services are intended for long-term storage of scientific research data. Upon unintentional loss of data (misuse of the collaborative workspace, accidental removal), the fcc.geo@cern.ch service account holder needs to be contacted via email. The person will interact with CERN's IT services to restore the latest known copy. No additional costs occur for storage, backup and restore activities.

Nonpublic data sets can be provided by the project members using HTTPS transfer protocol after authentication by sharing a cernbox link by asking the sharing to the service account holder in a reasonable secure fashion that counteracts data manipulation.

Sensitive data, i.e. non-anonymized data sets can only be accessed by the author, the measurement campaign management and the project's IT managers.

Access to sensitive data may be granted through a request to the network coordinator (CERN) with a justification for the request. Access will be granted on a case-by-case basis in agreement with the measurement campaign manager and, if samples, analysis and products from industrial partners are involved, in agreement with the sample owners. The data will be communicated in electronic format from the network coordinator to the data requestor in digitally signed and encrypted form. An additional IP access process, such as the establishment of a Non-Disclosure Form may apply.

Every collaboration member must inform the network coordinator without delay if a person affiliated (associated or employed) with the institute and who has access to the project data, leaves the institute. In this case, the network coordinator will revoke as soon as technically possible and resources permitting (working hours) the access of the person to the data.

Note: E-mail is not considered a secure communication channel for data and metadata files. Data can be modified and it is unclear what fields have been modified with respect to the original data source. Therefore only a link to the authentic data source shall be considered reliable information.

Ethical Aspects

Sensitive information will be kept secure. Access to non-anonymized data is managed by the network coordinator in close cooperation with the organisation, who provides the data set. Non-anonymized data will only be communicated in encrypted fashion and digitally signed.

-- JohannesGutleber - 2020-07-27

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2021-03-17 - MaximilianHaas
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    FCC All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback