Data management of GEO database
Data summary
Purpose
Acquisition and collection of geological data and its associated data analysis contributes essentially to the project's technical risk management plan. Data analysis and its associated results serve (1) obtaining credible project construction cost and schedule estimates, (2) a world-wide community of engineers and entrepreneurs to propose credible and feasible re-use possibilities for excavated tunnel material and (3) increase the outreach of the scientific publications based on this data.
The data will be published on ZENODO to provide transparency and data clearance for potential re-use scenarios. For quality management purposes, all raw data, analysis results and ancillary data such as device calibrations and metadata will be stored at CERN in an EOS/CERNBOX repository with limited access. Also, some proprietary raw data and selected analysis results that are subject to publication constraints and time-based embargoes will be kept in this internal data repository. Our approach enables follow-up projects and further generations of researchers continuing to build upon existing data sets, to validate the results and to document the improvement of technologies and techniques in a verifiable manner. This approach will ensure a durable impact of the EC funding obtained by projects such as
FCCIS and DEBI EU projects beyond the project period.
Relation to the objectives of the FCC project
The objective of the FCC project and the
FCCIS EC co-funded project is to develop a feasible new particle-collider based research infrastructure that serve a world-wide community of scientists until the end of the 21st century. The collection, processing, maintenance and publication of geological data will significantly help achieving this goal. In addition, establishing a durable library of raw data and analysis results can serve a large community of researchers, engineers and entrepreneurs from different fields to:
- get a better understanding of the subsurface in the Franco-Genevois basin across France and Switzerland,
- support the development of economically viable re-use scenarios for excavated material, and
- serve as an example for a large-scale subsurface investigation project with a high green environmental impact.
Sample type & data format
Published data includes comprehensive results of rock samples that are listed among the following types:
- cores: cylindrically-shaped rock samples, which can either be fully retreived (full core) or split (half core). Core samples typically range from several cm to m in length, with typical diameters according to Oil and Gas company standard drilling procedures (e.g.4 to 12 cm).
- plugs: small cylindrically-shaped rock samples drilled from a core. Usually, these plugs range from 2 - 8 cm in length and between 1 and 3 cm in diameter.
- cuttings: rock pieces as a result of the drilling process. Taken during drilling in an regular interval.
- hand pieces: these rock samples are of irregular shape and usually taken from outcrops (surface geology). Note that previous sample types can be obtained from outcrop samples if they have a sufficient volume.
Depending on the type of sample different analyses are possible. The list above ranks qualitatively each sample type in decreasing order of the amount of analyses possible. Please check the metadata of each analysis for further information on analysed samples. Results of these sample types are used to
- derive thorough correlations for the understanding of molasse material properties for tunnelling construction and engineering purposes,
- develop a 3D subsurface basin model,
- develop re-use scenarios for expected rock classification types during the excavation process and
- create figures and plots in scientific publications, such that other researchers can compare results and full transparency and reproducibility is given.
Numeric data and plots are value tables in Open Document Spreadsheet format (
.ODS
) for limited amounts of data with typed columns.
Larger data series are UTF-8 encoded, comma-separated-value files (
RFC 4810
) in textual format files (
{data filename}.CSV
) with column value and data format description (
{data filename}_description.csv
).
Images and raw measurement data files are provided by the measurement instruments and stored in subfolders with a unique name.
Document record and change track will be included (author contact information, status, version, change reason and date, description of contents, title, origin of data including a brief description of the measurement and/or experiment setup, sample ID, analysis scientist, sampler and processor) in a separate metadata file for each characterization action called
METADATA.ODS
.
Data files and images will be included in the open data sets and made available through a quality-managed
data release procedure on ZENODO.
Proprietary raw and calibration data delivered by the measurement device may be published if access restrictions permit publication. These restrictions depend on the ownership of the raw data and the contractual conditions for making data openly accessible (e.g. through the EU H2020 funding instrument, data produced in the course of a project for the purpose of the project must be made openly accessible by the project consortium).
Templates for the different files are available:
Handling of existing data
Existing data from past and ongoing research and development projects within the scope of the FCC study serves as a basis for the data files and subsequent geomodelling.
Origin of data
The differentiation among data holder and data owner proves to be essential. The following list refers to data holders, who permit to analyse and process the data within the scope of the FCC project. These data sets might originate from:
- past and present subsurface investigations in the region carried out for different (non-FCC) purposes,
- past and present subsurface investigations contracted by CERN for different (non-FCC) purposes and
- subsurface investigations contracted by CERN for FCC purposes.
Expected size of data
MAX REVISE
The exact size of data is today unknown. Initial experience from different measurements permit revising this initial data management plan. The main data storage volume stems from raw seismic and borehole data, SEM, QEMSCAN and optical microscope images stored in high-resolution bitmap format. The following list should give a first estimation of potential storage volume, expecting maximum data quality:
- Active and passive seismic raw data: 1 - 15 TB.
- Borehole data: 1 - 500 GB.
- High-resolution images: 1 - 2 GB.
- Laboratory measurement results (e.g. CSV-files): 1 - 10.000 MB.
Data utility
Within the Consortium:
The data sets will be shared within the consortium as the working baseline to:
- develop re-use scenarios for tunnel excavation material,
- create a 3D geological subsurface basin model of the project's perimeter and
- produce scientific publications to validate results through repeated experiments at different locations.
Beyond the Consortium:
The data can be used by independent researchers and engineers to understand the contents and conclusions of the scientific publications, which base their findings on the data.
Furthermore, independent researchers can use the files to produce figures and publications, showing comparisons of their and the proect's results.
Scientists can use the data files to repeat the experiments and measurements to verify and validate the project's research.
The data sets can be used by a world-wide set of researchers, engineers and entrepreneurs to develop credible re-use scenarios for tunnel excavation material.
The data sets can be used by public entities to increase their understanding about the geology in the Franco-Genevois basin.
The data sets may be used by scientific writers and the press to produce high-quality infographics, demonstrating the scientific impact and potential of the FCC project.
Fair data
Making Data Findable, Including Provisions for Metadata
Discoverability
The
ZENODO.ORG
platform will be used to make the data openly accessible and discoverable. Links to the data will be made available on the
FCC Twiki collaborative web site
together with metadata describing the data sets once they are released on ZENODO. A link on the FCC public website
fcc.cern
is provided. The data will be indexed using the
EU Open Data Portal
.
Since the open data supports the quality and credibility of the open publications, all data are discoverable through the scientific publications.
Each scientific publication will include
Digital Object Identifiers (DOI)
that direct to the associated open data sets.
Identification
Each data set is labelled by a DOI as a unique and persistent identifier.
Data sets will be referenced in scientific publications and if the open data platform permits, scientific papers based on the data will be linked on the open data platform.
The DOI is reserved when a ZENODO entry is created before any data are uploaded to the platform. At this point, the data set is not published and its visibility is classified ad
"Closed Access".
Metadata
The data sets follow the
EU Open Data Portal Metadata definitions
.
From the comprehensive set of fields, a minimum set will be provided.
The
Dublin Core Metadata Initiative
will be followed as much as possible.
For all characterisations, at least the following metadata will be provided via the ZENODO upload form and the individual metadata files in the data folders:
Metadata element |
Description |
Project identifier |
Points to a subfolder with the same name, holding the data |
Title |
Meaningful name of the sample characterisation |
Alternative title |
Other identifier in concise format |
Description |
Brief description of the data set |
Keywords |
List of keywords according to the library of congress terms |
Identifier |
Digital Object Identifier (DOI) reserves for this data set |
URI |
Uniform Resource Identifier linking to the place where the data are stored. Usually the path of the folder in the file system |
Dataset type |
Type of the bore log or material sample (e.g. cube, surface, loose rock) |
Documentation |
Detailed description of the dataset content |
Format |
File type of the data set, usually a compressed ZIP archive |
Issue date |
Date of the first issue |
Modification date |
Last date the data sample set was modified |
Publisher |
Usually the FCC collaboration |
Contact point |
Name of the organisation and service at the organisation who is in charge of that data sample |
Contact full address |
Address of the contact and organisation in charge of that data sample |
Contact e-mail |
E-mail address of the contact person |
Contact name |
First name and last name of the person who is in charge of the data sample |
Contact Web page |
A web page |
Version number |
Major and minor version number of the published data set. A minor version number 0 indicates a released version |
Version description |
Incremental change record |
Licence |
Link to licence text |
Owner |
Name of the person who must authorize the release of specifically marked information items |
Status |
One of IN WORK (minor version not equal 0 ), RELEASED (minor version number is 0 ), INVALID |
Materials |
A list of individual materials present in the sample as far as known |
Dimensions |
Sample dimensions (mass, diameter, length, width) |
Origin |
Geographical coordinates of the sample's origin |
Source |
Who provided the sample |
History |
A record of actions at different locations that indicate, how the sample was obtained, used and characterised |
Additional information |
Free text with additional comments |
The folder
data
of the CERNBOX file share contains a file called
CATALOG.ODS
that provides the most important metadata, notably the project identifier of the sample characterisation, which directs to the folder in which the open data is stored.
The GEO database catalog is also available on the
geo database Twiki page.
The following data description elements are collected for the catalog:
Data element |
Description |
Data set identifier |
Identifier in format LLL-YYMMDD |
URI |
Uniform Resource Identifier linking to the place where the data are stored. Usually the path of the folder in the file system. A cernbox link may either be public or accessible only to members of the e-group if the data are not yet published |
OID |
Digital Object Identifier filled if reserved in ZENODO |
Title |
Concise name of the data set |
Version |
Current version of the data set |
Status |
Release status of the data set (RELEASED , IN WORK , INVALID ) |
Organisation |
Short name of the organisation that created and manages this data set |
Contact name |
Name and e-mail (hyperlink) of the person who serves as contact point for this data set |
Type |
Type of the data set, e.g. a bore log or material sample (e.g. cube, surface, loose rock) |
Source |
Where the materials for this data set have been obtained from (e.g. contracted extraction, other project, literature) |
Location N/E |
Geographical coordinates where the materials that relate to this data set have been taken from |
Description |
Brief description of the data set |
Last update |
Date when this record has been updated |
*NOTE: *
It is understood that these fields are duplicates of those fields, which are also stored at a lower measurement folder level. The repetition at higher level serves a simple, easy-to-use catalogue in the project.
Each data set is stored in a separate sub-folder.
Each data set sub-folder has either a) folders for different samples that have been characterised or b) immediately folders for the characterisation data of the entire data set.
If a data set folder has sample sub-folders, each sample sub-folder contains a
METADATA.ODS
file that describes the specific data set according to the metadata elements indicated above.
Versioning
A dataset has a
major (
MM
) and a
minor (
mm
) version number, separated by a dot (
MM.mm
).
NOTE: The ZENODO recommended patch number is not used. It should always be “.0”.
If the minor version number is
0
, the data set is released.
Any minor version number different from zero indicates a data set that is in work.
At each release the major version number is incremented by one.
For each change in an "in work" version, the minor version number is incremented by one.
The first "in work" version starts with a major version number equal to 0.
Examples:
V 0.0
- the first draft version
V 0.1
- a second draft version
V 1.0
- the first released version
V 1.1
- an update to Version 1.0, not yet released
V 2.0
- another released version. Version
1.0
is now invalid.
Versions comply with the data set status and can either be:
-
IN WORK
(not released),
-
RELEASED
or
-
INVALID
(must no longer be used as reference)
Note: INVALID and IN WORK version must not be published and be referenced in publications.
Naming convention
Data set identification
A
data set comprises multiple samples and sample analysis records.
The
project identifier for a data set that comprises uses the following convention:
LLL-YYMMDD
where
LLL
is the three-letter abbreviation of the
organisation at which the data set is created, e.g. MUL for Montanuniversität Leoben.
YY
stands for the last two digits of the current year, e.g.
18
for
2018
.
MM
stands for the two digits of the current month, e.g.
04
for April.
DD
stands for the two digits of the current day of month, e.g.
17
for the seventeenth day.
A complete example for a project identifier is
TUW-180317
.
The three letter abbreviations of an exemplary number of organisations, which typically carry out material characterizations are shown in the table below. This document will be regularly updated. The three-letter abbreviations do not coincide with the typical organisation abbreviations used to identify an organization in EC projects. They merely serve coming to unique project identifiers for sample characterizations.
Abbreviation |
Organisation |
MUL |
Montanuniversität Leoben |
UNIGE |
Université de Genève |
ETH |
Eidgenössische Technische Hochschule |
Sample and analysis identification
Files that relate to
samples and data from different analysis processes are placed in subfolders of the data set according to this structure:
LLL-YYMMDD
is the name of the folder for a data set that contains multiple sample characterisations.
LLL-TTMMDD-{running number}-{type}
is the name of a sub-folder in
LLL-YYMMDD
that holds the analysis data of a specific sample or characterisation.
{type}
can be
- either
sample
to indicate that the folder contains data of a single sample that has been analysed with one or several methods or
-
{characterisation method abbreviation}
to indicate that this folder holds data for a specific characterisation method.
Today, the following sample analyses are registered:
Characterisation method abbreviation |
Characterisation method name |
Description |
QEMSCAN |
Automated mineralogy and Petrography apparatus |
Modal mineralogy, lithology, grain density |
FTIR |
Fourier Transform Infrared Spectroscopy |
Mineral components in mixtures to distinguish among different types of clay minerals, characterization of both di- and trioctahedral clay minerals, amounts of kaolinite and Al/Mg/Fe types, organic material |
OMI-T |
Optical Microscopy in transmitted light |
Optical analysis, 2.5x to 50x magnification |
OMI-R |
Optical Microscopy in reflected light |
Optical analysis, 2.5x to 50x magnification |
pXRF |
Portable X-Ray Fluorescence |
Elemental composition in portable format |
XRD-P |
X-Ray diffraction on powder sample |
Identification of mineral phases on randomly oriented sample |
XRD-S |
X-Ray diffraction on textured sample |
Identification of clay minerals on oriented sample |
XRD-EG |
X-Ray diffraction on ethylenglycole-treated sample |
Identification of swelling clays |
ICP-MS |
Inductively Coupled Plasma |
Elemental composition using mass spectroscopy |
ICP-OES |
Inductively Coupled Plasma |
Ultra-trace elemental composition using an optical emission spectrometer to identify leaching characteristics |
CUV |
Cuvette Tests |
Identification of chemical pollutants and leaching characteristics complementary to ICP-OES |
UCS |
Uniaxial Compressive Strength |
Uniaxial Compressive Strength |
LCPC |
Laboratoire Central des Ponts et Chaussées |
LAB and LAC values for abrasivity and breakability |
CER |
Cerchar abrasivity |
Abrasivity behavior |
VPS |
Sonic P- and S-wave velocity |
Sonic velocities in longitudinal and transversal direction |
BRZ |
Brazilian tensile strength |
Tensile strength behavior |
PL |
Point Load |
Point load index, rock strength parameter |
PORO_gas |
Porosity |
Total & effective porosity measurements using gas adsorption |
PERM_gas |
Permeability |
Helium / Nitrogen permeability measurements using gas adsorption |
CEC |
Total Cation Exchange Capacity |
Total identification of solved ions in pore water and pores |
CEC-ICP |
Effective Cation Exchange Capacity |
Effective identification of solved ions in pore water and pores using ICP optical emission spectrometer |
Enslin-Neff |
Free Water Uptake Capacity |
Water suction of samples under free swelling conditions |
Keeling |
Vapour Water Adsorption |
Water adsorption under 75% NaCl atmosphere identifying specific (inner crystalline) surface |
BET |
Brunauer-Emmett-Teller |
Determination of specific (outer) surface |
TC/TOC |
Total carbon / Total Inorganic Carbon Content |
Measurement of total organic and total inorganic carbon content, as well as derivation of total organic carbon content via difference-method |
MIP |
Mercury Intrusion Porosimetry |
Identification of pore size, pore size distribution and porosity by high-pressure mercury intrusion |
Filename and folder naming
The filename of a sample or data set contains a clear, concise and short name (e.g. borehole name or outcrop name) that describes the location, followed by the representative feature (e.g. depth) and analysis method in the format of e.g. fullwellname_depth_analysismethod. The "analysismethod" depicts the common abreviation of an analysis in a maximum of four letters, e.g. X-Ray Fluorescence (XRF) or PORO (porosity). Words are in lowercase and separated by underscores ("=_="). CamlCase is discouraged. In compliance with these regulations, an example file name would be: geo02_148_xrf. This sample originates from the well GEO02 (full well name) at a depth of 148m analysed by a X-Ray Fluorescence device, i.e. elemental (geochemical) composition analysis. Further filename extensions are encouraged to ease the understanding of the folder contents such as additional data sets within the same file (several measurements) or in case of measurements on different sample types (e.g. halfcore or cuttings).
All data sets are stored in the folder
data
of the
fccgeo
EOS/CERNBOX project.
Folder structure
Each top level data set folder contains at least the following files:
-
LICENCE.TXT
- Text of the Creative Commons CC BY 4.0 licence. Additional note about the data creator if needed including specific clauses on a case-by-case basis
-
METADATA.ODS
- the metadata if the data set including the change track record
Each data set folder contains sub-folders. These folders hold information about the entire data set from which the samples are taken and sub-folders for individual sample analysis tasks.
Collaboration members shall report needs for further characterisation methods in a timely fashion so that the
template
folder at top level can include further examples.
The file and folder's naming convention will evolve according to project requirements and growing experience.
One sub-folder needs to be created in a data set folder for each sample that is analysed with multiple techniques or for each analysis technique as indicated above.
Each sample specific sub-folder holds a
CharacterisationMetadata.ods
file that describes the sample and the analysis carried out on that sample.
Depending on the numbers of samples taken from the original material and the different types of characterisation performed, two options exist:
- Either there exists one sub-folder in the data set for each sample taken from the material lot (preferred) or
- there exists one sub-folder for each characterisation. Both is valid.
The sub-folders are named by extending the data set naming conventions with a dash (
-
) followed by a sequence number, another dash (
-
) and the word
sample
or the abbreviation of the characterisation method.
If a single sample is analysed with different analysis methods, the sample folder contains one sub-folder per analysis method that has the name of the analysis method (e.g.
SEM
,
TEM
).
In each sample or characterisation folder, the following sub-folders in each sample/characterisation folder exists:
-
img
for images,
-
raw
for any raw data files that devices deliver during the analysis, including calibration data files,
-
results
for the derived analysis data and results of the characterisation,
-
doc
for any additional documentation such as manuals, process descriptions, sample preparation instructions and
-
misc
for any additional information files.
The following image shows two possible folder structures. The
structure example and the files can be found in the
template
folder.
Another example is re-produced in tabular form:
Path name |
Description |
/eos/project/f/fccgeo/data/CRN-200301 |
Folder for a bulk of materials taken at CERN |
/eos/project/f/fccgeo/data/CRN-200301/Metadata.ods |
The metadata file that describes that material taken and the list of analysis performed |
/eos/project/f/fccgeo/data/CRN-200301/img |
Sub-folder that holds images that relate to the entire data set (e.g. of the borelog) |
/eos/project/f/fccgeo/data/CRN-200301/raw |
Sub-folder that holds the raw data files (e.g. the scanned borelog) |
/eos/project/f/fccgeo/data/CRN-200301/results |
Sub-folder that holds the analysis results that relate to the entire data set |
/eos/project/f/fccgeo/data/CRN-200301/doc |
Sub-folder that holds the additional documentation that permits better understanding the data set |
/eos/project/f/fccgeo/data/CRN-200301/misc |
Sub-folder that holds miscellaneous additional information files |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample |
Sub-folder for a qualitative analysis of a part of the material extracted |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/CharacerisationMetadata.ods |
Metadata file that describes the subset of the materials and the characterisation |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/img |
Sub-folder with images of the sample |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/results |
Sub-folder with results of the sample analysis |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM |
Sub-folder with data relating to the SEM analysis method of that specific sample |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/img |
Sub-folder with images from the SEM analysis of that specific sample |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/raw |
Sub-folder with the SEM device raw data |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/doc |
Sub-folder with the SEM device description, sample preparation information and process description |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM |
Sub-folder for a spectroscopy analysis of another part of the material extracted |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/CharacerisationMetadata.ods |
Metadata file that describes the subset of the materials and the characterisation |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/img |
Sub-folder that holds images taken during the analysis |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/raw |
Sub-folder that holds the raw data files obtained during the analysis |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/results |
Sub-folder that holds the analysis results and result data files obtained after the analysis |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/doc |
Sub-folder that holds the additional documentation that permits better understanding the results (e.g. device manuals) |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/misc |
Sub-folder that holds miscellaneous additional information files |
Data format
A set of sample characterisation data is uploaded as a compressed archive file in
.ZIP
format to ZENODO together with the following separate files:
-
licence.txt
(text of the CC BY 4.0 licence in UTF 8 format)
-
METADATA.ODS
(spreadsheet describing the entire data set)
-
./{subfolders of selected sample characterisations to be published}
(organisation of the sub-folders can either be by sample or by characterisation method)
Note: _Each subfolder corresponding to a specific measurement action will contain only those files for which the collaboration members agree to publish openly. Usually, these comprise spreadsheets in open format (
.ODS
) and comma separated value files (
.CSV
) as well as high resolution images (
.PNG
,
.TIFF
,
.JPEG
), vector plots (
.EPS
,
.PDF
) and summary documents (
.PDF
).
Keywords
Each dataset will at least be tagged with the following keywords:
- FCC
- FCCIS (if the data set is part of the FCCIS EC co-funded project)
- H2020 (if the data set was created as part of an EC co-funded project)
In addition, appropriate keywords from the
Library of Congress Subject Headings
classification
will be added (see
http://id.loc.gov/authorities/subjects.html
):
The keywords need at least to include
- discipline
- sample type
- type of materials
- properties or methods used for characterization
A selected list of entries from the following keyword terms and the link to the keyword term need to be entered in the two distinct fields foreseen in the ZENODO upload webform:
Metadata standards
The Dublin Core Metadata Initiative will be followed (
http://dublincore.org
) as much as reasonably applicable. Metadata examples concerning the sample measurement campaigns such as
http://icatproject-contrib.github.io/CSMD/csmd-4.0.html
and an existing sample database at CERN have been considered. However, to our best knowledge, no domain-specific metadata standard for those sample characteristics identification campaigns specified in the project exist. Therefore, a column-oriented data format with an explanation of the columns will be created in the scope of this project.
Templates
Template folder and files can be found in the folder
fccgeo/templates
.
The entire folder
fccgeo/templates/CRN-200728
is an empty example that can be copied into the folder
/fccgeo/data
to be renamed using the naming convention for the data set
LLL-YYMMDD
to create a new data set. It contains the exemplary required files and exemplary characterisation subfolders.
Storage administration and access permissions
Data store location
Data from subsurface samples are stored on a dedicated cernbox/eos data repository. This repository can be accessed either via a website or via a cernbox client application.
The path to the data share reachable directly on
lxplus
is
/eos/project/f/fccgeo
The data share is owned by the service account
fccgeo
(
fcc.geo@cern.ch
).
If you have access to the repository through membership of one of the e-egroups, you can directly access the share using a web browser:
https://cernbox.cern.ch/index.php/apps/files?dir=/__myprojects/fccgeo
Access permissions
Access to the data store is managed via three e-groups following the
cernbox access documentation
:
cernbox-project-fccgeo-admins
members have full access to the project in the cernbox website and can add readers and writers to the respective egroups.
cernbox-project-fccgeo-writers
members can read, write, delete in the project space. They have only access to the cernbox share via the pathname.
cernbox-project-fccgeo-readers
members can only read the files in the project space, They have only access to the cernbox share via the pathname.
Making data openly accessible
Data from characterization campaigns will be made openly available only after approval of all persons who were involved in that characterization.
Publication on ZENODO follows a quality management
Data Release Procedure.
Note: Persons publishing data on the ZENODO platform need to sign up with an account at that platform at
https://zenodo.org/signup/
.
Increase data re-use through clarifying licenses
Licence
Openly accessible data will be licensed under
Creative Commons CC BY 4.0 (see
https://creativecommons.org/licenses/by/4.0/
).
Users are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material
for any purpose, also commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation documented by CC BY 4.0. No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
Timing
Data will at latest be made available with the publication of an accompanying scientific publication that references the data sets. All data will at latest be available with the project end, even those data sets, which are not referenced in scientific publications.
Data publication may occur after the
six months EC H2020 permitted embargo period
has elapsed.
Data published may contain a subset of those data that are not subject to access restrictions or if publication would violate other existing license restrictions.
Re-use
Published data can be used by other scientists. Original data revealing particular sample preparation or analysis techniques that are subject to access restrictions will only be usable by researchers upon explicit request and approval of the characterisation campaign manager and the data owner.
Validity
The data will remain usable until the repository withdraws the data or leaves business.
Quality assurance
Data sets, metadata and measurement setup and procedure description will be reviewed by at least one peer from the related scientific domain prior to engaging the release procedure.
The author and the reviewer are named in the metadata.
Data sets, metadata and measurement setup and procedure description will be marked as
RELEASED
after approval of the measurement campaign manager (e.g. project supervisor) and one additional reviewer. Each approver is named in the metadata.
Review and release includes a validation of the measured sample, the measurement setup, conditions, procedure and equipment as well as sanity checks against similar studies and control of systematic errors by a related person from the scientific field. The person, who conducted the measurements cannot be simultaneously approve the measured data set. For scientific research (e.g. doctoral thesis), the associated academic supervisor must approve the data.
In case of data quality uncertainties after release, a new version
IN WORK
is created and the released data set version is marked
INVALID
.
The description of analysis/characterisation setup (
materials and method) are annotated with product references.
The measurement conditions are described.
The measurement location, date and time (periods) are noted.
Any potential and known adverse effects (environmental influences, influences of the measurement equipment) must be described in the metadata.
Allocation of resources
Cost estimate
A person at CERN reachable through the service account
fcc.geo@cern.ch
keeps the measurement data sets and perform the publication in the open data repository.
The estimated effort to manage a data set is 40 hours, 10 data sets per year, i.e. 400 hours or 10 weeks per year over the entire project period.
This resource is covered by the project management funds and CERN matching resources.
Note: the project coordination office will track the actual efforts and regularly update this estimation.
Each researcher in the project is responsible to create the data sets using the adopted open data format, providing the metadata files, describing the measurement setup, anonymising or selecting the data for publication, reviewing the data sets and performing the release process using the CERN provided storage infrastructure (EOS, cernbox) and the ZENODO platform. The estimated effort is another 40 hours per data set, 10 data sets per year, i.e. 10 weeks per year over the entire project period.
This resource is covered by the organisations that carry out the measurement campaigns.
Note: The participating institutes are strongly encouraged to track the time they are spending to prepare the data sets and to publish them and to report their actual estimates to the Coordinator.
Data management responsibilities
This data management plan is maintained by the FCC office at CERN (
fcc.office@cern.ch
).
All project members at the co-operating organisations commit to cooperate on the establishment of this DMP and to deliver the required information such that the associated deliverables and milestones can be produced in due time with the required quality levels:
Data storage and backup responsibilities are covered by the data repository provider (CERN).
The CERN project repository is managed by the CERN IT department.
Service account holder
fcc.geo@cern.ch
manages the data store.
The FCC secretariat and office (
fcc.office@cern.ch
) provides support for the upload to the CERNBOX/EOS data storage system and perform a formal (file integrity, naming, metadata completeness) check.
Long-term data preservation will be ensured by CERN at no additional cost.
Data security
All data delivered to the CERN project repository EOS/CERNBOX storage system is backed up by CERN's central IT services. In addition, a copy of released data will be kept on the ZENODO platform. Both services are intended for long-term storage of scientific research data. Upon unintentional loss of data (misuse of the collaborative workspace, accidental removal), the
fcc.geo@cern.ch
service account holder needs to be contacted via email. The person will interact with CERN's IT services to restore the latest known copy. No additional costs occur for storage, backup and restore activities.
Non-public data sets can be provided by the project members using HTTPS transfer protocol after authentication by sharing a CERNbox link and asking the service account holder in a reasonable secure fashion counteracting data manipulation.
Sensitive data, i.e. non-anonymized data sets can only be accessed by the author, the measurement campaign management and the project IT managers.
Access to sensitive data may be granted through a request to the network coordinator (CERN) with a justification of request. Access will be granted on a case-by-case basis in agreement with the measurement campaign manager and, if samples, analysis and products from industrial partners are involved, in agreement with the sample owners. The data will be communicated in electronic format from the network coordinator to the data requestor in digitally signed and encrypted form. An additional IP access process, such as the establishment of a Non-Disclosure Form may apply.
Every collaboration member must inform the network coordinator without delay if a person affiliated (associated or employed) with the institute and who has access to the project data, leaves the institute. In this case, the network coordinator will revoke as soon as technically possible and resources permitting (working hours) the access of the person to the data.
Note: E-mail is not considered a secure communication channel for data and metadata files. Data can be modified and it is unclear what fields have been modified with respect to the original data source. Therefore, only a link to the authentic data source shall be considered reliable information.
Ethical aspects
Sensitive information will be kept secure. Access to non-anonymized data is managed by the network coordinator in close cooperation with the organisation, who provides the data set. Non-anonymized data will only be communicated in encrypted fashion and digitally signed.
--
MaximilianHaas - 2021-03-17
--
JohannesGutleber - 2020-07-27
OLD VERSION (SAVE COPY)
Geo Data Management Plan
Data Summary
Purpose
Collecting and making available the data of the geology and the analysis of the soil is an essential part of the project's technical risk management plan.
It serves (1) obtaining credible project construction cost and schedule estimates, (2) serves a world-wide community of engineers and entrepreneurs to propose credible and feasible excavation materials
re-use possibilities and (3) raises the quality of the scientific publications based on those data.
Eventually all data relevant to describe the project and to develop re-use scenarios will be made openly available on ZENODO. For quality management purposes, all raw data, analysis results and ancillary data such as device calibrations and metadata will be stored CERN internally in an EOS/CERNBOX repository with limited access. Also, some proprietary raw data and selected analysis results that are subject to publication constraints and time-based embargoes will be kept in this internal data repository. The repository serves as the source from which data that are freely accessible for the public will be pushed to ZENODO.
Our approach permits follow-up projects and further generations of researchers continuing the work to build upon existing data sets, to validate the results and to document the improvement of technologies and techniques in a verifiable manner. This approach will ensure a durable impact of the EC funding obtained by projects such as
FCCIS and DEBI project beyond the project period.
Relation to the Objectives of the FCC Project
The objective of the FCC project and the
FCCIS EC co-funded project is to develop a feasible new particle-collider based research infrastructure that can serve a world-wide community of scientists until the end of the 21st century. The managed collection, processing and publication of geological data will help achieving this goal. In addition, establishing a durable library of raw data and analysis results can serve a large community of researchers, engineers and entrepreneurs from different fields to
- get a better understanding of the subsurface in the region of the Geneva lake basin across France and Switzerland,
- support the development of economically viable re-use scenarios for excavated materials, namely the molasse type and
- serve as an example for a large-scale subsurface investigation project.
Types and Formats of Data
The openly accessible data will be the comprehensive result data sets of characterized samples that are used to
- develop a 3D subsurface model of the region
- develop re-use scenarios for the expected different soil types during the excavation process and
- create the figures and plots in scientific publications, such that other researchers can compare their results easier and such that further results including historic data can be produced quicker.
Numeric data and plots are value tables in Open Document Spreadsheet format (
.ODS
) for limited amounts of data with typed columns.
Larger data series are UTF-8 encoded, comma separated value (
RFC 4810
) in textual format files (
{data filename}.CSV
) with column value and data format description (
{data filename}_description.csv
) will be used.
In addition,
images and raw measurement data files as provided by the measurement instruments will be stored in subfolders.
For all published files, a
document record and change track will be included (author contact information, status, version, change reason and date, description of contents, title, origin of the data including a brief description of the measurement and/or experiment setup) in a separate metadata file for each characterization action called
METADATA.ODS
.
Data files and images will be included in the open data sets that will be made available through a quality-managed
data release procedure on ZENODO.
Proprietary raw data delivered by the measurement instruments may be published if access restrictions permit the publication. Those restrictions depend on the ownership of the raw data and the contractual conditions for making those data openly accessible (e.g. through the EU H2020 funding instrument, data produced in the course of a project for the purpose of the project must be made openly accessible by the project consortium).
Templates for the different files are available:
Re-use of Existing Data
Existing data from past and ongoing research and development projects in the scope of the FCC study on geomodelling will serve as basis for the data files.
Origin of the Data
The data stem from
- past and present subsurface investigations in the region carried out for different (non FCC) purposes (e.g. University of Geneva, BRGM in France),
- past and present subsurface investigations contracted by CERN for different (non FCC) purposes
- subsurface investigations contracted by CERN for FCC purposes from 2022 onwards
Expected Size of Data
MAX REVISE
The size of the data is today not known. Initial experience with storing results from different kind of measurements will permit revising this initial data management plan. The main relevant data sizes will stem from images such as microscopic sample characteristic that are stored in high-resolution bitmap format. However, the total data set size for a single sample characterization is expected to be in the order of tens of MB only.
Data Utility
Within the Consortium:
The data sets will be shared within the consortium as the working baseline to
- develop excavation materials re-use scenarios
- establish a 3D subsurface model of the project's perimeter
- produce the scientific publications, to verify and validate the results through repeated experiments at different locations
Beyond the Consortium:
The data can be used by independent researchers, engineers to understand better the contents and conclusions of the scientific publications, which base their findings on the data.
Furthermore, independent researchers can use the files to produce figures and publications, showing comparisons of their own results and the project results.
Scientists can also use the data files to repeat the experiments and measurements to verify and validate the project's research.
The data sets can be used by a world-wide set of researchers, engineers and entrepreneurs to develop credible use-cases for excavation materials.
The data sets can be used by public entities to increase their understanding about the geology in the France/Switzerland border region.
The data sets may also be used by scientific writers and the press to produce high-quality infographics, demonstrating the impact potentials of the technology.
Fair Data
Making Data Findable, Including Provisions for Metadata
Discoverability
The
ZENODO.ORG
platform will be used to make the data openly accessible and discoverable. Links to the data will be made available on the
FCC Twiki collaborative web site
together with metadata describing the data sets once they are released on ZENODO. A link will also be provided at the FCC public website
fcc.cern
. The data will be indexed using the
EU Open Data Portal
.
Since the open data support the quality and credibility of the open publications, all data are discoverable through the scientific publications.
Each scientific publication will include
Digital Object Identifiers (DOI)
that point to the associated open data sets.
Identification
Each data set will carry a DOI as unique and persistent identifier.
Data sets will be referenced in scientific publications and if the open data platform permits, scientific papers based on the data will be linked on the open data platform.
The DOI is reserved when a ZENODO entry is created before any data are uploaded to the platform. At this point, the data set is not published and its visibility is classified ad
"Closed Access".
Metadata
The data sets follow the
EU Open Data Portal Metadata definitions
.
From the comprehensive set of fields, a minimum set will be provided.
The
Dublin Core Metadata Initiative
will be followed as much as possible.
For all characterisations, at least the following metadata will be provided via the ZENODO upload form and the individual metadata files in the data folders:
Metadata element |
Description |
Project identifier |
Points to a subfolder with the same name, holding the data |
Title |
Meaningful name of the sample characterisation |
Alternative title |
Other identifier in concise format |
Description |
Brief description of the data set |
Keywords |
List of keywords according to the library of congress terms |
Identifier |
Digital Object Identifier (DOI) reserves for this data set |
URI |
Uniform Resource Identifier linking to the place where the data are stored. Usually the path of the folder in the file system |
Dataset type |
Type of the bore log or material sample (e.g. cube, surface, loose rock) |
Documentation |
Detailed description of the dataset content |
Format |
File type of the data set, usually a compressed ZIP archive |
Issue date |
Date of the first issue |
Modification date |
Last date the data sample set was modified |
Publisher |
Usually the FCC collaboration |
Contact point |
Name of the organisation and service at the organisation who is in charge of that data sample |
Contact full address |
Address of the contact and organisation in charge of that data sample |
Contact e-mail |
E-mail address of the contact person |
Contact name |
First name and last name of the person who is in charge of the data sample |
Contact Web page |
A web page |
Version number |
Major and minor version number of the published data set. A minor version number 0 indicates a released version |
Version description |
Incremental change record |
Licence |
Link to licence text |
Owner |
Name of the person who must authorize the release of specifically marked information items |
Status |
One of IN WORK (minor version not equal 0 ), RELEASED (minor version number is 0 ), INVALID |
Materials |
A list of individual materials present in the sample as far as known |
Dimensions |
Sample dimensions (mass, diameter, length, width) |
Origin |
Geographical coordinates of the sample's origin |
Source |
Who provided the sample |
History |
A record of actions at different locations that indicate, how the sample was obtained, used and characterised |
Additional information |
Free text with additional comments |
The folder
data
of the CERNBOX file share contains a file called
CATALOG.ODS
that provides the most important metadata, notably the project identifier of the sample characterization that permits pointing to the folder in which the open data are stored.
The geo database catalog is also available on the
geo database Twiki page.
The following data description elements are collected for the catalog:
Data element |
Description |
Data set identifier |
Identifier in format LLL-YYMMDD |
URI |
Uniform Resource Identifier linking to the place where the data are stored. Usually the path of the folder in the file system. A cernbox link may either be public or accessible only to members of the e-group if the data are not yet published |
OID |
Digital Object Identifier filled if reserved in ZENODO |
Title |
Concise name of the data set |
Version |
Current version of the data set |
Status |
Release status of the data set (RELEASED , IN WORK , INVALID ) |
Organisation |
Short name of the organisation that created and manages this data set |
Contact name |
Name and e-mail (hyperlink) of the person who serves as contact point for this data set |
Type |
Type of the data set, e.g. a bore log or material sample (e.g. cube, surface, loose rock) |
Source |
Where the materials for this data set have been obtained from (e.g. contracted extraction, other project, literature) |
Location N/E |
Geographical coordinates where the materials that relate to this data set have been taken from |
Description |
Brief description of the data set |
Last update |
Date when this record has been updated |
*NOTE: *
It is understood that these fields are duplicates of those fields, which are also stored at lower, measurement data set folder level. The repetition at higher level serves creating a simple to use catalogue in the project.
Each data set is stored in a separate sub-folder.
Each data set sub-folder has in turn either a) folders for different samples that have been characterised or b) immediately folders for the characterisation data of the entire data set.
If a data set folder has sample sample sub-folders, then each sample sub-folder contains in turn a
METADATA.ODS
file that describes the specific data set according to the metadata elements indicated above.
Versioning
A dataset has a
major (
MM
) and a
minor (
mm
) version number, separated by a dot (
MM.mm
).
NOTE: The Zenodo recommended patch number is not used. It should always be “.0”.
If the minor version number is
0
, the data set is released.
Any minor version number different from zero indicates a data set that is in work.
At each release the major version number is incremented by one.
For each change in an "in work" version, the minor version number is incremented by one.
The first "in work" version starts with a major version number equal to 0.
Examples:
V 0.0
- the first draft version
V 0.1
- a second draft version
V 1.0
- the first released version
V 1.1
- an update to Version 1.0, not yet released
V 2.0
- another released version. Version
1.0
is now invalid.
Versions go together with the dataset status. It can either be
-
IN WORK
(not released),
-
RELEASED
or
-
INVALID
(must no longer be used as reference)
Note: INVALID and IN WORK version must not be published and be referenced in publications.
Naming Convention
Data Set Identification
A
data set is comprises multiple samples and sample analysis records.
The
project identifier for a data set that comprises uses the following convention:
LLL-YYMMDD
where
LLL
is the three-letter abbreviation of the
organisation at which the data set is created, e.g. TUW for Technische Universität Wien.
YY
stands for the last two digits of the current year, e.g.
18
for
2018
.
MM
stands for the two digits of the current month, e.g.
04
for April.
DD
stands for the two digits of the current day of month, e.g.
17
for the seventeenth day.
A complete example for a project identifier is
TUW-180317
.
The three letter abbreviations of an exemplary number of organisations, which typically carry out material characterizations are shown in the table below. This document will be regularly updated. The three-letter abbreviations do not coincide with the typical organisation abbreviations used to identify an organization in EC projects. They merely serve coming to unique project identifiers for sample characterizations.
Sample and Analysis Identification
Files that relate to
samples and data from different analysis processes are placed in subfolders of the data set according to this structure:
LLL-YYMMDD
is the name of the folder for a data set that contains multiple sample characterisations.
LLL-TTMMDD-{running number}-{type}
is the name of a sub-folder in
LLL-YYMMDD
that holds the analysis data of a specific sample or characterisation.
{type}
can be
- either
sample
to indicate that the folder contains data of a single sample that has been analysed with one or several methods or
-
{characterisation method abbreviation}
to indicate that this folder holds data for a specific characterisation method.
Today, the following characterisation methods are registered:
If a sub-folder contains a sample that has been analysed with multiple methods, it contains sub-folders that are named according to the abbreviation of the specific characterisation method.
Filename and Folder Naming
The filename of a dataset contains a clear, concise and very short name that identifies the contents.
Words are in lowercase and separated by underscores ("=_="). CamlCase is discouraged.
Filename extensions are encouraged to ease the understanding of the folder contents.
All data sets are stored in the folder
data
of the
fccgeo
EOS/CERNBOX project.
Folder Structure
Each top level data set folder contains at least the following files:
-
LICENCE.TXT
- Text of the Creative Commons CC BY 4.0 licence. Additional note about the data creator if needed including specific clauses on a case-by-case basis
-
METADATA.ODS
- the metadata if the data set including the change track record
Each data set folder contains subfolders. Those can be folders that hold the information about the entire data set from which the samples are taken and sub-folders for individual sample analysis tasks.
Collaboration members shall report needs for further characterization methods in a timely fashion so that the
template
folder at top level can include further examples.
The file and folder contents naming convention will evolve according to project needs and with growing experience.
One sub-folder needs to be created in a data set folder for each sample that is analysed with multiple techniques or for each analysis technique as indicated above.
Each sample-specific sub-folder holds a
CharacterisationMetadata.ods
file that describes the sample and the analysis carried out on that sample.
Depending on the numbers of samples taken from the original material lot and the different types of characterisations performed, there exist two options:
- Either there exists one sub-folder in the data set for each sample taken from the material lot (preferred) or
- there exists one sub-folder for each characterisation
The sub-folders are named by extending the data set naming conventions with a dash (
-
) followed by a sequence number, another dash (
-
) and the word
sample
or the abbreviation of the characterisation method.
If a single sample is analysed with different analysis methods, the sample folder contains one subfolder per analysis method that has the name of the analysis method (e.g.
SEM
,
TEM
).
In each sample or characterisation folder, there exist the following sub-folders in each sample/characterisation folder:
-
img
for images
-
raw
for any raw data files that devices deliver during the analysis, including calibration data files
-
results
for the derived analysis data and results of the characterisation
-
doc
for any additional documentation such as manuals, process descriptions, sample preparation instructions
-
misc
for any additional information files
The following image shows as the two possible folder structures. The
structure example and the files can be found in the
template
folder.
Another example is re-produced in tabular form:
Path name |
Description |
/eos/project/f/fccgeo/data/CRN-200301 |
Folder for a bulk of materials taken at CERN |
/eos/project/f/fccgeo/data/CRN-200301/Metadata.ods |
The metadata file that describes that material taken and the list of analysis performed |
/eos/project/f/fccgeo/data/CRN-200301/img |
Sub-folder that holds images that relate to the entire data set (e.g. of the borelog) |
/eos/project/f/fccgeo/data/CRN-200301/raw |
Sub-folder that holds the raw data files (e.g. the scanned borelog) |
/eos/project/f/fccgeo/data/CRN-200301/results |
Sub-folder that holds the analysis results that relate to the entire data set |
/eos/project/f/fccgeo/data/CRN-200301/doc |
Sub-folder that holds the additional documentation that permits better understanding the data set |
/eos/project/f/fccgeo/data/CRN-200301/misc |
Sub-folder that holds miscellaneous additional information files |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample |
Sub-folder for a qualitative analysis of a part of the material extracted |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/CharacerisationMetadata.ods |
Metadata file that describes the subset of the materials and the characterisation |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/img |
Sub-folder with images of the sample |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/results |
Sub-folder with results of the sample analysis |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM |
Sub-folder with data relating to the SEM analysis method of that specific sample |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/img |
Sub-folder with images from the SEM analysis of that specific sample |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/raw |
Sub-folder with the SEM device raw data |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-1-sample/SEM/doc |
Sub-folder with the SEM device description, sample preparation information and process description |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM |
Sub-folder for a spectroscopy analysis of another part of the material extracted |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/CharacerisationMetadata.ods |
Metadata file that describes the subset of the materials and the characterisation |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/img |
Sub-folder that holds images taken during the analysis |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/raw |
Sub-folder that holds the raw data files obtained during the analysis |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/results |
Sub-folder that holds the analysis results and result data files obtained after the analysis |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/doc |
Sub-folder that holds the additional documentation that permits better understanding the results (e.g. device manuals) |
/eos/project/f/fccgeo/data/CRN-200301/CRN-200301-2-SEM/misc |
Sub-folder that holds miscellaneous additional information files |
Data Format
The combined set of sample characterization data is uploaded as a compressed archive file in
.ZIP
format to ZENODO together with the following separate files:
-
licence.txt
(text of the CC BY 4.0 licence in UTF 8 format)
-
METADATA.ODS
(spreadsheet describing the entire data set)
-
./{subfolders of selected sample characterisations to be published}
(organisation of the sub-folders can either be by sample or by characterisation method)
Note: _Each subfolder corresponding to a specific measurement action will contain only those files that the collaboration members agree to publish openly. Usually these comprise spreadsheets in open format (
.ODS
) and comma separated value files (
.CSV
) as well as high resolution images (
.PNG
,
.TIFF
,
.JPEG
), vector plots (
.EPS
,
.PDF
) and summary documents (
.PDF
).
Keywords
Each dataset will at least be tagged with the following keywords:
- FCC
- FCCIS (if the data set is part of the FCCIS EC co-funded project)
- H2020 (if the data set was created as part of an EC co-funded project)
In addition, appropriate keywords from the
Library of Congress Subject Headings
classification
will be added (see
http://id.loc.gov/authorities/subjects.html
):
The keywords need at least to include
- discipline
- sample type
- type of materials
- properties or methods used for characterization
A selected list of entries from the following keyword terms and the link to the keyword term need to be entered in the two distinct fields foreseen in the ZENODO upload webform:
Metadata Standards
The Dublin Core Metadata Initiative will be followed (
http://dublincore.org
) as much as reasonably applicable. Metadata examples concerning the sample measurement campaigns such as
http://icatproject-contrib.github.io/CSMD/csmd-4.0.html
and an existing sample database at CERN have been considered. However, to our best knowledge, no domain-specific metadata standard for those sample characteristics identification campaigns specified in the project exist. Therefore, a column-oriented data format with an explanation of the columns will be created in the scope of this project.
Templates
Template folder and files can be found in the folder
fccgeo/templates
.
The entire folder
fccgeo/templates/CRN-200728
is an empty example that can be copied into the folder
/fccgeo/data
to be renamed using the naming convention for the data set
LLL-YYMMDD
to create a new data set. It contains the exemplary required files and exemplary characterisation subfolders.
Storage Administration and Access Permissions
Data Store Location
Data from subsurface samples are stored on a dedicated cernbox/eos data repository. This repository can be accessed either via a website or via a cernbox client application.
The path to the data share reachable directly on
lxplus
is
/eos/project/f/fccgeo
The data share is owned by the service account
fccgeo
(
fcc.geo@cern.ch
).
If you have access to the repository through membership of one of the e-egroups, you can directly access the share using a web browser:
https://cernbox.cern.ch/index.php/apps/files?dir=/__myprojects/fccgeo
Access Permissions
Access to the data store is managed via three e-groups following the
cernbox access documentation
:
cernbox-project-fccgeo-admins
members have full access to the project in the cernbox website and can add readers and writers to the respective egroups.
cernbox-project-fccgeo-writers
members can read, write, delete in the project space. They have only access to the cernbox share via the pathname.
cernbox-project-fccgeo-readers
members can only read the files in the project space, They have only access to the cernbox share via the pathname.
Making Data Openly Accessible
Data from characterization campaigns will be made openly available only after approval of all persons who were involved in that characterization.
Publication on Zenodo follows a quality management
Data Release Procedure.
Note: Persons publishing data on the ZENODO platform need to sign up with an account at that platform at
https://zenodo.org/signup/
.
Increase Data Re-Use Through Clarifying Licenses
Licence
Openly accessible data will be licensed under
Creative Commons CC BY 4.0 (see
https://creativecommons.org/licenses/by/4.0/
).
Users are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material
for any purpose, also commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation documented by CC BY 4.0. No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
Timing
Data will at latest be made available with the publication of an accompanying scientific publication that references the data sets. All data will at latest be available with the project end, even those data sets, which are not referenced in scientific publications.
Data publication may occur after the
six months EC H2020 permitted embargo period
has elapsed.
Data published may contain a subset of those data that are not subject to access restrictions or if publication would violate other existing license restrictions.
Re-use
Published data can be used by other scientists. Original data revealing particular sample preparation or analysis techniques that are subject to access restrictions will only be usable by researchers upon explicit request and approval of the characterization campaign manager and the data owner.
Validity
The data will remain usable until the repository withdraws the data or goes out of business.
Quality Assurance
Data sets, metadata and measurement setup and procedure description will be reviewed by at least one peer prior to engaging the release procedure.
The author and the reviewer are named in the metadata.
Data sets, metadata and measurement setup and procedure description will be marked as
RELEASED
only after approval of the measurement campaign manager (e.g. project supervisor) and one additional reviewer. The approvers are named in the metadata.
Review and release includes a validation of the measured sample, the measurement setup, conditions, procedure and equipment as well as sanity checks against similar studies and control of systematic errors.
In case of data quality uncertainties after release, a new version
IN WORK
is created and the released data set version is marked
INVALID
.
The description of analysis/characterisation setup (
materials and method) are annotated with product references.
The measurement conditions are described.
The measurement location, date and time (periods) are noted.
Any potential and known adverse effects (environmental influences, influences of the measurement equipment) are described in the metadata.
Allocation of Resources
Cost Estimate
A person at CERN reachable through the service account
fcc.geo@cern.ch
keeps the measurement data sets and perform the publication in the open data repository.
The estimated effort to manage a data set is 40 hours, 10 data sets per year, i.e. 400 hours or 10 weeks per year over the entire project period.
This resource is covered by the project management funds and CERN matching resources.
Note: the project coordination office will track the actual efforts and regularly update this estimation.
Each researcher in the project is reponsible to create the data sets using the adopted open data format, providing the metadata files, describing the measurement setup, anonymising or selecting the data for publication, reviewing the data sets and performing the release process using the CERN provided storage infrastructure (EOS, cernbox) and the ZENODO platform. The estimated effort is another 40 hours per data set, 10 data sets per year, i.e. 10 weeks per year over the entire project period.
This resource is covered by the organisations who carry out the measurement campaigns.
Note: The participating institutes are strongly encouraged to track the time they are spending to prepare the data sets and to publish them and to report their actual estimates to the Coordinator.
Data Management Responsibilities
This data management plan is maintained by the FCC office at CERN (
fcc.office@cern.ch
).
All project members at the co-operating organisations commit to cooperate on the establishment of this DMP and to deliver the required information such that the associated deliverables and milestones can be produced in due time with the requirement quality levels:
Data storage and backup responsibilities are covered by the data repository provider (CERN).
The CERN project repository is managed by CERN IT department.
Service account holder
fcc.geo@cern.ch
manages the data store.
The FCC secretariat and office (
fcc.office@cern.ch
) provides support for the upload to the CERNBOX/EOS data storage system and perform a formal (file integrity, naming, metadata completeness) check.
Long-term data preservation will be ensured by CERN at no additional cost.
Data Security
All data delivered to the CERN project repository EOS/CERNBOX storage system is backed up by CERN's central IT services. In addition, a copy of released data will be kept on the ZENODO platform. Both services are intended for long-term storage of scientific research data. Upon unintentional loss of data (misuse of the collaborative workspace, accidental removal), the
fcc.geo@cern.ch
service account holder needs to be contacted via email. The person will interact with CERN's IT services to restore the latest known copy. No additional costs occur for storage, backup and restore activities.
Nonpublic data sets can be provided by the project members using HTTPS transfer protocol after authentication by sharing a cernbox link by asking the sharing to the service account holder in a reasonable secure fashion that counteracts data manipulation.
Sensitive data, i.e. non-anonymized data sets can only be accessed by the author, the measurement campaign management and the project's IT managers.
Access to sensitive data may be granted through a request to the network coordinator (CERN) with a justification for the request. Access will be granted on a case-by-case basis in agreement with the measurement campaign manager and, if samples, analysis and products from industrial partners are involved, in agreement with the sample owners. The data will be communicated in electronic format from the network coordinator to the data requestor in digitally signed and encrypted form. An additional IP access process, such as the establishment of a Non-Disclosure Form may apply.
Every collaboration member must inform the network coordinator without delay if a person affiliated (associated or employed) with the institute and who has access to the project data, leaves the institute. In this case, the network coordinator will revoke as soon as technically possible and resources permitting (working hours) the access of the person to the data.
Note: E-mail is not considered a secure communication channel for data and metadata files. Data can be modified and it is unclear what fields have been modified with respect to the original data source. Therefore only a link to the authentic data source shall be considered reliable information.
Ethical Aspects
Sensitive information will be kept secure. Access to non-anonymized data is managed by the network coordinator in close cooperation with the organisation, who provides the data set. Non-anonymized data will only be communicated in encrypted fashion and digitally signed.
--
JohannesGutleber - 2020-07-27