Analysis preservation and reproducibility

CERN Central Analysis Preservation

The central analysis preservation services CAP and REANA offer full archiving of analyses, including input data, analysis code and documentation. Meta information will be fully searchable and the REANA project will enable rerunning and reinterpretation of analyses.

There is a command line tool available to interface to CAP. A tutorial for this cap-client is also available.

LHCb specific practices

*As of December 2017 LHCb has adopted a minimal set of mandatory AP practices as part of the review process. See the presentation https://indico.cern.ch/event/672229/#75-analysis-preservation*

There are 4 domains in which an analysis can be made more reproducible in order to be able to fully exploit the services offered by CAP:

  • Analysis code repositories
  • Ntuple storage
  • Analysis automation
  • Runtime environment preservation
Details on this architecture can be found in the internal note LHCb-INT-2017-021 "LHCb Analysis Preservation Roadmap".

In each domain tools and examples have been developed.

Analysis code repositories

Analysis code is kept in gitlab repositories at https://gitlab.cern.ch

The main repository of each analysis should be kept in the PHWG gitlab groups. This will ensure that the code is available to the collaboration even after individual analysts have left.

LHCB phyiscs PHWG gitlab groups (access managed by WG egroups):

It is easy to transfer a project from a personal user into a gitlab group but it depends on the access to the groups being setup correctly. This is a responsibility of the conveners.

Typically the access rights are:

  • WG Conveners (as per egroup) -- Owner
  • WG members (as per egroup) -- Maintainer
  • lhcb-general -- Developer

NOTE: Gitlab has a Guest role, but for us it is of no use as it does not allow for the code to be browsed or downloaded (see here).

In addition, please make sure that the project is marked as private (as opposed to "public", which makes it accessible to everyone, or to "internal", which makes it accessible to anyone with a CERN account including our colleagues on ATLAS and CMS).

If you have questions on how to transfer your project, please contact your WG conveners.

Documentation is available here: https://docs.gitlab.com/ce/user/project/settings/index.html#transferring-an-existing-project-into-another-namespace

In short: Settings > General > Advanced (expand) > Transfer project > Select a new namespace (type name of your working group) > Transfer

Ntuple Storage

There is dedicated storage available on eos: /eos/lhcb/wg

The organisation of the working group directories is up to the working groups. Please contact your conveners.

There is a service account available to allow authentication of remote machines (such as gitlab ci runners) for data access.

account name: lbanadat

A keytab to be used for authentication is available in a preliminary location at /eos/lhcb/wg/BandQ/Test/anadat.keytab. A detailed example of how this is intended to be used is part of the analysis containerisation template .

Working Group Production

Using working group production has the advantage that the ntuples will automatically be available in a central place, accessible to the collaboration. The output of the working group production is managed in the LHCb bookkeeping.

WG productions can be automatically submitted using the WG/CharmWGProd repository on GitLab, see the README for details. Analysts from other working groups are welcome to use the Charm package.

Automated analysis workflow

Scripting and automating an analysis is no new concept. Indeed anybody who had run (parts of) an analysis on a computing cluster has already done exactly this. In the context of APR practices, scripting and automation has the added benefit of capturing exactly how the analysis tools were executed. Obviously the more parts of the analysis are included in the automation the better this information is preserved. The analysis script(s) also provide an invaluable starting point for new people who want to learn about the analysis and reuse or improve the analysis tools.

Often a simple ROOT, python or even bash script will be all that is needed. Almost any analysis we are aware of has such scripts in one form or another.

In order to handle complex analysis flows, with several data preparation steps and many systematic checks, there are several utilities available. All these so called "workflow engines" provide a way to organize the component scripts of an analysis into a common workflow.

There now is a dedicated lesson on the Snakemake workflow engine in the LHCb StarterKit. Additionally, the official documentation contains a in depth tutorial and lots of additional information.

The goal for full reproducibility is to have a full documentation of how the analysis scripts need to be executed to reproduce the results. An automated workflow can be thought of as a machine-readable documentation.

Runtime environments

Preservation of the runtime environment of an analysis in Linux containers is working and is used by CAP. A lot of development is still going on in this area.

ROOT docker image LHCb docker image Customizing docker images

Tools

Tools for analysis preservation and reproducibiity are gathered under this gitlab group: https://gitlab.cern.ch/lhcb-analysis-preservation/

Example analyses

Documents

-- SebastianNeubert - 2018-06-12
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2019-02-07 - SebastianNeubert
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback