Archiving and freezing datasets on disk


  • Initial proposal (PhC): 13-Dec-2010
  • Updated proposal (PhC): 19-Jan-2011
  • Meeting discussion: 10-Jan-2011
  • Update including Ricardo's suggestions: 24-Jan-2011
  • Update on actual implementation: 07-Feb-2011
  • New proposal (RG-D, PhC): 21-Mar-2011


Our Computing Model foresees that we keep only on disk (TxD1 service classes) the two latest versions of (re)processed real data. However it is important to keep for a very long time datasets that were used in order to produce published physics results. Similarly when simulated data are meant to be obsolete, it is foreseen to keep at least one copy of them for a certain period on tape only. These datasets can be deleted in a second stage when it is clear they are no longer needed.

It is proposed to archive datasets at the time of creation, as part of the regular replication process, but in a storage elements different from those which users will read data from. The archived datasets will replace the T1 used at 2 sites (CERN + 1 Tier1), and analysis will take place only on T0D1 service classes.

The Computing Model also assumes that a fraction of the RAW and their corresponding R(S)DSTs is kept on disk such that developers or subdetector experts do not need to stage data from tape. This was partially addressed during 2010, but in an unsatisfactory way (making CERN-RDST a T1D1 storage and re-registering RAW files processed at CERN as CERN-RDST). As it seems the use case is more to have a fraction of all data (from RAW to streamed DST) for a small number of runs on disk at CERN, a new mechanism has to be put in place. This is a "data-freezing" mechanism.

Archive and freeze SE setup and namespace

It is proposed to archive datasets using special SEs at CERN and Tier1s (<site>-ARCHIVE). In addition, data should be available on a T0D1 storage at CERN and at a variable number of Tier1s (e.g. 3 for real data, 2 for MC).

Due to the unicity of the namespace in Castor, archived or frozen datasets must be replicated with a Castor path different from that used for regular SEs into an adequate service class and file class. We use the existing LHCb-RDST space token for holding the -ARCHIVE Dirac storages.

For freezing, The CERN-FREEZER Dirac storage uses LHCb-DST that is a T0D1 space token.

These space tokens have respectively as SA_PATH: <old SAPath>/lhcb/archive and <old SAPath&gt/lhcb/freezer. The files then have the following paths:

<old SAPath>/lhcb/archive/<LFN>
<old SAPath>/lhcb/freezer/<LFN>

Status on 21-Mar-2011:

  • All DIRAC SEs have been created and tested.
  • <site>-ARCHIVE SEs are banned for reading as users should not be able to use files in the archive space token.

Archiving procedure

Archiving a dataset consists of replicating it to the corresponding DIRAC SE at CERN and at one Tier1, as part of the replication process. When a dataset is retired for users, it will be deleted from the regular SEs <site>[_MC][-M]-DST, and it should be made invisible in the bookkeeping. In the case when datasets can be trashed completely, then the archived set can be deleted as well, and the replica flag unset in the bookkeeping.

Action: Zoltan to implement a method for setting/unsetting the visibility flag to files. Action: Philippe to adapt the plugins (done).

Archiving existing datasets

The existing datasets need to be transferred to 2 -ARCHIVE SEs (one at CERN, one at Tier1s). This can be done using the ArchiveDataset plugin (not yet released). As a consequence there will be additional tape copies of those datasets, but it appears it is very difficult to change the storage class of a file from T1D1 to T0D1. Therefore old datasets should be archived only just before they are deleted.

New SE setup

T1D1 space tokens should on the long term disappear. The following intermediate procedure is proposed (works on Castor, to be checked on dCache and StoRM):

  • Create a new SRM space token (e.g LHCb_disk) that contains the existing space tokens LHCb_DST, LHCb_MC-DST, LHCb_M-DST, LHCb_MC-M-DST, LHCb_FAILOVER, LHCb_HISTOS and is defined as T0D1.
  • Redefine the Dirac SEs based on this space token.

The LHCb_RAW and LHCb_RDST space tokens should be merged at Tier1s (can be kept separate at CERN).

Old SE Pool New SE
LHCb_RAW lhcbraw LHCb_Tape
LHCb_RDST lhcbrdst
LHCb_DST lhcbdst LHCb_Disk
LHCb_M-DST lhcbmdst
LHCb_MC-M-DST lhcbdata
LHCb_FAILOVER lhcbfailover
LHCb_HISTOS lhcbhistos

LHCb_USER should be kept separate in order to protect production data from user overflow.

Restoring archived datasets

In order to make archived datasets active again, they should be replicated to the proper Dirac SE (LHCb_disk) and the visibility flag should be set in the BK. This can be achieved using the same plugin that is used for data distribution.

Freezing procedure

For real data, we propose to set aside in the freezer datasets corresponding to selected runs, processed at CERN. The selection criterion still has to be defined (run number being a multiple of n (for example 5) would on average freeze 1/n of the runs processed at CERN. One can also freeze the first files of all runs processed at CERN (ordinal number less than 10).

Freezing can be achieved in the policy of a replication plugin, with the small caveat that it is not guaranteed that non-merged DST files will not be deleted by the merging job before they have been replicated into the freezer.

There is a need for allowing users to query the BK for frozen datasets, even enforcing it for RAW and RDST files. This could be achieved with a special flag in the BK: by default user requests to get RAW or RDST files should require this flag. An alternative is that users could ask for files replicated in CERN-FREEZER (similar to "Advanced Save"). This request could be set by default for RAW and RDST file types (in the GUI).

A regular cleaning of the frozen datasets is necessary. The retention period in the freezer has yet to be determined. Once a popularity tool is implemented, the retirement of a dataset can be suggested by its loss of popularity.

Action: define the policy and implement it in the distribution plugins and/or upload module.

Savannah task

Link to a related Savannah task

Related documentation

-- PhilippeCharpentier - 13-Dec-2010

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2011-04-01 - PhilippeCharpentier
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback