Automatic File merging


  • Created: 15.09.2008
  • Updated: 06.10.2008
  • Discussed at PASTE meeting: 07.10.2008
  • Updated: 24.10.2008
  • Reviewed at DIRAC3 week: 20.11.2008
  • Updated: 20.11.2008
  • Changed handling of BK: 03.04.2009


The aim of this page is to present a roadmap to automatic file merging in production. Sites require that files stored on SEs (in particular T1Dx) are of a size larger than 2 GB (3 to 5 being even preferred). The advantage at the LHCb level is that the number of files to deal with at the end is much smaller...

The main applications are (not exhaustive)

  • Simulation: output files are very small compared to CPU requirements of jobs (typically 200 MB now)
  • Stripping: after stripping and streaming, output files of the stripping should also be quite small for 24h hours jobs.

The merging of files is just a technical implementation detail, and thus the users should not see that this necessary step happened and should concentrate on which software produced the data, not the file.

File merging workflow

In order to merge Gaudi/POOL/Root files, a Gaudi application must be used, while for MDF files a simple "cat" should be used. For Gaudi, generic options file exists that allows to merge any ROOT file. There is no need to use any of the particular application, setting up the LHCb-project environment is sufficient. For clarity of naming, the application name Merge will be used. It will be associated to setting up the LHCb project environment.

Action (Joel): unknown applications names should setup the LHCb project environment. The job options file will actually define the action.

The workflow will consist of:

  • Running the merger application
  • Upload the output file to a storage
  • Upload the BK information
  • Clear the input files (LFC and SE)

Merging stripped DSTs present an additional complication due to the fact that we need to create an SETC pointing to the DST. For this it is proposed that the stripping job only produces DST fragments, then the merging job will merge the DSTs and then produce the SETC as a subsequent step. This would solve two problems at once: having a merged ETC that points to the right file and keep a hierarchy in the BK between the DST and the ETC. This procedure should be applied as well in case no merging is required in order to solve the hierarchy problem.

Action (Marco, Joel): provide job options for creating SETC from DST. It is probably easier to use DaVinci, as an additional goodie could be to monitor the DST content (Marco, private communication).

*NEW* If stripping is to produce multiple streams, they could even be generated at the merging step, which would reduce the data handling. The stripping application would produce a single temporary DST. The merging job would then merge and create the streams. The SETCs would be produced by subsequent steps (one per stream as there are overlaps between streams).

Implications on the production system

Production definition

Whenever a production is defined for which merging is required, a sister production in charge of merging should be defined, using the merging workflow defined above. This production should be associated to a transformation that creates jobs based on the size of files stored at a given Tier1. This needs a change in the transformation logic (besides using the number of files as criterion). In addition, it is suggested that the transformation has a timeout after which the transformation is automatically flushed (could be a week or so) in order not to remain with unmerged files. Manual flushing will of course also be possible.

Action (Andrei): define a transformation policy based on amount of data.

Suggestion: automatic flushing should be implemented only after experience with manual flushing. Note that flushing should not stop the production. Still to be discussed whether there will be a "flushing" state or will remain "active"

The definition of which files to be merged should be simply based on the initial production number. The list of files will be obtained from a query to the BK giving the production number and the file type (e.g. DST_TMP)

Action (Zoltan): define TMP types in the BK (or implement an automatic way to number them, e.g. use -?) and make this type invisible i nthe GUI. They should be accessible though through a direct query (e.g. "Production=1000, FileType=DST_TMP")

NEW: For namespace as well as BK reporting, the merging production should use the production number of the initial production. The actual WMS production number of the merger will not appear in the BK except in the job name. Like this the processing pass will be associated to the original production and users will not notice that there was an additional operation for merging.

Initial production jobs

When a production needs merging of the output data, files should be uploaded temporarily from the job to a T0D1 storage (any), until they are merged, after which they can be deleted. They should be registered in the BK for provenance information as well as in the LFC such that the merging job can find them. The suggested SE for this upload is Tier1-Merge (to be defined in the CS) that could currently use the space token LHCb_FAILOVER. The regular namespace should be used using the temporary file type (e.g. /lhcb/data/2008/DST_TMP/00001000/......)

Distribution policy

The data distribution policy should be executed after the merging, while the initial production only uploads to any Tier1-Merge SE (preference to the closest one). Intermediate files should be removed after completion of the merging as described in the workflow section.

Processing Pass for Bookkeeping

We should not forget that what the user sees when querying the BK is the "Processing Pass" which indicates all successive actions that created it. Each production is associated with a ProcessingPass. Merging however doesn't need to appear to the user as this is a technical detail. Therefore the initial production will not be registered for its corresponding processing pass, but only the merging production will be registered as if it was the original production. The applications within the jobs however will reflect the correct processing tree.
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2009-04-03 - PhilippeCharpentier
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback