Difference: FileMerging (1 vs. 7)

Revision 72009-04-03 - PhilippeCharpentier

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Automatic File merging

History

Line: 8 to 8
 
  • Updated: 24.10.2008
  • Reviewed at DIRAC3 week: 20.11.2008
  • Updated: 20.11.2008
Added:
>
>
  • Changed handling of BK: 03.04.2009
 

Aim

The aim of this page is to present a roadmap to automatic file merging in production. Sites require that files stored on SEs (in particular T1Dx) are of a size larger than 2 GB (3 to 5 being even preferred). The advantage at the LHCb level is that the number of files to deal with at the end is much smaller...
Line: 57 to 58
 The data distribution policy should be executed after the merging, while the initial production only uploads to any Tier1-Merge SE (preference to the closest one). Intermediate files should be removed after completion of the merging as described in the workflow section.

Processing Pass for Bookkeeping

Changed:
<
<
We should not forget that what the user sees when querying the BK is the "Processing Pass" which indicates all successive actions that created it. Each production is associated with a ProcessingPass. Merging however doesn't need to appear to the user as this is a technical detail. The initial production is the only one that will be known to the BK, and thus be associated to the effective processing group/index. The merging step will just not appear in the processing pass table, but the merging jobs will be present in the processing tree and have as "Production" attribute that of the initial production.

Action (Joel): implement a scheme for passing the initial production number to the workflow and use it instead of the actual production number.

For example if the stripping production is 1000 and the merging production 1001, all files and jobs will have the "Production" attribute set to 1000 in the BK. Only the name of the merging jobs will reflect by convention that they come from 1001, due to their name (e.g. 00001001_00001).

>
>
We should not forget that what the user sees when querying the BK is the "Processing Pass" which indicates all successive actions that created it. Each production is associated with a ProcessingPass. Merging however doesn't need to appear to the user as this is a technical detail. Therefore the initial production will not be registered for its corresponding processing pass, but only the merging production will be registered as if it was the original production. The applications within the jobs however will reflect the correct processing tree.
 \ No newline at end of file

Revision 62008-11-20 - PhilippeCharpentier

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Automatic File merging

History

Line: 6 to 6
 
  • Updated: 06.10.2008
  • Discussed at PASTE meeting: 07.10.2008
  • Updated: 24.10.2008
Added:
>
>
  • Reviewed at DIRAC3 week: 20.11.2008
  • Updated: 20.11.2008
 

Aim

The aim of this page is to present a roadmap to automatic file merging in production. Sites require that files stored on SEs (in particular T1Dx) are of a size larger than 2 GB (3 to 5 being even preferred). The advantage at the LHCb level is that the number of files to deal with at the end is much smaller...
Changed:
<
<
The main applications are:
>
>
The main applications are (not exhaustive)
 
  • Simulation: output files are very small compared to CPU requirements of jobs (typically 200 MB now)
  • Stripping: after stripping and streaming, output files of the stripping should also be quite small for 24h hours jobs.
Added:
>
>
The merging of files is just a technical implementation detail, and thus the users should not see that this necessary step happened and should concentrate on which software produced the data, not the file.
 

File merging workflow

Changed:
<
<
In order to merge Gaudi/POOL/Root files, a Gaudi application must be used, while for MDF files a simple "cat" should be used. For Gaudi, generic options file exists that allows to merge any ROOT file. There is no need to use any of the particular application, setting up the LHCb-project environment is sufficient. For clarity of naming, one should envisage that the application name Merge be associated to setting up the LHCb project environment.
>
>
In order to merge Gaudi/POOL/Root files, a Gaudi application must be used, while for MDF files a simple "cat" should be used. For Gaudi, generic options file exists that allows to merge any ROOT file. There is no need to use any of the particular application, setting up the LHCb-project environment is sufficient. For clarity of naming, the application name Merge will be used. It will be associated to setting up the LHCb project environment.

Action (Joel): unknown applications names should setup the LHCb project environment. The job options file will actually define the action.

  The workflow will consist of:

  • Running the merger application
Changed:
<
<
  • Upload the output file (the distribution policy being defined in the CS)
>
>
  • Upload the output file to a storage
 
  • Upload the BK information
  • Clear the input files (LFC and SE)

Merging stripped DSTs present an additional complication due to the fact that we need to create an SETC pointing to the DST. For this it is proposed that the stripping job only produces DST fragments, then the merging job will merge the DSTs and then produce the SETC as a subsequent step. This would solve two problems at once: having a merged ETC that points to the right file and keep a hierarchy in the BK between the DST and the ETC. This procedure should be applied as well in case no merging is required in order to solve the hierarchy problem.

Added:
>
>
Action (Marco, Joel): provide job options for creating SETC from DST. It is probably easier to use DaVinci, as an additional goodie could be to monitor the DST content (Marco, private communication).

*NEW* If stripping is to produce multiple streams, they could even be generated at the merging step, which would reduce the data handling. The stripping application would produce a single temporary DST. The merging job would then merge and create the streams. The SETCs would be produced by subsequent steps (one per stream as there are overlaps between streams).

 

Implications on the production system

Production definition

Whenever a production is defined for which merging is required, a sister production in charge of merging should be defined, using the merging workflow defined above. This production should be associated to a transformation that creates jobs based on the size of files stored at a given Tier1. This needs a change in the transformation logic (besides using the number of files as criterion). In addition, it is suggested that the transformation has a timeout after which the transformation is automatically flushed (could be a week or so) in order not to remain with unmerged files. Manual flushing will of course also be possible.
Changed:
<
<
The definition of which files to be merged should be simply based on the initial production number. The list of files can be obtained either from a query to the BK (preferred) or using the file naming convention and the LFC.
>
>
Action (Andrei): define a transformation policy based on amount of data.

Suggestion: automatic flushing should be implemented only after experience with manual flushing. Note that flushing should not stop the production. Still to be discussed whether there will be a "flushing" state or will remain "active"

The definition of which files to be merged should be simply based on the initial production number. The list of files will be obtained from a query to the BK giving the production number and the file type (e.g. DST_TMP)

Action (Zoltan): define TMP types in the BK (or implement an automatic way to number them, e.g. use -?) and make this type invisible i nthe GUI. They should be accessible though through a direct query (e.g. "Production=1000, FileType=DST_TMP")

NEW: For namespace as well as BK reporting, the merging production should use the production number of the initial production. The actual WMS production number of the merger will not appear in the BK except in the job name. Like this the processing pass will be associated to the original production and users will not notice that there was an additional operation for merging.

 

Initial production jobs

Changed:
<
<
When a production needs merging of the output data, files should be uploaded temporarily from the job to a T0D1 storage, until they are merged, after which they can be deleted. They should be registered in the BK for provenance information as well as in the LFC such that the merging job can find them. However the "Got_Replica" flag of the BK should preferably not be set (but see the Processing Pass section) such that these files are not visible to users. The suggested SE for this upload is Tier1_Merge that uses the space token LHCb_FAILOVER with as specific SAPath (e.g. /lhcb/merge)
>
>
When a production needs merging of the output data, files should be uploaded temporarily from the job to a T0D1 storage (any), until they are merged, after which they can be deleted. They should be registered in the BK for provenance information as well as in the LFC such that the merging job can find them. The suggested SE for this upload is Tier1-Merge (to be defined in the CS) that could currently use the space token LHCb_FAILOVER. The regular namespace should be used using the temporary file type (e.g. /lhcb/data/2008/DST_TMP/00001000/......)
 

Distribution policy

Changed:
<
<
The data distribution policy should be executed after the merging, while the initial production only uploads to any Tier1_Merge SE (preference to the closest one). Intermediate files should be removed after completion of the merging as described in the workflow section.
>
>
The data distribution policy should be executed after the merging, while the initial production only uploads to any Tier1-Merge SE (preference to the closest one). Intermediate files should be removed after completion of the merging as described in the workflow section.
 

Processing Pass for Bookkeeping

Changed:
<
<
We should not forget that what the user sees when querying the BK is the "Processing Pass" which indicates all successive actions that created it. Each production is associated with a ProcessingPass. Merging however (which is a Processing Pass ;-)) doesn't need to appear to the user as this is a technical detail. Therefore probably only the merging production number should be declared to the BK, including all steps description. The initial production is only an intermediate step unknown to the BK. If this is the case, then the Got_Replica flag of intermediate files can be set and removed when the file is deleted, as the file will never be exposed to users, unless they query the BK by production number, which is unlikely, except for experts (and the productionDB transformation agents).
>
>
We should not forget that what the user sees when querying the BK is the "Processing Pass" which indicates all successive actions that created it. Each production is associated with a ProcessingPass. Merging however doesn't need to appear to the user as this is a technical detail. The initial production is the only one that will be known to the BK, and thus be associated to the effective processing group/index. The merging step will just not appear in the processing pass table, but the merging jobs will be present in the processing tree and have as "Production" attribute that of the initial production.

Action (Joel): implement a scheme for passing the initial production number to the workflow and use it instead of the actual production number.

 
Changed:
<
<
-- PhilippeCharpentier - 15 Sep 2008
>
>
For example if the stripping production is 1000 and the merging production 1001, all files and jobs will have the "Production" attribute set to 1000 in the BK. Only the name of the merging jobs will reflect by convention that they come from 1001, due to their name (e.g. 00001001_00001).

Revision 52008-10-24 - PhilippeCharpentier

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Automatic File merging

History

  • Created: 15.09.2008
  • Updated: 06.10.2008
Changed:
<
<
  • Discussed at PASTE meeting: 07.10.2008
>
>
 
  • Updated: 24.10.2008

Aim

Line: 34 to 34
 The definition of which files to be merged should be simply based on the initial production number. The list of files can be obtained either from a query to the BK (preferred) or using the file naming convention and the LFC.

Initial production jobs

Changed:
<
<
When a production needs merging of the output data, files should be uploaded temporarily from the job to a T0D1 storage, until they are merged, after which they can be deleted. They should be registered in the BK for provenance information as well as in the LFC such that the merging job can find them. However the "Got_Replica" flag of the BK should preferably not be set such that these files are not visible to users. The suggested SE for this upload is Tier1_Merge that uses the space token LHCb_FAILOVER with as specific SAPath (e.g. /lhcb/merge)
>
>
When a production needs merging of the output data, files should be uploaded temporarily from the job to a T0D1 storage, until they are merged, after which they can be deleted. They should be registered in the BK for provenance information as well as in the LFC such that the merging job can find them. However the "Got_Replica" flag of the BK should preferably not be set (but see the Processing Pass section) such that these files are not visible to users. The suggested SE for this upload is Tier1_Merge that uses the space token LHCb_FAILOVER with as specific SAPath (e.g. /lhcb/merge)
 

Distribution policy

The data distribution policy should be executed after the merging, while the initial production only uploads to any Tier1_Merge SE (preference to the closest one). Intermediate files should be removed after completion of the merging as described in the workflow section.
Added:
>
>

Processing Pass for Bookkeeping

We should not forget that what the user sees when querying the BK is the "Processing Pass" which indicates all successive actions that created it. Each production is associated with a ProcessingPass. Merging however (which is a Processing Pass ;-)) doesn't need to appear to the user as this is a technical detail. Therefore probably only the merging production number should be declared to the BK, including all steps description. The initial production is only an intermediate step unknown to the BK. If this is the case, then the Got_Replica flag of intermediate files can be set and removed when the file is deleted, as the file will never be exposed to users, unless they query the BK by production number, which is unlikely, except for experts (and the productionDB transformation agents).
 -- PhilippeCharpentier - 15 Sep 2008

Revision 42008-10-24 - PhilippeCharpentier

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"
Changed:
<
<

Automatic File merging (proposal)

>
>

Automatic File merging

History

  • Created: 15.09.2008
  • Updated: 06.10.2008
  • Discussed at PASTE meeting: 07.10.2008
  • Updated: 24.10.2008

Aim

 The aim of this page is to present a roadmap to automatic file merging in production. Sites require that files stored on SEs (in particular T1Dx) are of a size larger than 2 GB (3 to 5 being even preferred). The advantage at the LHCb level is that the number of files to deal with at the end is much smaller...

The main applications are:

Line: 9 to 16
 
  • Stripping: after stripping and streaming, output files of the stripping should also be quite small for 24h hours jobs.

File merging workflow

Changed:
<
<
In order to merge Gaudi/POOL/Root files, a Gaudi application must be used, while for MDF files a simple "cat" should be used. For Gaudi, generic options file exists that allows to merge any ROOT file. There is no need to use any of the particular application, setting up the LHCb-project environment is sufficient. For clarity of naming, one could envisage that the application name Merge be associated to setting up the LHCb project environment... One can also have a CMT project LHCbMerge that would just be used for that.
>
>
In order to merge Gaudi/POOL/Root files, a Gaudi application must be used, while for MDF files a simple "cat" should be used. For Gaudi, generic options file exists that allows to merge any ROOT file. There is no need to use any of the particular application, setting up the LHCb-project environment is sufficient. For clarity of naming, one should envisage that the application name Merge be associated to setting up the LHCb project environment.
  The workflow will consist of:

  • Running the merger application
Changed:
<
<
  • Upload the output file and set the transfer requests according to the distribution policy
>
>
  • Upload the output file (the distribution policy being defined in the CS)
 
  • Upload the BK information
  • Clear the input files (LFC and SE)
Line: 22 to 29
 

Implications on the production system

Production definition

Changed:
<
<
Whenever a production is defined for which merging is required, a sister production in charge of merging should be defined, using the merging workflow defined above. This production should be associated to a transformation that creates jobs based on the amount of data already stored at a given Tier1. This needs a change in the transformation logic (besides using the number of files as criterion). In addition, it is suggested that the transformation has a timeout after which the transformation is automatically flushed (could be several days) in order not to remain with unmerged files. Note: should this be more widely used?
>
>
Whenever a production is defined for which merging is required, a sister production in charge of merging should be defined, using the merging workflow defined above. This production should be associated to a transformation that creates jobs based on the size of files stored at a given Tier1. This needs a change in the transformation logic (besides using the number of files as criterion). In addition, it is suggested that the transformation has a timeout after which the transformation is automatically flushed (could be a week or so) in order not to remain with unmerged files. Manual flushing will of course also be possible.

The definition of which files to be merged should be simply based on the initial production number. The list of files can be obtained either from a query to the BK (preferred) or using the file naming convention and the LFC.

 

Initial production jobs

Changed:
<
<
When a production needs merging of the output data, files should be uploaded temporarily from the job to a T0D1 storage, until they are merged, after which they can be deleted. They should be registered in the BK for provenance information as well as in the LFC such that the merging job can find them. However the "Got replica" flag of the BK should preferably not be set such that these files are not visible to users. The suggested SE for this upload is Tier1_FAILOVER.
>
>
When a production needs merging of the output data, files should be uploaded temporarily from the job to a T0D1 storage, until they are merged, after which they can be deleted. They should be registered in the BK for provenance information as well as in the LFC such that the merging job can find them. However the "Got_Replica" flag of the BK should preferably not be set such that these files are not visible to users. The suggested SE for this upload is Tier1_Merge that uses the space token LHCb_FAILOVER with as specific SAPath (e.g. /lhcb/merge)
 

Distribution policy

Changed:
<
<
The data distribution policy should be executed by the merging production, while the initial production implements a dedicated distribution policy as discussed above. Intermediate files should be removed after completion of the merging as described in the workflow section.
>
>
The data distribution policy should be executed after the merging, while the initial production only uploads to any Tier1_Merge SE (preference to the closest one). Intermediate files should be removed after completion of the merging as described in the workflow section.
  -- PhilippeCharpentier - 15 Sep 2008

Revision 32008-09-29 - PhilippeCharpentier

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"
Changed:
<
<

Automatic File merging (proposal)

The aim of this page is to present a roadmap to automatic file merging in production. Sites require that files stored on SEs (in particular T1Dx) are of a size larger than 2 GB (3 to 5 being even preferred). The advantage at the LHCb level is that the number of files to deal with at the end is much smaller...

>
>

Automatic File merging (proposal)

The aim of this page is to present a roadmap to automatic file merging in production. Sites require that files stored on SEs (in particular T1Dx) are of a size larger than 2 GB (3 to 5 being even preferred). The advantage at the LHCb level is that the number of files to deal with at the end is much smaller...
  The main applications are:

  • Simulation: output files are very small compared to CPU requirements of jobs (typically 200 MB now)
  • Stripping: after stripping and streaming, output files of the stripping should also be quite small for 24h hours jobs.
Changed:
<
<

File merging workflow

>
>

File merging workflow

In order to merge Gaudi/POOL/Root files, a Gaudi application must be used, while for MDF files a simple "cat" should be used. For Gaudi, generic options file exists that allows to merge any ROOT file. There is no need to use any of the particular application, setting up the LHCb-project environment is sufficient. For clarity of naming, one could envisage that the application name Merge be associated to setting up the LHCb project environment... One can also have a CMT project LHCbMerge that would just be used for that.
 
Changed:
<
<
In order to merge Gaudi/POOL/Root files, a Gaudi application must be used. A generic options file exists that allows to merge any such file. There is no need to use any of the particular application, setting up the LHCb-project environment is sufficient. For clarity of naming, one could envisage that the application Merge be associated to setting up the LHCb project environment...
>
>
The workflow will consist of:
 
Changed:
<
<

Production jobs

>
>
  • Running the merger application
  • Upload the output file and set the transfer requests according to the distribution policy
  • Upload the BK information
  • Clear the input files (LFC and SE)
 
Changed:
<
<
When a production needs merging of the output data, files should be uploaded temporarily from the job to a T0D1 storage, until they are merged, after which they can be deleted. They should be registered in the BK for provenance information, in the LFC such that the merging job can find them. However the "Got replica" flag of the BK should not be set such that these files are not visible to users. The suggested SE for this upload is LHCb_FAILOVER at all Tier1's.
>
>
Merging stripped DSTs present an additional complication due to the fact that we need to create an SETC pointing to the DST. For this it is proposed that the stripping job only produces DST fragments, then the merging job will merge the DSTs and then produce the SETC as a subsequent step. This would solve two problems at once: having a merged ETC that points to the right file and keep a hierarchy in the BK between the DST and the ETC. This procedure should be applied as well in case no merging is required in order to solve the hierarchy problem.
 
Added:
>
>

Implications on the production system

 

Production definition

Deleted:
<
<
 Whenever a production is defined for which merging is required, a sister production in charge of merging should be defined, using the merging workflow defined above. This production should be associated to a transformation that creates jobs based on the amount of data already stored at a given Tier1. This needs a change in the transformation logic (besides using the number of files as criterion). In addition, it is suggested that the transformation has a timeout after which the transformation is automatically flushed (could be several days) in order not to remain with unmerged files. Note: should this be more widely used?
Changed:
<
<
The data distribution policy should be executed by the merging production, while the initial production implements a dedicated distribution policy (e.g. LHCb_FAILOVER).
>
>

Initial production jobs

When a production needs merging of the output data, files should be uploaded temporarily from the job to a T0D1 storage, until they are merged, after which they can be deleted. They should be registered in the BK for provenance information as well as in the LFC such that the merging job can find them. However the "Got replica" flag of the BK should preferably not be set such that these files are not visible to users. The suggested SE for this upload is Tier1_FAILOVER.
 
Added:
>
>

Distribution policy

The data distribution policy should be executed by the merging production, while the initial production implements a dedicated distribution policy as discussed above. Intermediate files should be removed after completion of the merging as described in the workflow section.
  -- PhilippeCharpentier - 15 Sep 2008 \ No newline at end of file

Revision 22008-09-22 - PhilippeCharpentier

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Automatic File merging (proposal)

Line: 9 to 9
 
  • Simulation: output files are very small compared to CPU requirements of jobs (typically 200 MB now)
  • Stripping: after stripping and streaming, output files of the stripping should also be quite small for 24h hours jobs.
Changed:
<
<
Under construction...
>
>

File merging workflow

 
Added:
>
>
In order to merge Gaudi/POOL/Root files, a Gaudi application must be used. A generic options file exists that allows to merge any such file. There is no need to use any of the particular application, setting up the LHCb-project environment is sufficient. For clarity of naming, one could envisage that the application Merge be associated to setting up the LHCb project environment...

Production jobs

When a production needs merging of the output data, files should be uploaded temporarily from the job to a T0D1 storage, until they are merged, after which they can be deleted. They should be registered in the BK for provenance information, in the LFC such that the merging job can find them. However the "Got replica" flag of the BK should not be set such that these files are not visible to users. The suggested SE for this upload is LHCb_FAILOVER at all Tier1's.

Production definition

Whenever a production is defined for which merging is required, a sister production in charge of merging should be defined, using the merging workflow defined above. This production should be associated to a transformation that creates jobs based on the amount of data already stored at a given Tier1. This needs a change in the transformation logic (besides using the number of files as criterion). In addition, it is suggested that the transformation has a timeout after which the transformation is automatically flushed (could be several days) in order not to remain with unmerged files. Note: should this be more widely used?

The data distribution policy should be executed by the merging production, while the initial production implements a dedicated distribution policy (e.g. LHCb_FAILOVER).

 

-- PhilippeCharpentier - 15 Sep 2008 \ No newline at end of file

Revision 12008-09-15 - PhilippeCharpentier

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="LHCbComputing"

Automatic File merging (proposal)

The aim of this page is to present a roadmap to automatic file merging in production. Sites require that files stored on SEs (in particular T1Dx) are of a size larger than 2 GB (3 to 5 being even preferred). The advantage at the LHCb level is that the number of files to deal with at the end is much smaller...

The main applications are:

  • Simulation: output files are very small compared to CPU requirements of jobs (typically 200 MB now)
  • Stripping: after stripping and streaming, output files of the stripping should also be quite small for 24h hours jobs.

Under construction...

-- PhilippeCharpentier - 15 Sep 2008

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback