CRAB Logo

CRAB Advanced topics

Complete: 4 Go to SWGuideCrab

Merging output files

With many jobs, handling output files manually can be quite tedious. To simplify further processing, output files can be merged with the following methods.

TFileService output

Several plain ROOT files can be easily merged with the hadd utility included with ROOT and CMSSW:

hadd mergedfile.root file1.root file2.root ... fileN.root

EDM output

Paste the following configuration into a file merge.py

import FWCore.ParameterSet.Config as cms
from FWCore.ParameterSet.VarParsing import VarParsing

options = VarParsing('analysis')
options.register('loginterval', 1000, mytype=VarParsing.varType.int)
options.parseArguments()

process = cms.Process("PickEvent")

process.load('FWCore.MessageService.MessageLogger_cfi')
process.MessageLogger.cerr.FwkReport.reportEvery = options.loginterval

process.source = cms.Source ("PoolSource",
        fileNames = cms.untracked.vstring(options.inputFiles),
        duplicateCheckMode = cms.untracked.string('noDuplicateCheck')
)

process.out = cms.OutputModule("PoolOutputModule",
        fileName = cms.untracked.string(options.outputFile)
)

process.end = cms.EndPath(process.out)
The output can then be merged with:
cmsRun merge.py outputFile=merged.root inputFiles=file:file1.root,file:file2.root
Or, if many files are to be merged at once:
ls file*.root|sed 's/^/file:/'>list.txt
cmsRun merge.py outputFile=merged.root inputFiles_load=list.txt

Running MC generation on LHE files

CRAB contains basic support for running MC generation on LHE files.

How to use LHE files in input

The LHE input files should be added in the CMSSW parameter-set configuration file as an LHE source (i.e. as input files to be read and run over by the CMSSW code). If the user has the LHE input files in his/her local machine and wishes to send them to the worker nodes, then he/she should add the paths to the local files in the JobType.inputFiles parameter of the CRAB configuration file so that CRAB collects them into the input sandbox that is sent to the worker nodes. Assuming the files are named file1.lhe, file2.lhe, ..., fileN.lhe (and that they are located in the current working directory from where the crab submit command will be executed) the line to be added to the CRAB configuration file is:

config.JobType.inputFiles = ['file1.lhe', 'file2.lhe', ..., 'fileN.lhe']

Then one has to specify in the CMSSW parameter-set configuration file that these files should be taken as an LHE source:

process.source = cms.Source("LHESource",
    fileNames = cms.untracked.vstring([
        'file:file1.lhe',
        'file:file2.lhe',
        ...
        'file:fileN.lhe'
    ])
)

If the LHE files are too large to fit in the input sandbox, they should be moved to a place globally accessible (e.g. via xrootd). Or preferably, if they are in a specifically site-wide directory, CRAB needs to be configured to send the jobs to that site only (use the Site.whitelist parameter in the CRAB configuration). In any case, if the LHE files are not in the user's local machine, they should NOT be added in the JobType.inputFiles parameter of the CRAB configuration, while the CMSSW parameter-set configuration should include their LFNs or PFNs(can prepend the xrootd redirector to make a PFN our of a LFN) in the LHE source. For example:

1) using LFNs without xrootd redirector:

process.source = cms.Source("LHESource",
    fileNames = cms.untracked.vstring([
        '/store/user/atanasi/my_LHE_files/file1.lhe',
        '/store/user/atanasi/my_LHE_files/file2.lhe',
        ...
        '/store/user/atanasi/my_LHE_files/fileN.lhe'
    ])
)

2) using LFNs with xrootd redirector:

process.source = cms.Source("LHESource",
    fileNames = cms.untracked.vstring([
        'root://cms-xrd-global.cern.ch//store/user/atanasi/my_LHE_files/file1.lhe',
        'root://cms-xrd-global.cern.ch//store/user/atanasi/my_LHE_files/file2.lhe',
        ...
        'root://cms-xrd-global.cern.ch//store/user/atanasi/my_LHE_files/fileN.lhe'
    ])
)

If an xrootd redirector is not prepended, local (at the worker node) file access will be tried first and in case of file open error the system should fallback to AAA. If an xrootd redirector is prepended, no local access will be tried and AAA will be used directly.

The rest of the CRAB configuration file is like running MC generation. Here is an example of a CRAB configuration file to run MC generation on LHE files that the user will send in the input sandbox:

from CRABClient.UserUtilities import config
config = config()

config.General.requestName = 'MC_generation_LHE'

config.JobType.pluginName = 'PrivateMC'
config.JobType.psetName = 'pset_MC_generation_LHE.py'
config.JobType.inputFiles = ['file1.lhe', 'file2.lhe', ..., 'fileN.lhe']

config.Data.outputPrimaryDataset = <primary-dataset-name>
config.Data.splitting = 'EventBased'
config.Data.unitsPerJob = 2000
NJOBS = 1000
config.Data.totalUnits = config.Data.unitsPerJob * NJOBS
config.Data.publication = True
config.Data.outputDatasetTag = 'MC_generation_LHE'

config.Site.storageSite = <storage-site>

note.gif Note: Each luminosity section in the output files will contain as many events as specified in the JobType.eventsPerLumi parameter (100 by default).

job splitting on LHE files

There is no pre-processing of LHE input files to determine event counts in files. All LHE input files are passed to all jobs, and job splitting is done purely by what you specify in the config. Jobs then skip events according their job number while reading in files up to the ones they process. This is better explained with the example below:

   * 2 LHE files, 150 events each
   * totalUnits = 300, unitsPerJob = 100

   - Job 1 is configured to have skipEvents = 0, maxEvents = 100, reads only first LHE file
   - Job 2 has skipEvents = 100, maxEvents = 100, reads first and second   LHE file
     (skips the first 100 events from the first file)
   - Job 3 has skipEvents = 200, maxEvents = 100, reads first file, but skips all events,
     skips 50 events and then processes the rest of the second file

The above example scales up. E.g. with 50 files at 1000 events each, the last 10 jobs would read the previous 49 files, skipping events until they reach the ones they need to process.

Running MC generation with pileup

This is more a CMSSW topic than a CRAB one. Usually such a MC has a signal data sample as input and is therefore submitted like any normal Analysis job. In addition cmsRun will read the pileup sample, but this is not what drives CRAB job splitting etc. There are ways to automatically configure cmsRun for the pileup using e.g. cmsdriver, see https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCmsDriver (see also https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1714/1/2.html )

There is one very important CRAB aspect, though.

When you run your jobs on the distributed system (aka GRID) you have to be careful with locations of data and jobs. Best is of course if pileup and signal dataset are located at same site, so jobs also run there and all data access is local. But if signal and pileup are at different sites, at least one of those needs to be read remotely via AAA (i.e. xrootd) and this could create sever performance issues and site overload. If you need to explore this scenario, it will be better to force CRAB jobs to run at the site which hosts the pileup, since that's where majority of data will be read from (I presume you do not have a premixed sample).

Summary:

  • Always try to run CRAB jobs where the pileup sample is, not where the signal sample is

Running a user script with CRAB; the scriptExe parameter

The JobType.scriptExe parameter in the CRAB configuration file allows the user to run an arbitrary script in place of cmsRun. In fact, the CRAB3 job wrapper executes the following steps on the worker node:

  1. It sets up the CMSSW environment (basically running cmsrel and cmsenv).
  2. It tweaks the CMSSW parameter-set configuration file provided by the user (JobType.psetName parameter in the CRAB configuration file) by setting up the input files and the input lumis the job should analyze.
  3. It executes cmsRun -j FrameworkJobReport.xml -p PSet.py, where PSet.py is the tweaked CMSSW parameter-set configuration file.
  4. It parses the framework job report produced by cmsRun looking for OutputModule and TFileService outputs. These output files will be staged out to the user-specified storage site (unless the user disabled the transfers by setting General.transferOutputs = False). In addition, publication of PoolOutputModule output files will also be performed (unless the user disabled it by setting Data.publication = False).

By means of the JobType.scriptExe parameter, the user can alter the behavior in step 3 and execute an alternative script instead of cmsRun. To be precise, CRAB job wrapper will

  1. Aternative: execute scriptExe jobId <scriptArgs> where scripExe is the name passed in CRAB configuration, jobId is the number of this job in the task , and <scriptArgs> indicate optional additional arguments which can be specified in CRAB configuration file via the JobTYpe.scriptArgs parameter
The CMSSW parameter-set configuration file will still be tweaked and the CMSSW environment will be set. The user just has to make sure that the script produces a valid FrameworkJobReport.xml file (e.g. by running cmsRun -j FrameworkJobReport.xml -p PSet.py inside the script).

The next subsections give some simple examples of how to run a user script via the JobType.scriptExe parameter. The examples make use of the CRAB configuration file crabConfig_tutorial_Data_analysis.py defined in section CRAB configuration file to run on Data and the CMSSW parameter-set configuration file pset_tutorial_analysis.py defined in section CMSSW configuration file to process an existing dataset.

Simple Hello World example

In this first example we show how to use the JobType.scriptExe parameter with a simple wrapper on top of cmsRun to print some information. We will basically execute the same command as the non-scriptExe version is executing (i.e.: cmsRun -j FrameworkJobReport.xml -p PSet.py) and add some echo commands at the beginning of the script.

The line that needs to be added in the CRAB configuration file is the following:

config.JobType.scriptExe = 'myscript.sh'

The script myscript.sh that we are going to execute is the following:

# Actually, let's do something more useful than a simple hello world... this will print the input arguments passed to the script
echo "Here there are all the input arguments"
echo $@

# If you are curious, you can have a look at the tweaked PSet. This however won't give you any information...
echo "================= PSet.py file =================="
cat PSet.py

# This is what you need if you want to look at the tweaked parameter set!!
echo "================= Dumping PSet ===================="
python -c "import PSet; print PSet.process.dumpPython()"

# Ok, let's stop fooling around and execute the job:
cmsRun -j FrameworkJobReport.xml -p PSet.py

This script will print something like this:

==== CMSSW Stack Execution FINISHING at Thu Nov 13 21:02:26 2014 ====
======== CMSSW OUTPUT STARTING ========
== CMSSW:  Log for recording SCRAM command-line output
== CMSSW:  -------------------------------------------
== CMSSW:  Here there are all the input arguments
== CMSSW:  1
== CMSSW:  ================= PSet.py file ==================
== CMSSW:  import FWCore.ParameterSet.Config as cms
== CMSSW:  import pickle
== CMSSW:  handle = open('PSet.pkl', 'rb')
== CMSSW:  process = pickle.load(handle)
== CMSSW:  handle.close()
== CMSSW:  ================= Dumping PSet ====================
== CMSSW:  import FWCore.ParameterSet.Config as cms
== CMSSW:  
== CMSSW:  process = cms.Process("NoSplit")
== CMSSW:  
== CMSSW:  process.source = cms.Source("PoolSource",
== CMSSW:      fileNames = cms.untracked.vstring('/store/data/Run2012B/SingleMu/AOD/13Jul2012-v1/0003/E0DDC6AB-60D6-E111-B77A-0030487CDA4C.root', 
== CMSSW:          '/store/data/Run2012B/SingleMu/AOD/13Jul2012-v1/0003/7C8CC10C-60D6-E111-89B9-BCAEC54B303A.root', 
== CMSSW:          '/store/data/Run2012B/SingleMu/AOD/13Jul2012-v1/0003/F24221AB-62D6-E111-BC65-485B39800B75.root'),
== CMSSW:      lumisToProcess = cms.untracked.VLuminosityBlockRange("193834:1-193834:2", "193834:5-193834:8", "193834:21-193834:24", "193834:29-193834:30", "193834:33-193834:34", 
== CMSSW:          "193834:3-193834:4", "193834:9-193834:12")
== CMSSW:  )
== CMSSW:  process.output = cms.OutputModule("PoolOutputModule",
== CMSSW:      outputCommands = cms.untracked.vstring('drop *', 
== CMSSW:          'keep recoTracks_*_*_*'),
== CMSSW:      fileName = cms.untracked.string('output.root'),
== CMSSW:      dataset = cms.untracked.PSet(
== CMSSW:          dataTier = cms.untracked.string(''),
== CMSSW:          filterName = cms.untracked.string('')
== CMSSW:      ),
== CMSSW:      logicalFileName = cms.untracked.string('')
== CMSSW:  )
== CMSSW:  
== CMSSW:  
== CMSSW:  process.out = cms.EndPath(process.output)
== CMSSW:  
== CMSSW:  
== CMSSW:  process.CPU = cms.Service("CPU")
== CMSSW:  
== CMSSW:  
== CMSSW:  process.SimpleMemoryCheck = cms.Service("SimpleMemoryCheck")
== CMSSW:  
== CMSSW:  
== CMSSW:  process.Timing = cms.Service("Timing",
== CMSSW:      summaryOnly = cms.untracked.bool(True)
== CMSSW:  )
== CMSSW:  
== CMSSW:  
== CMSSW:  process.maxEvents = cms.untracked.PSet(
== CMSSW:      input = cms.untracked.int32(-1)
== CMSSW:  )
== CMSSW:  
== CMSSW:  process.options = cms.untracked.PSet(
== CMSSW:      wantSummary = cms.untracked.bool(True)
== CMSSW:  )
== CMSSW:  
== CMSSW:  
== CMSSW:  14-Nov-2014 09:39:13 CET  Initiating request to open file dcap://t2-srm-02.lnl.infn.it/pnfs/lnl.infn.it/data/cms/store/data/Run2012B/SingleMu/AOD/13Jul2012-v1/0003/E0DDC6AB-60D6-E111-B77A-0030487CDA4C.root
== CMSSW:  14-Nov-2014 09:39:16 CET  Successfully opened file dcap://t2-srm-02.lnl.infn.it/pnfs/lnl.infn.it/data/cms/store/data/Run2012B/SingleMu/AOD/13Jul2012-v1/0003/E0DDC6AB-60D6-E111-B77A-0030487CDA4C.root
== CMSSW:  Begin processing the 1st record. Run 193834, Event 221707, LumiSection 1 at 14-Nov-2014 09:39:20.067 CET
[...]

Generating an additional output file in the script

Suppose that we want to generate an additional output file in the script and collect it. One can still use the JobType.outputFiles parameter of the CRAB configuration file for collecting the output file. The line that needs to be added in the CRAB configuration file is the following:

config.JobType.outputFiles = ['simpleoutput.txt']

Let's produce the additional output in the script in the simplest way I can find; just add this line at the end of the script:

# $1 will point to the job number which is passed as the first argument.
echo "I am a simple output for job "$1 > simpleoutput.txt

As one can see, the message I am a simple outptut for job X has been placed in the produced output file:

crab getoutput --quantity 1

Placing file 'simpleoutput_3.txt' in retrieval queue
Please wait
Retrieving simpleoutput_3.txt
Success: Success in retrieving simpleoutput_3.txt
Success: All files successfully retrieve
Log file is /afs/cern.ch/work/m/mmascher/CRAB3-tutorial/CMSSW_7_0_5/src/crab_projects/crab_tutorial_Data_analysis_scriptExe/crab.log

cat /afs/cern.ch/work/m/mmascher/CRAB3-tutorial/CMSSW_7_0_5/src/crab_projects/crab_tutorial_Data_analysis_scriptExe/results/simpleoutput_3.txt

I am a simple output for job 3

Do not run cmsRun at all in the script

In case one does not want to run cmsRun in the script, a workaround to the requirement of having to produce a framework job report is to pass an empty framework job report via the JobType.inputFiles parameter:

config.JobType.inputFiles = ['FrameworkJobReport.xml']

And this is the minimal content that the FrameworkJobReport.xml file should have:

<FrameworkJobReport>
<ReadBranches>
</ReadBranches>
<PerformanceReport>
  <PerformanceSummary Metric="StorageStatistics">
    <Metric Name="Parameter-untracked-bool-enabled" Value="true"/>
    <Metric Name="Parameter-untracked-bool-stats" Value="true"/>
    <Metric Name="Parameter-untracked-string-cacheHint" Value="application-only"/>
    <Metric Name="Parameter-untracked-string-readHint" Value="auto-detect"/>
    <Metric Name="ROOT-tfile-read-totalMegabytes" Value="0"/>
    <Metric Name="ROOT-tfile-write-totalMegabytes" Value="0"/>
  </PerformanceSummary>
</PerformanceReport>

<GeneratorInfo>
</GeneratorInfo>
</FrameworkJobReport>

You also need to remove the following lines from your CMSSW parameter-set configuration file. Otherwise CRAB will still think that an output is produced:

#process.output = cms.OutputModule("PoolOutputModule",
#    outputCommands = cms.untracked.vstring("drop *", "keep recoTracks_*_*_*"),
#    fileName = cms.untracked.string('output.root'),
#)
#process.out = cms.EndPath(process.output)

Finally, remove the cmsRun line from the script and add a simple instruction that prints the input files put in the tweaked Pset.py by the CRAB3 job wrapper:

echo "================= Dumping Input files ===================="
python -c "import PSet; print '\n'.join(list(PSet.process.source.fileNames))"

#cmsRun -j FrameworkJobReport.xml -p PSet.py

Define your own exit code and exit message

As a first remark, notice that in the examples above we haven't exited the script explicitly, but left the script to terminate by itself. If we would exit the script explicitly, i.e. if we would add a line

exit <some-non-0-exit-code>

in the script, the CRAB job wrapper would exit before producing the JSON job report file necessary for further execution in the stageout wrapper and the post-job. Furthermore, this exit code is not passed to CRAB, so it would be anyway a useless exit code.

The exit code(s) (and exit message(s)) that CRAB uses are the ones defined in the FrameworkJobReport.xml file. If cmsRun fails execution, the FrameworkJobReport.xml file will contain a section named FrameworkError with details of the failure (exit code and error message). On the other hand, if cmsRun succeeds, the FrameworkJobReport.xml file will not contain a FrameworkError section and CRAB will assume that the exit code was 0 (exit code ≠ 0 means execution failure, exit code = 0 means no error). Therefore, if the user wants to pass its own exit code and/or exit message to CRAB, he/she has to edit the FrameworkJobReport.xml file and add an appropriate FrameworkError section it. This has to be done as the last step before terminating the script.

Below is a real example of a FrameworkError section for a failure opening an input file:

<FrameworkJobReport>
...
...
<FrameworkError ExitStatus="8028" Type="Fatal Exception" >
<![CDATA[
An exception of category 'FallbackFileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initFile()
   [3] Calling StorageFactory::open()
   [4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://xrootd-cms.infn.it//store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root'
   Additional Info:
      [a] Input file root://eoscms//eos/cms/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root?svcClass=default could not be opened.
Fallback Input file root://xrootd-cms.infn.it//store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root also could not be opened.
      [b] XrdClient::Open(name='root://xrootd-cms.infn.it//store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root', flags=0x10, permissions=0666) => error 'No servers are available to read the file.83]:1094 Mw[::90.147.66.75]:1094 Mw[::134.158.132.31]:1094' (errno=3011)
      [c] Current server connection: root://cms-xrd-global.cern.ch:1094//store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root

]]>
</FrameworkError>
</FrameworkJobReport>

Thus, to define a customized exit code and exit message (and error type), one has to add the following text to the FrameworkJobReport.xml file:

<FrameworkError ExitStatus="EXIT CODE" Type="ERROR TYPE" >
<![CDATA[
EXIT MESSAGE

]]>
</FrameworkError>

When more than one FrameworkError sections are present in the FrameworkJobReport file, CRAB will use the one that appears first. Thus, to make sure your exit status takes precedence, it should be added at the very top of the file, just after the opening of the FrameworkJobReport section. Here is an example of how to add a user exit code and exit message (and error type) to the FrameworkJobReport file from the script:

...
...

# At the end of the script modify the FJR
exitCode=1
exitMessage="My arbitrary exit message"
errorType="My arbitrary error type"

if [ -e FrameworkJobReport.xml ]
then
    cat << EOF > FrameworkJobReport.xml.tmp
<FrameworkJobReport>
<FrameworkError ExitStatus="$exitCode" Type="$errorType" >
$exitMessage
</FrameworkError>
EOF
    tail -n+2 FrameworkJobReport.xml >> FrameworkJobReport.xml.tmp
    mv FrameworkJobReport.xml.tmp FrameworkJobReport.xml
else
    cat << EOF > FrameworkJobReport.xml
<FrameworkJobReport>
<FrameworkError ExitStatus="$exitCode" Type="$errorType" >
$exitMessage
</FrameworkError>
</FrameworkJobReport>
EOF
fi

The actual error message that CRAB will show in the crab status error summary will be:

<N> jobs failed with exit code 1:
          Error while running CMSSW:
          My arbitrary error type
          My arbitrary exit message

Automatic cancellation of a CRAB task

CRAB will automatically kill a task if a (configurable) minimum number of jobs in the task have failed.

note.gif Note: The CRAB configuration parameter Data.failureLimit defines this minimum number. If the user doesn't set this parameter, there will be no automatic kill of the task.

warning.gif It is not recommend that the users change this parameter, unless they have a good reason or are encouraged to do so by a CRAB expert. Indeed, we are planing to remove the parameter Data.failureLimit from the CRAB configuration file.

If CRAB kills a task, the jobs that were not finished before the kill are lost. Depending on the failure reasons of the jobs the user has the following options for recovering the task (we assume the task was doing an analysis as opposed to MC generation; in the later case one can always run additional jobs):

  • Do crab resubmit in order to submit the killed and failed jobs again. This is worth to do if the user knows that the failure reasons are recoverable (one simple good reason would be that the rest of the jobs in the task were successful).
  • Submit a new task to recover the non-finished and failed jobs. One would use the crab report command to obtain the missingLumiSummary.json file with the luminosity sections that were not analyzed and use it as the lumi-mask in the recovery task.
  • Submit a new task from scratch.

-- AndresTanasijczuk - 2015-05-07

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2019-02-21 - StefanoBelforte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback