Analysis Toys

All the scripts and example input files described below can be found at UserCode/SAKoay/AnalysisToys:

cvs co -d AnalysisToys UserCode/SAKoay/AnalysisToys
chmod a+x UserCode/SAKoay/AnalysisToys/scripts/*.py UserCode/SAKoay/AnalysisToys/scripts/*.sh

It works best if you put AnalysisToys/scripts in your shell $PATH. When navigating this page, you can also click on the script names in the headers to view them in the CVS browser.

DISCLAIMER : Most of the shell/python scripts have only be tested at the cmslpc cluster, and for private usage. So don't be surprised if they don't run out-of-the-box for you -- please instead email the author with a clear description in case the problem can be easily fixed.

All python scripts in the following have online help, which you can access by giving the option "-h", or also by running without necessary arguments. Options start with one or two dashes ("-") and must all be specified before arguments, e.g. instead of ascript.py hello -v world you probably want to call ascript.py -v hello world instead.

Job creation

brew.py -- text substitution engine for creating job configuration files

This script is appropriate e.g. for creating CRAB jobs, as opposed to CRAB-less alternatives that you can run on a local batch farm (see jobify.py). It takes at least two mandatory arguments, the first being the input file, and the rest being one or more template files. input should be a text file where each line contains a list of strings to substitute into the template files, thereby creating a set of template instances. The format of input is controlled by the --parse option regular expression, e.g. by default:

^\s*(\S+)\s+/([^/]+)/(\S+)\s*([0-9]*)\s*(\S*)\s*([0-9.]*)\s*$

The default has been designed to take 3 mandatory fields and 3 optional ones. The labels of each filed are specified via the --fields= option:

__LABEL__;__DATASET__;__DETAILS__;__EVENTSPERJOB__=10000;__TRIGGERTABLE__=HLT;__MAXPTHAT__=-9;__FILES__=;__SECONDARYFILES__=

This is a semi-colon separated list, and each item (e.g. __LABEL__, __DATASET__ etc. in the above) defines a label for the corresponding group specified in parentheses (...) in the --parse= regular expression. A default value can be specified in the label=value format like __EVENTSPERJOB__=10000 in the above. Note that there are actually more fields specified in this example than there are groups to parse, so in particular the __FILES__ and __SECONDARYFILES__ labels will always be substituted by their default values (which are empty strings in this case).

These default settings would for example be appropriate for an input file like:

ttbar-madgraph   /TTJets_TuneD6T_7TeV-madgraph-tauola/Fall10-START38_V12-v2/GEN-SIM-RECO       10000     HLT
Znunu-madgraph   /ZinvisibleJets_7TeV-madgraph/Fall10-START38_V12-v1/GEN-SIM-RECO              20000     HLT
W-madgraph       /WJetsToLNu_TuneD6T_7TeV-madgraph-tauola/Fall10-START38_V12-v1/GEN-SIM-RECO   20000     HLT

Given such an input file, three sets of template files will be generated, with the following substitutions made in each set:

Set __LABEL__ __DATASET__ __DETAILS__ __EVENTSPERJOB__ __TRIGGERTABLE__
#1 ttbar-madgraph TTJets_TuneD6T_7TeV-madgraph-tauola Fall10-START38_V12-v2/GEN-SIM-RECO 10000 HLT
#2 Znunu-madgraph ZinvisibleJets_7TeV-madgraph Fall10-START38_V12-v1/GEN-SIM-RECO 20000 HLT
#3 W-madgraph WJetsToLNu_TuneD6T_7TeV-madgraph-tauola Fall10-START38_V12-v1/GEN-SIM-RECO 20000 HLT

In all cases the default values of __MAXPTHAT__, __FILES__, and __SECONDARYFILES__ are used. For example if there are two template files called crab.cfg.template and cmsRun_cfg.py.template provided, brew.py would generate 6 files in the --in-directory= option location (default current directory):

crab_ttbar-madgraph.cfg
cmsRun_cfg_ttbar-madgraph.py
crab_Znunu-madgraph.cfg
cmsRun_cfg_Znunu-madgraph.py
crab_W-madgraph.cfg
cmsRun_cfg_W-madgraph.py

The format of the above instance names is controlled by the --output-template option.

Lastly, the field labels should appear in appropriate locations of your template files. For example, you probably want to have something like the following in your crab.cfg.template:

[CMSSW]
datasetpath=/__DATASET__/__DETAILS__
pset=my_cmsrun_cfg___DATASET__.py
events_per_job = __EVENTSPERJOB__

[USER]
ui_working_dir = __DATASET__
user_remote_dir=/somewhere/to/output/__LABEL__

The user_remote_dir can be created by brew.py if specified via the --make-paths option.

Note that brew.py creates jobs but does not submit them. It is easy to get a list of submission commands via:

find -maxdepth 2 -mindepth 1 -name "crab*.cfg" | sed "s|.*|crab -cfg \0 -create -submit|"

which you can create an alias for, for convenience.

jobify.py -- text substitution engine for creating jobs for a list of LFN's

This script is good for creating cmsRun config files that you can run locally (without CRAB), because it takes a list of file names and splits them up into however many jobs you specify. Otherwise it functions similarly to brew.py, i.e. by substituting input strings into templates to create various instances. The template files are provided via the --templates option, typically as a wildcard e.g. =*.template* by default. The script then accepts one or more text input files containing lists of LFN's, e.g.:

/store/mc/Fall10/QCD_TuneD6T_HT-250To500_7TeV-madgraph/GEN-SIM-RECO/START38_V12-v1/0013/0A86DEE5-2FDB-DF11-9470-00304867FF03.root
/store/mc/Fall10/QCD_TuneD6T_HT-250To500_7TeV-madgraph/GEN-SIM-RECO/START38_V12-v1/0013/08FC1FC3-15DB-DF11-B2CF-00188B7AD28D.root
/store/mc/Fall10/QCD_TuneD6T_HT-250To500_7TeV-madgraph/GEN-SIM-RECO/START38_V12-v1/0013/08FB0627-B2DB-DF11-BF15-001E8CC04089.root
/store/mc/Fall10/QCD_TuneD6T_HT-250To500_7TeV-madgraph/GEN-SIM-RECO/START38_V12-v1/0013/089372E0-F5DA-DF11-A15A-001A92810AC6.root
...

Each line of the input file corresponds to a single LFN; such an input file can for example be downloaded from DBS via the cfg link for files in a particular dataset. The name of the input file will be used as the job and output label, so give it a meaningful name e.g. QCD250to500-madgraph.txt for the above. Note that jobify.py creates jobs but does not submit them -- you can do that via the submitall_condor.sh script.

The number of files per job is specified via the --files-per-job option, and controls the number of jobs that are created. Each job will run over a block of the LFN's in the given input files. The output goes into the storage directory specified via the --storage-path, --storage-subpath, and --storage-postfix options -- the final path is a concatenation of all three, plus the job label and index at the end. For example, if there are 100 LFN's in QCD250to500-madgraph.txt and you specify --files-per-job 10 as an option, 10 jobs indexed 1, 2, ..., 10 will be created under the QCD250to500-madgraph job label.

You must have at least 3 template files to run cmsRun jobs in the Condor batch system at FNAL:

Condor JobAd template

The JobAd is a configuration for jobs that run on the Condor batch system. You can find at the US CMS website a tutorial of the JobAd syntax, and an example template file is AnalysisToys/templates/condor.jobad.template. The following setting is typically what you need to change:

transfer_input_files = __ENV(PWD)__/__ENV(CMSSW_VERSION)__.tgz,__JOBPATH__/my_cmsRun_cfg.py,__ENV(X509_USER_PROXY)__

These provide the input files to run your job on the remote machine. The first, __ENV(PWD)__/__ENV(CMSSW_VERSION)__.tgz, expects to find in the current directory a tarball containing the CMSSW code and libraries under which your job runs. This can be created by executing AnalysisToys/scripts/pack.sh in the current directory, after you have run cmsenv at the appropriate location. Secondly, you need to change my_cmsRun_cfg.py to the name of the particular cmsRun configuration file you're providing as a template to jobify.py. The third input file is the GRID certificate proxy required to upload your output to the storage element. So long as you have executed voms-proxy-init before submitting jobs, it should work as is.

cmsRun wrapper script template

An example template is AnalysisToys/templates/cmsrun.sh.template and should require little modification. This is a script that runs at the remote batch machine, unpacks the CMSSW tarball, executes cmsRun, and uploads the output files to the storage element at the end.

cmsRun configuration template

This is a standard cmsRun configuration, except that you must insert the __FILES__ replacement tag into your source file names:

process.source = cms.Source("PoolSource",
  fileNames = cms.untracked.vstring(__FILES__),
)

__FILES__ will then be replaced by a python list of LFN's during job creation. You can also specify __SECONDARYFILES__ for the secondaryFileNames parameter if you want to run also over the parent dataset in the 2-file solution via the --use-parents option.

submitall_condor.sh -- prints submission commands for all created Condor jobs

You just have to execute submitall_condor.sh jobDir where jobDir is the directory containing *.jobad files (which can be in sub-directories). Such a jobDir is for example created by jobify.py. submitall_condor.sh prints a list of submission commands but does not execute them, so to do so you can pipe the output to a temporary file:

submitall_condor.sh jobDir > dosubmit
source dosubmit

submitall_condor.sh also tries to check whether a job has already been submitted, so it should be pretty safe to execute even on partially submitted locations and should just obtain the list of remaining jobs. Please contact the author if this does not appear to be working.

quarry.py -- progressive job creation for processing JSON files

This is useful in case one needs to "update" processed data samples once in a while when a new JSON file is released. The script comes with step-by-step instructions online. Just run it in an empty work directory.

CRAB

craball.sh -- executes CRAB commands on multiple job directories

This is a very simple script that just runs CRAB commands that you specify on a list of directories that you can specify via wildcards, e.g.:

craball.sh crabDirs_* -status

A more sophisticated version is crib.py.

crib.py -- multi-job CRAB interface

This is a "friendlier" (at least according to the author) interface to CRAB, with the following features:

  • A more compact, color-coded display of job status, and a global summary for multiple CRAB directories (i.e. running over many datasets).
  • The ability to check the storage location (if local, e.g. dCache when running at FNAL) for presence of output files, which allows one to ignore the CRAB status code -- at your own risk -- and consider the job as done as long as it has output.
  • Understands job categories like "nooutput", "aborted", "failed", etc., which can be specified in lieu of number ranges e.g. in commands like -resubmit nooutput. This means that you no longer need to cut-and-paste job ranges that CRAB reports to have failed in order to resubmit them!

The syntax is simply:

crib.py <directories...> <crab-commands...>

where directories can be a list of paths, or omitted to assume the current directory. A recursive search is performed to locate all CRAB job locations starting from this list of paths. <crab-commands...> are the usual CRAB commands, but you can omit -status because it will always be called prior to other commands. This means that you no longer have to call crab -status and then crab -get -- just the latter is enough.

The other main feature is the aforementioned job categories. Currently the following presets are available:

Category Specification
unmade No jobs created.
created Created but not-yet-submitted jobs.
cancelled Cancelled jobs.
failed Jobs with any nonzero exit code (executable or job).
killed Jobs that have been killed.
aborted Jobs that have been aborted.
nooutput Jobs for which the output is not locatable.
multioutput Jobs for which there is more than one output per job.

TIP crib.py creates a .leave.me.alone file in the CRAB directory in order to allow running multiple instances of crib.py * without mutual collision. In some cases i.e. a nasty shut-down of crib.py this file can be left in the CRAB directory, a symptom of which is that instead of checking the directory when you run it again, crib.py simply prints out the directory name in the following format:
-------------------------------------------- MyCrabDir -------------------------------------------- 
If this happens undesirably, simply delete the MyCrabDir/.leave.me.alone file.

decrab.py -- remove duplicate CRAB output files [FNAL only]

CRAB occasionally likes to create multiple outputs for the same job. decrab.py locates the duplicates and removes them, by keeping the file with the latest timestamp. You probably want to run it in --dry-run option mode first to be sure that it's behaving correctly:

decrab.py -d /pnfs/cms/WAX/...

Note that the output directories have to be specified in /pnfs/cms/WAX/... (local) form, and it assumes that the files in the directories have the CRAB format xxxx_N_m_yyy.root where xxxx is some user-specified string, N is a job number, m is an instance number (increments if there are multiple outputs per job), and yyy a random CRAB-generated string. The decrab.py script should print out a list of "Normal" job numbers, followed by a list of multi-output jobs (if any) for example in the format:

   Multi-output:
      687 :      patuple_687_1_NCZ.root = 3.8G        patuple_687_1_Glq.root = 3.8G
      688 :      patuple_688_1_OyI.root = 3.5G        patuple_688_1_P2g.root = 3.5G
      689 :      patuple_689_1_8PW.root = 3.9G        patuple_689_1_Mh8.root = 3.9G

The names and sizes of each duplicate file are listed. The script will select one file to keep out of all these duplicates, which has the largest size (within some leeway) and latest creation time; this is the one highlighted in a non-gray color. Once you have verified that this looks ok, you should run decrab.py without the --dry-run (a.k.a. -d) option so that it actually performs the deletion operations.

Condor batch system

chill.py -- executes condor commands on a specified subset of jobs

This script allows you to filter running jobs by grepping for job info parameters (e.g. the command used to run the job), and then to execute a specified command (e.g. condor_hold) on the selected jobs. You probably need to be somewhat familiar with the Condor batch system for this to be useful.

Storage element

ccopy.py -- checksum-verified copy [FNAL only]

This script provides directory tree copying facilities where the copied files are verified to have the same checksum as the input file. It should work for any filesystem where standard file manipulation commands work (ls, cp, rm), e.g. normal disks and EOS. The syntax is:

ccopy.py [-v] [-o] <source...> <target>

You probably want to always specify the -v (--verbose) and -o (--overwrite) options. Verbose tells you what is going on. Overwrite is important in case the target file already exists but seems to be different from the source file, in which case the target will not be overwritten unless the -o option has been specified. The -q (--quick) option may also be useful when resuming copies, as it can speed up things by performing a simple (but less reliable) file size check rather than a full CRC comparison in order to determine whether or not a file needs to be overwritten. In all cases a CRC check is always performed after a copy to verify that the transfer is correct.

copy.py -- recursive dCache-to-dCache copy [FNAL only]

This script provides "standard" directory tree copying facilities between storage element locations. The syntax is:

copy.py --copy-jobfile copyme <source...> <target>

This generates a file copyme which can then be provided to srmcp to perform all of the required copy commands. The srmcp syntax will be printed to the terminal after copy.py is done running.

retrieve.sh -- recursive dCache-to-disk copy [FNAL only]

This script copies from storage element to local (or network) disks. The syntax is:

retrieve.sh <copy-jobfile> <source...>

This generates a file <copy-jobfile> which can then be provided to srmcp to perform all of the required copy commands. The srmcp syntax will be printed to the terminal after copy.py is done running. The target directory is relative to the current directory. If you want to simplify the directory structure, e.g. to copy from /pnfs/cms/WAX/resilient/somebody/somewhere/source/... to source/..., you can specify a prefix to strip like:

retrieve.sh - somebody/somewhere/ copyme /pnfs/cms/WAX/resilient/somebody/somewhere/source

shelf.sh -- recursive disk-to-dCache copy [FNAL only]

This script copies from local (or network) disks to storage element. The syntax is:

shelf.sh <copy-jobfile> <target> <source...>

This generates a file <copy-jobfile> which can then be provided to srmcp to perform all of the required copy commands. The srmcp syntax will be printed to the terminal after copy.py is done running. If you want to simplify the directory structure, e.g. to copy from somewhere/along/the/line/source to /pnfs/cms/WAX/resilient/somebody/source/..., you can specify a prefix to strip like:

shelf.sh - somewhere/along/the/line/ copyme /pnfs/cms/WAX/resilient/somebody/ somewhere

ROOT

Parang -- a plot-making suite

More information at the dedicated Parang website.

-- SueAnnKoay - 10-Jun-2011

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2013-03-05 - SueAnnKoay
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback