Building the statistics tables for Monte-Carlo simulations.

Foreword

The user of a Monte-Carlo sample is interested in a number of efficiency values to make sense of this sample. This information is available in XML files (GeneratorLog.xml) produced by the simulation jobs and associated to each production job (on the GRID). Statistical quantities are computed according to formula detailed here. The statistics tables gather this information in a well-formatted way, and are published so that the information is easily accessible to users (analysts). It is the task of the MC contact to generate the statistics tables for the production of his/her WG. The machinery is most effective when the statistics pages are produced shortly after the productions are finished. There are several steps to be followed, detailed herebelow.

The current version of MCStatTools, v4r* is "frozen" following the discussion with experts during Simulation Meeting on May 21, 2019. The package is available in CERN GITlab to be checked-out and used locally as detailed at the bottom of this page. Any issues related to parsing of production XML logs remaining in .tgz archives on CASTOR should be mentioned during the Simulation Meeting and addressed to Production team as mentioned below. Any issues that are found to affect recent productions should be reported on LHCBGAUSS-1676. LHCBGAUSS-1676 may still trigger minor patch releases of MCStatTools. So, please, use the latest available release unless another version is specifically indicated here.

ALERT! MCStatTools is evolving towards automating the log generation procedure. The status of this process can be followed on this JIRA task: LHCBGAUSS-929.

Notes about running environment

In order to use MCStatTools which heavily relies on LHCbDIRAC API to gather all data needed to prepare statistics tables for MC productions, the user (MC liaison) will need a valid GRID user certificate to instantiate a valid DIRAC proxy, preferably prior to running the DownloadAndBuildStat.py script.

Due to changes in setting up the environment for LHCbDIRAC v9r3 and later, MC liaisons will have to make a temporary clone of the MCStatTools data package in order to use it (as detailed here):

$ git lb-clone-pkg MCStatTools
$ cd MCStatTools/scripts
The latest production version of LHCbDIRAC (v9r3 and later) should be set up with:
$ source /cvmfs/lhcb.cern.ch/lib/lhcb/LHCBDIRAC/lhcbdirac

ALERT! This procedure will be changed in a soon-to-be-released version of the package which should be able to offer a better way to set up the runtime environment.

TIP The following procedure is done automatically in the latest versions of MCStatTools in the final stage of running the script when the actual HTML tables are generated, yet the details are presented below to ease possible debugging when scripts do not yield the expected result.

The HTML table generation is done in the GaussStat.py module which requires to have access to the latest <your_WG> statistics tables corresponding to MC type <MC_type> (MC2012, SIM08STAT, SIM09STAT) and MC production conditions (e.g. beam energy, collision parameters, event generator, etc.) <MC_conditions>. These files are stored on EOS and may be copied locally w/ the following command:

    1cp -a /eos/project/l/lhcbwebsites/www/projects/STATISTICS/<MC_type>/<your_WG>-WG/Generation_<SimVer>-<MC_conditions>.html .

1 Get the productions ID number for your request(s).

Use the Dirac webpage (alternate webpage) to retrieve the Dirac Production ID associated to your request. Follow 'Production' -> 'Request manager', and then use the left hand panel to filter the displayed request and pin down the request you are interested in. Click on the request, and select 'Production monitor'. This will bring you to another Dirac webpage, which you could use directly if you know the request ID number. Each request have several steps, each with a Dirac Production ID. You are interested only in the step of type 'MCSimulation'. Write down its ProdID, shown in the first column.

Go through these steps for all the requests you want to process as the rest of the work-flow can be performed on several ProdIDs in one go.

TIP An alternative way to find production IDs when you only know the production event type is to use LHCbDirac command dirac-bookkeeping-decays-path

$ dirac-bookkeeping-decays-path -p 13563000
('/101396/13563000/RXCHAD.STRIP.DST', 'dddb-20170721-2', 'sim-20160321-2-vc-md100', 41, 26022, 101396)
('/101399/13563000/RXCHAD.STRIP.DST', 'dddb-20170721-2', 'sim-20160321-2-vc-mu100', 41, 26049, 101399)

which will given your the production IDs as the first component of the stripped path in the output tuples. Without the -p flag these path will become the full BKK paths for the productions (which do not contain however the production ID).

TIP Another alternative way to get this information from a bookkeeping path is to use the patched LHCbDirac command dirac-bookkeeping-prod4path. Mind that if there are spaces in the BK path, you should enclose it in quotes.

$ dirac-bookkeeping-prod4path -B '/MC/2012/Beam4000GeV-2012-MagDown-Nu2.5-Pythia8/Sim08c/Digi13/Trig0x409f0045/Reco14a/Stripping20NoPrescalingFlagged/41900006 ( ttbar_gg_1l17GeV ) /ALLSTREAMS.DST'
For BK path /MC/2012/Beam4000GeV-2012-MagDown-Nu2.5-Pythia8/Sim08c/Digi13/Trig0x409f0045/Reco14a/Stripping20NoPrescalingFlagged/41900006 ( ttbar_gg_1l17GeV ) /ALLSTREAMS.DST: 
Productions found (Merge): 32263 
...
Parent productions (MCSimulation): 32262 

You are interested only in the prodID for the 'MCSimulation'.

2 Retrieve XML files and produce the statistics tables.

This step consists into retrieving the XML files of a set of ProdIDs and constructing statistics tables for them. The latter operation is done in a the module called GaussStat.py. The full operation is steered by DownloadAndBuild.py, which uses GaussStat.py internally.

The following instructions are/will be updated for latest LHCbDirac and MCStatTools versions.

To begin, cd to a directory on a filesystem with free space/remaining quota of the order of 5-7 MB for each production which you need to process. This space takes into account that XML files will ultimately be downloaded locally using the LHCbDIRAC API and also includes the size of files produced by the script for a large number of XML files requested for merging (of order 2000 at the least). TIP In more recent versions, the largest JSON file is automatically gzipped reducing the disk space needed to successfully complete the excution of the script.

Launch DownloadAndBuildStat.py, passing it the (comma-separated) list of prodIDs as last command line argument. Other options are:

$ python DownloadAndBuildStat.py --help
Usage: DownloadAndBuildStat.py [options] <ProdIDs>

   <ProdIDs> : list of ProdID(s), comma-separated, with no blank spaces.

Please, use '-h'/'--help' option to get the full list of options.

Options:
  -h, --help            show this help message and exit
  -n <NB_LOGS>, --number-of-logs=<NB_LOGS>
                        number of logs to download [default : 1000]
  --use-mem             instruct new download algorithm to use memory as
                        temporary buffer
  --delta-size=<DELTA>  log files smaller by <DELTA> from largest file of
                        sample will be deleted/ignored [default : 0.07] It is
                        used only when file size distribution is non-gaussian.
  -v <VERB_LEVEL>, --verb-level=<VERB_LEVEL>
                        case insensitive verbosity level [CRIT, ERROR, WARN,
                        INFO, DEBUG, VERBOSE; default: info]
  --usage               show this help message and exit
  --use-local-logs      Use already downloaded XML logs, to recover data after
                        crash.

TIP Out of these the most frequently used are -n to set the number of XML logs to merge for each considered production and --use-local-logs for the exceptional cases when the script may crash for reasons other than code bugs.

TIP In case you do not have a valid GRID proxy initialized on the running machine, you will be prompted to create such a proxy at run-time. This is done externally using the dirac-proxy-init command and will trigger an automatic re-run of the script command. Beware that minimal DIRAC access is required so that production meta-information could be extracted each time a given production is being processed.

TIP When merging XML files the current algorithm creates a template of the set of statistical quantities that will be summed up based on the first processed XML file. In spite of quite elaborate filtering applied by the code to eliminate malformed and/or incomplete/corrupted XML files to be used as template, in case such a file is picked-up as template, the merging will most likely issue lots of warnings and the resulting statistics tables should be discarded. In such rare events, the workaround is to simply identify from the log stream the culprit XML file file, remove it from the set and re-run the script using the --use-local-logs flag.

ALERT! The validation of XML file LFNs with the Log-SE may take quite a long time. In spite of measures taken to indicate to the user that the script is not stuck, please, be advised of this peculiar implementation of the current algorithm.

The final output includes the following files and directories:

Prod_<00_prodID>_Generation_log.json[.gz] - debug file that allows reloading of production data without re-accessing XML files
Generation_Sim..-<MC_conditions>[.html | _pid-<prodID>.json] - files containing final statistics tables in HTML template or as JSON dictionary  
GenLogs-<prodID> - a directory containing all temporary working data for a given production (most likely only the XML files renamed with unique file names) 

3. Publish the tables.

Create a JIRA task at LHCBGAUSS (you have to login with CERN SSO using your account). Choose as Component Generators Statistics and either give a pointer to a folder containing the updated tables in a public-readable area (public directory on lxplus or on EOS/cernbox), or preferably as last resort, upload the statistics pages (HELP HTML page, but also JSON/JSON.GZ files when possible). The tables will then be added to the Gauss web site by the collaboration wide responsible.

Helper scripts

ALERT! To get the basic filtering efficiency, the new LHCbDirac versions include a dedicated script, e.g.

 1000$ dirac-bookkeeping-rejection-stats -P 57611
 1010Using BK query {'Visible': 'Yes', 'Production': 57611, 'ReplicaFlag': 'Yes'}
 1020Checking if BK EventInputStat is reliable... : OK in 38.0 seconds
 1030Getting metadata for 121 files  : completed in 0.5 seconds
 1040Getting metadata for 120 jobs : completed in 0.2 seconds
 1050Event stat: 1957882 on 121 files
 1060EventInputStat: 27302528 from 121 jobs
 1070Retention: 7.17 %

TIP A useful executable script which gives the production meta-information as extracted from DIRAC when given production IDs is show_prod_meta.py, e.g.

  100$ ./show_prod_meta.py 101394
  103[...]
  106For ProdID. #101394:
  109 - NbJobsStep1 : 17774
  112 - APPCONFIG_file : 'Sim09-Beam4000GeV-2012-MagDown-Nu2.5-Pythia8'
  115 - SIMCOND : 'sim-20160321-2-vc-md100'
  118 - BKKPath : '/MC/2012/Beam4000GeV-2012-MagDown-Nu2.5-Pythia8/Sim09h'
  121 - diracOK : True
  124 - eventType : '13563000'
  127 - simYear : '2012'
  130 - DDDB : 'dddb-20170721-2'
  133 - APPCONFIG_version : 'v3r391'
  136 - DecFiles_version : 'v30r41'
  139 - prodID : 101394
  142 - Gauss_version : 'v49r15p1'
  145=============

(Almost) obsolete scripts and modules

This section deals with scripts and modules used in previous versions of MCStatTools which, for various reasons, are considered obsolete but may still prove useful in retrieving old data. This code is mostly designed to retrieve the XML log files stored at various locations in recent years (HTTP server and CASTOR in particular). Also, this sections describes scripts introduced in the package to provide support as specific algorithms and modules were detached from the main working code.

ALERT! Access to CASTOR was officially banned for CERN users since January, 2020. As a consequence, MCStatTools production scripts do not contain any more code for accessing CASTOR through XRootD protocol and a possible workaround procedure is presented below to be used by special users who are still able to access the CASTOR system. In case of normal users you'll get the error below if trying to access CASTOR:

[ERROR] Server responded with an error: [3005] Insufficient privileges for user nnnn,mmm performing a StagePrepareToGetRequest request on svcClass 'default'
Please, contact the MC Production team to be advised on how to proceed to obtain access to the GeneratorLog.xml files corresponding to the targeted production. This is not a MCStatTools bug and it cannot be solved unless the experts provide an alternative/long-time support API to access these files.

1. set up an environment containing a working XRootD Python interface

$ lb-run --ext xrootd_python LCG/94 bash
The new bash shell could be automatically stopped; issue $ fg 1 to bring it into foreground and be able to run the following commands. 2. use check_for_staged.py script to stage archive files on CASTOR and download them locally (on a filesystem with enough free disk space)
$ python check_for_staged.py -s <ProdID list>
3. a few hours later (check if files are staged, if not wait a few more hours and repeat)
$ python check_for_staged.py -c <ProdID list>
4. if files are shown as staged, download the tarballz
$ python !check_for_staged.py -r <ProdID list>
5. use legacy_unpack_ProdLogs.py script to extract the XML files from the downloaded archives (see command help for details)
$ python legacy_unpack_ProdLogs.py <ProdID_list>
6. set up the environment for the latest (production) LHCbDIRAC (preferably in a new session)
$ source /cvmfs/lhcb.cern.ch/lib/lhcb/LHCBDIRAC/lhcbdirac
7. proceed with parsing and merging the XML statistics using --use-local-logs command line flag
$ python DownloadAndBuildStat.py [flags] <ProdID_list>
Edit | Attach | Watch | Print version | History: r46 < r45 < r44 < r43 < r42 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r46 - 2020-06-16 - AlexGrecu
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback