Building the statistics tables for Monte-Carlo simulations.

Foreword

The user of a Monte-Carlo sample is interested in a number of efficiency values to make sense of this sample. This information is available in xml files produced by the simulation jobs and associated to each production job. Statistical quantities are computed according to formula detailed here. The statistics tables gather this information in a well-formatted way, and are published so that the information is easily accessible to users. It is the task of the MC contact to generate the statistics tables for the production of his/her WG. The machinery is most effective when the statistics pages are produced shortly after the productions are finished. There are several steps to followed, detailed herebelow.

The current version of MCStatTools, v4r7p* is "frozen" following the discussion with experts during Simulation Meeting on May 21, 2019. Packge is available in CERN GITlab to be checked-out and used locally as detailed at the bottom of this page. Any issues related to parsing of production XML logs remaining in .tgz archives on CASTOR should be reported on LHCBGAUSS-1677 (this JIRA task will be closed with prior notice upon decision announced at the Simulation Meeting). Any issues that are found to affect larger number of productions should be reported on LHCBGAUSS-1676. Both types or issues may still trigger minor patch releases of MCStatTools. So, please, use the latest available patch release from v4r7p* series unless another version is specifically indicated by experts.

TIP Parts of this documentation may be obsolete, but it is kept mainly for providing in-sight on the algorithms involved in retrieving and merging generator statistics for MC productions and their evolution with the package releases.

1 Get the productions ID number for your request(s).

Use the Dirac webpage (alternate webpage) to retrieve the Dirac Production ID associated to your request. Follow 'Production' -> 'Request manager', and then use the left hand panel to filter the displayed request and pin down the request you are interested in. Click on the request, and select 'Production monitor'. This will bring you to another Dirac webpage, which you could use directly if you know the request ID number. Each request have several steps, each with a Dirac Production ID. You are interested only in the step of type 'MCSimulation'. Write down its ProdID, shown in the first column.

Do it for all the requests you want to process as the rest of the work-flow can be performed on several prodIDs in one go.

Another way to get this information from a bookkeeping path is to use the LHCbDirac command dirac-bookkeeping-prod4path. Mind that if there are spaces in the BK path, you should enclose it in quotes.

$ lb-run LHCbDirac/prod dirac-bookkeeping-prod4path --BK '/MC/2012/Beam4000GeV-2012-MagDown-Nu2.5-Pythia8/Sim08c/Digi13/Trig0x409f0045/Reco14a/Stripping20NoPrescalingFlagged/41900006 ( ttbar_gg_1l17GeV ) /ALLSTREAMS.DST'
For BK path /MC/2012/Beam4000GeV-2012-MagDown-Nu2.5-Pythia8/Sim08c/Digi13/Trig0x409f0045/Reco14a/Stripping20NoPrescalingFlagged/41900006 ( ttbar_gg_1l17GeV ) /ALLSTREAMS.DST: 
Productions found (Merge): 32263 
Parent productions (MCSimulation): 32262 

You are interested only in the prodID for the 'MCSimulation'.

2 Produce the tables.

This step consists into retrieving the xml files of a set of ProdIDs and constructing statistics tables for them. The latter operation is done by a script called GaussStat.py. The full operation is steered by DownloadAndBuild.py, which runs GaussStat.py internally.

The following instructions are updated for latest LHCbDirac and MCStatTools versions. For obsolete older versions, one could use the old method of setting up the run-time environment for a compatible LHCbDirac package:

SetupProject LHCbDirac

To begin, cd to a directory with a lot of free space.

Tip, idea In order to be able to asses the amount of free disk space required and the status of log files for archived productions please, make use of check_for_staged.py helper script:

Usage: check_for_staged.py [options] <ProdIDs>

       <ProdIDs> : list of ProdID(s), comma-separated, with no blank spaces.

Options:
  -h, --help            show this help message and exit
  -s, --stage           stage the files (do NOT use repeatedly)
  -c, --check           check whether log file was staged
  -r, --copy            copy file to current dir
  -v <VERB_LEVEL>, --verb-level=<VERB_LEVEL>
                        case insensitive verbosity level [CRIT, ERROR, WARN,
                        INFO, DEBUG; default: info]

Copy the current stats table of <your_WG> and corresponding to the MCtype <MC _ype> (MC2012, SIM08STAT, SIM09STAT) in this directory:

    1cp $LHCBDOC/STATISTICS/<MC_type>/<your_WG>-WG/*.html .

Launch DownloadAndBuildStat.py, passing it the list of prodIDs you have just found:

    1lb-run --use MCStatTools LHCbDirac/prod bash --norc
    2python $MCSTATTOOLSSCRIPTS/DownloadAndBuildStat.py [opts..] XXXX,YYYY,ZZZZ

where XXXX, YYYY, ZZZZ are the prodId's of the simulation steps for the requests

This will try to get the logs first from the web, then from "CASTOR archive" (a.k.a. ARCHIVE in some script log messages). It will also filter the xml files for obviously malformed and incomplete files. Then it calls the script that builds the tables up. It will then update existing tables with the information corresponding to the ProdIDs passed.

ALERT! In case you do not have a valid GRID proxy initialized on the running machine, you will be prompted to create such a proxy at run-time. This is done using the dirac-proxy-init command.

ALERT!If execution appears to hang, try running the script from an lxplus node. When using your local machine there is a chance the incoming packets from archive (XRootD) server will be rejected by your firewall.

Interesting options are --verbose, --number-of-logs=<NB_LOGS>.

Full help:

$ python DownloadAndBuildStat.py --help
Usage: DownloadAndBuildStat.py [options] <ProdIDs>

       <ProdIDs> : list of ProdID(s), comma-separated, with no blank spaces.

Options:
  -h, --help            show this help message and exit
  -n <NB_LOGS>, --number-of-logs=<NB_LOGS>
                        number of logs to download [default : 1000]
  --delta-size=<DELTA>  log files smaller by <DELTA> from largest file of
                        sample will be deleted [default : 0.07]
  -t, --tape            get files from ARCHIVE (on disk/tape) directly
  -v <VERB_LEVEL>, --verb-level=<VERB_LEVEL>
                        case insensitive verbosity level [CRIT, ERROR, WARN,
                        INFO, DEBUG, VERBOSE; default: info]
  --usage               show this help message and exit
  --use-local-logs      Use already downloaded logs, for debuging purpose.

ALERT! MCStatTools v4r6 (see LHCBGAUSS-1586) introduces a rigorous validation of the prodID=s provided on the command-line using (and caching) metadata from =LHCbDIRAC. Also the Sim full version (including letter for minor version) is now included in the statistics table comments to allow users to identify the associated LFN path in BKK.

ALERT! MCStatTools v4r5 (see LHCBGAUSS-1415) makes the HTML table generation interactive. In case no existing HTML table is found in the working directory for the simulation conditions corresponding to a given ProdId, the user is asked to select the WG (working group) where such file should be searched automatically in the current EOS repository for the HTML tables. In case the search is succesful the latest file in the repository is copied locally before merging operations are triggered.

Also, yet another JSON file is saved(/overwritten!) for each generated table containing a dictionary with the all the quantities which are output in the HTML table. This feature is experimental for now and would require further work in order to integrate into a future dynamic UI to the production statistics tables.

ALERT! As of MCStatTools v4r4 the "old" default behaviour of DownloadAndBuildStat.py to generate a HTML tables for each ProdId was recovered. However, for each ProdId an additional JSON file is generated which caches the generator counters for that ProdId. This JSON file may be used for debugging reasons and/or faster rebuilding of HTML tables. The JSON files should be compressed when used as supplemental information for debugging issues reported via JIRA. The ProdId JSON file will need to be removed in case you want to extract more production logs or the script fails when re-run (though it would be nice of you to inform the developers about such issues). All changes in v4r4 are discussed in LHCBGAUSS-1352.

ALERT! MCStatTools is evolving towards automating the log generation procedure. The status of this process can be followed on this JIRA task: LHCBGAUSS-929.

TIP In MCStatTools v3r2 and later, the fall-back to parsing parameters from jobDescription.xml files, in case Dirac request of Simulation conditions parameters fails, was removed since it lead any way to invalid naming of the generated HTML page. Instead the script will fail completely for the specific ProdId.

ALERT! To get the basic filtering efficiency, the new LHCbDirac versions include a dedicated script, e.g.

 1000$ dirac-bookkeeping-rejection-stats -P 57611
 1010Using BK query {'Visible': 'Yes', 'Production': 57611, 'ReplicaFlag': 'Yes'}
 1020Getting metadata for 121 files  : completed in 0.2 seconds
 1030Getting metadata for 121 jobs : completed in 0.1 seconds
 1040Event stat: 1957882 on 121 files
 1050EventInputStat: 27302528 from 121 jobs
 1060Retention: 7.17 %

Known MCStatTools issues

As the MCStatTools code is restructured and rewritten to attain the level of functionality described in LHCBGAUSS-929, there are a number of known issues (Thumbs-down) that will be (are) solved in the development version of the package. The issues are presented together with workarounds whenever they are available. The list below also contains some (Thumbs-up) tips and features that should improve the user's experience and provide a better understanding of the scripts' workflow.

Thumbs-down Bug detected in versions v4r1 through v4r3 which prevents generation of HTML tables when production data loaded from JSON. A possible work-around till the code gets patched is to use flags --save-html (and --use-local-logs in case you still have the XML logs on disk) in order to force generation of HTML tables from the first run.

Thumbs-up Use check_for_staged.py script to ensure that for each Prod ID at least some of the production logs are staged on disk (ALERT! Recommended procedure for archived productions ! ). Since v3r3 the user gets both the summed size of log files for each Prod ID and the summed size of all staged log files to ease selection of a partition with necessary free disk space.

Thumbs-up Newly introduced feature to initialise GRID proxy at run-time requires the script to restart in order for LHCbDirac API calls to be successful. Mind, that the script will issue a warning when doing so. If debug logging level is set the command being executed is printed before the old instance of the script exits.

Tip, idea Please, open a JIRA task under LHCBGAUSS project for Generator Statistics component if you are sure you found bug or, as MC liaison, you really need a new feature to be implemented.

3. Publish the tables.

Create a JIRA task at LHCBGAUSS (you have to login with your CERN SSO account). Choose as Component Generators Statistics and either upload the statistics pages or give a pointer to a folder containing the updated tables in a public-readable area. The tables will then be added to the Gauss web site.

Look for available files in the bookeeping given an eventype.

This is automated in the following script from Vanya. For each returned tuple, the first entry is the bkk path, the last two ones are the number of files and the overall number of events.

 1000> lhcb-proxy-init 
 1010> PATH=/afs/cern.ch/user/i/ibelyaev/public/scripts:$PATH
 1020> get_bookkeeping_info 11102003  
 1030('/MC/2012/Beam4000GeV-MayJune2012-MagDown-Nu2.5-EmNoCuts/Sim06a/Trig0x0097003dFlagged/Reco13a/Stripping19aNoPrescalingFlagged/11102003/ALLSTREAMS.DST', 'head-20120413', 'sim-20120727-vc-md100', 17, 203599)
 1040('/MC/2012/Beam4000GeV-MayJune2012-MagUp-Nu2.5-EmNoCuts/Sim06a/Trig0x0097003dFlagged/Reco13a/Stripping19aNoPrescalingFlagged/11102003/ALLSTREAMS.DST', 'head-20120413', 'sim-20120727-vc-mu100', 17, 207000)
 1050[...]

LHCbDIRAC based versions

Starting with v4r7, MCStatTools includes a new algorithm for downloading the XML files needed for creating the generator statistics tables. This new algorithm is fully based on the LHCbDIRAC interfaces in v9r3 and later. Previous downloading methods are recovered by using the -k, --compat command line flag (although these methods should be regarded as obsolete and they will be removed from the package in the next major version). Also older downloading algorithm will only work with the environment set up by LHCbDIRAC v9r2p9 or earlier which provide a working Python interface to XRootD library. DownloadAndBuildStat.py should fail with a clear message if this interface is corrupted.

Due to changes in setting up the environment for LHCbDIRAC v9r3 and later, MC liaisons will have to make a temporary clone of the MCStatTools data package in order to use it (as detailed here):

$ git lb-clone-pkg -b v4r7p5 MCStatTools
$ cd MCStatTools/scripts
The latest production version of LHCbDIRAC (v9r3 and later) should be set up with:
$ source /cvmfs/lhcb.cern.ch/lib/lhcb/LHCBDIRAC/lhcbdirac

ALERT! The validation of XML file LFNs with the Log-SE may take quite a long time. In spite of measures taken to indicate to the user that the script is not stuck, please, be advised of this peculiar implementation of the current algorithm.

ALERT! A subshell with the same environment as set by lb-run command may be obtained issueing the following command (after setting the LHCbDIRAC environment):

$ ( eval $(xenv --sh -x /cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/MCStatTools/v4r7p5/MCStatTools.xenv); bash -i )
Edit | Attach | Watch | Print version | History: r37 < r36 < r35 < r34 < r33 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r37 - 2019-07-09 - AlexGrecu
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback