RelMon: a Tool to Perform Goodness of Fit Tests of Large Sets of Histograms

RelMon is a tool to perform automatic comparison of two rootfiles containing histograms and profiles also organised in directories. It allows for example to make regressions of CMSSW releases through the comparison of the DQM histograms produced. Its primary usage is to perform systematic validation of a 'test CMSSW release' against a 'reference CMSSW release'; the same suite of DQM distributions (developed by DPG/POG/PAG) is produced over a a group of samples (data and MC) in the context of relVal production; RelMon performs a compatibility test between the 'test' and 'reference' instance of each distribution, and provides an easily browsable and hierarchic organization of the outcome of tests, with pointers to access the tested distributions in the DQM gui.

The tool is entirely written in Python and relies on the pyROOT bindings of ROOT. Two interfaces are available to read the DQM histograms both from rootfiles and from the DQM database via the DQM2json utility. The comparisons between two histograms are ranked using statistical tests, configurable by the user. Single directories are then ranked according the ranks of their contents. The information about the tests can be viewed as a plain ASCII report or as a set of web pages that try to aggregate the information in a concise form using pie-chart diagrams. Third party technologies are involved in the generation of the graphics RelMon, namely the DQM GUI plotting infrastructure, the Google Chart API and the BluePrint css framework.

A set of scripts is intended to ease the usage of the interfaces representing a set of handy command-line tools. There's an ongoing development of a web-based service to streamline and further authomatize and streamline the production of RelMon reports.

Getting RelMon and set up the environment

For CMS users

RelMon is available natively with every CMSSW release since the 5 cycle!

scramv1 project CMSSW CMSSW_8_0_14
cd CMSSW_8_0_14/src
git cms-addpkg Utilities/RelMon/
chmod 744 Utilities/RelMon/scripts/fetchall_from_DQM_v2.py
chmod 744 Utilities/RelMon/python/*
scramv1 build -j 2
cmsenv
voms-proxy-init -voms cms

For All Users

Get the tarball here and uncompress it.
cd RelMon
chmod +x scripts/*
source scripts/RelMon_set_env.sh

Now set up ROOT. You can use a stand-alone installation or the ROOT version that comes with CMSSW. In case you plan to use CMSSW to set up ROOT and pyROOT and the DB interface, remember that it is sometimes easier to set the Grid environment before the one of CMSSW because of the python version included in the gLite Middleware.

File Interface

The file interface allows you to to compare and produce a report about the compatibility of two CMSSW releases using the agreement of the whole sets of histograms that the DQM harvesting step produced running on a particular dataset. The agreement is quantified by a certain statistical test. A script is available in the scripts directory. Let's start with a meaningful example, suppose two sample root files are available:
compare_using_files.py DQM__RelValZMM__CMSSW_4_2_0_pre7-START42_V6-v2__DQM.root DQM__RelValZMM__CMSSW_4_2_0_pre6-START42_V4-v1__DQM.root -C -R -d Muons  -o File_int
This command will start the comparisons ( -C ) of the two rootfiles, concentrating only on the Muons direcotry ( -d Muons ) and produce a report ( -R this option must appear) in the directory RelValZMM ( -o RelValZMM ); -R this option must also appear . NOTA BENE: the format of the filename XXX__SAMPLE__CMSSWVERSION__TIER.root, the one of the harvested DQM files, is at the moment essential for the correct functioning of the script. You can also avoid to use files formatted as DQM harvested rootfiles, with the help of the --meta switch. You can specify a short description of your files with:
... --meta "MyShortDescription1 @@@ MyShortDescription2"
The "@@@" is used a s a separator between the two descriptions. More options are at your disposal of course. To explore them just type:
compare_using_files.py -h

How does it work?

Basically the directory structure of the files is walked, and the histograms are compared one by one. An object of the type Directory (see dirstructure module), is filled with the information about the outcome of the tests. At the end of the comparison, this Directory instance is dumped on disk as a pickle (serialised form of Python object) for future usage, e.g. html report production. The option -P allows to start from such a pickled object and produce the report avoiding to re-run the comparisons. If you compare the directory structure of an harvested file in a TBrowser and the one in your report, you'll notice two things: First is that empty directories are skipped since uninformative and second that the !RunSummary directories are disappeared. This is done on purpose to align the output of the DB and File interface. If you are interested in the details look at the Directory::prune method. The webpages are produced by functions manipulating Directory objects which are collected in the directory2html module. This library is indeed thought to throw the base for a purely hypothetical and not at all promised future CherryPy version of RelMon.

Where are the reports?

The reports are in general collected here: http://cms-service-reldqm.web.cern.ch/cms-service-reldqm/ReleaseMonitoring which where the /afs/cern.ch/cms/offline/dqm/ afs directory is served. The access specifically to the release validation reports is eased by the top level interface. After a few months of being available online, relmon reports are removed from the top level interface and archived in here: /eos/cms/store/group/pdmv . Once available, the eos2castor archival tool will need be used to archive out of eos. Please ask PdmV if you need an old relmon report to be temporarily re-made avaialble online.

Running locally

If user want to run RelMon locally (to see reports in its own computer) he should use --standalone flag so the report HTML files would fetch JavaScript over HTTP. The standalone method is also triggered when setting up metas for custom root files. For example:
ValidationMatrix.py -a rootFullSimPU -o FullSimReport --standalone

DQM DB Interface

The DQM DB interface allows you to gather the information about the histograms to be compared directly from the DQM DB content. The API exploited to achieve this goal is the DQM2json. Again a script is available that wraps the interface class. Let's inspect a sample command:
compare_using_db.py -1 CMSSW_5_3_0-START53_V4-v1 -2 CMSSW_5_3_1-START53_V5-v1 -S RelValZMM  -C -R  -d "Muons" -o DB_int
This command uses again the already known -C, -R, -d and -o options. The difference is now that the sample ( -S ) and the two releases have to be specified separately ( -1, -2 ). This is slight asymmetry between the two approaches is due to the fact that all the information that was automatically parsed from the filename must be entered here. As usual,
compare_using_db.py -h
will display all the necessary help. Please not the -T option. Along the lines of the previous argumentation, also the Tier of the data is to be entered here since it cannot be parsed from the filename. The default is DQM.

Another possibility could be:

compare_using_db.py -1CMSSW_4_2_0_pre7-START42_V6-v2 -2 CMSSW_4_2_0_pre6-START42_V4-v1 -S RelValZMM  -C -R  -d "00 Shift" -o 00Shift

You might have noticed the difference between the directory investigated here, 00 Shift, and previous one, Muons. The 00 Shift directory is an example of usage of [[https://twiki.cern.ch/twiki/bin/view/CMS/DQMGuiLayouts][DQMGuiLayouts]. These kind of directories do not physically exists inside the harvested root files as TDirectoryFiles, but they are "fictitious" in the sense that they summarise histograms that exist already.

How does it work?

Here as well the directory structure exposed by the DQM2json API is navigated. Comparisons are then performed histogram by histogram. The mechanism then is the same of the File Interface: a Directory instance is filled, pickled on disk and the html is produced. You might have noticed some messages about the number of threads run simultaneously. One of the main differences between the two interfaces is the usage of threads to recursively navigate the directory structure. The problem is indeed easily vectorisable, though threads are only really bring a big advantage when using the DQM2json API so to compensate for the latencies imposed by the responses of the server. If you are curious about the way in which Python implements Threads, please have a look here.

Black Lists

Black listing directories

Both interfaces (and scripts!) support blacklists. This is particularly desirable in presence of directories containing a very big number of histograms, for example RPC or HLT. The way in which a blacklist is entered on the command line is:
-B DIRNAME1@LEVEL1,DIRNAME2@LEVEL2,...
Therefore a comma separated list of name-level pairs of values characterising a single directory. If the level is -1, all the directories in the tree having the specified name will be skipped.

Black listing single histograms

From CVS Tag V00-08-05 it is possible to skip single histograms from comparison. It is supported by ValidationMatrix.py and compare_using_files.py scripts. The black-list is located http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/CMSSW/Utilities/RelMon/data/blacklist.txt By default scripts are not skipping single histograms, if users wants to use this list for skipping they should include --use_black_file command line parameter. Examples:
ValidationMatrix.py -a rootFilesDIR -o reportsOUTPUT --use_black_file -N 7
compare_using_files.py DQM_rootFile1.root DQM_rootFile2.root -C -R -o outputREPORT --use_black_file

Statistical Tests

A somewhat big emphasis was given to the requirement of statistical tests to check the compatibility of the pairs of histograms. There is a complete suite of tests available natively at disposal and a very flexible mechanism that allows the expansion of the collection of the statistical test classes. The available tests are:
  • Chi2: this is the default test of RelMon. The ROOT implementation of the test is exploited. The default selection on p-value to determine wether a comparison has passed (or not) is set to >1e-05 (<).
  • KS: this test is NOT designed for comparison of binned data. Indeed the Kolmogorov-Smirnov Theorem is not valid in absence of unbinned datasets. It is nevertheless included since it is fast and gives almost appropriate results in presence of very fine binning. The ROOT implementation of the test is exploited.
  • Bin2Bin: this test is implemented to check the 1to1 compatibility of two sets of histograms. A use-case could be the change of the CMS Production job submission tool and the comparison of the produce samples. The implementation is in Python, but all the computationally heavy operations are pushed down into the state-of-the-art Python C interpreter. It must be noted that the P-Value returned by this test should be handled with care. Indeed it is not a P-Value, but rather the percentage of bins that corresponded. Therefore, if you have 100 bins and the content of 83 was the same, the test returns .83. The default selection on p-value to determine wether a comparison has passed (or not) is set to >0.9999 (<).

How does it work?

All RelMon statistical tests (and statistical test wrappers around ROOT implementations) inherit from a very generic class, the StatisticalTest. It is sufficient to look into the utils module to understand the details of the implementation and how easy it is to implement new tests, always provided that literature offers new candidates with respect to the two and a half cited above. This is indeed not a trivial problem, but please contact the author in case you think that your test should be implemented in RelMon.

Advanced Topics

The following topics are not usually meant for normal needs of average users, nevertheless, if you are curious, read further!

Validation Matrix

Is it possible to verify the compatibility of two CMSSW releases using all histograms produced in the harvesting steps for all Release Validation samples produced? The answer is yes, with ValidationMatrix.py. This script manages multiple instances of the file interface so to perform several comparisons of rootfiles pairs simultaneously, each root file corresponding to a dataset/release combination. A typical command line looks like:
ValidationMatrix.py -a myRootfilesDir  -o CMSSW_X_Y_Z_pre1VSCMSSW_X_Y_Z_pre2 -N 7
This command will generate a directory, CMSSW_X_Y_Z_pre1VSCMSSW_X_Y_Z_pre2, in which all the samples represented by the rootfiles in myRootfilesDir will be reproduced in separate directories. The number of processes to be run at the same time is specified by -N: please be gentle since these are not shy Python threads, but each process will run on a different CPU (if available). More details are needed about the -a option. The tool will try to fetch all the rootfiles contained in the directory specified and organise them in pairs to be then compared. So please help the tool help you and copy pair of files in the directory: pairs will be built from each unique release/dataset combination; be careful that, for instance, FullSim and FastSim are handled as if they were different datasets, so you should separate those set of files in different folders in order to get the meaningful comparisons you want. PU files (i.e. samples with pile up simulation) are treated as different datasets, thus also in this case, you may want to make a dedicated directory. If for some reason the automating sorting fails, you can always fall back to a manual selection, with the -R and -T options, where R stands for reference and T for test. The output directory should be of the form XXXXVSYYYY, use any other formatting at your own risk.

Once the output directory is generated, you can produce the summary condensating all the information about the single comparisons with the command line:

ValidationMatrix.py -i CMSSW_X_Y_Z_pre1VSCMSSW_X_Y_Z_pre2
which tells the ValidationMatrix.py to process all the pickles together and generate a comprehensive report.

Compressing reports and publishing reports

Reports can be published in two ways:
  • Copy the report directory on a webspace, for example AFS
  • Compress the report, add a .htaccess file to enable Apache serving compressed files and copy the report directory on a webspace, for example AFS gaining a factor 3-4 in disk space and bandwidth.

For a simple report the first option can be a valid strategy, but once a report spans over many RelVal samples it is better to optimise. The dir2webdir.py script optimises the report. The syntax is:

dir2webdir.py ReportDir
Automatically, all the files will be compressed using the gzip algorithm, the pickles will be copied in a separate directory and an .htaccess file will be created in the top directory.

For operators: generating full reports after upload of harvested files

The steps to be followed by the operators in order to fetch the relevant files, build and publish a RelMon report are the following:
# Setup the grid proxy in order to authenticate to the server.
export MAIN_RELEASE=CMSSW_X_Y_Z
export COMP_RELEASE=CMSSW_Z_Y_Z'
cd /tmp/"$USER"
export WORKDIR=RelMonReport_"$MAIN_RELEASE"VS"$COMP_RELEASE"
mkdir $WORKDIR
cd $WORKDIR
cmsrel $MAIN_RELEASE
cd $MAIN_RELEASE ;cmsenv ;cd -
cvs co -d RelMon -r V00-06-05 UserCode/dpiparo/RelMon
cd RelMon;source scripts/RelMon_set_env.sh ;cd -
mkdir work;cd work
compare_two_releases.sh $MAIN_RELEASE $COMP_RELEASE

These command lines should be executed on a machine on which afs is mounted and have at least ~50 GB free space for the /tmp directory: perfect candidates are the lxplus machines. The relval user has full access to the /afs/cern.ch/cms/offline/dqm/ directory. NOTE: Due to a pyROOT unexpected behaviour it is better to log enabling X forwarding.

How to choose the CMSSW_X_Y_Z and CMSSW_X_Y_Z' releases

The idea is to test one release against the preceding one to spot the changes could have been produced. As a general rule, every pre-release should be checked against the previous pre-release (CMSSW_X_Y_Z_preN vs CMSSW_X_Y_Z_preN-1). The test of the pre1 release should be carried out against the last release of the preceding cycle. The releases (no pre-releases) should be checked against the last pre-release (CMSSW_4_3_0 VS CMSSW_4_3_0_pre7) ort the last release (CMSSW_4_2_6 VS CMSSW_4_2_5).

Minimization of output HTML file names

RelMon generates reports in HTML files for each directory. As in afs file system the file name length is limited to certain amount of characters, you can use output file name minimization. To minimize the report's file name use command line parameter --hash_name It uses python's hashlib.md5 module to generate unique names, to minimize it more RelMon uses only first 10characters of hashlib's generated name. Example:
ValidationMatrix.py -a rootFileDIR -o reportOUTPUT -N 4 --hash_name

Useful Links

Screen Shots

Main Page

img6a0eafba9afe35d787e2d33d1f61f4c1.png
Main Page

Top of global Report

img7397625b47e1b4d2f015a746848b605a.png
The first part of the global report with the single subsystems

Detail of the subsystems with a gauge

img3479a70a5caf6bebf50437191d29071a.png
The quality of a single subsystem can be inspected quickly looking at gauges

Even more detail

img3b9056f4c930e445f941d8f5864baca5.png
All the directories compared for every RelVal sample

One directory of one single sample

imga4813d3970f353609ac2f24a0517f354.png
Understand the status of the single subdirectories in an instant and click to get more information

The plots: from the DQM GUI

imgb1644ba0df624e01d15339d9e913cb45.png
You can exploit the power of the DQM GUI to display the comparison plots for all histograms of the analysed set

The plots: internally generated

img161177d41dcbb7ed8c340bf159c56281.png
If the files are not indexed in the GUI (e.g. non DQM files!) you can generate the plots yourself and visualise them

-- DaniloPiparo - 15-Jul-2011

RelMon service

RelMon service is a RelMon report production automation tool. Production service you can find: the service is accessible at https://cms-pdmv-dev.cern.ch/relmonsvc/. The tool consists of RESTful service and graphical user interface.

Notation

  • RelMon campaign - request for RelMon service to make RelMon comparison (as at RelMon Reports homepage). Initially RelMon campaign consists of its name, threshold (see Threshold) and one to six categories (see bellow). After RelMon service starts to process RelMon campaign, the campaign is appended with more detailed information.
  • Category - part of RelMon campaign. Category consists of HLT option, two lists - reference and target - of relval workflow names and the name identifying the category. In this context there are six available category names: "Data", "FullSim", "FastSim", "Generator", "FullSim_PU" and "FastSim_PU".
  • Threshold - a percentage of workflows for which root files must be accessible before starting RelMon report production.

Operation https://cms-pdmv.cern.ch/relmonsvc

To start a new RelMon report production process, RelMon service must be given initial RelMon campaign as an input. Bellow is a real example of initial RelMon campaign which could be passed to RelMon service:
{
    "name":"BH&SE_740pre6_vs_740pre8",
    "threshold":100,
    "lastUpdate": 1461836633,
    "categories":[
        {
            "name":"FullSim",
            "HLT":"only"
            "lists":{
                "reference":[
                    "fabozzi_RVCMSSW_7_4_0_pre6BeamHalo_13_150130_155653_4907",
                    "fabozzi_RVCMSSW_7_4_0_pre6SingleElectronPt10_UP15_150130_155755_467"
                ],
                "target":[
                    "franzoni_RVCMSSW_7_4_0_pre8BeamHalo_13__MinGT_150315_195114_8636",
                    "franzoni_RVCMSSW_7_4_0_pre8SingleElectronPt10_UP15__MinGT_150315_200318_6695"
                ]
            }
        }
    ]
}
In the example there is JSON representation of RelMon campaign named "BH&SE_740pre6_vs_740pre8", with threshold set to 100% and with one category "FullSim". The HLT option in that only category set to "only" means that RelMon report for this category would be made only with HLT flag set to true. The category also has two workflow names in both lists.

After RelMon campaign has been passed to RelMon service to start new RelMon report production, the service starts checking names of workflows DQMIO outputs at Request Manager. If the DQMIO output name is accessible, then RelMon service starts querying DBSReader to get expected number of root files for given workflow. Also RelMon service starts querying DQM GUI for root files availability and compares the number of available root files with the expected number. If these two numbers are equal, the workflow is assigned status "ROOT" (see Workflow statuses explained). Moreover, RelMon service checks Workload Manager database to get information about processing state of the workflow. Graph of states that can be returned from Workload Manager database is explained here. Interesting states are those which tell that starting from that particular state no more root files can appear for a workflow in that state. This lets RelMon service detect workflows which have not enough root files available and are already in the state where no more root files will be produced, in such case RelMon service assigns "NoROOT" status to this kind of workflows. This whole procedure of querying services and databases is repeated at constant intervals until threshold is reached or exceeded (see explanation bellow).

By default threshold is calculated by the formula: t = R/(T-d-r), where t - threshold, R - number of workflows for which already exist enough root files, T - total number of requested workflows (from each category, from each list), d - number of workflows which do not produce DQMIO output, r - number of workflows for which it is already known to be not enough root files available. This formula can be influenced by changing RelMon service configuration file (see Config options).

After the threshold is reached or exceeded, if not all workflows (except ignored ones i.e. "NoDQMIO" and "NoROOT") have enough root files available, RelMon service waits in configuration specified amount of time (4 hours by default), then, for the last time, checks available root files and starts downloading needed root files. When downloads are finished RelMon report production is started.

For RelMon report production ValidationMatrix.py script is used. Finding pairs of root files to be compared is done automatically by ValidationMatrix.py script. It is known that there exist cases when automatic pairing fails. In this kind of cases RelMon service might produce empty or not complete RelMon reports for categories for which pairing has failed.

Produced RelMon reports are optimized by running dir2webdir.py script on them. Optimized reports are then moved to directory visible by RelMon Reports homepage (the directory can be customized in configuration file, see Config options)

Workflow statuses explained

"initial"
RelMon service has not done any action on this workflow.
"waiting"
It is known that the workflow produces DQMIO output, but the name of this output is still unknown.
"NoDQMIO"
Workflow does not produce DQMIO output.
"DQMIO"
DQMIO output name is known, RelMon service is checking for root files availability.
"NoROOT"
According to Workload Manager the workflow is in one of the final states (final states are defined in service configuration), but there are not enough root files available.
"ROOT"
There are enough root files for given workflow.
"downloaded"
root files for this workflow have been downloaded.
"failed"
Something went wrong while processing this workflow.
$*"failed_rqmgr"*
Request manager doesn't know this worklow name. $*"failed downlod"*: something went wrong during the download process

Graphical user interface

Visiting home page of RelMon service redirects the browser to the graphical user interface. In case you are not signed in to your CERN account you will be asked to sign in before reaching the GUI. On the home page exists two sections: "New RelMon request (campaign)" and "Latest requests".

Submitting new RelMon campaign.

  1. Click "New RelMon" button.
  2. Fill needed categories (see Filling categories) with reference and target workflow names and the HLT option.
  3. Enter RelMon campaign name and threshold.
  4. Click "Submit" (or "Submit query" depending on your browser).
  5. Click "Confirm"

Filling categories

After clicking "New RelMon" button, the form for new RelMon campaign is revealed. There is a tab for each category (e.g. "Data", "FullSim", "Generator", etc.). Each tab contains a field for list of reference workflow names, a field for list of target workflow names and an HLT option (except for "Generator" category which does not have the HLT option). Needed workflow names can be looked up at https://hypernews.cern.ch/HyperNews/CMS/get/dataopsrequests.html. Workflow names in the lists must be separated with white space.

Campaign name

Bellow the tab set there is a field for RelMon campaign name. This is how finished comparison will be named at RelMon Reports page. Note: choose the name that is a valid name for Linux directory. If you create campaign with name, which already exist you delete all categories from first campaign, e.g. if you have campaign xxx with categories Data, FullSim and you create another campaign with the same name "xxx" and categories FullSim, FastSim. You take responsibility that everything from first campaign in category FullSim will be deleted and replaced with new files from second campaign.

Latest requests

In this section there is a table of submitted RelMon campaigns. On the left column there is general information about campaigns. On the right column there is information about categories and workflows of specific campaigns.

EDIT functionality

There is "Edit" button in each campaign. You can edit campaigns while they're not in the final statuses. It means, that while not finished or failed. When you press Edit, all fields will be filled in by information from that campaign. After submitting edited campaign, that campaign immediately starts from the beginning. It doesn't matter it was in downloading or comparing files phase.

General information

Interesting elements of the general information column are "Status" field, "Log file" link, "Terminate" and "Close" buttons. "Terminate" button, when clicked, allows to stop the ongoing RelMon request and clean all downloaded and generated files. "Close" button is meant for cleaning RelMon report record at RelMon service but leaving the actual report available. "Log file" is a link to a log file generated while producing RelMon reports. "Status" field can have one of the following values: "initial", "waiting", "downloading", "comparing", "finished", "failed" or "terminating".

Explanation of request statuses:

"initial"
Request has just been submitted.
"waiting"
Service is checking statuses of workflows and there are not enough workflows with status "ROOT" (see Workflow statuses explained).
"downloading"
root files are downloading.
"downloaded"
root files are downloaded.
"Qued_to_compare"
campaign is added to queue to compare root files.
"comparing"
RelMon reports are being produced (ValidationMatrix.py script).
"finished"
RelMon reports are completed and uploaded to be accessible via RelMon Reports page.
"failed"
Something went wrong.
"terminating"
The request processing is being stopped and cleaning is in progress.

Detailed information about categories and workflows

In this column there are expandable containers for every category. Expanded category reveals expandable lists. Expanded list reveals workflow names, their statuses and possibly the number of root files available and expected for particular workflow. Possible workflow statuses explained here

Information how does file comparing works

We have couple of steps. First of all we have two lists: Reference and Target. So, we filter workflows and delete from those workflows, e.g. if we have workflow "xxx" in Reference's with status NoROTT, NoDQMIO, failed_rqmgr, we check for the workflow which match this one in References' list. And delete both of them. After that we check how many workflows in each list we have. And start matching workflows. For matching workflows we're using Levenshtein distance algorithm, How exactly comparison works with examples you can find in slides: https://indico.cern.ch/event/479139/contributions/2145971/attachments/1263354/1868703/Presentation_RelMon-Service_v10.pdf

RESTful service

Previously explained graphical user interface relies on RESTful service. Bellow are endpoints of the restful public (public for logged in users) api explained.
  • GET / - returns GUI HTML
  • GET /userinfo - returns details about logged in user
  • GET /requests - returns list of requests with all information about them
  • GET /requests/ - returns request with given id
  • POST /requests - creates new RelMon campaign. POST data must be of format as in this example
  • GET /requests//log - returns comparison process log file
  • POST /requests//terminator - starts specified RelMon campaign termination
  • POST /requests//close - closes specified RelMon campaign by removing records about this campaign from RelMon service

For administrators

Deployment

  1. Install CMSSW on remote computing machine
  2. Install the following python packages on RelMon service machine: Flask, Flask-RESTful, Flask-CORS, python-crontab, Paramiko. Here is an example of how to do it with pip package manager:
       pip install flask
       pip install flask-restful
       pip install flask-cors
       pip install python-crontab
       pip install paramiko
       
  3. Download RelMon service source from GitHub:
       git clone https://github.com/cms-PdmV/relmonService.git
       
  4. Edit configuration file config to match your case. See Config options
  5. Set base option in static/index.htm to match base url at proxy host. This base must end with /. For example if the service is accessible at https://cms-pdmv.cern.ch/relmonsvc/, then the base option should be set to /relmonsvc/.

NOTE: If you are deploying RelMon service not behind authentication proxy, you could use this tutorial for setting up Shibboleth authentication on the the service.

Config options

list administrators
List of CERN usernames that have the administrative rights. The user that is defined as remote_user must appear in this list because remote machine must have administrative rights.
list authorized_users
List of CERN usernames that have the right to use this RelMon service.
list final_relmon_statuses
List of workflow statuses, of RelMon service notation, that are considered final and cannot be changed. Most of the time the list should look like this ["failed", "terminating", "finished"].
list final_wm_statuses
List of workflow statuses, of Workload Manager notation (https://github.com/dmwm/WMCore/wiki/Request-Status), that are considered to be final and no more root files can appear after workflow has reached one of these statuses.
string remote_host
Remote machine address
string service_host
Address of RelMon service machine.
string service_base
Base URL at RelMon service proxy e.g. /relmonsvc.
string credentials_path
Path to the credentials file which is of structure {"user": <username>, "pass": <password>}.
string key_path
Path to the userkey. NOTE: RelMon service expects passwordless userkey.
string certificate_path
Path to the user certificate. NOTE: RelMon service expects passwordless user certificate.
string keytab_path
Path to the Kerberos keytab file.
string host_cert_path
Path to the RelMon service machine host certificate.
string host_key_path
Path to the RelMon service machine host key.
string remote_cmssw_dir
Path to the CMSSW directory at remote machine.
string remote_work_dir
Directory on remote machine for scripts to run and store temporary files.
string cmsweb_host
Host of cmsweb
string dqm_root_url
URL to the DQM GUI root file access at cmsweb.
string data_file_name
Name of the file to store RelMon service data.
string logs_dir
Directory to store log files uploaded from remote machine
string dbsreader_url
URL to the DBSReader at cmsweb.
string wmstats_url
URL to Workload Manager status api at cmsweb.
string datatier_check_url
URL to Request Manager at cmsweb to get datasets info by name.
string relmon_path
Path to directory (most probably on afs) where finished RelMon reports should be placed.
integer service_port
RelMon service communication port.
integer time_between_status_updates
Interval in seconds to wait between rechecking workflow statuses.
integer time_between_downloads
Interval in seconds between retrying downloader.
integer time_after_threshold_reached
Time in seconds to wait after threshold is reached or exceeded (except if 100% of workflows are ready) before doing the last statuses check and starting downloader.
boolean ignore_noroot_workflows
If set to true workflows with statuses "NoROOT" are ignored when calculating threshold.

Starting and stopping

To start the service one needs to execute service.py:
ssh pdmvserv@vocms085.cern.ch
cd /home/relmonsvc/relmonService/
source /home/mcm/cmsPdmV/mcm/kinit.sh &
python service.py &> out.log &
you should wait 1minute before closing connection to keep afs token. After one minute you will see the output of kinit.sh script Password for pdmvserv@CERNNOSPAMPLEASE.CH: you don't need to type anything and connection to machine can be terminated.
To stop the service you can press Ctrl+C if the service is running in foreground, otherwise you kill the process.

Screenshots

New RelMon campaign

img700e97f7b332ee7fe6a7fd2be53d18a2.png

Latest RelMon campaigns

img80262d5dc2522dd08aeb83859278e58b.png
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng RelMon1.png r1 manage 280.4 K 2011-07-15 - 11:23 DaniloPiparo RM1
PNGpng RelMon2.png r1 manage 54.5 K 2011-07-15 - 11:24 DaniloPiparo  
PNGpng RelMon3.png r1 manage 92.3 K 2011-07-15 - 11:23 DaniloPiparo  
PNGpng RelMon4.png r1 manage 50.1 K 2011-07-15 - 11:23 DaniloPiparo  
PNGpng RelMon5.png r1 manage 73.6 K 2011-07-15 - 11:34 DaniloPiparo  
PNGpng RelMon6.png r1 manage 141.9 K 2011-07-15 - 11:34 DaniloPiparo  
PNGpng RelMon7.png r1 manage 72.2 K 2011-07-15 - 11:35 DaniloPiparo  
PNGpng gui1.png r1 manage 107.8 K 2015-04-24 - 03:47 JonasDaugalas  
PNGpng gui2.png r1 manage 175.3 K 2015-04-24 - 03:47 JonasDaugalas  
Edit | Attach | Watch | Print version | History: r62 < r61 < r60 < r59 < r58 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r62 - 2017-01-25 - GiovanniFranzoni
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback