RelMon: a Tool to Perform Goodness of Fit Tests of Large Sets of Histograms
RelMon is a tool to perform automatic comparison of two rootfiles containing histograms and profiles also organised in directories.
It allows for example to make regressions of CMSSW releases through the comparison of the DQM histograms produced.
Its primary usage is to perform systematic validation of a 'test CMSSW release' against a 'reference CMSSW release'; the same suite of DQM distributions (developed by DPG/POG/PAG) is produced over a a group of samples (data and MC) in the context of relVal production;
RelMon performs a compatibility test between the 'test' and 'reference' instance of each distribution, and provides an easily browsable and hierarchic organization of the outcome of tests, with pointers to access the tested distributions in the DQM gui.
The tool is entirely written in Python and relies on the pyROOT bindings of ROOT.
Two interfaces are available to read the DQM histograms both from rootfiles and from the DQM database via the
DQM2json utility.
The comparisons between two histograms are ranked using statistical tests, configurable by the user. Single directories are then ranked according the ranks of their contents.
The information about the tests can be viewed as a plain ASCII report or as a set of web pages that try to aggregate the information in a concise form using pie-chart diagrams.
Third party technologies are involved in the generation of the graphics
RelMon, namely the DQM GUI plotting infrastructure, the
Google Chart API
and the
BluePrint
css framework.
A set of scripts is intended to ease the usage of the interfaces representing a set of handy command-line tools. There's an ongoing development of a web-based service to streamline and further authomatize and streamline the production of
RelMon reports.
Getting RelMon and set up the environment
For CMS users
RelMon is available natively with every CMSSW release since the 5 cycle!
scramv1 project CMSSW CMSSW_8_0_14
cd CMSSW_8_0_14/src
git cms-addpkg Utilities/RelMon/
chmod 744 Utilities/RelMon/scripts/fetchall_from_DQM_v2.py
chmod 744 Utilities/RelMon/python/*
scramv1 build -j 2
cmsenv
voms-proxy-init -voms cms
For All Users
Get the tarball
here
and uncompress it.
cd RelMon
chmod +x scripts/*
source scripts/RelMon_set_env.sh
Now set up ROOT. You can use a stand-alone installation or the ROOT version that comes with CMSSW.
In case you plan to use CMSSW to set up ROOT and pyROOT and the DB interface, remember that it is sometimes easier to set the Grid environment
before the one of CMSSW because of the python version included in the gLite Middleware.
File Interface
The file interface allows you to to compare and produce a report about the compatibility of two CMSSW releases using the agreement of the whole sets of histograms that the DQM harvesting step produced running on a particular dataset. The agreement is quantified by a certain statistical test.
A script is available in the scripts directory.
Let's start with a meaningful example, suppose two sample root files are available:
compare_using_files.py DQM__RelValZMM__CMSSW_4_2_0_pre7-START42_V6-v2__DQM.root DQM__RelValZMM__CMSSW_4_2_0_pre6-START42_V4-v1__DQM.root -C -R -d Muons -o File_int
This command will start the comparisons (
-C
) of the two rootfiles, concentrating only on the Muons direcotry (
-d Muons
) and produce a report (
-R
this option must appear) in the directory RelValZMM (
-o RelValZMM
);
-R
this option must also appear . NOTA BENE: the format of the filename
XXX__SAMPLE__CMSSWVERSION__TIER.root
, the one of the harvested DQM files, is at the moment essential for the correct functioning of the script.
You can also avoid to use files formatted as DQM harvested rootfiles, with the help of the
--meta
switch. You can specify a short description of your files with:
... --meta "MyShortDescription1 @@@ MyShortDescription2"
The "@@@" is used a s a separator between the two descriptions.
More options are at your disposal of course. To explore them just type:
compare_using_files.py -h
How does it work?
Basically the directory structure of the files is walked, and the histograms are compared one by one. An object of the type Directory (see dirstructure module), is filled with the information about the outcome of the tests.
At the end of the comparison, this Directory instance is dumped on disk as a pickle (serialised form of Python object) for future usage, e.g. html report production.
The option
-P
allows to start from such a pickled object and produce the report avoiding to re-run the comparisons.
If you compare the directory structure of an harvested file in a TBrowser and the one in your report, you'll notice two things: First is that empty directories are skipped since uninformative and second that the
!RunSummary directories are disappeared.
This is done on purpose to align the output of the DB and File interface. If you are interested in the details look at the
Directory::prune
method.
The webpages are produced by functions manipulating
Directory
objects which are collected in the
directory2html
module. This library is indeed thought to throw the base for a purely hypothetical and not at all promised future
CherryPy version of
RelMon.
Where are the reports?
The reports are in general collected here:
http://cms-service-reldqm.web.cern.ch/cms-service-reldqm/ReleaseMonitoring
which where the
/afs/cern.ch/cms/offline/dqm/
afs directory is served. The access specifically to the release validation reports is eased by the
top level interface
.
After a few months of being available online, relmon reports are removed from the
top level interface
and archived in here:
/eos/cms/store/group/pdmv
. Once available, the eos2castor archival tool will need be used to archive out of eos. Please ask
PdmV if you need an old relmon report to be temporarily re-made avaialble online.
Running locally
If user want to run
RelMon locally (to see reports in its own computer) he should use --standalone flag so the report HTML files would fetch
JavaScript over HTTP.
The standalone method is also triggered when setting up metas for custom root files.
For example:
ValidationMatrix.py -a rootFullSimPU -o FullSimReport --standalone
DQM DB Interface
The DQM DB interface allows you to gather the information about the histograms to be compared directly from the DQM DB content. The API exploited to achieve this goal is the
DQM2json. Again a script is available that wraps the interface class. Let's inspect a sample command:
compare_using_db.py -1 CMSSW_5_3_0-START53_V4-v1 -2 CMSSW_5_3_1-START53_V5-v1 -S RelValZMM -C -R -d "Muons" -o DB_int
This command uses again the already known
-C
,
-R
,
-d
and
-o
options. The difference is now that the sample (
-S
) and the two releases have to be specified separately (
-1
,
-2
). This is slight asymmetry between the two approaches is due to the fact that all the information that was automatically parsed from the filename must be entered here.
As usual,
compare_using_db.py -h
will display all the necessary help.
Please not the
-T
option. Along the lines of the previous argumentation, also the Tier of the data is to be entered here since it cannot be parsed from the filename. The default is
DQM
.
Another possibility could be:
compare_using_db.py -1CMSSW_4_2_0_pre7-START42_V6-v2 -2 CMSSW_4_2_0_pre6-START42_V4-v1 -S RelValZMM -C -R -d "00 Shift" -o 00Shift
You might have noticed the difference between the directory investigated here,
00 Shift
, and previous one,
Muons
. The
00 Shift
directory is an example of usage of [[https://twiki.cern.ch/twiki/bin/view/CMS/DQMGuiLayouts][DQMGuiLayouts]. These kind of directories do not physically exists inside the harvested root files as
TDirectoryFiles
, but they are "fictitious" in the sense that they summarise histograms that exist already.
How does it work?
Here as well the directory structure exposed by the
DQM2json API is navigated. Comparisons are then performed histogram by histogram. The mechanism then is the same of the File Interface: a Directory instance is filled, pickled on disk and the html is produced.
You might have noticed some messages about the number of threads run simultaneously. One of the main differences between the two interfaces is the usage of threads to recursively navigate the directory structure. The problem is indeed easily vectorisable, though threads are only really bring a big advantage when using the
DQM2json API so to compensate for the latencies imposed by the responses of the server. If you are curious about the way in which Python implements Threads, please have a look
here
.
Black Lists
Black listing directories
Both interfaces (and scripts!) support blacklists. This is particularly desirable in presence of directories containing a very big number of histograms, for example RPC or HLT.
The way in which a blacklist is entered on the command line is:
-B DIRNAME1@LEVEL1,DIRNAME2@LEVEL2,...
Therefore a comma separated list of name-level pairs of values characterising a single directory. If the level is
-1
, all the directories in the tree having the specified name will be skipped.
Black listing single histograms
From CVS Tag V00-08-05 it is possible to skip single histograms from comparison. It is supported by
ValidationMatrix.py and compare_using_files.py scripts. The black-list is located
http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/CMSSW/Utilities/RelMon/data/blacklist.txt
By default scripts are not skipping single histograms, if users wants to use this list for skipping they should include
--use_black_file
command line parameter. Examples:
ValidationMatrix.py -a rootFilesDIR -o reportsOUTPUT --use_black_file -N 7
compare_using_files.py DQM_rootFile1.root DQM_rootFile2.root -C -R -o outputREPORT --use_black_file
Statistical Tests
A somewhat big emphasis was given to the requirement of statistical tests to check the compatibility of the pairs of histograms. There is a complete suite of tests available natively at disposal and a very flexible mechanism that allows the expansion of the collection of the statistical test classes.
The available tests are:
- Chi2: this is the default test of RelMon. The ROOT implementation of the test
is exploited. The default selection on p-value to determine wether a comparison has passed (or not) is set to >1e-05 (<).
- KS: this test is NOT designed for comparison of binned data. Indeed the Kolmogorov-Smirnov Theorem is not valid in absence of unbinned datasets. It is nevertheless included since it is fast and gives almost appropriate results in presence of very fine binning. The ROOT implementation of the test
is exploited.
- Bin2Bin: this test is implemented to check the 1to1 compatibility of two sets of histograms. A use-case could be the change of the CMS Production job submission tool and the comparison of the produce samples. The implementation is in Python, but all the computationally heavy operations are pushed down into the state-of-the-art Python C interpreter. It must be noted that the P-Value returned by this test should be handled with care. Indeed it is not a P-Value, but rather the percentage of bins that corresponded. Therefore, if you have 100 bins and the content of 83 was the same, the test returns .83. The default selection on p-value to determine wether a comparison has passed (or not) is set to >0.9999 (<).
How does it work?
All
RelMon statistical tests (and statistical test wrappers around ROOT implementations) inherit from a very generic class, the
StatisticalTest
.
It is sufficient to look into the
utils
module to understand the details of the implementation and how easy it is to implement new tests, always provided that literature offers new candidates with respect to the two and a half cited above. This is indeed not a trivial problem, but please contact the author in case you think that your test should be implemented in
RelMon.
Advanced Topics
The following topics are not usually meant for normal needs of average users, nevertheless, if you are curious, read further!
Validation Matrix
Is it possible to verify the compatibility of two CMSSW releases using all histograms produced in the harvesting steps for all Release Validation samples produced? The answer is yes, with
ValidationMatrix.py
.
This script manages multiple instances of the file interface so to perform several comparisons of rootfiles pairs simultaneously, each root file corresponding to a dataset/release combination.
A typical command line looks like:
ValidationMatrix.py -a myRootfilesDir -o CMSSW_X_Y_Z_pre1VSCMSSW_X_Y_Z_pre2 -N 7
This command will generate a directory,
CMSSW_X_Y_Z_pre1VSCMSSW_X_Y_Z_pre2
, in which all the samples represented by the rootfiles in
myRootfilesDir
will be reproduced in separate directories. The number of processes to be run at the same time is specified by
-N
: please be gentle since these are not shy Python threads, but each process will run on a different CPU (if available).
More details are needed about the
-a
option. The tool will try to fetch all the rootfiles contained in the directory specified and organise them in pairs to be then compared. So please help the tool help you and copy pair of files in the directory: pairs will be built from each unique release/dataset combination; be careful that, for instance,
FullSim
and
FastSim
are handled as if they were different datasets, so you should separate those set of files in different folders in order to get the meaningful comparisons you want.
PU
files (i.e. samples with pile up simulation) are treated as different datasets, thus also in this case, you may want to make a dedicated directory. If for some reason the automating sorting fails, you can always fall back to a manual selection, with the
-R
and
-T
options, where R stands for reference and T for test.
The output directory should be of the form XXXXVSYYYY, use any other formatting at your own risk.
Once the output directory is generated, you can produce the summary condensating all the information about the single comparisons with the command line:
ValidationMatrix.py -i CMSSW_X_Y_Z_pre1VSCMSSW_X_Y_Z_pre2
which tells the
ValidationMatrix.py
to process all the pickles together and generate a comprehensive report.
Compressing reports and publishing reports
Reports can be published in two ways:
- Copy the report directory on a webspace, for example AFS
- Compress the report, add a .htaccess file to enable Apache serving compressed files and copy the report directory on a webspace, for example AFS gaining a factor 3-4 in disk space and bandwidth.
For a simple report the first option can be a valid strategy, but once a report spans over many
RelVal samples it is better to optimise.
The
dir2webdir.py
script optimises the report. The syntax is:
dir2webdir.py ReportDir
Automatically, all the files will be compressed using the gzip algorithm, the pickles will be copied in a separate directory and an .htaccess file will be created in the top directory.
For operators: generating full reports after upload of harvested files
The steps to be followed by the operators in order to fetch the relevant files, build and publish a
RelMon report are the following:
# Setup the grid proxy in order to authenticate to the server.
export MAIN_RELEASE=CMSSW_X_Y_Z
export COMP_RELEASE=CMSSW_Z_Y_Z'
cd /tmp/"$USER"
export WORKDIR=RelMonReport_"$MAIN_RELEASE"VS"$COMP_RELEASE"
mkdir $WORKDIR
cd $WORKDIR
cmsrel $MAIN_RELEASE
cd $MAIN_RELEASE ;cmsenv ;cd -
cvs co -d RelMon -r V00-06-05 UserCode/dpiparo/RelMon
cd RelMon;source scripts/RelMon_set_env.sh ;cd -
mkdir work;cd work
compare_two_releases.sh $MAIN_RELEASE $COMP_RELEASE
These command lines should be executed on a machine on which
afs
is mounted and have at least ~50 GB free space for the
/tmp
directory: perfect candidates are the
lxplus
machines. The
relval
user has full access to the
/afs/cern.ch/cms/offline/dqm/
directory.
NOTE: Due to a pyROOT unexpected behaviour it is better to log enabling X forwarding.
How to choose the CMSSW_X_Y_Z and CMSSW_X_Y_Z' releases
The idea is to test one release against the preceding one to spot the changes could have been produced.
As a general rule, every pre-release should be checked against the previous pre-release (CMSSW_X_Y_Z_preN vs CMSSW_X_Y_Z_preN-1). The test of the pre1 release should be carried out against the last release of the preceding cycle.
The releases (no pre-releases) should be checked against the last pre-release (CMSSW_4_3_0 VS CMSSW_4_3_0_pre7) ort the last release (CMSSW_4_2_6 VS CMSSW_4_2_5).
Minimization of output HTML file names
RelMon generates reports in HTML files for each directory. As in afs file system the file name length is limited to certain amount of characters, you can use output file name minimization. To minimize the report's file name use command line parameter
--hash_name
It uses python's hashlib.md5 module to generate unique names, to minimize it more
RelMon uses only first 10characters of hashlib's generated name. Example:
ValidationMatrix.py -a rootFileDIR -o reportOUTPUT -N 4 --hash_name
Useful Links
Screen Shots
Main Page
Main Page
Top of global Report
The first part of the global report with the single subsystems
Detail of the subsystems with a gauge
The quality of a single subsystem can be inspected quickly looking at gauges
Even more detail
All the directories compared for every
RelVal sample
One directory of one single sample
Understand the status of the single subdirectories in an instant and click to get more information
The plots: from the DQM GUI
You can exploit the power of the DQM GUI to display the comparison plots for all histograms of the analysed set
The plots: internally generated
If the files are not indexed in the GUI (e.g. non DQM files!) you can generate the plots yourself and visualise them
--
DaniloPiparo - 15-Jul-2011
RelMon service is a
RelMon report production automation tool. Production service you can find: the service is accessible at
https://cms-pdmv-dev.cern.ch/relmonsvc/
. The tool consists of RESTful service
and graphical user interface.
Notation
- RelMon campaign - request for RelMon service to make RelMon comparison (as at RelMon Reports homepage). Initially RelMon campaign consists of its name, threshold (see Threshold) and one to six categories (see bellow). After RelMon service starts to process RelMon campaign, the campaign is appended with more detailed information.
- Category - part of RelMon campaign. Category consists of HLT option, two lists - reference and target - of relval workflow names and the name identifying the category. In this context there are six available category names: "Data", "FullSim", "FastSim", "Generator", "FullSim_PU" and "FastSim_PU".
- Threshold - a percentage of workflows for which root files must be accessible before starting RelMon report production.
To start a new
RelMon report production process,
RelMon service must be given initial
RelMon campaign as an input. Bellow is a real example of initial
RelMon campaign which could be passed to
RelMon service:
{
"name":"BH&SE_740pre6_vs_740pre8",
"threshold":100,
"lastUpdate": 1461836633,
"categories":[
{
"name":"FullSim",
"HLT":"only"
"lists":{
"reference":[
"fabozzi_RVCMSSW_7_4_0_pre6BeamHalo_13_150130_155653_4907",
"fabozzi_RVCMSSW_7_4_0_pre6SingleElectronPt10_UP15_150130_155755_467"
],
"target":[
"franzoni_RVCMSSW_7_4_0_pre8BeamHalo_13__MinGT_150315_195114_8636",
"franzoni_RVCMSSW_7_4_0_pre8SingleElectronPt10_UP15__MinGT_150315_200318_6695"
]
}
}
]
}
In the example there is JSON representation of
RelMon campaign named "BH&SE_740pre6_vs_740pre8", with threshold set to 100% and with one category "FullSim". The HLT option in that only category set to "only" means that
RelMon report for this category would be made only with HLT flag set to true. The category also has two workflow names in both lists.
After
RelMon campaign has been passed to
RelMon service to start new
RelMon report production, the service starts checking names of workflows DQMIO outputs at
Request Manager
. If the DQMIO output name is accessible, then
RelMon service starts querying
DBSReader
to get expected number of root files for given workflow. Also
RelMon service starts querying DQM GUI for root files availability and compares the number of available root files with the expected number. If these two numbers are equal, the workflow is assigned status "ROOT" (see
Workflow statuses explained). Moreover,
RelMon service checks Workload Manager database to get information about processing state of the workflow. Graph of states that can be returned from Workload Manager database is explained
here
. Interesting states are those which tell that starting from that particular state no more root files can appear for a workflow in that state. This lets
RelMon service detect workflows which have not enough root files available and are already in the state where no more root files will be produced, in such case
RelMon service assigns "NoROOT" status to this kind of workflows. This whole procedure of querying services and databases is repeated at constant intervals until threshold is reached or exceeded (see explanation bellow).
By default threshold is calculated by the formula:
t = R/(T-d-r)
,
where t - threshold, R - number of workflows for which already exist enough root files, T - total number of requested workflows (from each category, from each list), d - number of workflows which do not produce DQMIO output, r - number of workflows for which it is already known to be not enough root files available. This formula can be influenced by changing
RelMon service configuration file (see
Config options).
After the threshold is reached or exceeded, if not all workflows (except ignored ones i.e. "NoDQMIO" and "NoROOT") have enough root files available,
RelMon service waits in configuration specified amount of time (4 hours by default), then, for the last time, checks available root files and starts downloading needed root files. When downloads are finished
RelMon report production is started.
For
RelMon report production
ValidationMatrix.py
script is used. Finding pairs of root files to be compared is done automatically by
ValidationMatrix.py
script. It is known that there exist cases when automatic pairing fails. In this kind of cases
RelMon service might produce empty or not complete
RelMon reports for categories for which pairing has failed.
Produced
RelMon reports are optimized by running
dir2webdir.py
script on them. Optimized reports are then moved to directory visible by
RelMon Reports homepage (the directory can be customized in configuration file, see
Config options)
Workflow statuses explained
- "initial"
- RelMon service has not done any action on this workflow.
- "waiting"
- It is known that the workflow produces DQMIO output, but the name of this output is still unknown.
- "NoDQMIO"
- Workflow does not produce DQMIO output.
- "DQMIO"
- DQMIO output name is known, RelMon service is checking for root files availability.
- "NoROOT"
- According to Workload Manager the workflow is in one of the final states (final states are defined in service configuration), but there are not enough root files available.
- "ROOT"
- There are enough root files for given workflow.
- "downloaded"
- root files for this workflow have been downloaded.
- "failed"
- Something went wrong while processing this workflow.
- $*"failed_rqmgr"*
- Request manager doesn't know this worklow name. $*"failed downlod"*: something went wrong during the download process
Graphical user interface
Visiting home page of
RelMon service redirects the browser to the
graphical user interface. In case you are not signed in to your CERN
account you will be asked to sign in before reaching the GUI. On the
home page exists two sections: "New
RelMon request (campaign)" and
"Latest requests".
Submitting new RelMon campaign.
- Click "New RelMon" button.
- Fill needed categories (see Filling categories) with reference and target workflow names and the HLT option.
- Enter RelMon campaign name and threshold.
- Click "Submit" (or "Submit query" depending on your browser).
- Click "Confirm"
Filling categories
After clicking "New
RelMon" button, the form for new
RelMon campaign is
revealed. There is a tab for each category (e.g. "Data", "FullSim",
"Generator", etc.). Each tab contains a field for list of reference workflow names, a field for list of target workflow names and an HLT option (except for
"Generator" category which does not have the HLT option). Needed workflow names can be looked up at
https://hypernews.cern.ch/HyperNews/CMS/get/dataopsrequests.html
. Workflow names in the lists must be separated with white space.
Campaign name
Bellow the tab set there is a field for
RelMon campaign name. This is how finished comparison will be named at
RelMon Reports page. Note: choose the name that is a valid name for Linux directory. If you create campaign with name, which already exist you delete all categories from first campaign, e.g. if you have campaign xxx with categories Data,
FullSim and you create another campaign with the same name "xxx" and categories
FullSim,
FastSim. You take responsibility that everything from first campaign in category
FullSim will be deleted and replaced with new files from second campaign.
Latest requests
In this section there is a table of submitted
RelMon campaigns. On the left column there is general information about campaigns. On the right column there is information about categories and workflows of specific campaigns.
EDIT functionality
There is "Edit" button in each campaign. You can edit campaigns while they're not in the final statuses. It means, that while not finished or failed. When you press Edit, all fields will be filled in by information from that campaign. After submitting edited campaign, that campaign immediately starts from the beginning. It doesn't matter it was in downloading or comparing files phase.
General information
Interesting elements of the general information column are "Status" field, "Log file" link, "Terminate" and "Close" buttons. "Terminate" button, when clicked, allows to stop the
ongoing
RelMon request and clean all downloaded and generated files. "Close" button is meant for cleaning
RelMon report record at
RelMon service but leaving the actual report available. "Log file" is a link to a log file generated while producing
RelMon reports. "Status" field can have one of the following values: "initial", "waiting", "downloading", "comparing", "finished", "failed" or "terminating".
Explanation of request statuses:
- "initial"
- Request has just been submitted.
- "waiting"
- Service is checking statuses of workflows and there are not enough workflows with status "ROOT" (see Workflow statuses explained).
- "downloading"
- root files are downloading.
- "downloaded"
- root files are downloaded.
- "Qued_to_compare"
- campaign is added to queue to compare root files.
- "comparing"
- RelMon reports are being produced (
ValidationMatrix.py
script).
- "finished"
- RelMon reports are completed and uploaded to be accessible via RelMon Reports page.
- "failed"
- Something went wrong.
- "terminating"
- The request processing is being stopped and cleaning is in progress.
Detailed information about categories and workflows
In this column there are expandable containers for every category. Expanded category reveals expandable lists. Expanded list reveals workflow names, their statuses and possibly the number of root files available and expected for particular workflow. Possible workflow statuses explained
here
Information how does file comparing works
We have couple of steps. First of all we have two lists: Reference and Target. So, we filter workflows and delete from those workflows, e.g. if we have workflow "xxx" in Reference's with status
NoROTT,
NoDQMIO, failed_rqmgr, we check for the workflow which match this one in References' list. And delete both of them. After that we check how many workflows in each list we have. And start matching workflows. For matching workflows we're using Levenshtein distance algorithm, How exactly comparison works with examples you can find in slides:
https://indico.cern.ch/event/479139/contributions/2145971/attachments/1263354/1868703/Presentation_RelMon-Service_v10.pdf
RESTful service
Previously explained graphical user interface relies on RESTful
service. Bellow are endpoints of the restful public (public for logged
in users) api explained.
- GET / - returns GUI HTML
- GET /userinfo - returns details about logged in user
- GET /requests - returns list of requests with all information about them
- GET /requests/ - returns request with given id
- POST /requests - creates new RelMon campaign. POST data must be of format as in this example
- GET /requests//log - returns comparison process log file
- POST /requests//terminator - starts specified RelMon campaign termination
- POST /requests//close - closes specified RelMon campaign by removing records about this campaign from RelMon service
For administrators
Deployment
- Install CMSSW on remote computing machine
- Install the following python packages on RelMon service machine: Flask, Flask-RESTful, Flask-CORS, python-crontab, Paramiko. Here is an example of how to do it with pip
package manager:
pip install flask
pip install flask-restful
pip install flask-cors
pip install python-crontab
pip install paramiko
- Download RelMon service source from GitHub:
git clone https://github.com/cms-PdmV/relmonService.git
- Edit configuration file
config
to match your case. See Config options
- Set
base
option in static/index.htm
to match base url at proxy host. This base
must end with /
. For example if the service is accessible at https://cms-pdmv.cern.ch/relmonsvc/
, then the base
option should be set to /relmonsvc/
.
NOTE: If you are deploying
RelMon service not behind authentication proxy, you could use
this tutorial
for setting up Shibboleth authentication on the the service.
Config options
- list administrators
- List of CERN usernames that have the administrative rights. The user that is defined as
remote_user
must appear in this list because remote machine must have administrative rights.
- list authorized_users
- List of CERN usernames that have the right to use this RelMon service.
- list final_relmon_statuses
- List of workflow statuses, of RelMon service notation, that are considered final and cannot be changed. Most of the time the list should look like this
["failed", "terminating", "finished"]
.
- list final_wm_statuses
- List of workflow statuses, of Workload Manager notation (https://github.com/dmwm/WMCore/wiki/Request-Status
), that are considered to be final and no more root files can appear after workflow has reached one of these statuses.
- string remote_host
- Remote machine address
- string service_host
- Address of RelMon service machine.
- string service_base
- Base URL at RelMon service proxy e.g.
/relmonsvc
.
- string credentials_path
- Path to the credentials file which is of structure
{"user": <username>, "pass": <password>}
.
- string key_path
- Path to the userkey. NOTE: RelMon service expects passwordless userkey.
- string certificate_path
- Path to the user certificate. NOTE: RelMon service expects passwordless user certificate.
- string keytab_path
- Path to the Kerberos keytab file.
- string host_cert_path
- Path to the RelMon service machine host certificate.
- string host_key_path
- Path to the RelMon service machine host key.
- string remote_cmssw_dir
- Path to the CMSSW directory at remote machine.
- string remote_work_dir
- Directory on remote machine for scripts to run and store temporary files.
- string cmsweb_host
- Host of cmsweb
- string dqm_root_url
- URL to the DQM GUI root file access at cmsweb.
- string data_file_name
- Name of the file to store RelMon service data.
- string logs_dir
- Directory to store log files uploaded from remote machine
- string dbsreader_url
- URL to the DBSReader at cmsweb.
- string wmstats_url
- URL to Workload Manager status api at cmsweb.
- string datatier_check_url
- URL to Request Manager at cmsweb to get datasets info by name.
- string relmon_path
- Path to directory (most probably on afs) where finished RelMon reports should be placed.
- integer service_port
- RelMon service communication port.
- integer time_between_status_updates
- Interval in seconds to wait between rechecking workflow statuses.
- integer time_between_downloads
- Interval in seconds between retrying downloader.
- integer time_after_threshold_reached
- Time in seconds to wait after threshold is reached or exceeded (except if 100% of workflows are ready) before doing the last statuses check and starting downloader.
- boolean ignore_noroot_workflows
- If set to
true
workflows with statuses "NoROOT" are ignored when calculating threshold.
Starting and stopping
To start the service one needs to execute
service.py
:
ssh pdmvserv@vocms085.cern.ch
cd /home/relmonsvc/relmonService/
source /home/mcm/cmsPdmV/mcm/kinit.sh &
python service.py &> out.log &
you should wait 1minute before closing connection to keep afs token. After one minute you will see the output of kinit.sh script
Password for pdmvserv@CERNNOSPAMPLEASE.CH:
you don't need to type anything and connection to machine can be terminated.
To stop the service you can press Ctrl+C if the service is running in foreground, otherwise you kill the process.
Screenshots
New RelMon campaign
Latest RelMon campaigns