Note, this is being deprecated for https://twiki.cern.ch/twiki/bin/view/CMS/CompOpsProductionReprocessing
Overview of release validation
Release validation workflows are organized into batches of similar workflows. Organizing workflows into batches is convenient for the following reasons:
- similar workflows have similar errors, and the reports about the errors can be condensed this way
- discussions of the status of workflows are organized easier
- e-mail notifications of when workflows are completed are organized easier
The chronology of a batch is the following
- The batches are requested by members of the PPD using the webpage https://anlevin.web.cern.ch/anlevin/relval/batchmanager.html
- The requests are sent to a server running on cmsrelval002.cern.ch
- The server does some simple checks for example on the workflow names, then sends the information into a database and sends a notification e-mail to hn-cms-dataopsrequests@cernNOSPAMPLEASE.ch. The server code is here: https://github.com/AndrewLevin/relval_batch_manager
- The database, which is running on this machine, dbod-cmsrv1.cern.ch receives the information
- Scripts running on the machine cmsrelval003.cern.ch do careful checks of the workflows configuration and then assign the workflow to a site
- The workflows get acquired by the relval wmagent and run at the site
- When all of the workflows in a batch are finished, scripts running on the machine cmsrelval003.cern.ch collect information about failed jobs and output datasets and sends a notification e-mail to hn-cms-relval@cernNOSPAMPLEASE.ch
Responsibilities of the release validation operators
- Maintain all of the scripts running on cmsrelval002.cern.ch and cmsrelval003.cern.ch
- Pay attention to changes in services that these scripts interact with e.g. reqmgr2, dbs3, and phedex
- Monitor the status of the non-announced relval batches and make sure things do not get stuck
- Respond quickly to e-mails/messages/phone calls asking for help with submitting relval workflows, asking about the status of relval workflows, asking about errors in the relval workflows, etc.
- Attend the weekly workflow team meetings and computing operations meetings
- Interact with sites using ggus tickets if there are site-related problems e.g. file access problems
The mysql database
- The workflow names and the metadata associated to the batches are stored in a https://dbondemand.web.cern.ch/DBOnDemand/
mysql database. This database contains four tables:
- workflows, which contains workflow names
- batches, which contains batch metadata
- datasets, which contains the output dataset names
- clone_reinsert_requests, which contains un-processed requests to clone a batch
Batch status
- A batch can have the following statuses:
- approved, there is actually no approval step, so this is the status of the batch when it is created
- input_dsets_ready, means that it has been checked that the input datasets are at the site where the batch will run
- waiting_for_transfer, means that input datasets are being transferred to the site where the batch will run
- assigned, the workflows in the batch have been assigned to a site and have already started running or will start running soon
- announced, means that the workflows in the batch have finished running and the announcor.py script has done the announcement step
- reject_abort_requested, means that the batch is waiting to be killed by the batch_killor.py script
- assistance, all of the workflows in the batch are completed, but there is a large rate of some error, so a human should check what is going
The top-level script that runs every hour
- The script run_all_or_scripts.sh is run every hour by a cronjob on cmsrelval003.cern.ch
- This script runs these six scripts
- batch_killor.py, kills batches (i.e. aborts or rejects all workflows) that are waiting to be killed
- batch_clonor.py, clones batches that are waiting to be cloned
- kerberos_voms_renewor.py, renews the kerberos ticket and the voms proxy for cmsrelval003.cern.ch
- input_dset_checkor.py, described below
- assignor.py, described below
- announcor.py, described below
- All of these scripts, and some other miscellaneous ones, are stored in https://github.com/AndrewLevin/WmAgentScripts
input_dset_checkor.py
- This script runs over every workflow in a batch and checks if the input datasets and the pileup datasets are at the site where the workflow will run
- If the input datasets or pileup datasets are at not at the site, a transfer request is made and the batch is moved to the status waiting_for_transfer
- If the input datasets and pileup datasets are at the site, the batch is moved to the status input_dsets_ready
assignor.py
- This script moves the status of each workflow in the batch to assigned
- It also checks whether the output datasets already exist, and if they do it sends an e-mail alert
- A large number of parameters of the workflow are set when the workflow is assigned e.g. the RSS limit
announcor.py
- For each batch that is in the assigned status and all of whose workflows are in the completed status, this script does the following things
- It collects that names of the output datasets, and makes several transfer requests of them
- It checks the job failure information that is in wmstats and creates a condensed report about it
- It sets the status of the output datasets to VALID
- It creates a text file containing the list of output datasets and the number of events in each of them, and puts this text file in a publicly accessible web area
- It sends an e-mail to the hn-cms-relval hypernews announcing that the batch is finished and giving information abotu the job failures and the output datasets
Management of the relval eos space at CERN
Monitoring webpage
- The main and only monitoring webpage for batches of relval workflows is https://cms-project-relval.web.cern.ch/cms-project-relval/relval_monitor_most_recent_50_batches.txt
- The link to this monitoring webpage is given to everyone who submits a batch, so a lot of the PPD people use it an it is important to make sure it is accurate
- It is update every ten minutes and shows information about the most recent 50 batches
- It is created by a cronjob running on cmsrelval003.cern.ch, which runs the script /home/relval/WmAgentScripts/RelVal/create_monitoring_page.py every ten minutes
E-mail alerts
- There are a large number of e-mail alerts hard-coded into the scripts. For example, in assignor.py there is:
if len(schema) != 1:
os.system('echo '+wf[0]+' | mail -s \"assignor.py error 9\" andrew.m.levin@vanderbilt.edu')
sys.exit(1)
- Some of these e-mails indicate temporary problems that can be ignored if they do not repeat, but some of them indicate real problems that require action from people. For example, if you see an e-mail like this:
Subject: assignor.py error 1
Body:
2016_05_19_3_0
bsutar_PR_reference_RelVal_273410_160519_125653_7959
2016_05_19_2_0
bsutar_PR_reference_RelVal_273410_160519_122839_6314
/DoubleEG/CMSSW_8_0_8-2016_05_19_PRreference_80X_dataRun2_Prompt_v8-v1/*
it means that the workflow bsutar_PR_reference_RelVal_273410_160519_125653_7959 from batch 2016_05_19_3_0 would write into the same datasets as the workflow bsutar_PR_reference_RelVal_273410_160519_122839_6314 from the already assigned batch 2016_05_19_2_0. In this case, you should kill the batch and send an e-mail to the requester reporting this. When there is a cmsweb upgrade, you will get a lot of e-mail alerts which can be ignored.
Batches needing assistance
- Normally batches are assigned and announced without any manual work
- However, sometimes there are problems with the batch that require human intervention
- For example, there are some errors that are a problem on our (i.e. the computing operations) side. This includes file system problems
- For each batch, the announcor.py script calls the script assistance_decision.py to decide if the batch should be announced or if it should be looked at by a person
- The assistance_decision.py script checks the rate of various error codes and compares with the corresponding thresholds
- If the rate of one of the error codes is too high, the batch is moved to status "assistance", and an e-mail is sent to the hard-coded e-mail address
Replication of pileup datasets
- Pileup datasets contain a small number of files that are accessed very heavily by some workflows
- Sometimes this can overload a site's file system
- If there is a high rate of file access failures for the pileup dataset, you need to ask the site to replicate that dataset in a ggus ticket like this: https://ggus.eu/index.php?mode=ticket_info&ticket_id=121337
Site where the workflows run
- The site where the workflows run is set in the server that runs on cmsrelval002.cern.ch
- The site is hard-coded to be T1_US_FNAL for all batches except HI batches (see below)
- T1_US_FNAL was chosen because it is a large, reliable site
Heavy ion workflows
- Some batches are classified as HI by the requester
- These batches cannot be run at T1_US_FNAL due to political reasons
- They can be run at any other site, and normally they are run at T1_FR_CCIN2P3 or T1_DE_KIT
Special command-line operations
- Killing a batch
- Causes all of the workflows in the batch to be aborted or rejected
- Does not clean up the output datasets that the workflows produce
- Does not do anything immediately, just changes the status of the batch in the database to "reject_abort_requested"
[relval@cmsrelval003 RelVal]$ python2.6 kill_batch.py 2016_05_20_7_0
- Cloning a batch
- Allows you to specify the site that the cloned batch will be assigned and the processing version of the new batch
- Does not do anything immediately, just inserts information into the clone_reinsert_requests table in the mysql database
[relval@cmsrelval003 RelVal]$ python2.6 clone_batch.py 2016_05_09_3_0 T1_US_FNAL 2
- Assigning an individual workflow
- The arguments are workflow_name site processing_version [processing_string]
[relval@cmsrelval003 RelVal]$ python2.6 assignment.py anlevin_RVCMSSW_8_1_0_pre5GGToHgg_NLO_Pow_13TeV_py8__gen_160525_215350_6206 T1_US_FNAL 2
[relval@cmsrelval003 RelVal]$ python2.6 assignment.py anlevin_RVCMSSW_8_1_0_pre5GGToHgg_NLO_Pow_13TeV_py8__gen_160525_215350_6206 T1_US_FNAL 2 TEST
- Cloning an individual workflow
- The second argument just affects the name of the cloned workflow
- The third argument should probably always be DATAOPS -- it is the group attached to the workflow
[relval@cmsrelval003 RelVal]$ python2.6 resubmit.py mewu_RVCMSSW_8_1_0_pre5GGToHgg_NLO_Pow_13TeV_py8__gen_160520_103003_9441 anlevin DATAOPS
Special command-line operations on the database
- Delete a workflow from a batch
mysql> delete from workflows where workflow_name="mewu_RVCMSSW_8_1_0_pre5DYTollJets_LO_Mad_13TeV_py8__gen_160520_102407_7629";
- Insert a workflow into a batch
mysql> insert into workflows set useridmonth="12", useridyear="2015", useridday="04", useridnum=1,batch_version_num=0, workflow_name="mewu_RVCMSSW_7_6_1RunJetHT2015D__rerecoGT_RelVal_jetHT2015D_151204_195434_876";
- Change the status of a batch
mysql> update batches set status="assigned" where useridmonth="05" and useridyear="2016" and useridday="08" and useridnum=1 and batch_version_num=0;
AndrewLevin - 2016-06-04