Release Validation operations on WMAgent

General useful links

WMAgent useful links

Available Resources

Jobs are submitted via GlideIns, going thru CERN Frontend/Factory.
  • cmssrv113.fnal.gov:
    • WMAgent version v0.9.33 ("relval" team, dedicated to RelVals only)
  • vocms174.cern.ch:
    • WMAgent version v0.9.33 ("relvallsf" team, dedicated to CERN LSF submission only)

WMAgent overview

The WMAgent infrastructure can be splitted in 4 components:
  1. RequestManager (ReqMgr): it receives and manages the requests. When you create a request, you have to inject it into the ReqMgr. It's a central component deployed in the cmsweb.cern.ch (or cmsweb-testbed.cern.ch)
  2. Global WorkQueue: when you assign a request to a given team/site(s), your request is handled by the Global WorkQueue, which splits the request in blocks of work. These blocks of work correspond to an amount of jobs that are going to be created by the WMAgent. It's a central component deployed in the cmsweb.cern.ch (or cmsweb-testbed.cern.ch)
  3. WMAgent: the agent itself. The agent is responsible for pulling the blocks of work (when there are slots available for the site whitelisted) from the Global WorkQueue, when this block is in the Local WorkQueue, the agent handles it (mysql/couchdb records, creation and submission of jobs, dbs insertion, phedex injection, etc) and submit the jobs to the site whitelisted.
  4. WMStats: the monitoring tool used to monitor the workflows/requests. It's a central component deployed in the cmsweb.cern.ch (or cmsweb-testbed.cern.ch).

The requests can have one out of several status, they are "new", "assignment-approved", "assigned", "acquired", "running", "completed", "closed-out", "announced", "failed", "aborted", "rejected", etc. The main ones are:

  • "new": this is the status of ACDC requests, which are just created in the ReqMgr
  • "assignment-approved": usually we get requests from PdmV/PPD in this status, which means that the request was created and it's approved for assignment. runTheMatrix.py creates and auto-approve the requests.
  • "assigned": it means your request was assigned and is being handled by the ReqMgr and Global WQ :-). In our normal RelVal environment, the requests stay on this status for less than 30min.
  • "acquired": if everything went fine during the assignment phase, then the Global WQ acquire the blocks of work. Once the work is available in the Global WQ, the Local WQ will try to pull these elements.
  • "running": after acquired by the Local WQ, jobs are created and submitted to the condor local queue - at this point the workflow status move to "running". You are already able of condor_q these jobs.
  • "completed": this status means that all the jobs related to a given request have completed (successfully or not). At this point you have the workloadSummary link in WMStats (giving you a summary of the request, hosted in the cmsweb couchdb), with the name of the output datasets, number of events, performance plots, error report about failures jobs and so on.
  • "closed-out": intermediate status between completed and announced. When the request is in closed-out it means that the request reached 100% of statistics with no failures. From this point on you should make the PhEDEx transfer requests and set the datasets produced to VALID in DBS.
  • "announced": when all the workflows are completed, phedex transfer already made, VALID in DBS, harvesting done and announced in the Hypernews.

For further information, please refer to: WMAgentReference

Types of RelVal workflows and dataset naming convention

RelVal workflows are always created using the TaskChain type of request. This is the most flexible type of request allowing chaining N tasks/steps, where one uses the output from the previous one and so on. Every task in the request corresponds to a different task, which means that if a job fail, only the output of a task is lost instead of missing the output of the whole chain. The flexibility that TaskChain requests provide is very useful to handle all the different type of workflows existing in the sets of RelVals. We could categorize RelVal workflows in 5 different types:
  1. data workflows: these workflows usually have 1 or 2 steps, take a real data as input and reprocess it
  2. MC FastSim workflows: workflows with a single step with no input dataset
  3. MC FullSim from scratch: workflows with several steps and no input dataset, they start running at GEN and SIM level.
  4. MC FullSim by recycling: workflows with several steps and recycling GEN-SIM input dataset from a previous release, they do not produce GEN-SIM samples.
  5. MC PileUp workflows: these pileup workflows may be from scratch or not and they take an MCPileUp dataset for the DIGI step.

Since there are different RelVal workflows and different ways of naming the output dataset - according to its category described above - we have to be very careful at assignment and make sure to set/recycle the correct information. The first mandatory rule for RelVal datasets is that they must contain the word "RelVal" in their name. The dataset naming convention for them is the following:

  1. data datasets:
    /<PD>/<release>-<GT>_RelVal_<label>-<ProcessingVersion>/<Datatier>
  2. MC datasets:
    /<PD>/<release>-<GT>-<ProcessingVersion>/<Datatier>
  3. FS datasets:
    /<PD>/<release>-<GT>_FastSim-<ProcessingVersion>/<Datatier>

Where:

  • PD: Primary dataset. It comes from the input dataset. If there is no input dataset, the PrimaryDataset will be set according to what is set in the request dictionary. Summarizing, you don't need to bother about it.
  • release: automatically picked up from the request dictionary
  • GT: usually there are three different Global Tag inside a given release, one/two for real data workflows, one for MC/FS and one for HeavyIon workflows.
  • label: used to distinct the different Eras for real data (e.g. Run2012A, Run2012D, etc)
  • ProcessingVersion: it's the version of the workflow (v1, v2, v3, etc)
  • Datatier: don't bother about it. It's set automatically by the agent, according to the configuration python files.

In terms of assigning a workflow, the parameters to properly set the output datasets names would be:

  1. data datasets:
    /<PD>/<AcquisitionEra>-<ProcessingString>-<ProcessingVersion>/<Datatier>
  2. MC datasets:
    /<PD>/<AcquisitionEra>-<ProcessingString>-<ProcessingVersion>/<Datatier>
  3. FS datasets:
    /<PD>/<AcquisitionEra>-<ProcessingString>-<ProcessingVersion>/<Datatier>

Where:

  • AcquisitionEra: corresponds to the CMSSW version (e.g. CMSSW_6_2_0_pre3)
  • ProcessingString: it depends on the category of workflow, as following:
    • MC workflows: it corresponds to the Global Tag only (e.g. START61_V8)
    • FS workflows: it corresponds to the Global Tag plus the "FastSim" word (e.g. START61_V8_FastSim)
    • Data workflows: it corresponds to the Global Tag plus the "RelVal" word plus the label (e.g. GR_R_61_V6_RelVal_mb2010B)
    • PU workflows: it corresponds to the string "PU_" plus the Global Tag (e.g. PU_START61_V8). If it's a FastSim workflow, then of course we have to append also the "FastSim" to the ProcString.
  • ProcessingVersion: an integer number only (... e.g. 1, 2, 3, etc)

runTheMatrix.py command and its parameters

This command can be used for several purposes: create the requests/configuration, inject these requests into ReqMgr, it can also be used to create the good and old cmsDriver_standard_hlt.txt file. It comes within the release in the PyRel package. You can start taking a look at its help:
runTheMatrix.py --help 

Example:

runTheMatrix.py -l 5.2,8.0,33.0 -i all --wmcontrol init --noCafVeto 
The parameters and arguments are:
  • "-l ": use "-l" option when there are specific number of workflows to be created/injected. The list of workflows must be comma separated.
  • "-i all": this parameter says that you want to create workflows by recycling the default GEN-SIM samples configured in the PyRel package. Do not use it if you are producing MC samples from scratch (producing also the GEN-SIM samples).
  • "--wmcontrol init": it says to wmcontrol that the workflows should be only created but not injected into ReqMgr. Keep in mind that you MUST create the workflows/configurations/requests before inject them into ReqMgr.
  • "--wmcontrol force": once you have locally created the configurations and workflows, you can run this command this command to upload the configurations to couchDB and injected the workflows into ReqMgr.
  • "--noCafVeto": it's an old parameter which says to create/inject workflows even if the input dataset is at CAF (Castor). It should be deprecated... Anyway, always use this parameter when injecting RelVal workflows.

Another very useful example, in case you want to create all the standard workflows/configuration in a single command (with GenSim recycling in place):

runTheMatrix.py -i all --what standard --wmcontrol init --noCafVeto | tee creating.log

then, if everything succeeded, you can now inject all those workflows locally tested, type:

runTheMatrix.py -i all --what standard --wmcontrol force --noCafVeto | tee injection.log

You can also create a text file containing all the workflows and their cmsDriver.py commands using the following almost deprecated command:

runTheMatrix.py --show --raw standard

The last example shows how to print a summary of the workflows and their step names, type:

runTheMatrix.py --what pileup --show

Step-by-step: preparing the environment and creating/injecting data requests

This section addresses the creation of the CMSSW area, changes required in the MatrixInjector.py, creation and injection of the requests/workflows. After that, you have to follow the instructions to assign those requests.
  1. Log in to vocms174 with the relval user;
  2. Create the software area and keep record of the PyRel tag
    release=CMSSW_X_Y_Z         # change it
    cd /data/relval/CMSSW/
    export SCRAM_ARCH=slc5_amd64_gcc472         # make sure it's the correct production ScramArch
    scramv1 p CMSSW $release
    cd $release/src/
    cmsenv
    addpkg Configuration/PyReleaseValidation
  3. Take note of the PyRel tag set. Now you have to change a few parameters inside MatrixInjector.py. Set your username in "self.user" and your group (DATAOPS) in "self.group" in the following file:
    vim Configuration/PyReleaseValidation/python/MatrixInjector.py 
  4. Compiling PyRel package and creating a dir:
    scramv1 b
    mkdir -p standardWMA/v1
    cd standardWMA/v1 
  5. Setting the cms, crab and wmclient environment:
    source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh
    source /afs/cern.ch/cms/ccs/wm/scripts/Crab/crab.sh
    source /data/relval/WMClient/v01/etc/wmclient.sh
    export PATH=/data/amaltaro/WMcontrol/::${PATH}
    export PYTHONPATH=/data/amaltaro/WMcontrol/:${PYTHONPATH}
    export WMAGENT_REQMGR=cmsweb.cern.ch
  6. Now it's time to create the configurations and prepare the workflows. To create all the standard workflows in one shot, by recycling GEN-SIM samples, run the following command (and go have a coffee :-D, it takes almost 30min):
    runTheMatrix.py -i all --what standard --wmcontrol init --noCafVeto | tee creating.log 
  7. Take a quick look at the creating.log in order to check that the requests are ok (does it have the correct cmsweb? is it recycling the correct samples? does it have the correct job splitting?)
  8. If everything is fine and there were no failures locally creating the workflows, then we can proceed to injection of them into RequestManager. The following command will upload the configuration files to couchDB and inject the requests into ReqMgr. Note that you must have created all the workflows before the injection and if there are failures this process will be aborted. Type and go get another coffee (it takes ~15min):
    runTheMatrix.py -i all --what standard --wmcontrol force --noCafVeto | tee injection.log
  9. If everything went fine, you should see several dictionaries and messages like "Injected workflow" and "Approved workflow". Now it's time to get a list of the requests injected (and also approved by the injection script), type:
    cat injection.log | grep "Approved workflow:" | awk '{print $3}' | tee injectedWFs.log 
  10. You can cross-check that these requests were submitted by going to WMStats and looking for them (you can use Campaign to filter them
  11. You still have to do one thing before the assignment. Go to WMStats and change the job splitting of the following 4 workflows:
    1. RunEl2012B, RunEl2012C, RunMu2012C, RunMu2012B, RunEl2012D, RunMu2012D
    2. Set the the job splitting for HLTD step for 4 lumis per job and press "Update Parameters" (wait for the Update message in red).
    3. Set the the job splitting for RECODreHLT step for 1 lumis per job and press "Update Parameters" (wait for the Update message in red).
  12. The job splitting for all the other workflows should be already OK ~2 or ~3 h jobs (except by the Data workflows...)

Assigning RelVal workflows (NEW WAY)

First of all, you should use this assignment script only for new releases, which means >= CMSSW_6_1_0, >= CMSSW_6_2_0_pre1, >= CMSSW_5_3_8. This script will take the AcquistionEra, ProcessingString and ProcessingVersion from the injected dictionary and use it for the assignment, which means that we rely on the information injected on ReqMgr. This script is stored in vocms174 and is used to assign RelVal workflows. With this script you can assign data, MC and FS requests ALL together, everything is handled by the script. For more information, please take a look HERE.

Here is the basic usage of this script, you just run it against all the standard workflows (remember that you can always use "--test" to just check what will be done without any real assignment). But before that you need to source the grid environment variables:

source /afs/cern.ch/project/gd/LCG-share/current_3.2/etc/profile.d/grid-env.sh
python /data/amaltaro/WmAgentScripts/RelVal_Official/newAssignRelValWorkflow.py -w $req --test

In case you want to assign PileUp workflows, you could use the following command line:

python /data/amaltaro/WmAgentScripts/RelVal_Official/newAssignRelValWorkflow.py -w $req --test --pu

Assigning RelVal workflows (OLD WAY)

First of all, you will use this assignment script only for not new releases, which means < CMSSW_6_1_0, < CMSSW_5_3_9. The main difference between this script and the new one is that this script builds the AcquistionEra based in the "type" of workflow (data, MC, FastSim), that means you have to split the assignment in three parts. This script is stored in vocms174 and it's used to assign RelVal workflows. With this script you must assign different categories of workflows using different parameters for data, MC, FS and special requests, almost everything handled via parameters. For more information, please take a look HERE.

Here are the basic cases in which you will use to assign standard workflows, according to their category (remember that you can always use "--test" to just check what will be done without any real assignment). But before that you have to renew the kerberos ticket and source the grid environment variables:

kinit
source /afs/cern.ch/project/gd/LCG-share/current_3.2/etc/profile.d/grid-env.sh
python /data/amaltaro/WmAgentScripts/RelVal_Official/assignRelValWorkflow.py -w <workflowName> -p 1 --data
python /data/amaltaro/WmAgentScripts/RelVal_Official/assignRelValWorkflow.py -w <workflowName> -p 1 --mc

Finalizing, closing-out and announcing the samples/requests

This section describes the procedures that should be taken when you have all workflows already running or in completed. The first thing to do is looking at the requests in the GlobalMonitor and check if there are failed jobs (in the "failure" column) or jobs that are failing but waiting for a resubmission ("cool off" column). If there are, then you have to debug the problem and collect all information/logs that you can and provide it in the announcement.

Usually we run around 97 workflows for a full set of standard RelVals (105 from 620pre3 on, excluding the 2 RunHI2010 that must be rejected), so if we have more than 99% of the workflows completed, we can start to finalize/close-out those requests following this procedure.

  1. Get a list with all the workflow names for the standard/pileup set of RelVals and save it in a text file, then we run the closeOutTaskChainWorkflows.py script (see more details HERE giving the text file as argument (it may take an hour).
  2. When this script finishes, you need to check all workflows that were not closed-out by going to WMStats and checking whether:
    1. the workflow is in completed status
    2. there are production/processing/merge failures, if so then you need to get logs to provide in the announcement. Remember to look first for the logArchive, in case it doesn't exist, then you can look for the condor logs in the agent.
    3. IF the workflow is in completed and there are no failures, you can manually move it to closed-out status
  3. After that, you can start getting a list of the dataset names produced and their statistics (amount of events) using getRelValDsetNames.py script, according to what's described here HERE.
  4. Once you have the text file containing dataset names + statistics, you can print the dataset names (for those produced - which means stats different of "-1") and make a PhEDEx subscription to to T1_US_FNAL (as custodial normal replication, under RelVal group) and T2_CH_CERN (as normal replication, under RelVal group). Example:
    cat 600p10_datasets.txt | grep -v '\-1' | awk '{print $1}'

  1. If we have more than 99% of the datasets produced, then we can start preparing the announcement according to the following steps:
    1. Get the list of datasets / events produced, as aforementioned, and create the statistics table as described in HERE. Example:
      cat 600p10_datasets.txt | awk '{printf("|%-100s | %7s |\n",$1,$2)}' | tee ~/webpage/relval_stats/CMSSW_6_0_0_pre10_standard_v1.txt 
      ii. Make sure that all datasets to be announced were already subscribed to FNAL and CERN. iii. Set in DBS the produced datasets to VALID status, as described in HERE. Example:
       python setDatasetStatus.py 620pre3_standard_dsets.txt VALID --url=https://cmsdbsprod.cern.ch:8443/cms_dbs_prod_global_writer/servlet/DBSServlet 
  2. Go to RelVal Samples and Release Testing HN and create the announcement, like THIS one. Please don't forget to i) mention the requests that are still running, IF there are some and ii) to change the PyReleaseValidation tag.
  3. Now you can also move those requests that are in "closed-out" to "announced" status, using the setrequeststatus.py script, as described HERE.

Preparing and injecting DQM harvesting manually

The automatic DQM harvesting is usually enabled in the RelVal workflows (when the variable "EnableDQMHarvest" is set to 1). When it's not enabled in the workflows, then we need to:
  1. Transfer the DQM samples to CERN (T2_CH_CERN)
  2. manually create the configuration files
  3. Submit the jobs to the LSF batch system
In order to create these configuration files and the harvesting job to be submitted to LSF system, we have to use "buildHarvesting.sh" script, which will read the statistics table text file and create a DQM.sh file with all dataset/run/scenarios in it, as described in HERE.

Translating/Injecting step0 workflows into WMA

  1. Connect to vocms174 with the relval account
  2. Create an environment variable based on the request and batch number, e.g.:
    label=R2535_B149
  3. Create a software area, for example:
    release=CMSSW_5_2_6
    cd /data/amaltaro/
    export SCRAM_ARCH=slc5_amd64_gcc462         # watch out with ScramArch
    scramv1 p CMSSW $release
    cd $release/src/
    cmsenv
  4. Create the $label directory and load the following environments:
    mkdir $label; cd $label
    source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh
    source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
  5. Get the tarball from the official request, for example:
    wget --no-check-certificate https://cms-project-generators.web.cern.ch/cms-project-generators/ProductionConfigurations/Summer12_FS53-20130506021410.tgz
  6. Untar it, for example:
    tar xvf Summer12_FS53-20130506021410.tgz
  7. Finally run the following script in order to create the requests (you need to provide the relative path where the summary.txt is), e.g.:
    python /data/amaltaro/scriptStep0/translateInjectStep0.py $PWD"/Summer11-20130513152712_0/summary.txt" | tee creating.log
  8. After it's finished, you can get the workflows injected by:
    grep Created creating.log | awk '{print $3}'

Troubleshooting

Request not acquired by the Local WorkQueue

If you find requests that were assigned for > 2h, it moved to "Acquired" by the Global WorkQueue but no by the Local one, then you may have to check the following:
  1. Make sure that the services are properly running (aka local CouchDB and MySQL). You can check if couch is up and running tailing its log (which are in UTC timezone):
    tail install/couchdb/logs/couch.log 
    and you can try to connect to MySQL to check if it's up and responsive:
    $manage mysql-prompt wmagent 
  2. Check that WorkQueueManager component is working and the other components as well, doing:
    $manage status 
    or even better, looking directly into their logs, for instance:
    tail -f install/wmagent/WorkQueueManager/ComponentLog 
    .
  3. Make sure that the input dataset/block in the request is present at the site where you are trying to run it.
  4. Make sure that you have assigned the request to the correct teamName
  5. Finally, make sure that you have set thresholds for the targeted site and they are not 0.
  6. If everything above was checked and is fine, and even so the request is not acquired by the agent, then contact Samir or Diego B.

-- AlanMalta - 03-Aug-2012

Edit | Attach | Watch | Print version | History: r32 < r31 < r30 < r29 < r28 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r32 - 2013-06-05 - AlanMalta
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback