WMAgent operations
Introduction
This page will be an all-comprehensive guide to the operations with the new WMAgent production tool.
Overview of the Request Manager and the WMAgent
The WMAgent architecture consists of a
MySQL database for state tracking, a
CouchDB instance for history recording and a collection of component daemons that perform atomic tasks related to automated processing; it operates as a front end to the grid and farm submission systems used to access CMS resources, managing job level information and working together with the (
ReqMgr
).
Operators are organized in Teams, and each Team will drive in general more than one Agent instance. The
ReqMgr takes the workflows by Requestor/Physics and assigns them to the respective Agent instances.
Since the requestor makes the production request, the generic workflow goes through some initial statuses (new, approved) until it gets ASSIGNED. Then the Request Manager handles it (ACQUIRED).
The workflow status gets RUNNING when the jobs are created and queued in the Agent go to the grid. When the workflow is finished it gets COMPLETED and, if its run has been regular, it can be switched by hand to CLOSED-OUT.
Status |
Description |
responsibility |
new |
workflow was created by requestor |
Physics |
approved |
physics management approved of this request |
Unified Officer |
assigned |
L3 assign the workflow to the sites or site groups |
Unified Officer |
acquired |
coordinator assigned the request to a site and defined processed dataset name, era and splitting |
OS |
running-open and running-closed |
workflow is being processed |
OS |
completed |
workflow processing is complete |
Workflow Traffic Controller |
closed-out |
workflow post mortem has been checked |
L3's/Unified |
announced |
output datasets have been made VALID and announced by L3 |
L3's/Unified |
Other "Final States"
failed |
Agent is unable to run workflow |
rejected/rejected-archived |
member of workflow team has manually rejected this workflow from a new, assignment-approved, approved, completed or closed-out state |
aborted/aborted-archived |
member of the workflow team has manually aborted a workflow in the acquired or running states |
force-complete |
Member of the workflow team has determined that we have enough statistics for the workflow to do what we need so we are forcing the workflow to the complte status so it can be closed out |
It's up to the operator to look after the workflows in ASSIGNED, ACQUIRED, RUNNING-OPEN, RUNNING-CLOSED statuses, and to check for failures, errors and every non-regular situation. So let's focus on those statuses. Here is a more detailed explanation of the states:
https://github.com/dmwm/WMCore/wiki/Request-Status
Dataflow through the agent
New - Physics
- Fresh work into the system
- Physics/Dima
- They decide when if/when we want to run the data
- They decide priority of data
- Highest Priority is system is RelVal
- Next Priority is always Data/ReReco
- Next Redigi
- MontiCarlo
- Testing Data/backfill
assignment-approved - Unified Officer
- Handoff to Unified Officer
- Is there input data?
- where can we run this data
- different types of data are run at different sites
- ReReco - now multicore
- runs only at T1’s + Desy, Purdue, Nebraska
- Heavy Ion data can only run at ???
- put data into place
- We need to make a list of “trusted T2’s”
- need to work with the site support team to solidify this list
- operations have one list, Unified has a list, Site support has a list best if we can somehow put this information in ssb?? that way we have a place to pull it from Operations and site support have to work together in maintaining this. The pulling of sites in and out of drain and the waiting room/morgue is killing production How long do we usually have advance notice of downtimes?
- instead of being put into drain maybe drop them down to a “untrusted site” this would allow the work that is running to finish, things get stuck if sites are in drain/down we could put a site in this list to clear things out and do short/fast workflows
- Only short workflows that would run in 1-2 days are allowed on these sites so if they drop out we don’t lose too much time
- We have been having issues with workflows getting stuck, or having to be restarted because they were running on sites that are put into drain while the workflow is running
- I propose T1’s + “good T2’s” run multistep workflows
- Any high priority workflows are if so move data into place on diskHow do we control overflow and making sure a workflow doesn’t overflow to a bad site?
assigned - Unified Officer
-
- Once all the data is in place and we determine where the data can run Unified moves the workflow to assigned
- OS pays attention to this only to report if work sits in assigned for longer than 2 days
acquired - Unified Officer
- this is procedure from before restructuring, OS no longer watches acquired with claim that things no longer get stuck here. I think Ali is now doing this?
- Work will sit here where if a site is in drain/down that the data needs to run at
- kill clone workflow and let Unified pick it back up
- Work will sit here if the site is busy
- is the whitelist the only sites that the data can be run at?
- if not talk to JR and see what we can do to reassign
- not sure what this procedure is
- work will get stuck in global queue here
- if it sits in GQ and LQ doesn’t pick it up in a day, and the above conditions have been met, kill and clone, let Unified pick it up
running -open - workflow team
- a workflow will stay here until all work has been pulled into the system
- watch for large failure rates
running-closed - Workflow team
- watch for high failure rates
- work sometimes gets stuck here
- sites go into drain - this can still happen
- if this is the only place that a workflow can run, check with site support team how long will the site be down? should we wait it out or kill and clone?
- can JR auto detect this now???
Next two cases haven't happened in a long time now, can we consider them fixed? If things aren't
- sometimes JobCreator gets stuck and needs to be “kicked”
- stuckRequests.py will tell us if this is the case
- for right now, if restarting the component doesn’t work we need to talk to SeangChan or Alan to fix it
- is this something we need to learn to do? Or are we better off making the developers look so they fix the problem?
- Gets stuck in PhEDExInjector
- check PhEDEX injector, do not restart indiscriminately because it will take a long time to catch up
- frequently best just to have SeangChan or Alan look into it
Force-completeing - WTC
- if data is “close enough” or if physics tells us we have enough statistics we can force complete via the web interface
- this has a high cost for the agents so if we are close and an agent is actively working best to leave it becomplete
Complete - WTC
- this is where the WTC spends a lot of their time
- If a workflow meets all of our criteria it will close out on it’s own and we are done
assist-manual -usual issues
- robust merge
- WMStats is written so that if it is having a hard time reading a merge file it will just skip it, for the most part this is OK as we only need 95% completion, but sometimes this brings us to something less than 95%
- A “blind” acdc with fewer events will sometimes find enough missing events to get us through otherwise we need to kill and clone
- file read issues
- is the site up/good?
- check das to make sure data is on disk
- make acdc - assign either via web or script
- changing splitting frequently helps
- running at bad site/draining site
- kill and clone let unified move the data and re-assign the workflow
- for larger workflows we need to come up with a better solution we can’t be throwing away weeks of processing time due to sites going up and down when we are busy timeouts
- shows as bad FWJR or wallclock timeout
- change splitting and acdc
- exceed maxRSS
- merge failures
- sometimes it’s just a file read issue, sometimes it’s a file that was put onto disk that we are now having problems reading
- can be acdc’d need to have fewer events per file
- sometimes merges can’t be salvaged and we have to kill and clone
- Mini-AOD & ReDigi
- small, generally work on 1-2 sites
- acdc should be set to the same whitelist as the parent
- Unified is setting the list to something larger than the parent sometimes, this makes debugging more difficult for later steps
- is Unified worried about overflow? How do we handle this?
- probably could run on “untrusted sites” since this is a fast workflow and is only one step
- StepChain???
- MonteCarlo
- larger has a wide ranging white list
- can be multistepped
- first step can run anywhere on the white list later steps have to run where the parent step ran so if the children steps have to run on sites that have become unstable we need to kill and clone
- Most of the issues are as outlined above and can be handled as above
- acdc’s can be assigned either via script or via web interface, if you have to do
- TaskChains
- generally have a fairly wide whitelist
- mostly the same issues as above but run into more merge issues than other types of workflows
- special in that they are cobbled together workflows, later steps have output on disk at sites and they can only run at that site
- if the sites go down our only choice is to kill and clone
- it would be good to find a way to recover when we are busy, some taskChains get pretty big and we don’t want to lose processing time.
- Definately needs to run at stable sites
- Because they are cobbled together workflows, later steps inherate information from previous steps and these workflows can not be assigned via the web interface you need to use the “AssignProdTaskChain.py” script.
- if you do not use the proper script you will end up with “no version” in some of your output steps, that will get mixed up with other steps basically messing up the dataset to the point where you are just best off starting over from scratch
- The way the parent workflows are being asssigned we need to copy the white list from the parent. Overflow has been known to start some steps on other sites so sometimes acdc’s get stuck in acquired because the data is not where we are expecting it. This is especially messy with merge issues.
- Rereco
- This is data! We need 100% or we need to account for every lumi/event we are missing
- Same issues as above and first stabs at recovery can be through ACDC
- we are now running multicore for Rereco and have a select list of sites the data can run on, namely T1’s, Desy, Purdue, Nebraska
because we are running multicore we need to increase the amount of memory that “a job” can use because condor will actually divide this number for us and if we don’t it will fail out, this is a number we don’t mess with for anything else
-
- Notes on other parameters we play with here
- When we have exhausted ACDC, we go for recovery instead
- recovery takes an output dataset and tracks it back to the input lumi/event and starts processing the data from that point.
- each recovery workflow acts like a “parent” and knows nothing about the other recovery workflows
- when we run recovery workflows we need to do them one at a time or we will end up with duplicates in the output dataset
- recovery workflows can be ACDC’d just like a parent workflow, but again, we can only run the ACDC’s for one recovery at a time, not mulitple
- when we have done everything we possibly can to recover a Rereco, if for example there is a datafile that we absolutely can not read, we need to note the lumi/event it is failing on and report it back then we can put the workflow in the bypass file
- bypass
- if we have done everything we can to recover the workflow and are not yet up to statistics talk to L3’s or Dima and find out if it’s “good enough” and we can put the workflow into bypass
- if we have fatal errors that we can not recover from with our usual bag of tricks contact Dima about returning the workflow to Physics
- if the workflow can not be recovered we put it into “reject” and let physics redo it from scratch
- closed out
- Unified/L3’s
- now auto handled to move to announced
- if we manually move workflows to closed out from complete Unified will not pick it up
- announced * this is where L3’s/Unified turn the workflows back to physics
- announced-archived
- final state
- Other states
- abort/abort archived
- if a workflow is in status assigned/acquired/running-open/running-closed and we decide we need to kill the workflow we move it to abort
workflow should move to abort-archived on it’s own, sometimes the get “lost” in abort, SeangChan has to move them manually
-
- rejected/rejected archived
- if a workflow is in states new/assigned-approved, complete, closed-out and we decide the data is no good we reject it
- failed
- a workflow moves to status failed on it’s own in the agent if the agent can not run the data. Sometimes it tells you why it failed, but not always, we need to check for failed workflows try to figure out why they are in failed and try to get them going again.
Workflow status on the grid
When the workflow gets ACQUIRED, the Agent creates an internal queue (called local queue) and pulls work from global queue in small chunks (blocks, etc.) creating jobs until thresholds are fulfilled. Jobs prepared in the global queue assume the initial status "queued", those in the local queue are "pending". In this step, in some cases (LHE workflows) the Agent interacts with PhEDEx in order to get information on the input dataset blocks (GEN datasets).
After this first step, the workflow gets RUNNING: the Agent starts interacting with the Factory to send pilots to the sites. When a job is matched with a site for submission, its status is switched to "pending", and afterwards it's switched to "running" when it starts running on the worker node. So basically a job is "running" when it's really running on the site (i.e. on the WN), "queued" when it's ready to go to the Agent, and "pending" in any intermediate case.
Jobs that run without problems become "success"; if they fail, their status become "failure". There is an intermediate status "cool off" which indicates a job which has failed for some reasons, but for which a resubmission will be scheduled by WMAgent (unlike the "failure" ones which are frozen and will never be resubmitted). Cool off jobs are useful as operators can ask sites to fix problems, expecting that WMAgent will resubmit jobs by itself and retry running.
So when a workflow gets ASSIGNED, the ReqMgr takes care of choosing the best Agent on a slot-availability basis. If Agent1 and Agent2 are the agents to be used, then the ReqMgr evaluates the thresholds set inside those Agents, choosing the agent having more free slots (or more total slots??)
Monitoring the production
While L3/ Unified creates tape families, pre-stage samples and assign workflows to sites or site-groups, operators monitor the workflows, report on temporary failures and try to solve them before the next automatic submission (cool off).
When completed, Team Leaders perform post mortem (all input events processed or all requested events produced, failure analysis, etc.). If a workflow does not meet our closeout criteria, ACDC/Recovery/Cloning needs to occur. After close out, L3 will check post mortem, set status to VALID and announce datasets.
Communication among operators is done through ELOG
, where every change of thresholds or any weird behaviour found must be reported. Communication through ELOG
is very fundamental, especially when it's needed to communicate specific issues to the next operator that looks at the system (shift model), but also to get help from other workflow team members playing around.
Operators need to register to ELOG (it is not using your default CERN account credentials). The email feed works only in 1 direction, namely one can subscribe to any sub-elog and get all notifications via email, but replying via email doesn't work.
Site status, CSP, downtimes
If there are failures on a specific site, the clues are that sites fail to one site, get resubmitted and then the job runs happily on another site. In this case, before any investigation one should always check if the site is in downtime.
http://dashb-ssb.cern.ch/dashboard/request.py/siteview#currentView=Production&highlight=true
If the site is not in downtime, and it failing a lot of jobs open a GGUS ticket, and inform the site support team.
CMS sites are mainly monitored via the CMS Computing Shift procedures; the places where one can find live, historical and future information about CMS Sites status and downtimes are:
Sometimes site managers announce their downtimes on relevant HN as well, but it's optional.
Generally if the site is in maintenance it is not supposed to run production jobs (and therefore its thresholds in the WMAgent should be set to 0).
Debugging
Generally you should consider opening tickets when you have problems with sites and services. Please read the CMS Ticket policy that gives you an idea of when you should use tickets.
Before opening tickets it's needed to understand the nature of the problem. For instance, suppose we have cooloffs/failures on a workflow and we want to figure out where is the problem. We can go through these steps (use step n+1 if step n is not useful):
Log Archives:
Retry 1 -> /store/unmerged/logs/prod/2011/10/14/cmsdataops_Backfill_111005_MCTest_T2_BR_SPRACE_111005_114953/Production/0000/1/5e6e95be-f205-11e0-ab8a-003048f0e38c-248-1-logArchive.tar.gz
To retrieve that log archive and analyze its content for further information on execution, deduce the site where the log is stored by looking at the State transitions, and then to use PhEDEx to map the (Sitename,LFN) to the PFN:
-
- use the MergeLogCollect tarball for completed workflows; after completion WMA merges all the logcollects into a single tarball, whose srm path is directly available:
Output Files:
srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/logs/prod/2011/10/WMAgent/cmsdataops_HIG-Summer11-00799_2_111028_153758/cmsdataops_HIG-Summer11-00799_2_111028_153758-ProductionRAWSIMoutputMergeLogCollect-1-logs.tar
- svn co svn+ssh://svn.cern.ch/reps/WmAgentScripts .
- cd mc/
[vocms201] /afs/cern.ch/user/j/jen_a/WmAgentScripts/mc > python CheckWorkQueueElements.py nnazirid_HIG-Summer12-01056_253_v1__130130_175136_2419 Element: 00f48629339ba7d9d7c8763e63944631 is Running in http://vocms201.cern.ch:5984/workqueue
Element: 042156f8c703fd4c78a57e3461b5ae09 is Running in http://vocms201.cern.ch:5984/workqueue
Element: 05e9e5cba077fa6aab06bf84f05b981c is Running in http://vocms201.cern.ch:5984/workqueue
Element: 0714071c59ca282b05c05eec068894c0 is Running in http://vocms201.cern.ch:5984/workqueue
Element: 0c9407d05c5a92d541bdfd2feb0bf73d is Running in http://vocms201.cern.ch:5984
As user cmst1: [vocms201] /afs/cern.ch/user/c/cmst1 > source /data/admin/wmagent/env.sh [vocms201] /data/srv/wmagent/current > $manage db-prompt
SQL*Plus: Release 11.2.0.3.0 Production on Tue Feb 12 18:45:32 2013
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning, Real Application Clusters and Real Application Testing options
SQL>
SQL> SELECT wmbs_users.cert_dn as owner, wmbs_workflow.task, wmbs_job_state.name, COUNT(wmbs_job.id) AS jobs, SUM(wmbs_job.outcome) AS success, SUM(wmbs_fileset.open) AS open FROM wmbs_workflow INNER JOIN wmbs_users ON wmbs_users.id = wmbs_workflow.owner INNER JOIN wmbs_subscription ON wmbs_workflow.id = wmbs_subscription.workflow INNER JOIN wmbs_fileset ON wmbs_subscription.fileset = wmbs_fileset.id LEFT OUTER JOIN wmbs_jobgroup ON wmbs_subscription.id = wmbs_jobgroup.subscription LEFT OUTER JOIN wmbs_job ON wmbs_jobgroup.id = wmbs_job.jobgroup LEFT OUTER JOIN wmbs_job_state ON wmbs_job.state = wmbs_job_state.id WHERE wmbs_workflow.name='nnazirid_HIG-Summer12-01056_253_v1__130130_175136_2419' GROUP BY wmbs_users.cert_dn, wmbs_workflow.task, wmbs_job_state.name;
OWNER TASK NAME JOBS SUCCESS OPEN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ---------- ---------- ---------- /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=nnazirid/CN=733973/CN=Nikolaos Naziridis /nnazirid_HIG-Summer12-01056_253_v1__130130_175136_2419/MonteCarloFromGEN/MonteCarloFromGENMergeRAWSIMoutput cleanout 1702 1701 0 /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=nnazirid/CN=733973/CN=Nikolaos Naziridis /nnazirid_HIG-Summer12-01056_253_v1__130130_175136_2419/MonteCarloFromGEN/MonteCarloFromGENCleanupUnmergedRAWSIMoutput cleanout 2615 2601 0 /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=nnazirid/CN=733973/CN=Nikolaos Naziridis /nnazirid_HIG-Summer12-01056_253_v1__130130_175136_2419/MonteCarloFromGEN cleanout 130260 130084 0 /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=nnazirid/CN=733973/CN=Nikolaos Naziridis /nnazirid_HIG-Summer12-01056_253_v1__130130_175136_2419/MonteCarloFromGEN/LogCollect executing 1 0 0 /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=nnazirid/CN=733973/CN=Nikolaos Naziridis /nnazirid_HIG-Summer12-01056_253_v1__130130_175136_2419/MonteCarloFromGEN/LogCollect cleanout 4501 4231 0 /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=nnazirid/CN=733973/CN=Nikolaos Naziridis /nnazirid_HIG-Summer12-01056_253_v1__130130_175136_2419/MonteCarloFromGEN/MonteCarloFromGENMergeRAWSIMoutput/MonteCarloFromGENRAWSIMoutputMergeLogCollect cleanout 93 89 0
6 rows selected.
SQL> anything in "cleanout" is done and can be ignored. the jobs in "executing" should be looked at In this case it is a log collect job I went to 201 and looked at condor, there was 1 log collect job still running, but it was on it's 3rd time around so I did a condor_rm on it hopefully this will allow the WF to close out.
- Detailed active jobs status description per workflow:
SELECT wmbs_users.cert_dn as owner, wmbs_workflow.task, wmbs_job_state.name, wmbs_job.id AS jobs FROM wmbs_workflow INNER JOIN wmbs_users ON wmbs_users.id = wmbs_workflow.owner INNER JOIN wmbs_subscription ON wmbs_workflow.id = wmbs_subscription.workflow INNER JOIN wmbs_fileset ON wmbs_subscription.fileset = wmbs_fileset.id LEFT OUTER JOIN wmbs_jobgroup ON wmbs_subscription.id = wmbs_jobgroup.subscription LEFT OUTER JOIN wmbs_job ON wmbs_jobgroup.id = wmbs_job.jobgroup LEFT OUTER JOIN wmbs_job_state ON wmbs_job.state = wmbs_job_state.id WHERE wmbs_workflow.name='pdmvserv_EWK-Summer11Leg-00008_00003_v0__130904_001917_9905' AND wmbs_job.state IN (0,1,2,3);
- Detailed job information:
SELECT * FROM wmbs_job WHERE id = 3236320;
- Workflow job status in the database per taskType
SELECT DISTINCT wmbs_workflow.task, wmbs_job_state.name, COUNT(wmbs_job.id) AS numJobs FROM wmbs_workflow INNER JOIN wmbs_subscription ON wmbs_workflow.id = wmbs_subscription.workflow INNER JOIN wmbs_fileset ON wmbs_subscription.fileset = wmbs_fileset.id LEFT OUTER JOIN wmbs_jobgroup ON wmbs_subscription.id = wmbs_jobgroup.subscription LEFT OUTER JOIN wmbs_job ON wmbs_jobgroup.id = wmbs_job.jobgroup LEFT OUTER JOIN wmbs_job_state ON wmbs_job.state = wmbs_job_state.id WHERE wmbs_workflow.name='pdmvserv_EGM-UpgradePhase1Age0START-00006_00001_v0__131005_122949_9444' GROUP BY wmbs_workflow.task, wmbs_job_state.name;
WMAgent hands on (goes into the tutorial)
The operator must have access to the machines on which the WMAgents are located. Then the enviroment_script must be sourced and "cd" to the current_directory. Please refer to the Agents section in the Toolkit to get an updated list of the Agents.
Use the ./config/wmagent/manage status
to see if all is OK and in particular if all the components are running:
[vocms201] /data/srv/wmagent/current > ./config/wmagent/manage status
+ Couch Status:
++ Couch running with process: 32109
++ {"couchdb":"Welcome","version":"1.0.2"}
+ Status of MySQL
++ MYSQL running with process: 32316
++ Uptime: 3794448 Threads: 29 Questions: 367575591 Slow queries: 431 Opens: 146482 Flush tables: 1 Open tables: 64 Queries ...
Status of WMAgent:
Checking default database connection... ok.
Status components: ['DashboardReporter', 'WorkQueueManager', 'DBSUpload', 'PhEDExInjector', 'JobAccountant', 'JobCreator', ... ]
Component:DashboardReporter Running:9652
Component:WorkQueueManager Running:16305
Component:DBSUpload Running:9802
Component:PhEDExInjector Running:9896
Component:JobAccountant Running:9955
Component:JobCreator Running:10046
Component:JobSubmitter Running:10125
Component:JobTracker Running:10158
Component:JobStatusLite Running:10184
Component:ErrorHandler Running:10379
Component:RetryManager Running:10412
Component:JobArchiver Running:10609
Component:TaskArchiver Running:10656
Component:WorkQueueService Running:10724
To check just the health status of the Agent components you can also give a look at the Agent location link under Request Overview > Agent Monitor.
If needed, check for messages in the logs under <current_directory>install/wmagent
, for instance:
[vocms201] /data/srv/wmage