WMAgent Toolkit
There are several operations which must be done by an operator. Here there is a list of the commands needed. Please refer to the
OperationsWMAgentTutorial for examples on how to use them and to the
CompOpsWorkflowOperationsWMAgentReference for an overview of the system.
Agents
Note: we have now changed the cmst1 password, you need to login to the machine as yourself and do the following: sudo -u cmst1 /bin/bashs
cd ~cmst1
First thing the operator must have access to the machines on which the wmagents are located. Then the
enviroment_script must be sourced and "cd" to the
current_directory.
|
vocms201, 202, 215, 216 |
cmssrv112 |
cmssrv98 |
enviroment_script |
/data/admin/wmagent/env.sh |
/data/admin/wmagent/env.sh |
/data/admin/wmagent/env.sh |
current_directory |
/data/srv/wmagent/current |
/data/srv/wmagent/current |
/data/srv/wmagent/current |
Agents run out of Disk Space
If agents run out of disk space you should follow the following instructions.
- Truncate couch db log
- Log in to the agent
-
cd /data/srv/wmagent/current
-
cd install/couchdb/logs/
- Truncate the couch db log
bashs-3.2$ > couch.log
- Clear out the binary logs for MySQL
- Log in to the agent
-
cd /data/srv/wmagent/current
-
ls -lh install/mysql/database/
- Check the oldest mysqld-bin.0000*
-
./config/wmagent/manage mysql-prompt wmagent
-
mysql> PURGE BINARY LOGS TO 'mysqld-bin.0000*';
- Truncate the stderr and stdout files for each component
Shutting off sites
- login to one CERN WMA, say vocms201
- sudo to cmst1
- edit /afs/cern.ch/user/c/cmst1/www/wmaconfig/slot-limits.conf
- search for the line having the site in the first column, say T2_RU_PNPI:
- add the word "down" in the 3rd column:
- In this way no more jobs will be submitted to that site. To restart submission, remove "down".
Cloning a Workflow
Sometimes it is requied to clone a workflow. Instructions on how to do it are here:
ClonningWorklow
Commands
Operation |
Command |
Comments |
start agent |
./config/wmagent/manage start-agent |
starts components |
stop agent |
./config/wmagent/manage stop-agent |
stops components |
stop services |
./config/wmagent/manage stop-services |
stops couch and mysql (used before machine reboot) |
start services |
./config/wmagent/manage start-services |
starts couch and mysql (used after machine reboot) |
status agent |
./config/wmagent/manage status |
start component |
./config/wmagent/manage execute-agent wmcoreD --start --components=component_name |
stop component |
./config/wmagent/manage execute-agent wmcoreD --shutdown --components=component_name |
restart component |
./config/wmagent/manage execute-agent wmcoreD --restart --components=component_name |
change site slots |
./config/wmagent/manage execute-agent wmagent-resource-control --site-name=site --site-slots=site_slots |
change task slots |
./config/wmagent/manage execute-agent wmagent-resource-control --site-name=site --task-type=task --task-slots=task_slots |
task can be Cleanup, LogCollect, Merge, Skim, Processing, Production |
show current thresholds |
./config/wmagent/manage execute-agent wmagent-resource-control -p |
T1Thresholds |
Condor commands
Condor tools are in
/data/srv/condor/current/bin/
.
Command |
Description |
condorq |
Display information about jobs in the local Condor queue. Format is <jobid> <job_status> <?> <path_to_condor_log> <site> |
condor_overview.sh |
Overview about status of jobs at sites, output is <count> <site> <status>, please see below the status meaning. |
condor_q -analyze <jobid> |
Run analysis summary |
condor_q -constraint 'JobStatus==2 && DESIRED_Sites=="T2_XX_XXXX"'
-format '%s.' ClusterId -format '%s ' ProcId -format '%s ' 'formatTime(QDate,"%m/%d %H:%M")'
-format '%s ' 'formatTime(JobStartDate,"%m/%d %H:%M")'
-format '%s\n' DESIRED_Sites |
(Change T2_XX_XXXX) |
condor_q -constraint 'JobStatus==3' |
Check for jobs marked for removal but stuck for some reasons |
condor_rm -constraint 'JobStatus==3' -forcex |
Force removal for jobs marked for removal and got stuck |
Additional documentation here:
How to limit disk usage of jobs
Condor status in the condor_overview
Status |
Description |
1 |
pending |
2 |
running |
3 |
removed |
5 |
held |
Helper Scripts
To assist in monitoring the sites a couple of scripts are now available. They can be run from
vocms201:
- condor_global_overview will give an overview table merging the condor_overview results from all agents
- gml will give a table with:
- the condor_global_overview info
- the slot information (available in ~cmst1/www/wmaconfig/slot-limits.conf or online)
- the site information from cms dashboard site view
- a link to open tickets (Savannah
, if there are any)
- the sites in order of importance (according to T1list and T2list)
- for readability the script can be run with -l (extra horizontal lines) or -b (alternating background color) flags
Glidein Tools
glideinWMS tools under
/data/srv/glideinWMS/tools/
Components
Glidein Factories
Monitoring the T1 sites
CMS T1 Production Monitoring - Happy Face
See also
T1Thresholds and
T2Thresholds
Contacts