WMAgent Toolkit

There are several operations which must be done by an operator. Here there is a list of the commands needed. Please refer to the OperationsWMAgentTutorial for examples on how to use them and to the CompOpsWorkflowOperationsWMAgentReference for an overview of the system.

Agents

Note: we have now changed the cmst1 password, you need to login to the machine as yourself and do the following: sudo -u cmst1 /bin/bashs

cd ~cmst1

First thing the operator must have access to the machines on which the wmagents are located. Then the enviroment_script must be sourced and "cd" to the current_directory.

  vocms201, 202, 215, 216 cmssrv112 cmssrv98
enviroment_script /data/admin/wmagent/env.sh /data/admin/wmagent/env.sh /data/admin/wmagent/env.sh
current_directory /data/srv/wmagent/current /data/srv/wmagent/current /data/srv/wmagent/current

Agents run out of Disk Space

If agents run out of disk space you should follow the following instructions.

  • Truncate couch db log
    1. Log in to the agent
    2.  cd /data/srv/wmagent/current
    3.  cd install/couchdb/logs/ 
    4. Truncate the couch db log
        bashs-3.2$ > couch.log 
  • Clear out the binary logs for MySQL
    1. Log in to the agent
    2.  cd /data/srv/wmagent/current
    3. ls -lh install/mysql/database/  
    4. Check the oldest mysqld-bin.0000*
    5.  ./config/wmagent/manage mysql-prompt wmagent 
    6.  mysql> PURGE BINARY LOGS TO 'mysqld-bin.0000*'; 
  • Truncate the stderr and stdout files for each component

Shutting off sites

  • login to one CERN WMA, say vocms201
  • sudo to cmst1
  • edit /afs/cern.ch/user/c/cmst1/www/wmaconfig/slot-limits.conf
  • search for the line having the site in the first column, say T2_RU_PNPI:

    • T2_RU_PNPI 500

  • add the word "down" in the 3rd column:

    • T2_RU_PNPI 500 down

  • In this way no more jobs will be submitted to that site. To restart submission, remove "down".

Cloning a Workflow

Sometimes it is requied to clone a workflow. Instructions on how to do it are here:

ClonningWorklow

Commands

Operation Command Comments
start agent
./config/wmagent/manage start-agent 
starts components
stop agent
./config/wmagent/manage stop-agent 
stops components
stop services
./config/wmagent/manage stop-services 
stops couch and mysql (used before machine reboot)
start services
./config/wmagent/manage start-services 
starts couch and mysql (used after machine reboot)
status agent
 ./config/wmagent/manage status 
start component
 ./config/wmagent/manage execute-agent wmcoreD --start --components=component_name 
stop component
 ./config/wmagent/manage execute-agent wmcoreD --shutdown --components=component_name
restart component
./config/wmagent/manage execute-agent wmcoreD --restart --components=component_name
change site slots
 ./config/wmagent/manage execute-agent wmagent-resource-control --site-name=site --site-slots=site_slots
change task slots
./config/wmagent/manage execute-agent wmagent-resource-control --site-name=site --task-type=task --task-slots=task_slots 
task can be Cleanup, LogCollect, Merge, Skim, Processing, Production
show current thresholds
./config/wmagent/manage execute-agent wmagent-resource-control -p
T1Thresholds

Condor commands

Condor tools are in /data/srv/condor/current/bin/.

Command Description
condorq Display information about jobs in the local Condor queue. Format is <jobid> <job_status> <?> <path_to_condor_log> <site>
condor_overview.sh Overview about status of jobs at sites, output is <count> <site> <status>, please see below the status meaning.
condor_q -analyze <jobid> Run analysis summary
condor_q -constraint 'JobStatus==2 && DESIRED_Sites=="T2_XX_XXXX"' 
-format '%s.' ClusterId -format '%s ' ProcId  -format '%s ' 'formatTime(QDate,"%m/%d %H:%M")' 
-format '%s ' 'formatTime(JobStartDate,"%m/%d %H:%M")'  
-format  '%s\n' DESIRED_Sites
(Change T2_XX_XXXX)
condor_q -constraint 'JobStatus==3' Check for jobs marked for removal but stuck for some reasons
condor_rm -constraint 'JobStatus==3' -forcex Force removal for jobs marked for removal and got stuck

Additional documentation here:

How to limit disk usage of jobs

Condor status in the condor_overview

Status Description
1 pending
2 running
3 removed
5 held

Helper Scripts

To assist in monitoring the sites a couple of scripts are now available. They can be run from vocms201:

  • condor_global_overview will give an overview table merging the condor_overview results from all agents
  • gml will give a table with:
    • the condor_global_overview info
    • the slot information (available in ~cmst1/www/wmaconfig/slot-limits.conf or online)
    • the site information from cms dashboard site view
    • a link to open tickets (Savannah, if there are any)
    • the sites in order of importance (according to T1list and T2list)
    • for readability the script can be run with -l (extra horizontal lines) or -b (alternating background color) flags

Glidein Tools

glideinWMS tools under /data/srv/glideinWMS/tools/

Components

Component Name Function
DashboardReporter
DBSUpload
ErrorHandler
JobAccountant
JobArchiver
JobCreator
JobStatusLite
JobSubmitter
JobTracker
PhEDExInjector
RetryManager
TaskArchiver
WorkQueueManager
WorkQueueService

Glidein Factories

http://vocms39.cern.ch/glidefactory/monitor/glidein_v2_5_2/factoryStatus.html

Monitoring the T1 sites

CMS T1 Production Monitoring - Happy Face

See also T1Thresholds and T2Thresholds

Contacts

cms-service-glideinwms@cernNOSPAMPLEASE.ch Contact address for the glidein factory task force
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2016-02-24 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback