The Workload Management System Admin Guide

Introduction

Service Overview

The Workload Management System (WMS) comprises a set of Grid middleware components responsible for the distribution and management of tasks across Grid resources, in such a way that applications are conveniently, efficiently and effectively executed. Following the list of sub-services the WMS is composed of:

  • Workload Management – WM: Core component of the Workload Management, its purpose is to accept and satisfy requests for job management coming from its clients
  • WMProxy: Web service interface to submit jobs to the WM
  • Job Controller – JC: Acts as an interface to condor for the WM
  • Log Monitor – LM: Directly connected to JC acts as a job monitoring tool parsing condor log files
  • Local Logger: copy events to be sent to the LB server into a local disk file
  • LBProxy: keeps a local view of the job state to be sent to the LB server
  • Proxy Renewal: Service to renew the proxy of a long-running job
  • ICE: (Interface to CREAM Environment) is the WMS service dealing when interacting with CREAM based CEs.

Installation and configuration

Hardware Requirements

  • 4 GB RAM is the minimum suggested for memory
  • quad-core processor is recommended to better handle parallel matchmaking and all the different sub-services running on a WMS
  • Min disk space: depends on load and type of jobs submitted.
    • under '${GLITE_LOCATION_VAR}/sandboxdir' 30-40GB is the min on several wms used in production with the cron job to purge job sandboxes once a week enabled in order to accomodate submitted job sandbox dirs.

Install & Configure

  • Install and configure OS and basic services according to the https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310
  • Install the glite-WMS metapackage from the appropriate gLite software repository
  • Configure the WMS node by running '/opt/glite/yaim/bin/yaim -c -s site-info.def -n WMS' Following a list of the WMS specific variables that can be set in the 'site-info.def' file:
    • $WMS_HOST -> the WMS hostname, ex. : 'egee-rb-01.$MY_DOMAIN'
    • $PX_HOST -> the hostname of a server myproxy, ex.: 'myproxy.$MY_DOMAIN'
    • $BDII_HOST -> the hostname of the site bdii to be used, ex: 'sitebdii.$MY_DOMAIN'
    • $LB_HOST -> the hostname of the LB server to be used, ex: 'lb-server.$MY_DOMAIN:9000' This variable is set as a service specific variable in the file services/glite-wms, located one directory below the one where the 'site-info.def' file is located

Daemons and services running

Scripts to check the daemons status and to start/stop are located in the ${GLITE_WMS_LOCATION}/etc/init.d/ directory (i.e. ${GLITE_WMS_LOCATION}/etc/init.f/glite-wms-wm start/stop/status). Glite production installation also provide a more generic service, called gLite, to manage all of them simultaneously, try service gLite status/start/stop On a typical WMS node the following services must be running:
  • glite-lb-locallogger:
     glite-lb-logd running
           glite-lb-interlogd running
  • glite-lb-proxy:
     glite-lb-proxy running as 4137
  • glite-proxy-renewald:
     glite-proxy-renewd running
  • globus-gridftp:
     globus-gridftp-server (pid 3107) is running...
  • glite-wms-jc:
     JobController running in pid: 10008
           CondorG master running in pid: 10063 10062
           CondorG schedd running in pid: 10070
  • glite-wms-lm:
     Logmonitor running...
  • glite-wms-wm:
     /opt/glite/bin/glite-wms-workload_manager (pid 9957) is running...
  • glite-wms-wmproxy:
    WMProxy httpd listening on port 7443
    httpd (pid 22223 22222 22221 22220 22219 22218 22217) is running ....
    ===
    WMProxy Server running instances:
    UID        PID  PPID  C STIME TTY          TIME CMD
  • glite-wms-ice:
    /opt/glite/bin/glite-wms-ice-safe (pid 10103) is running...

File Systems/Directories

Log files locations

Log files are located under ${GLITE_LOCATION_LOG}, typically being '/var/log/glite'. This directory on a heavly used WMS can become quite big, on the order of tens of GB. Old rotated log files should be manually removed. Following the default log files that can be found on a typical WMS:

  • wmproxy.log Used in case of authentication or submisison error

  • workload_manager_events.log Used to check the status of the matchmaking process (from Waiting to Ready status) and the query to the information system to fill in the InformationSuperMarket

  • ice.log used to check jobs that matched a CREAM based CE and are sent to it via ICE

  • jobcontoller_events.log Used to check the jobs events once arrived on condor

  • httpd-wmproxy-errors.log Used in case of problems in contacting the WMProxy service

  • httpd-wmproxy-access.log
  • logmonitor_events.log Aggregate information about each job coming from various log files

  • glite-wms-wmproxy-purge-proxycache.log
  • lcmaps.log Used when there are problems in the mapping of remote users to local pool accounts

Other log files that can be useful in case of trouble are the condor log in:

  • /var/local/condor/log/
  • /var/glite/logmonitor/CondorG.log/

Configuration Guide

The general configuration file for the WMS is located in /opt/glite/etc/glite_wms.conf This file is organized in section, one for every running service plus a Common section For a general description of the glite_wms.conf configuration file, the configuration parameters and their default values see here: https://twiki.cnaf.infn.it/cgi-bin/twiki/view/EgeeJra1It/WMSConfFile

Tuning of some configuration parameters

  • II_timeout: default by yaim is set to 30, increase it for low-memory machines. 4 GB is the minimum suggested for memory
  • MatchRetryPeriod: once a job becomes pending, meaning that there are no resources available, this parameter represents the period between successive match-making attempts, in seconds. The default value is '1000', in order to decrease the number of periodic retries of unmatched jobs the value of this parameter should be increased. A suggested value (used on several production wms) is several hours, '14400'

Troubleshooting

WMS Monitoring

-- ElisabettMolinari - 26 Mar 2008

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2009-04-16 - ElisabettMolinari
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback