Please put the node in maintenance (see procedure here) when you have to make some intervention on the middleware services, otherwise it will trigger some alarms.

Quattor configuration of the sensors and exceptions

  • All the sensors and exceptions are defined in the CDB template pro_monitoring_metrics_lcgrb.tpl. This template is included is the CDB template pro_system_lcgrb.tpl.

  • An example of a metric and an exception is:
"/system/monitoring/metric/_5240" = nlist(  # Metric ID
        "name",         "rb_ns_count",   # Name of the metric
        "descr",        "Number of network server processes running.",   # Description of the metric
        "class",        "system.processCount",   # Class of the metric
        "param",        list("cmdline", "/opt/edg/bin/edg-wl-ns_daemon", "uid", "edguser"),   # Process to check
        "period",       60,
        "smooth",       nlist("typeString", false, "maxdiff", 0.0, "maxtime", 600),
        "active",       true,
        "latestonly",   false,
);

"/system/monitoring/exception/_30108" = nlist(   # Exception ID                  
        "name",         "rb_ns_wrong",   # Name of the exception
        "descr",        "Number of network server process wrong.",   # Description of the exception
        "active",       true,
        "latestonly",   false,
        "importance",   1,
        "alarmtext",    "rb_ns_wrong",   # Name of the exception displayed on the Lemon page
        "correlation",  "5240:1 < 1",   # Exception triggered when the network server is not running
        "actuator",     nlist("execve",  "/sbin/service edg-wl-ns restart",   # Command to execute when the NS is not running
                              "maxruns", 1,
                              "timeout", 0,
                              "active",  true)

Exceptions defined for the cluster lcgrb

Exception name Description Service to restart Comment
RB_NS_WRONG Number of network server process(es) running wrong (should be one). edg-wl-ns -
RB_WM_WRONG Number of workload manager process(es) running wrong (should be one). edg-wl-wm -
RB_JOBCONTROLLER_WRONG Number of job controller process(es) running wrong (should be one and four). edg-wl-jc -
RB_CONDORMASTER_WRONG Number of condor master processes(es) running wrong (should be one). edg-wl-jc -
RB_CONDORSCHEDD_WRONG Number of Condor scheduler process(es) running wrong (should be one). edg-wl-jc -
RB_LOGD_WRONG Number of logd process(es) running wrong (should be one). edg-wl-locallogger -
RB_INTERLOGD_WRONG Number of interlogd process(es) running wrong (should be one). edg-wl-locallogger -
RB_RENEWD_WRONG Number of renewd process(es) running wrong (should be two). edg-wl-proxyrenewal -
RB_LM_WRONG Number of log monitor process(es) running wrong (should be one). edg-wl-lm -
RB_MJS_WRONG Number of lcg-mon-job-status process(es) running wrong (should be one). lcg-mon-job-status -
RB_FTPD_WRONG Number of ftpd process(es) running wrong (should be one). edg-wl-ftpd -
RB_BKSERVERD_WRONG Number of bkserverd process(es) running wrong (should be between 1 and 11). See OP -
RB_FTPD_WRONG Number of ftpd process(es) running wrong (should be between 1 and 2000). See OP -
RB_DG20LOGD_HIGH Number of dg20logd files in tmp too high (should be < 1000). - Backlog
RB_INPUTFL_SIZE Size of the /var/edgwl/workload_manager/input.fl file too high (should be < 1.5GB). - Backlog
RB_QUEUEFL_SIZE Size of the /var/edgwl/jobcontrol/queue.fl too high (should be < 1.5GB). - Backlog
RB_INPUTFL_WRONG Number of records in file /var/edgwl/workload_manager/input.fl - Not yet implemented
RB_QUEUEFL_WRONG Number of records in file /var/edgwl/jobcontrol/queue.fl - Not yet implemented
RB_FD_LM_WRONG Number of file descriptors opened by the log monitor process too high (should be < 800). See OP -
RB_LOGS_FULL Middleware log files partition (/data01) full. - -
RB_SANDBOX_FULL Sandbox partition (/data02) full. - -
RB_MYSQL_FULL MySQL partition (/data03) full. - -

Operations help guide for LCG RB nodes

The operational procedures (OP) can be found here.

-- YvanCalas - 12 Mar 2007

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2007-03-19 - YvanCalas
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback