Experiment Lemon Metrics

Existing metrics and exceptions

The following table contains the existing metrics (and exceptions) used on experiment VO boxes. It only contains metrics related to experiment services. At the moment the list is not exhaustive.

Metric ID Metric name Metric description Metric class VO Services Hosts Template
4031 dq2_ss_heartbeat Counts "is down" alarms in the central agent watchdog in the last 30 mins log.Parse CMS PhEDEx vocms20 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/cms/phedex_monitor_debug
4032 dq2_ss_file_error_rate Counts "alert" in the central agent watchdog in the last 30 mins log.Parse CMS PhEDEx vocms20 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/cms/phedex_monitor_debug
4034 dq2_ss_file_submit_rate Counts "is down" alarms in the T0 agent watchdog in the last 30 mins log.Parse CMS PhEDEx vocms20 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/cms/phedex_monitor_debug
4036 dq2_da_submit_rate Counts "alert" in the T0 agent watchdog in the last 30 mins log.Parse CMS PhEDEx vocms20 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/cms/phedex_monitor_debug
4037 dq2_da_done_rate Counts "is down" alarms in the T1 agent watchdog in the last 30 mins log.Parse CMS PhEDEx vocms20 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/cms/phedex_monitor_debug
4038 dq2_ta_error_rate Counts "alert" in the T1 agent watchdog in the last 30 mins log.Parse CMS PhEDEx vocms20 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/cms/phedex_monitor_debug
34 gridftp Reused and modified metric that checks if globus-gridftp-server runs and if it runs under root with ppid 1 system.numberOfProcesses LHCb DIRAC volhcb15 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=profiles/profile_volhcb15
815 samclient_chklockfile Reused and modified metric that checks if SAM tests submission is not stucked samclient.checklockfile LHCb SamClient volhcb05 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/lhcb/vobox/pro_monitoring_samclient
4060 DIRAC_Cert_valid Checks whether DIRAC certificate will remain valid next 14days FIO::CertOK LHCb DIRAC volhcb17-26 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/lhcb/vobox/pro_monitoring_metrics_dirac_common
4061 DIRAC_Cert_key_perm Checks whether DIRAC host.key has correct mode,uid,gid file.info LHCb DIRAC volhcb17-26 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/lhcb/vobox/pro_monitoring_metrics_dirac_common
4090 opt_own_partition Checks if /opt is on its own partition - parsing of /proc/mounts log.Parse LHCb DIRAC volhcb17-26 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/lhcb/vobox/pro_monitoring_metrics_dirac_common
9104 partitionInfo For this existing metric we added exception that checks if /opt is full (>XX%) system.partitionInfo LHCb DIRAC volhcb17-26 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/lhcb/vobox/pro_monitoring_metrics_dirac_common
4076 DIRAC_Lemon_Agent_check Checks whether log of DIRAC's Lemon Agent contains information about failure of critical services/agents log.Parse LHCb DIRAC volhcb16 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/lhcb/vobox/pro_monitoring_metrics_dirac_common
4034 dq2_ss_file_submit_rate Checks the status of gsisshd in the last 30 mins log.Parse ALICE gsisshd voalice06 voalice11 voalice12 voalice13 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/alice/pro_params_voalice_acl&os=slc5&arch=x86_64&svcclass=vobox&resource=alice&customization=alice_vobox
4036 dq2_da_submit_rate Checks the status of alice-box-proxyrenewal in the last 30 mins log.Parse ALICE alice-box-proxyrenewal voalice06 voalice11 voalice12 voalice13 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/alice/pro_params_voalice_acl&os=slc5&arch=x86_64&svcclass=vobox&resource=alice&customization=alice_vobox
4037 dq2_da_done_rate Checks the status of registration of the vobox inside myproxy in the last 30 mins log.Parse ALICE registration of the vobox inside myproxy voalice06 voalice11 voalice12 voalice13 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/alice/pro_params_voalice_acl&os=slc5&arch=x86_64&svcclass=vobox&resource=alice&customization=alice_vobox
4120 CheckFileAge1 Checks how old a generic file is file.sslmtime CMS SAM client vocms36  
4121 CheckFileAge2 Checks how old a generic file is file.sslmtime CMS SAM client vocms36  
4122 CheckFileAge3 Checks how old a generic file is file.sslmtime CMS SAM client vocms36  
5230 DBS_Req_Rates_http DBS reader intsances request rates log.Parse CMS DBS vocms30 vocms31 vocms73 vocms74 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/cms/cmsdbs/monitoring&os=slc5&arch=x86_64&svcclass=vobox&resource=cms&customization=cmst0dbs
5231 DBS_Req_Rates_https DBS writer and admin intsances request rates log.Parse CMS DBS vocms30 vocms31 vocms73 vocms74 http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/customization/cms/cmsdbs/monitoring&os=slc5&arch=x86_64&svcclass=vobox&resource=cms&customization=cmst0dbs

Desired metrics and exceptions

The following table contains metrics (and exceptions) that would be convenient to monitor experiment services, but do no yet exist. At the moment the list is not exhaustive.

Metric name Metric description Metric class VO Services
CheckFileAge1 Checks how old a generic file is file.sslmtime CMS
CheckFileAge2 Checks how old a generic file is file.sslmtime CMS
CheckFileAge3 Checks how old a generic file is file.sslmtime CMS
CheckFileAge4 Checks how old a generic file is file.sslmtime CMS
CheckFileAge5 Checks how old a generic file is file.sslmtime CMS
phedex_mgmt_watchdog Checks age of central agent watchdog logfile file.sslmtime CMS PhEDEx
phedex_mgmt_aliveness Counts "is down" alarms in the central agent watchdog log.Parse CMS PhEDEx
phedex_mgmt_alert Counts "alert" in the central agent watchdog log.Parse CMS PhEDEx
phedex_t0_watchdog Checks age of T0 agent watchdog logfile file.sslmtime CMS PhEDEx
phedex_t0_aliveness Counts "is down" alarms in the T0 agent watchdog log.Parse CMS PhEDEx
phedex_t0_alert Counts "alert" in the T0 agent watchdog log.Parse CMS PhEDEx
phedex_t1_watchdog Checks age of T1 agent watchdog logfile file.sslmtime CMS PhEDEx
phedex_t1_aliveness Counts "is down" alarms in the T1 agent watchdog log.Parse CMS PhEDEx
phedex_t1_alert Counts "alert" in the T1 agent watchdog log.Parse CMS PhEDEx
phedex_caf_watchdog Checks age of CAF agent watchdog logfile file.sslmtime CMS PhEDEx
phedex_caf_aliveness Counts "is down" alarms in the CAF agent watchdog log.Parse CMS PhEDEx
phedex_caf_alert Counts "alert" in the CAF agent watchdog log.Parse CMS PhEDEx

Submitted requests

The following metrics (and exceptions) have been officially requested to the Lemon team:

Metric name Request date Accepted?

Proposals for new sensors

In this section we include descriptions for sensors that must be written from scratch to implement metrics that cannot be implemented with any existing sensor.

-- AndreaSciaba - 2009-09-30 -- JiriHorky - 22-Oct-2009

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2010-05-17 - JiriHorky
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback