Grid Site Monitoring Questionnaire

In December 2006, with the purpose of consolidating the mandate and understanding the existing deployment of monitoring tools within the infrastructure, a questionnaire was circulated to all site administrator contact addresses registered for LCG in the Grid Operations Centre database. The results presented below were also summarised at the WLCG Collaboration Meeting Monitoring BOF

Questions

1) What local fabric monitoring system do you use?:

  1. GridICE/Lemon
  2. Nagios
  3. Other (please specify)
  4. None.

2) Which Grid level sensors do you use?:

  1. which services are monitored
  2. what values/metrics are measured

3) Who provided the sensors?

4) Is your fabric monitoring part of any regional/off-site monitoring framework?

  1. who are you linked with
  2. generally, how is this implemented

5) When you learn that something is wrong with the services at your site, what is the most frequent way you are informed?

  1. looking in the local fabric or Grid monitoring system
  2. getting a trouble ticket
  3. getting a mail/telephone call from VOs/users
  4. other (please specify)..

6) Briefly describe what you see as your top 3 monitoring priorities to help improve your service reliability/availability

Replies

Over 200 sites were polled and 34 responses were received and analysed following reminder (prior to 17 Jan 2007). Due to variations in the detail and clarity of response the following inevitably includes some approximation.

What local fabric monitoring system do you use?

The majority of those who responded were using a local monitoring framework with a majority using multiple frameworks in combination. The count of sites for each category were -

  1. Nagios: 22
  2. GridICE/Lemon: 10
  3. Other: =majority as (a or b) + Ganglia: 13
  4. None : 3

Which Grid level sensors do you use?

12 sites reported monitoring some Grid services most commonly the CE + SE

Who provided the sensors?

Excluding reporting of the SAM sensors, variously as gLite, LCG etc., 6 sites reported using sensors supplied by ROCS from CE(2), AP(2) and IT(2)

Is your fabric monitoring part of any regional/off-site monitoring framework?

10 sites reported being part of a regional framework but few details were provided as to implementation. (There was clarly duplication with the previous question)

When you learn that something is wrong with the services at your site, what is the most frequent way you are informed?

  1. Local monitoring : 21
  2. Support Ticket : 10
  3. Looking at SAM/GSTAT : 4
  4. Direct from User/VO : 3

Briefly describe what you see as your top 3 monitoring priorities to help improve your service reliability/availability

Due to variety of response styles and detail it is hard to tabulate responses to this question. The following strong themes and keyword repetition were noted
  • single view - common interface - global view
  • unified tools - repository
  • more/deeper diagnostics
  • more flexible – alarm levels
  • improved/reliable/redundant SAM
  • hardware/network monitoring

Despite the focus on monitoring, several sites highlighted non-monitoring priorities for improving reliability -

  • Working/debugged middleware
  • Reliable hardware
  • Experience/knowledge transfer

-- IanNeilson - 05 Mar 2007

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2007-03-07 - IanNeilson
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback