Multi Level Monitoring Overview

Introduction

The EGEE SA1 Multi level monitoring (MLM) system is an operational monitoring system built to meet the needs of EGEE during the EGEE-III project. It is based on Nagios and integrates existing operational tools using the MSG messaging system as an integration framework.

This document will give you an overview of

  • the main components within the Multi level monitoring system
  • an understanding of how these various components interact
  • the deployment model, showing where the various components will be deployed
  • the teams involved in the development of the system
  • a project roadmap along with development and deployment milestones.

Aims and Objectives

The aim of the multi level monitoring system is to provide an integrated project level monitoring system for EGEE-III. The model is based on concepts outlined in the EGEE-III Operations Automation Strategy. It should also fit within the principles of future federated infrastructures such as EGI.org.

It works on the model where regional entities, such as a ROC in EGEE-III or a NGI in EGI.org, havethe responsibility for the smooth running of sites within its local domain. It does this by monitoring remotely these sites to detect operational problems which could affect users and provides the first level support for aiding sites to follow up these problems. The data produced by these regional monitoring systems is also used by project level systems to carry out tasks such as SLA calculation.

Sites should also deploy their own local monitoring system. This is completely based upon the standard components of MLM, e.g. Nagios, Metric Description DB. This will improve site reliability by giving site administrators better visibility of the state of grid services at their site.

Components in multi-level monitoring

The diagram below shows the various components in the system, according to their deployment scope - project, regional, site. We will now outline the roles and responsibilities of each of these components.

0810-Work-Items-Deployment-v1.5.png

Repositories of Information

In order to configure the monitoring system we need two different types of information
  • A list of sites and services which should be tested
  • The metrics that should be obtained for each of the services.

The definitive source of information for these two types of information are the GOCDB and Metric Description Database

GOCDB

  • Provider: STFC
  • Deployment Model: Project Level
  • Description: A project level database which contains the definitive list of sites belonging to the EGEE infrastructure.
  • Dependencies: None
  • Responsibilities:
    • Contains personal contact information for staff at sites
    • Contains list of roles for project members
    • Contains list of downtimes and monitorable services
  • Interfaces provided:
    • Web interface for site and regional administrators to create and update information
    • Programmatic query interface for other operation tools and VO tools to export/query information
  • Comments:
  • Links and Documentation:

Metric Description Database

  • Provider: BARC/CERN
  • Deployment Model: Project Level
  • Description: A project level database which contains the metadata about metrics that can be used to test services, and also how these metrics can be combined to calculate availability or generate blacklists.
  • Dependencies:
    • GOCDB contains the list of services types
    • GOCDB contains the lists of roles for users which limit which actions they may take
  • Responsibilities:
    • Provide descriptions of tests which should be run against grid services at EGEE sites.
    • Allow customization of tests that should be run.
    • Provide a list of calculations based on the tests that can then be used for e.g. availability, blacklisting, …
    • Allow project members to provide their own calculations
    • Maintains history of which metrics and calculations were valid at which point in time
  • Interfaces provided:
    • Web interface to add new metrics, metricsets and calculations
    • Programmatic interface for NCG and SLA calculation system to query and obtain the list of metrics to be tested for each service type
    • Programmatic interface for the SLA calculation system to obtain the current set of VO-specific SLAs that should be calculated
  • Comments:
  • Links and Documentation:

Aggregate Topology Provider

* Provider: BARC/CERN
  • Deployment Model: Project
  • Description:
  • Dependencies:
  • Responsibilities:
    • ...
    • ...
    • ...
  • Interfaces provided:
    • ...
    • ...
  • Comments:
  • Links and Documentation:

Regional monitoring with Nagios

The regional monitoring system is the core of MLM, and is responsible for the actual testing of the sites. Once the tests have been run, they're sent to the other components which need them, in particular the regional dashboard and central metric store / SLA calculation. This is shown below in more detail. Note the use of the Messaging system to communicate results to geographically distant components.

0901-Central-Regional-Flows.png

Regional grid monitoring (Nagios)

* Provider: Nagios.org
  • Deployment Model: Regional
  • Description: The regional grid monitoring nagios tests sites and services in the region. It uses the components below to be configured, to store it's historical data and to publish it's data to other system via the messaging system. It has a pluggable architecture into which these other components are deployed, including extra visualization tools.
  • Dependencies:
    • Monitoring Configuration Generation (NCG)
    • Regional Metric Store
    • MSG-Nagios bridge
    • User credentials for grid tests are stored in a MyProxy server
  • Responsibilities:
    • Schedule and run tests against sites and services
    • Generate notifications in case of problems
    • Publish results of tests to message bus
  • Interfaces provided:
    • Standard Nagios web interface provided to ROC and site admins for viewing and scheduling tests
  • Comments:
  • Links and Documentation: http://www.nagios.org/docs/

Monitoring Configuration Generation (NCG)

  • Provider: SRCE
  • Deployment Model: All locations where Nagios is needed - Project, Regional, Site
  • Description: NCG is a configuration tool which generates an appropriate configuration file for nagios
  • Dependencies:
    • GOCDB (or any other topology source, such as Aggregate Topology Provider or Information system) provides the list of sites and services
    • Metric Description database provides the list of metrics that are to be configured
  • Responsibilities:
    • query the GOCDB and Metric Description DB and produce a suitable nagios configuration file
  • Interfaces provided:
    • Command line tool which produces a set of nagios configuration files.
  • Comments:
  • Links and Documentation: GridMonitoringNcgOverview

Regional Metric Store

  • Provider: Nagios.org
  • Deployment Model: Regional
  • Description: The regional metric stores provides a database of current and historical metrics results
  • Dependencies: MySQL
  • Responsibilities:
    • Provide the Alarm DB with a list of current metrics which are in CRITICAL or WARNING states
    • Provide regional admins with a local historical store of metrics for their sites for debug purposes
  • Interfaces provided:
    • Direct SQL interface for query by alarm DB
    • Integrated into Metric Visualization system (e.g. RRD graphs of historical results)
  • Comments: this is simply the standard nagios DB extension tools - NDOUtils. It's available for MySQL, PostgreSQL and Oracle.
  • Links and Documentation: http://nagioswiki.org/wiki/Addon:NDOUtils, http://www.nagios.org/docs/

Metric Visualization

* Provider: nagios.org / CERN ???
  • Deployment Model: Regional
  • Description: (similar to what is in current SAM Portal) - Provide visualization of current tests results and history
  • Dependencies:
  • Responsibilities:
  • Interfaces provided:
    • Generate web-accessible graphs of history for embedding in other tools
  • Comments: Not sure if this is development wokr, or just configuration of existing tools like pnp4nagios and NagViz
  • Links and Documentation: Laurence's notes on pnp4nagios, http://www.pnp4nagios.org/pnp/start, http://nagvis.org/

MSG-Nagios bridge

* Provider: SRCE/CERN
  • Deployment Model: Regional/Project/Site
  • Description: This provides a mechanism to submit tests results to a nagios, and also consume test results from a nagios
  • Dependencies:
    • Messaging system
  • Responsibilities:
    • Listen on messaging system for message destined for this nagios instance and push them to nagios
    • Publish the results of all grid tests to the messaging system
  • Interfaces provided:
    • Test results can pushed to Nagios via messaging
    • Test results can be consumed from Nagios via messaging
  • Comments: This is a set of scripts and daemons which integrate into a nagios tightly to provide this functionality
  • Links and Documentation:

Site Debugging tools

* Provider: Nagios.org / ???
  • Deployment Model: Regional
  • Description: Provides the existing SAMAP functionality e.g. Provide the ability for site admins to reschedule tests via the nagios interface
  • Dependencies:
  • Responsibilities:
  • Interfaces provided:
    • allow site/ROC admins to re-schedule testing of their sites
  • Comments: Not clear why this is different from just the standard nagios functionality - need further clarification with SAMAP team. Indeed this could be seen to be completely satisfied by a site nagios instance.
  • Links and Documentation: SAMAP

Project level metric results and SLA calculations

The EGEE project has a need to keep a reliable long-term store of metric results and availability calculations performed on this data. this includes infrastructure availability (e.g. OPS tests) and also VO specific availabilities.

Project Metric Store

* Provider: Nagios.org ???
  • Deployment Model: Project
  • Description: This is a long-term store of metric results gathered from all regional monitoring instances. It would provide a much longer history of the results that the regional instance.
  • Dependencies:
    • Regional monitoring to test sites and push data to repository via messaging system using standard APIs.
    • SLA Calculation provides the availabity calculation results
    • The Metric Description Database provides additional metadata on the metrics
  • Responsibilities:
    • Store authoritative EGI.org calculated availabilities.
    • Provide long-term historical repository of all site testing.
  • Interfaces provided:
    • Interface for management to obtain availability reports (per region, country, site etc…).
    • Interface for other operational tools to obtain calculated availability.
    • Interface to extract historical data from repository
    • Provide a query interface for point intime queries of the metric status
    • Provide an interface to request dumps of historical data
  • Comments: This would probably leverage NDOUtils much like the regional metric store.
  • Links and Documentation: http://nagioswiki.org/wiki/Addon:NDOUtils

SLA Calculation

* Provider: BARC
  • Deployment Model: Project
  • Description: This is an active component which calculate availabilities based on the metric results stored in the project metric store
  • Dependencies: * GOCDB or other topology sources to provide list of sites and services that should be used for availability calculation * Metric Description Database to provide the list of tests and algorithm to be used for the calculation
  • Responsibilities:
    • Store authoritative EGI.org calculated availabilities.
  • Interfaces provided: None - all interaction is via the project metric store
  • Comments: This component could be deployed regionally too if the regional and project metric store use the same underlying component (e.g. NDOUtils)
  • Links and Documentation:

Regional operations tools

Alarm DB

* Provider: IN2P3

  • Deployment Model: Regional
  • Description: Store a list of currently active alarms for the R-COD operations staff to act on
  • Dependencies:
    • the metric store provides the underlying metric results for which alarms could be generated (via direct SQL)
  • Responsibilities:
    • from problem metric results provided by Metric store, raise alarms if certain certain threshold conditions are met. These are used by operators using the operations dashboard
  • Interfaces provided:
    • A list of alarms are provided to the operations dashbaord
  • Comments:
  • Links and Documentation:

Operations Dashboard

* Provider: IN2P3

  • Deployment Model: Regional
  • Description: A dashboard which implements the R-COD workflow for tracking problems at sites, and creating tickets based on alarms
  • Dependencies:
    • The Alarm DB will provide the list of current alarms to the dashboard
    • ???
  • Responsibilities: *Implement the R-COD workflows
  • Interfaces provided:
    • Web interface
    • Interface to/from GGUS via Ticketing Interface
  • Comments:
  • Links and Documentation: OperationalDocumentation

Ticketing Interface

* Provider: IN2P3/ FZK

  • Deployment Model: Regional
  • Description: An system to create tickets in GGUS, and receive updates
  • Dependencies:
    • GGUS will provide a standard mechanism to interface with
  • Responsibilities:
    • inteface between GGUDand the operations dashboard
  • Interfaces provided:
    • Allow creation of a new ticket for an alarm
    • query interface to receive updated information on a ticket from GGUS
  • Comments: Need docs for this !
  • Links and Documentation:

Configuration Management

Worker Node Configuration (JobWrapper Tests)

Worker Node Configuration Portal

Accounting

Accounting providers

Accounting portal

Site components

Site Monitoring (Nagios)

* Provider:
  • Deployment Model: Regional
  • Description:
  • Dependencies:
  • Responsibilities:
    • ...
    • ...
    • ...
  • Interfaces provided:
    • ...
    • ...
  • Comments:
  • Links and Documentation:

Middleware event publication

Infrastructrure

MSG Brokers

* Provider:

  • Deployment Model: Regional
  • Description:
  • Dependencies:
  • Responsibilities:
    • ...
    • ...
    • ...
  • Interfaces provided:
    • ...
    • ...
  • Comments:
  • Links and Documentation:

Operational infrastructure monitoring

* Provider:
  • Deployment Model: Regional
  • Description:
  • Dependencies:
  • Responsibilities:
    • ...
    • ...
    • ...
  • Interfaces provided:
    • ...
    • ...
  • Comments:
  • Links and Documentation:

Portals

Operations Portal

QR Metrics Portal

Information System monitoring (gStat)

Edit | Attach | Watch | Print version | History: r11 | r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2009-01-30 - JamesCasey
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback