Cloud monitoring

Introduction

The motivation for this development is to offer a common solution for monitoring of the performance of the VMs used for the VO processing, could be on the cloud , though not necessary. In general case , the VOs are not interested to monitor every individual VM, but need to be notified when the machine gets stuck in order to take actions normally via experiment work load management systems. Another useful functionality is to account how resources are used. This can be in particularly important for the cloud resources. The proposed model is inspired by the development started in LHCb by Mario Ubeda Garcia. In difference with LHCb implementation we plan to provide a solution which is not coupled with any given work load management system (WMS) but can be easily integrated with the WMS of any experiment. Implementation from Mario is pretty generic, based on widely used monitoring system (Ganglia) and therefore can be taken on board as a base for a common solution.

Main principles

  • Simplicity
  • Foresee minimal deployment inside the cloud
  • No functionality which is VO-specific
  • Simple integration with various WMS of the VOs

Main components:

Ganglia monitoring system, namely gmond (at every VM) , gmetad for storing monitoring data collected from a single or several clusters (clouds), ganglia web frontend for visualization data collected in gmetad, ganglia-api to ensure communication with VO WMSs.

Ganglia Monitoring Daemon (gmond)

Gmond is a multi-threaded daemon which runs on each cluster node which has to be monitored (corresponding at every VM). Gmond has its own redundant, distributed database. Gmond has four main responsibilities: monitor changes in host state, multicast relevant changes, listen to the state of all other ganglia nodes via a multicast channel and answer requests for an XML description of the cluster state. More information can be found here. Gmond is a part of CernVM distribution.

Ganglia META Daemon (gmetad)

Gmetad stores historical information to Round-Robin databases and exports summary XML which the web frontend uses to present useful snapshots and trends for all hosts monitored by ganglia. While gmond uses multicast channels in a peer-to-peer way, gmetad pulls the XML description from ganglia data sources (either gmond or another gmetad) via XML over unicast routes.

Ganglia Web Frontend

The Ganglia web frontend provides a view of the gathered information via real-time dynamic web pages. Most importantly, it displays Ganglia data in a meaningful way for system administrators and computer users. Although the web frontend to ganglia started as a simple HTML view of the XML tree, it has evolved into a system that keeps a colorful history of all collected data.

Ganglia API

The Ganglia API is a small standalone python application that takes XML data from a number of Ganglia gmetad processes and presents a RESTful JSON API of the most recently received data. It finds all of the gmetad processes, then polls each one of them in turn, keeping the latest results in memory. It makes easy to query the latest data searching by environment, host, metric name. More information can be found here.

Architecture

Slide1.jpg

Implementation

As can be seen from the schema above, most of the components do exist as a part of the GANGLIA monitoring system (light orange color). So the main effort will consist in understanding and documenting of installation and configuration of all components of the chain. The only component which might require some development effort is the one shown with gray colour on the schema, the component which can perform real time analysis of data collected from the VM clusters in order to detect eventual problems for example stuck VMs. CEP/Esper technology can be considered as a candidate for implementation of this component.

Comments

My 2c - Luca

  • I've understood that storage and visualization are not the core businesses of the project, but may be it's worth considering RRD alternatives as Whisper which has been developed in recent years to overcome RRD limitations. Ganglia add-on should be available to use it as backend. This will also open the possibility to use other tool for metrics visualization, as Graphite, which is the reference today for time series visualization.
  • The CEP technology is appropriate for the real-time analysis job and Esper can be a good choice. Getting data through the Ganglia API is doable, although not optimal. The "pull" nature of a REST API is probably not the best to build a "stream" of monitoring events, but it's a minor issue as far as the API performs well with high-frequency queries and is able to provide many results in a single request.

Laurence

  • As a general comment, we should try to avoid running services in the provider itself (pets) as it will increase the operations overhead.
  • Only monitoring information that is useful should be exported from the provider.

Ryan

  • Having monitoring provided as a complete service would be especially valuable, so that users would only have to apply a small configuration change in their VM and stats would start showing up in the central monitoring service. It sounds like this is the intent.
  • I also agree that Whisper and Graphite are appealing and RRD has some limitations. In our experience, any issue that temporarily inhibits the ability of the system to send monitoring information will cause a blank period to show up in the monitoring, because of the inability to back-fill RRD data. Such periods are usually precisely when you want to be able to monitor the system the most, e.g. to see if it was swapping, having a network issue, etc.

Cristovao

-- JuliaAndreeva - 10 Jan 2014
Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg Slide1.jpg r1 manage 58.5 K 2014-01-20 - 14:21 JuliaAndreeva  
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2014-03-21 - CristovaoCordeiro
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback