Service monitoring in the LHC experiments

The LHC experiments' computing infrastructure is hosted in a distributed way across different computing centers in the Worldwide LHC Computing Grid and needs to run with high reliability. It is therefore crucial to offer a unified view to shifters, who generally are not experts in the services, and give them the ability to follow the status of resources and the health of critical systems in order to alert the experts whenever a system becomes unavailable. 

Several experiments have chosen to build their service monitoring on top of the flexible Service Level Status (SLS) framework commonly used in CERN IT. Based on examples from ATLAS, CMS and LHCb, this contribution will describe the complete development process of a service monitoring instance and explain the options and deployment models that can be adopted. We will also describe the software package used in ATLAS Distributed Computing to send health reports through the MSG messaging system and publish them to SLS on a lightweight web server.

  • Track: "Distributed Processing and Analysis on Grids and Clouds" or "Computer Facilities, Production Grids and Networking"

  • Presentation type: Poster

  • Authors: Fernando Barreiro, Alessandro di Girolamo, Peter Kreuzer, Stefan Roiser, Diego da Silva Gomes, Vincent Bernardoff, Josep Flix

-- FernandoHaraldBarreiroMegino - 20-Sep-2011

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2011-10-13 - FernandoHaraldBarreiroMegino
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback