Service monitoring in the LHC experiments

The computing infrastructure of the LHC experiments consists of a complex structure that is hosted in a distributed way across the different computing centers in the Worldwide LHC Computing Grid and needs to run with full reliability. It is therefore crucial to offer a unified view to shifters, who generally are not experts in the services, so that they can follow the status of resources and health of critical systems and alert the experts whenever a system becomes unavailable.

Several of the main LHC experiments have chosen to build their service monitoring on top of the flexible Service Level Status (SLS) framework commonly used in CERN IT. Based on examples from the ATLAS, CMS and LHCb experiments, this contribution will describe the complete development process of such a service monitoring instance and explain the existing deployment models that can be adopted to finally publish status reports to SLS. Particular focus will be given on the simple and easy installable software package that has been developed in the ATLAS Distributed Computing community to pass the health reports through the MSG messaging system and finally publish them to SLS in a common web server.

  • Track: "Distributed Processing and Analysis on Grids and Clouds" or "Computer Facilities, Production Grids and Networking"

  • Presentation type: Poster

  • Authors: Fernando Barreiro, Alessandro di Girolamo, Peter Kreuzer, Stefan Roiser, Diego da Silva Gomez, Vincent Bernardoff

