Service monitoring in the LHC experiments
The LHC experiments' computing infrastructure is hosted in a distributed way across different computing centers in the Worldwide LHC Computing Grid and needs to run with high reliability. It is therefore crucial to offer a unified view to shifters, who generally are not experts in the services, and give them the ability to follow the status of resources and the health of critical systems in order to alert the experts whenever a system becomes unavailable.
Several experiments have chosen to build their service monitoring on top of the flexible Service Level Status (SLS) framework commonly used in CERN IT. Based on examples from ATLAS, CMS and LHCb, this contribution will describe the complete development process of a service monitoring instance and explain the options and deployment models that can be adopted. We will also describe the software package used in ATLAS Distributed Computing to send health reports through the MSG messaging system and publish them to SLS on a lightweight web server.
- Track: "Distributed Processing and Analysis on Grids and Clouds" or "Computer Facilities, Production Grids and Networking"
- Presentation type: Poster
- Authors: Fernando Barreiro, Alessandro di Girolamo, Peter Kreuzer, Stefan Roiser, Diego da Silva Gomes, Vincent Bernardoff, Josep Flix
--
FernandoHaraldBarreiroMegino - 20-Sep-2011