HammerCloud abstract for EGI CF 2012

Title

"Experience in Grid Site Testing for High Energy Physics with HammerCloud"

Overview

HammerCloud is a grid site testing service for the ATLAS, CMS and LHCb experiments centered at CERN in Geneva. This tool, which is provided as an online service for operation managers, site administrators and, in general, grid experts, allows them to perform on-demand tests of their computing facilities in order to validate and measure their performance. In addition, HammerCloud runs automated tests to check the availability and reliability of the sites under different circumstances. The tests consist of real analysis code provided by the physics community to ensure real-world use cases for the grid sites. Indeed, HammerCloud has been employed in HEP for more than 2 years and has helped increase the performance and reliability seen by the grid users. In this work we will present the lessons learnt while deploying, optimizing and evolving the system for the three VOs and the development plans for the near and mid-term future.

Description

Frequent validation and stress testing of the network, storage and CPU resources of a grid site is essential to achieve high performance and reliability. HammerCloud was previously introduced with the goals of enabling VO- and site-administrators to run such tests in an automated or on-demand manner. The ATLAS, CMS and LHCb experiments have all successfully integrated it into their grid operations infrastructures. This work will present the experience in running HammerCloud at full scale for more than 3 years and present solutions to the scalability issues faced by the service. First, we will show the particular challenges faced when integrating with CMS and LHCb offline computing, including customized dashboards to show site validation reports for the VOs and a new API to tightly integrate with the LHCbDIRAC Resource Status System. Next, a study of the automatic site exclusion component used by ATLAS will be presented along with results for tuning the exclusion policies. A study of the historical test results for ATLAS, CMS and LHCb will be presented, including comparisons between the experiments' grid availabilities and a search for site-based or temporal failure correlations. Finally, we will look to future plans that will allow users to gain new insights into the test results; these include developments to allow increased testing concurrency, increased scale in the number of metrics recorded per test job (up to hundreds), and increased scale in the historical job information (up to many millions of jobs per VO).

Impact

HammerCloud is running between 40,000 and 60,000 grid jobs per day for the three VOs that is currently testing. In the ATLAS instance, the auto-exclusion feature has been deployed more than one year ago with very promising results on the grid reliability, decreasing the grid error rate by up to a 50%.

Conclusions

HammerCloud has proven that is a fundamental tool for the grid operations, not only helpful for the commissioning of new sites and upgrades/tuning campaigns, but necessary to monitor the availability and reliability of sites, providing useful insights for daily operations on grid sites.

Track classification

Operational services and infrastructure.

Comments

(None).

-- RamonMedranoLlamas - 11-Nov-2011

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2011-11-14 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback