• Title
Towards
HammerCloud 5: Progress and Experience in Grid Site Testing
• Abstract
With the exponential growth of LHC (Large Hadron Collider) data in 2011,
and more to come in 2012, distributed computing has become the
established way to analyse collider data. The ATLAS grid infrastructure
includes more than 80 sites worldwide, ranging from large national
computing centers to smaller university clusters. These facilities are
used for data reconstruction and simulation, which are centrally managed
by the ATLAS production system, and for distributed user analysis. To
ensure the smooth operation of such a complex system, regular tests of
all sites are necessary to validate the site capability of successfully
executing user and production jobs. We report on the development, optimization
and results of an automated functional testing suite using the
HammerCloud
framework. Functional tests are short light-weight applications covering
typical user analysis and production schemes, which are periodically
submitted to all ATLAS grid sites. Results from those tests are collected
and used to evaluate site performances. Sites that fail or are unable to
run the tests are automatically excluded from the
PanDA brokerage system,
therefore avoiding user or production jobs to be sent to problematic sites.
We show that stricter exclusion policies help to increase the grid reliability,
and the percentage of user and production jobs aborted due to network or
storage failures can be sensibly reduced using such a system.
• Author
Federica Legger (LMU Munich)
• Co-authors
Johannes Elmsheuser (LMU Munich)
Ramon Medrano Llamas (CERN)
Gianfranco Sciacca (Bern)
Daniel van der Ster (CERN)
• Presentation type (parallel or poster)
parallel
--
DanielVanDerSter - 06-Oct-2011
Topic revision: r1 - 2011-10-06
- unknown