Worker Node testing for WLCG
- Note: write access for external collaborators can be obtained here.
Introduction
As of mid September 2012 most of the WLCG sites in EGI are still running the
old gLite 3.2 WN version on their worker nodes, despite various issues:
- The old GFAL/lcg_util code has known bugs that are only fixed in EMI/UMD releases of the WN.
- New products like GFAL2 and features like Xrootd support and federation are not getting real exposure in the production environment.
- Developers who implemented new features (often on our request) may become unavailable when the EMI project has ended.
- It becomes hard to maintain the old build infrastructure and expertise for security patches, should they be needed.
- Even though the old code may be "good enough" for current usage by ATLAS and CMS, it certainly is not for the many other VOs that most EGI sites need to support.
ALICE and LHCb are much less affected, at least for SL5, because their jobs bring themselves essentially all they need.
For SL6 porting also LHCb will benefit from corresponding test queues.
In the spring of 2012 an initiative was launched to get the EMI-1/UMD-1 WN validated
by ATLAS and CMS on a set of sites that together cover all of the relevant SE types:
- BeStMan (as part of EOS)
- CASTOR
- dCache
- DPM
- EOS
- StoRM
Due to other activities with higher priorities at that time, the validation was only completed
partially, allowing e.g. CNAF and a few CMS T2 to move their WN to the EMI-1/UMD-1 release.
We now need to restart this activity and keep testing further WN updates regularly,
such that we may discover early if a particular update breaks some experiment work flow.
The testing would be done through HammerCloud and participating sites would set up
small, essentially permanent test queues for the experiments they support and apply WN
updates (automatically?) as they appear in the EMI-2 testing repository:
Meanwhile the EMI-2/UMD-2 WN has been released and it has a much longer lifetime
than what was tested earlier, so we should concentrate on that now.
The OS will be mainly SL5 for the time being.
Sites are welcome to join this effort!
Participating sites and queues
SE type |
VOs |
Site |
CE + queue name |
WN version |
ATLAS status |
CMS status |
LHCb status |
CASTOR |
atlas, cms |
RAL |
lcgce03.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce07.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce08.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce09.gridpp.rl.ac.uk:8443/cream-pbs-gridTest |
EMI-WN 2.0.0 |
|
|
|
dCache |
atlas, cms |
DESY |
grid-cr2.desy.de:8443/cream-pbs-emi2-sl6 |
EMI-WN 2.0.0 SL6 |
|
|
|
dCache |
atlas |
TRIUMF |
ce1.triumf.ca:8443/cream-pbs-test |
EMI-WN 2.0.0 |
|
|
|
DPM |
atlas, cms, lhcb |
Brunel |
dc2-grid-65.brunel.ac.uk:8443/cream-pbs-atlas dc2-grid-65.brunel.ac.uk:8443/cream-pbs-cms dc2-grid-65.brunel.ac.uk:8443/cream-pbs-lhcb |
EMI-WN 2.0.0 SL6 |
|
|
|
DPM |
atlas, lhcb |
Liverpool |
hepgrid5.ph.liv.ac.uk:8443/cream-pbs-long |
EMI-WN-2.0.0 |
|
|
|
DPM |
atlas, lhcb |
Manchester |
vm3.tier2.hep.manchester.ac.uk:8443/cream-pbs-long |
EMI-WN-2.2.0 |
|
|
|
DPM |
atlas, cms |
Oxford |
t2ce02.physics.ox.ac.uk:8443/cream-pbs-shortfive t2ce02.physics.ox.ac.uk:8443/cream-pbs-mediumfive t2ce02.physics.ox.ac.uk:8443/cream-pbs-longfive |
EMI-WN 2.0.0 |
|
|
|
StoRM |
atlas, cms |
CNAF |
ce03-lcg.cr.cnaf.infn.it:8443/cream-lsf-emitest |
EMI-WN 2.0.0 |
|
|
|
ATLAS test details
CMS test details
Summary of fixes to data management components
The
latest EMI-2 update
contains fixes for all known issues related to gfal/lcg_utils and DPM/LFC clients.
Result tables (match your site here!)
- NOTE: ATLAS found
gsidcap
access failing for limited (WN) proxies and opened GGUS:87065
for the dCache developers.
- Fixed in EMI-2 Update 6
released Nov 26.
- Also CMS have seen this issue, but currently no CMS site is using that protocol.
- For ATLAS sites where only plain
dcap
is used the Oct release was already OK.
- CMS workaround for DPM sites documented here.
- See aforementioned CMS workaround for DPM sites.
- Note: with EMI-1 an upgrade to
lcg_util 1.13.9
may still be needed.
- See aforementioned CMS workaround for DPM sites.