Week of 140526

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web



  • local: Ben (CERN Grid Services), Maria A (SCOD), Maarten (ALICE), Felix (ASGC), Xavi (Storage), Maria D (GGUS), Pablo (Grid Monitoring + GGUS)
  • remote: Sang-Un (KISTI), Roger (NDGF), Stefano (CMS), Salvatore (CNAF), Rolf (IN2P3), Onno (NL-T1), Kai (ATLAS)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
    • T0/T1s
      • BNL short DDM problem on Friday due to dCache headnode failure (solved: GGUS:105640)

  • CMS reports (raw view) -
    • major issue with Argus identified last friday by Middleware Support and Rome admin while following up on GGUS:105545 (see last Thursday report), likely reason for large part of our troubles with glexec: we are particularly concerned about glexec failures at the end of the user job. Pilot can not regain user identity to transfer output and cleanup and job is failed wasting all the used resources. glexec failure at user payload start is less severe and we handle by wait and retry.
      • story is in GGUS:105597 . Problem description in here: PEP daemon may be throttled by outgoing OCSP requests (VOMS and CAnL certificate validators appear to be serialized, so single rogue OCSP responder can bring PEP daemon to halt :-()
        • in English: one CA failing to reply quickly to certificate validation request leads to process blocking and all authz requests pending at that time ultimately fail. Sites are affected at random depending on getting jobs from users with certificates of the non responding CA
      • underlying technical issue GGUS:105666 Argus PEP incorrectly serializes certificate validation
      • Need robust fix asap. What's the mechanism to prioritize and follow up on middleware issues affecting WLCG nowadays ?

  • ALICE -
    • NTR

During the meeting the ARGUS issue reported by CMS is explained by Maarten who is part of the ARGUS GGUS SU and is aware of the problem. The main problem is in the CANL component where OSCP support is enabled by default. This will need to be disabled to fix the issue. Stefano asks how the priority could be raised to make sure the fix is provided asap. Maarten replies that this is already taken care of as this affects many sites and experiments, not only CMS. MariaA asks whether the developers have provided a date to release a fix. Maarten says that not all the developers concerned had replied yet. MariaA also asks how could WLCG inform about this known issue so other sites and experiments do not spend time debugging and understanding this. It seems ARGUS suffers from other instabilities recently so it is not always clear whether ARGUS problems are due to this or not. To be followed up offline.

Sites / Services round table:

  • BNL: Not present
  • FNAL: Not present
  • IN2P3: NTR
  • JINR: Not present
  • KIT: Not present
  • NL-T1: NTR
  • PIC: Not present
  • RAL: Not present
  • RRC-KI: Not present
  • TRIUMF: Not present

Central Services:

  • GGUS: Release done this morning at 06:38 UTC. In this release: new NGI_China, decommission of CMS savannah bridge, and several minor improvements in CMS forms. See release notes for more details
    • Alarm for NGI_IT still open
    • Alarms for UK and US will be done tomorrow

  • CERN Grid services: Ben informs about a new WMS upgrade and the ramping down of SL5 capacity. It is planned to start draining SL5 queues on the 19th of June. Maarten asks whether this means that SL5 capacity will be no longer available as of that date. Ben clarifies that this only affects job submission through the old SLC5 CEs (ce201 ... ce207). To be followed up offline.

  • Storage: Xavi reports about a Castor CMS upgrade that took place during the day and took a bit longer than expected due to an unrelated problem with Castor-Public that needed urgent investigation. The delay had no impact and was properly announced at the SSB. ATLAS EOS will be updated on the 27.05.2012 from 10 to 10h30.


  • Next meeting on Friday

Thursday: Ascension holiday

  • The meeting will be held on Friday instead.



  • local:
  • remote:

Experiments round table:

  • ALICE -

Sites / Services round table:

  • ASGC:
  • BNL:
  • CNAF:
  • FNAL:
  • IN2P3:
  • JINR:
  • KIT:
  • KISTI:
  • NDGF:
  • NL-T1:
  • PIC:
  • RAL:
  • RRC-KU:
  • OSG:
  • CERN batch and grid services:
  • CERN storage services:
  • Grid Monitoring:
  • GGUS:
  • Databases:
  • MW Officer:


Edit | Attach | Watch | Print version | History: r12 | r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2014-05-27 - MariaALANDESPRADILLO
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback