Week of 140526

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Ben (CERN Grid Services), Maria A (SCOD), Maarten (ALICE), Felix (ASGC), Xavi (Storage), Maria D (GGUS), Pablo (Grid Monitoring + GGUS)
  • remote: Sang-Un (KISTI), Roger (NDGF), Stefano (CMS), Salvatore (CNAF), Rolf (IN2P3), Onno (NL-T1), Kai (ATLAS)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
    • T0/T1s
      • BNL short DDM problem on Friday due to dCache headnode failure (solved: GGUS:105640)

  • CMS reports (raw view) -
    • major issue with Argus identified last friday by Middleware Support and Rome admin while following up on GGUS:105545 (see last Thursday report), likely reason for large part of our troubles with glexec: we are particularly concerned about glexec failures at the end of the user job. Pilot can not regain user identity to transfer output and cleanup and job is failed wasting all the used resources. glexec failure at user payload start is less severe and we handle by wait and retry.
      • story is in GGUS:105597 . Problem description in here: PEP daemon may be throttled by outgoing OCSP requests (VOMS and CAnL certificate validators appear to be serialized, so single rogue OCSP responder can bring PEP daemon to halt :-()
        • in English: one CA failing to reply quickly to certificate validation request leads to process blocking and all authz requests pending at that time ultimately fail. Sites are affected at random depending on getting jobs from users with certificates of the non responding CA
      • underlying technical issue GGUS:105666 Argus PEP incorrectly serializes certificate validation
      • Need robust fix asap. What's the mechanism to prioritize and follow up on middleware issues affecting WLCG nowadays ?

  • ALICE -
    • NTR

During the meeting the ARGUS issue reported by CMS is explained by Maarten who is part of the ARGUS GGUS SU and is aware of the problem. The main problem is in the CANL component where OSCP support is enabled by default. This will need to be disabled to fix the issue. Stefano asks how the priority could be raised to make sure the fix is provided asap. Maarten replies that this is already taken care of as this affects many sites and experiments, not only CMS. MariaA asks whether the developers have provided a date to release a fix. Maarten says that not all the developers concerned had replied yet. MariaA also asks how could WLCG inform about this known issue so other sites and experiments do not spend time debugging and understanding this. It seems ARGUS suffers from other instabilities recently so it is not always clear whether ARGUS problems are due to this or not. To be followed up offline.

Sites / Services round table:

  • ASGC: NTR
  • BNL: Not present
  • CNAF: NTR
  • FNAL: Not present
  • IN2P3: NTR
  • JINR: Not present
  • KISTI: NTR
  • KIT: Not present
  • NDGF: NTR
  • NL-T1: NTR
  • PIC: Not present
  • RAL: Not present
  • RRC-KI: Not present
  • TRIUMF: Not present

Central Services:

  • GGUS: Release done this morning at 06:38 UTC. In this release: new NGI_China, decommission of CMS savannah bridge, and several minor improvements in CMS forms. See release notes for more details
    • Alarm for NGI_IT still open
    • Alarms for UK and US will be done tomorrow

  • CERN Grid services: Ben informs about a new WMS upgrade and the ramping down of SL5 capacity. It is planned to start draining SL5 queues on the 19th of June. Maarten asks whether this means that SL5 capacity will be no longer available as of that date. Ben clarifies that this only affects job submission through the old SLC5 CEs (ce201 ... ce207). To be followed up offline.

  • Storage: Xavi reports about a Castor CMS upgrade that took place during the day and took a bit longer than expected due to an unrelated problem with Castor-Public that needed urgent investigation. The delay had no impact and was properly announced at the SSB. ATLAS EOS will be updated on the 27.05.2012 from 10 to 10h30.

AOB:

  • Next meeting on Friday

Thursday: Ascension holiday

  • The meeting will be held on Friday instead.

Friday

Attendance:

  • local: Andrea M (MW Officer), Ben (CERN batch and grid services), Felix (ASGC), Maarten (SCOD), Pablo (grid monitoring + GGUS)
  • remote: Alexei (ATLAS), Antonio (CNAF), John (RAL), Ken (CMS), Michael (BNL), Roger (NDGF), Rolf (IN2P3), Vladimir (LHCb)

Experiments round table:

  • CMS reports (raw view) -
    • It has been very quiet out there; scattered reports of sites that are running out of jobs, but we are working with the sites to understand why and getting things moving.
    • With the demise of the Savannah-GGUS bridge, we're busy learning more about GGUS. I for one started a ticket drill on the OSG; on Tuesday, I sent test tickets to all the OSG T2 sites, just asking them to report who received notification of the ticket and to figure out how to get the ticket closed. Only half the tickets have been closed so far. We'll keep working with OSG and the sites on this.

  • ALICE -
    • CERN
      • SLC6 CREAM CEs often reporting wrong job numbers in the BDII (GGUS:105855)
        • started Wed evening
        • ALICE needed to babysit VOBOXes to avoid overloading LSF with job submissions
        • looking OK again since yesterday mid evening
    • KIT
      • after the maintenance downtime a network configuration issue prevented full use of the SE
        • fixed Wed evening, thanks!
      • high network load due to usage of old SW versions by various ALICE users
        • they have been asked to switch to newer versions ASAP
        • we look further into preventing easy access to old versions
          • since Mon the vast majority are no longer available, but some were kept
        • the jobs cap has been lowered to mitigate the issue
    • NDGF
      • job failures due to many files not found in dCache; being debugged

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI:
  • KIT:
  • NDGF:
    • scheduled downtime next Mon for tape server upgrade; some ALICE or ATLAS data might be temporarily unavailable
  • NL-T1:
  • OSG:
  • PIC:
  • RAL: ntr
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services:
    • the ATLAS LFC daemons have been switched off and the service taken out of SLS monitoring
  • CERN storage services:
  • Databases:
  • GGUS:
    • One of the test alarms done during the release took 14 hours to be acknowledged
      • CNAF will look into what went wrong there
  • Grid Monitoring: ntr
  • MW Officer:
    • the fix for the DPM 1.8.8 bug affecting CMS T2 FTS-2 transfers has been released in EPEL

AOB:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2014-05-30 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback