Week of 131007

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: AndreaS, Jan, Maarten, Robert, Maria, Pablo, Alex
  • remote: Dave/CMS, Vladimir/LHCb, Michael/BNL, Wei-Jen/ASGC, Lucia/CNAF, Lisa/FNAL, Rolf/IN2P3, Xavier/KIT, Sang-Un/KISTI, Ulf/NDGF, Onno/NL-T1, Tiju/RAL, Rob/OSG, Pepe/PIC
Experiments round table:

  • CMS reports (raw view) -
    • Firewall rule changed on one of the 3 Monalisa/Dashboard servers (~Friday?) leading to loss of Hammercloud/Glidein test result data. Many sites indicated as failing this test over the weekend as a result -- investigating and these metrics will need to be corrected for the time affected.
    • GGUS ticket roundup -- GGUS:97732 , GGUS:97677 , GGUS:97705 from last week's report still active.
      • GGUS:97786 and GGUS:97787 BDII misreporting site status -- seems related to earlier solved ticket: GGUS:97601
      • GGUS:97814 lcg-voms.cern.ch appears to have a bad host certificate -- voms.cern.ch though is OK, so users seem to have a 50-50 chance to obtain a proxy...
Pablo: the firewall problem has been solved. Dave still sees bad numbers, but improving. Need to verify that there are no other problems.

  • ALICE -
    • CERN: lcg-voms.cern.ch host certs updated with wrong DNs, causing job submission failures for ALICE around the grid (alarm GGUS:97815)
    • alarm tickets do not seem to work for CERN (GGUS:97817)
Maria: the problem with the ALARM ticket doesn't depend on GGUS or SNOW but on the expansion of the ALICE e-group into individual email addresses. Need to investigate why it happened, nothing was changed recently.

Maarten requests a SIR for the lcg-voms.cern.ch incident. Everybody agrees with the request.

  • LHCb reports (raw view) -
    • Main activity are MC productions.
    • Fall incremental stripping campaign was launched Oct 3
    • T0:
      • Pilots aborted at ce202 (GGUS:97736), fixed
      • FTS3 transfers stopped before week-end (GGUS:97796), problem fixed
    • T1:
      • IN2P3: Alarm ticket (GGUS:97743) b/c of wrong protocol returned by SRM, promptly fixed
      • IN2P3: Pilots aborting b/c of issue with BLAH server (GGUS:97804), problem fixed immediately
      • GRIDKA: Problem with tape recall during the week-end (GGUS:97795), problem fixed during week-end
    • Other: Asked to GGUS team about list of LHCb "GGUS alarmers" which seems not to be in sync with VOMS (GGUS:97755)
Maria: the latest issue depends on a synchronisation problem between VOMRS and VOMS-Admin, as she commented in the ticket, which is now assigned to ROC-CERN and to the CERN VOMS experts.

Sites / Services round table:

  • ASGC: ntr
  • BNL: the lcg-voms.cern.ch problem affected us when FTS proxies were renewed, which caused failed transfers and job failures due to LFC authentication errors for about two hours. Both GGUS tickets (GGUS:97808 and GGUS:97813) were raised against BNL this morning because of the voms problem at CERN.There was no problem at BNL. Maarten: about the BNL VOMS server issue reported last week, experts converged on a plan requiring all sites to update their configuration so that BNL VOMS proxies are recognised again. A broadcast will be sent after the meeting.
  • CNAF: working hard to align our accounting system to the EGI portal, as it is missing about 5M jobs from CNAF. The two portals should be consistent in a couple of days.
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • KISTI: ntr
  • NDGF: there was a short network outage in the night between Friday and Saturday, due to one of the OPN routers, which had to be rebooted. Apparently this went unnoticed.
  • NL-T1: reminder that tomorrow morning we will switch from the OPN to the LHCONE; it should be transparent.
  • PIC: ntr
  • RAL: tomorrow morning we will upgrade the primary OPN link and switch to the backup connection, so it should be transparent.
  • OSG: tomorrow morning we will have a maintenance intervention to our monitoring system. It should be transparent and we are in contact with the SAM team.
  • CERN batch and grid services: nothing to add
  • CERN storage services:
    • when updating the central CASTOR nameserver, there was a glitch that affected the extended group administration rights. It was fixed at 11:30
    • experiments should comment on the dates proposed last week for the planned CASTOR upgrades, including a downtime of several hours. Jan will send an email to the experiments so they can answer offline.
  • Dashboards: this Tuesday or Wednesday we will cleanup SAM test detailed outputs older than three months from the SAM database.

AOB:

Thursday

Attendance:

  • local: AndreaS, Maarten, Stefan/LHCb, Alessandro/ATLAS, Luca, Alex, MariaD
  • remote: Michael/BNL, Dennis/NL-T1, Xavier/KIT, David/CMS, Rolf/IN2P3-CC, Kyle/OSG, Lisa/FNAL, Ulf/NDGF, Tiju/RAL, Jeremy/GridPP, Wei-Jen/ASGC

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • The problems mentioned on Monday as related to BNL (BNL LFC glitch and BNL FTS proxy expired) were all due to the lcg-voms issue tracked in GGUS:97821
      • Reboot of all ATLAS central service machines went smoothly on Wednesday (9th Oct.)
    • T0
      • Staging Failures at CERN Castor are still there, CASTOR team checking, GGUS:97662
    • T1/T2
      • SARA disk pool problem on Wednesday, GGUS:97885
      • INFN-T1 DATADISK was actually network switch problem, solved next day morning.

  • CMS reports (raw view) -
    • 2011 Legacy rereco & 13 TeV Run 2 MC production ongoing
    • Few issues last couple days:
      • GGUS:97860 -- CVMFS bad nodes at KIT -- now solved.
      • GGUS:97866, GGUS:97873, GGUS:97874 SAM test metrics failing briefly at FNAL due to bad firewall rules -- solved.
      • Working to revise failed site readiness metrics due to Dashboard firewall problem reported last meeting -- covering Friday - ~6 AM Monday.

  • ALICE -
    • Major outage: all activity stopped due to unexpected expiration of the AliEn CA at 11:33 CEST this morning! Experts are working on it...

  • LHCb reports (raw view) -
    • Main activities are incremental stripping (T0/1) and Simulation
    • T0:
      • All FTS transfers between all storage elements have been switched to FTS3 in production (few sites were missing)
    • T1:
      • CNAF: Short interruption on Tuesday because of problem with switch connecting to storage, fixed promptly.
      • IN2P3: Pilots aborted because of BLAH server issue, problem fixed (GGUS:97804)

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • FNAL: ntr
  • IN2P3-CC: ntr
  • KIT: ntr
  • KISTI: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr
  • GridPP: ntr
  • CERN batch and grid services: produced the requested SIR (see below). In the future the certificate renewal for VOMS will be automated. [Maarten: will correct some inaccuracies in the SIR]
  • CERN storage services: ntr

AOB:

AOB:

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2013-10-10 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback