Week of 131021

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: (MariaD/SCOD, Przemek/DB, Eddy/Dashboards, AlexandreL/Grid Services, Jan/Storage)
  • remote: (Xavier/KIT, DavidM/CMS, Lisa/FNAL, Wei-Jen/ASGC, Rolf/IN2P3, Tiju/RAL, Rob/OSG, Salvatore/CNAF, Christian/NGDF, Vladimir/LHCb, Pepe/PIC )

Experiments round table:

  • ATLAS reports (raw view) -
    • NTR
      • We have ATLAS S&C week so apologies in advance if someone can't make the meeting.

  • CMS reports (raw view) -
    • Mostly incident free weekend, MC production and 2011 Legacy rereco ongoing
    • GGUS:98180 Jobs failing at KIT due to CMS failing to request tape family for MC workflow
    • GGUS:98214 HC Glidein tests failing due to missing file at RAL -- appears fixed, green since Sat afternoon or so...

  • ALICE -
    • CERN: significant fraction of jobs (~35%) running on SLC6 with CVMFS now; so far the efficiencies have been lower and the failure rates have been higher than on SLC5 with Torrent; not clear yet if that was accidental or systematic; under investigation

  • LHCb reports (raw view) -
    • Main activities are incremental stripping (T0/1) and Simulation(T2)
    • T0: NTR
    • T1:
      • RAL: Pilots aborted during weekend (GGUS:98210) - Solved; Pilots aborted at ARC CE (Condor queue) today.
      • Occasional long staging times at KIT will be discussed in a new GGUS ticket, that LHCb will open.

Sites / Services round table:

  • ASGC: ntr
  • BNL: not connected
  • CNAF: ntr
  • FNAL: ntr
  • NL_T1: not connected
  • IN2P3: ntr
  • KIT: ntr
  • PIC: ntr
  • KISTI: not connected
  • OSG: ntr
  • RAL: FTS will be down on Wednesday and some other services will be at risk. All relevant entries are in GOCDB.
  • NDGF: ntr

  • CERN:
    • Storage: Today's CASTOR ATLAS update is done. An EOS ATLAS update is planned for Wednesday and other EOS interventions, all published in GOCDB.
    • Databases: ntr
    • Grid Services: ntr
    • Dashboards: ntr

  • GGUS: File ggus-tickets.xls is up-to-date and attached to page WLCGOperationsMeetings. There will be a GGUS Release this Wednesday 23 October including the usual round of test ALARMs. This will be the first release that KISTI will be tested.

AOB:

Thursday

Attendance:

  • local: (MariaD/SCOD, Przemek/DB, Eddy/Dashboards, Jan/Storage, Alexandre/Grid Services)
  • remote: (Oli/CMS, Rolf/IN2P3, Lisa/FNAL, Saverio/CNAF, Onno/NL_T1, Gareth/RAL, Wei-jen/ASGC, Pavel/KIT, Christian/NGDF, Sang-Un/KISTI, Vladimir/LHCb, Rob/OSG)

Experiments round table:

  • ATLAS reports (raw view) -
    • Nothing to Report
      • We have ATLAS S&C week so apologies in advance if someone can't make the meeting.

  • CMS reports (raw view) -
    • Some minor issues, MC production and 2011 Legacy rereco continue
    • GGUS:98256 Debugging with Fermilab opportunistic glideIn WMS jobs on FermiGrid

  • ALICE -
    • CERN: still investigating why SLC6 jobs have lower efficiencies and higher failure rates than SLC5 jobs. Discussion at the meeting suggested that a GGUS ticket would be helpful to better follow progress.

  • LHCb reports (raw view) -
    • Main activities are incremental stripping (T0/1) and Simulation(T2)
    • T0: NTR
    • T1: NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: not connected
  • CNAF: ntr
  • FNAL: ntr
  • NL_T1: A hardware issue appeared following a power outage that caused multiple dCache pool nodes to go down and no more reboot. The vendor shipped 4 new storage components which will be installed tomorrow but the intervention may not complete before the end of the weekend. GGUS:98370 describes this incident. GOCDB is updated.
  • IN2P3: ntr
  • KIT: The WNs' transition to SLC6 is progressing. 50% of the nodes are done. By next Monday 28/10 there will be 100 WNs done, i.e. no SLC5 left on site.
  • PIC: not connected.
  • KISTI: Last Friday's downtime is over. The network is being monitored. The first GGUS ALARM test was successful.
  • OSG: The GGUS-Footprints' interface worked very well after the release.
  • RAL: There will be UPS (Uninterrupted Power Supply) work on Tuesday 5/11 followed by tests. CASTOR storage and batch services will be down. Details will follow.
  • NDGF: ntr

  • CERN:
    • Databases: ntr
    • Storage: EOS ATLAS upgrade done yesterday. Discovered EOS client needed update and did it. CASTOR update for the other 3 experiments will be done 28-29/10.
    • Grid Services: ntr
    • Dashboards: ntr

  • GGUS: The ALARM tests following yesterday's release went well. Please read the Did You Know?... of this month. The list of released features is HERE.

AOB:

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2013-10-24 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback