Week of 130422
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance:
- local: Simone (SCOD), Maarten (Alice), Felix (ASGC), Jan (CERN), Alessandro (ATLAS), Ignacio (CERN), Edie(CERN), Zbigniew (CERN), Massimo (CERN), Maria (GGUS)
- remote: Ulf (NDGF), Michael (BNL), Onno (NL-T1), Xavier (KIT), David (CMS), Rolf (IN2P3), Tiju(RAL), Salvatore (CNAF), Rob (OSG), Pepe (PIC)
Experiments round table:
- ATLAS reports (raw view) -
- Central services
- CERN network issue Saturday 17:40-20:30 CET. GGUS:93514
. ATLAS observed a drop of jobs running on the GRID during that period (GRID draining, no possibility to start new jobs), then Saturday evening the problem seemed to be solved, but the same symptoms (GRID raining, no new jobs starting) reappeared around midnight till ~5am. Is CERN aware of any other issue during that period? We also observed that many services (running both on physical HW and on VMs)were not reporting monitoring information to SLS nor Lemon.
- From Ignacio: there was a different network issue (a "transparent" intervention which was not transparent) in SafeHost.
- CMS reports (raw view) -
- LHC / CMS
- Rereconstruction of 2012 data is nearly done, activity at most T1's down to level of ever-present "pedestal" of analysis jobs.
- CERN / central services and T0
- Network outage for a few hours on Saturday: (itssb link). Primarily affected us while the outage was occurring, recovery was swift once access was restored.
- Tier-1:
- Tier-2:
- ALICE -
- central services: a mistaken update of the job agent on Thu caused most sites essentially to get drained on Fri (most jobs quickly failed); understood and fixed Fri late afternoon
- CERN: mostly drained Sat late afternoon due to network incident, recovered mid evening
- KIT: some SE test failures were observed Fri afternoon and the concurrent jobs cap was therefore lowered to 3k for the weekend; back at 10k since 10:50 CEST
Sites / Services round table:
- NDGF: storage problem during the weekend. Investigating. Wed at 10:00AM SRM downtime (3h) for upgrade.
- NL-T1: a maintenance intervention SARA MSS currently ongoing. At approximately 3PM CEST, SARA suffered a network outage lasted 10 minutes. Things seem OK now.
- IN2P3: confirmed that IN2P3 will be able to absorbe as failover the requests for condition data at RAL during the RAL DB intervention on Wednesday.
- RAL: reminder about the Oracle patch being applied on Wednesday
- OSG: monthly maintenance tomorrow. All machines will be restarted (should be transparent).
- ASGC: FTS crashed because of power failure (hardware).
- PIC: On May 7 there will be an intervention at the tape library. Scheduled downtime flagged in GOCDB
- CERN Storage: EOS ATLAS not directly affected by power cut, but was down few hours afterwards because of side effects from an interrupted test suite. EOS CMS marked unavailable because of an expired certificate.
- CERN DB services survived network outage except for the replication of the LHCb online to offline (down for 2 hours).
- GGUS: GGUS Release this Wednesday, April 24 from 06:00 to 07:00 UTC with ALARM test round as usual. GOCDB entry is https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=121824&grid_id=0
Reminder: As announced on https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek130304#Thursday and last Thursday and at the WLCG Ops. Coord. meeting last week, the GGUS host certificate will be renewed. This certificate is used for authentication purposes of SOAP and hence impacts all systems that consume GGUS web services. The new certificate is attached to the relevant tickets in Savannah:136227
. The file ggus-tickets.xls is up-to-date and uploaded to page WLCGOperationsMeetings.
AOB:
Thursday
Attendance:
- local: Simone (SCOD), Jarka (CERN - Dashboards), Felix (ASGC), Alessandro (ATLAS), Stefano (CMS), Zbigniew (CERN DB), Ignacio (CERN PES), Maria (GGUS)
- remote: Ulf (NDGF), Xavier (KIT), Lisa (FNAL), Stefano (CMS), Ronald(NL-T1), Rob (OSG), Gareth (RAL), Rolf (IN2P3-CC)
Experiments round table:
- ATLAS reports (raw view) -
- Central services
- CERN Saturday/Sunday issue: ATLAS is still investigating why on Sunday 00am-07am the whole ATLAS Resources were not fully exploited. The issue seems not to be correlated with the SafeHost issue (which lasted 11pm Saturday - 2am Sunday). Not clear yet.
- T0/T1
- SARA transfers issues between SARA and IHEP GGUS:93643
. It seems to be at the FTS config level.
- SARA SRM issue GGUS:93551
. SRM stuck, now solved
- pic transfers error to DATATAPE GGUS:93553
. "The problem was because this new file family data was not defined on our dCache system.". Problem solved.
- CMS reports (raw view) -
- really nothing seems to be have happened other then business as usual
- as we are using glexec more and more for analysis jobs we hit issues at sites more, in many cases it seems a transient local problem, we have put retry after 30, 60 sec, does not seem to help but avoid black holes. Most likely sites need to develop expertise with glexec and argus. We plan to live with this.
- Jobs are currently aborted due to this, but can be resubmitted later and very little resources are wasted.
Sites / Services round table:
- ASGC: unscheduled network interruption on Tuesday (30 mins). Cause still unknown, under investigation.
- NDGF: tried to upgrade dCache to 2.5.2 but had to rollback to 2.4 (issues with space tokens). Under investigation with dCache developers.
- OSG: some issue spotted by the monitoring probe contacting the CERN BDII. GGUS:93650
to CERN.
- CERN DB: experiencing some serious problems with LCGR, CMSOffline, CMSOnlineADG (the latest used by CMS Frontier). Issue due to storage layer: one of the databases overloaded one NAS box because of the amount of data inserted. A possible responsible application could be CMS File Transfer. Under investigation.
AOB: