Week of 130422

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Simone (SCOD), Maarten (Alice), Felix (ASGC), Jan (CERN), Alessandro (ATLAS), Ignacio (CERN), Edie(CERN), Zbigniew (CERN), Massimo (CERN), Maria (GGUS)
  • remote: Ulf (NDGF), Michael (BNL), Onno (NL-T1), Xavier (KIT), David (CMS), Rolf (IN2P3), Tiju(RAL), Salvatore (CNAF), Rob (OSG), Pepe (PIC)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • CERN network issue Saturday 17:40-20:30 CET. GGUS:93514 . ATLAS observed a drop of jobs running on the GRID during that period (GRID draining, no possibility to start new jobs), then Saturday evening the problem seemed to be solved, but the same symptoms (GRID raining, no new jobs starting) reappeared around midnight till ~5am. Is CERN aware of any other issue during that period? We also observed that many services (running both on physical HW and on VMs)were not reporting monitoring information to SLS nor Lemon.
      • From Ignacio: there was a different network issue (a "transparent" intervention which was not transparent) in SafeHost.

  • CMS reports (raw view) -
    • LHC / CMS
      • Rereconstruction of 2012 data is nearly done, activity at most T1's down to level of ever-present "pedestal" of analysis jobs.
    • CERN / central services and T0
      • Network outage for a few hours on Saturday: (itssb link). Primarily affected us while the outage was occurring, recovery was swift once access was restored.
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • ALICE -
    • central services: a mistaken update of the job agent on Thu caused most sites essentially to get drained on Fri (most jobs quickly failed); understood and fixed Fri late afternoon
    • CERN: mostly drained Sat late afternoon due to network incident, recovered mid evening
    • KIT: some SE test failures were observed Fri afternoon and the concurrent jobs cap was therefore lowered to 3k for the weekend; back at 10k since 10:50 CEST

Sites / Services round table:

  • NDGF: storage problem during the weekend. Investigating. Wed at 10:00AM SRM downtime (3h) for upgrade.
  • NL-T1: a maintenance intervention SARA MSS currently ongoing. At approximately 3PM CEST, SARA suffered a network outage lasted 10 minutes. Things seem OK now.
  • IN2P3: confirmed that IN2P3 will be able to absorbe as failover the requests for condition data at RAL during the RAL DB intervention on Wednesday.
  • RAL: reminder about the Oracle patch being applied on Wednesday
  • OSG: monthly maintenance tomorrow. All machines will be restarted (should be transparent).
  • ASGC: FTS crashed because of power failure (hardware).
  • PIC: On May 7 there will be an intervention at the tape library. Scheduled downtime flagged in GOCDB
  • CERN Storage: EOS ATLAS not directly affected by power cut, but was down few hours afterwards because of side effects from an interrupted test suite. EOS CMS marked unavailable because of an expired certificate.
  • CERN DB services survived network outage except for the replication of the LHCb online to offline (down for 2 hours).
  • GGUS: GGUS Release this Wednesday, April 24 from 06:00 to 07:00 UTC with ALARM test round as usual. GOCDB entry is https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=121824&grid_id=0 Reminder: As announced on https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek130304#Thursday and last Thursday and at the WLCG Ops. Coord. meeting last week, the GGUS host certificate will be renewed. This certificate is used for authentication purposes of SOAP and hence impacts all systems that consume GGUS web services. The new certificate is attached to the relevant tickets in Savannah:136227. The file ggus-tickets.xls is up-to-date and uploaded to page WLCGOperationsMeetings.

AOB:

Thursday

Attendance:

  • local: Simone (SCOD), Jarka (CERN - Dashboards), Felix (ASGC), Alessandro (ATLAS), Stefano (CMS), Zbigniew (CERN DB), Ignacio (CERN PES), Maria (GGUS)
  • remote: Ulf (NDGF), Xavier (KIT), Lisa (FNAL), Stefano (CMS), Ronald(NL-T1), Rob (OSG), Gareth (RAL), Rolf (IN2P3-CC)

Experiments round table:

  • CMS reports (raw view) -
    • really nothing seems to be have happened other then business as usual
    • as we are using glexec more and more for analysis jobs we hit issues at sites more, in many cases it seems a transient local problem, we have put retry after 30, 60 sec, does not seem to help but avoid black holes. Most likely sites need to develop expertise with glexec and argus. We plan to live with this.
      • Jobs are currently aborted due to this, but can be resubmitted later and very little resources are wasted.

  • ALICE -
    • NTR

Sites / Services round table:

  • ASGC: unscheduled network interruption on Tuesday (30 mins). Cause still unknown, under investigation.
  • NDGF: tried to upgrade dCache to 2.5.2 but had to rollback to 2.4 (issues with space tokens). Under investigation with dCache developers.
  • OSG: some issue spotted by the monitoring probe contacting the CERN BDII. GGUS:93650 to CERN.
  • CERN DB: experiencing some serious problems with LCGR, CMSOffline, CMSOnlineADG (the latest used by CMS Frontier). Issue due to storage layer: one of the databases overloaded one NAS box because of the amount of data inserted. A possible responsible application could be CMS File Transfer. Under investigation.

AOB:

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2013-04-25 - SimoneCampana
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback