Week of 151026

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Links to Tier-1 downtimes

ASGC BNL CNAF FNAL IN2P3-CC JINR KISTI KIT
NDGF NIKHEF SARA-MATRIX NRC-KI PIC RAL TRIUMF  

Monday

Attendance:

  • local: Luca (SCOD+Storage), Eric (CMS), Raja (LHCb), Steve (Batch), Maarten (ALICE), Andrei (databases)
  • remote: Dario (ATLAS), Asa (ASGC), Michael (BNL), Francesco (CNAF), Lisa (FNAL), Onno (NLT1), John (RAL), Kyle (OSG), Pepe (PIC)

Experiments round table:

  • ATLAS reports (raw view) -
    • Normal data-taking and Grid production activities ongoing.
    • Many FTS "error 500" during the week-end at CERN and RAL.

  • ALICE -
    • high activity

  • LHCb reports (raw view) -
    • Data Processing:
      • Data processing of pp data at T0/1/2 sites. Some T2 attached to T1 in order to speed up the processing.
      • Monte Carlo mostly at T2, user analysis at T0/1/2D sites
      • Stable number of running jobs for processing of data at T0/1.
      • Data from pHe fully processed.
    • T0
      • Transfer to EOS stable now.
    • T1
      • Low level of upload failures at RAL - being followed up with the site. Also one CE down at RAL (GGUS:117171) due to hypervisor problems.
      • SARA srm seems under load, possibly related to GGUS:116939 . Wait and see.
    • AOB
      • Various dB interventions at CERN announced to specific individuals only. Would be useful if they were sent to a mailing list (lhcb-geoc) or announced properly as a WLCG service.

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: problem on ATLAS disk reported last week solved. Migration of wns from LSF7 to LSF9 is still ongoing.
  • FNAL: Issue on a CMS dCache pool that lead to data loss. There was a severe hardware problem affecting one machine, as consequence one filesystem is unrecoverable. FNAL is following up with CMS for the lost files.
  • GridPP:
  • IN2P3:
  • JINR:
  • KISTI:
  • KIT:
  • NDGF:
  • NL-T1: As mentioned by Raja, in the last month there were some load related issue affecting SARA which are being investigated.
  • NRC-KI:
  • OSG: NTR
  • PIC: NTR
  • RAL: the low level of failures seen by LHCb is due to 2 problematic diskservers (which are now offline) but all the data is still available. There was also a problem on one hypervisor that affected one of the FTS nodes.
  • TRIUMF:

  • CERN batch and grid services: The issues seen during our last attempt to upgrade our LSF masters to 9.1.3 are understood. On 4/11/2015 starting at 10am we'll make a new attempt to upgrade the LSF masters of our public instance to the new version. The upgrade will take several hours and involves a restart of LSF services on the worker nodes. Provided that the upgrade works fine the worker nodes will be upgraded subsequently in the following days after the intervention, starting with QA.
  • CERN storage services: NTR
  • Databases: NTR
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Luca (SCOD+Storage), Ben (Batch), Raja (LHCb), Maarten (ALICE), Andrei (databases)
  • remote: Michael (BNL), Rolf (IN2P3), Thomas (KIT), Andrew (Nikhef), Chris (OSG), Pepe (PIC), John (RAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • Normal data-taking and Grid production activities ongoing.

  • ALICE -
    • CERN got largely drained of jobs due to the Argus banning incident

  • LHCb reports (raw view) -
    • Data Processing:
      • Data processing of pp data at T0/1/2 sites. Some T2 attached to T1 in order to speed up the processing.
      • Monte Carlo mostly at T2, user analysis at T0/1/2D sites
      • Processing pAr data
    • T0
      • Problem with Cream CEs starting late yesterday. (GGUS:117263)
      • Problem with FTS. (GGUS:117206) - solved yesterday, but GGUS ticket not updated.
    • T1
      • Low level of upload failures at RAL - being followed up with the site.
      • SARA srm now down - GGUS:116939
      • RRCKI problems with tape system - GGUS:117267 . Seems to be recurrent at the site.
    • AOB
      • Tickets especially at CERN not being explained / closed.

Sites / Services round table:

  • ASGC:
  • BNL: NTR
  • CNAF:
  • FNAL:
  • GridPP:
  • IN2P3: NTR
  • JINR:
  • KISTI:
  • KIT: Downtime for GGUS extended to fix minor issues.
  • NDGF:
  • NL-T1: SARA: problem with the dCache cluster, follow up on Monday. Due to this issue all dCache components were restarted.
  • NRC-KI:
  • OSG: repo1 and repo2 are not accepting updates, but they are still serving packages, investigation ongoing.
  • PIC: NTR
  • RAL: NTR
  • TRIUMF:

  • CERN batch and grid services:
    • ARGUS issue with rules on central banning, affecting user's authorization and all CERN job submissions (OTG:0025994)
    • Reminder for LSF9 update on 4/11/2015 as already announced on Monday
    • FTS fts3.cern.ch will have 10 minutes of downtime on morning of Tuesday 10th November. GOCDB and OTG0025970.
  • CERN storage services: NTR
  • Databases: NTR
  • GGUS:
    • GGUS release on the 28th. Among other things: reworking the email notification, and creation of a mailing list for announcement of unscheduled outages. All test alarms have been acknowledged.
  • Grid Monitoring:
  • MW Officer:

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Oct-15.pptx r1 manage 2856.9 K 2015-10-26 - 15:55 PabloSaiz  
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2015-10-29 - OnnoZweersExternal
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback