Week of 150323

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Luca (SCOD+Storage), Iain (Batch and Grid), Lorena (Databases), Maarten (ALICE), Stefan (LHCb), Alessandro (ATLAS)
  • remote: Christoph (CMS), Felix (ASGC), Dmytro (NDGF), Onno (NLT1), Rolf (IN2P3), Matteo (CNAF), Tiju (RAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • FTS lost messages: moved some transfers to the CERN pilot. still we observe lost messages from BNL.

  • CMS -
    • Cosmic run with magnetic field about to start: CRAFT (Cosmic Run At Four Tesla)
    • Not much production/processing in the system
      • Scaling exercises taking quite some CPU slots at the sites though
    • Trouble with staging from CASTOR at CERN since Friday, GGUS:112490
      • Experts looking at it

  • ALICE -
    • CASTOR at CERN: staged files appear to be garbage-collected prematurely
      • being looked into

  • LHCb reports (raw view) -
    • operations dominated by MC jobs
    • no issues to report for this meeting

Sites / Services round table:

  • ASGC: NTR
  • BNL:
  • CNAF: NTR
  • FNAL:
  • GridPP:
  • IN2P3: NTR
  • JINR:
  • KISTI:
  • KIT: NTR
  • NDGF: NTR
  • NL-T1: Planned 2h downtime next Thursday for SRM due to maintenance on power feeds.
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL: NTR
  • RRC-KI:
  • TRIUMF: NTR

  • CERN batch and grid services: SAM tests failed during the weekend for CREAM CE. Expert are investigating the problem that is causing an increase of memory (and swap) used.
  • CERN storage services: we are following different CASTOR issues
  • Databases: NTR
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

OSG apologize for not being able to join due to its "All Hands Meeting" the entire week.

Thursday

Attendance:

  • local: Luca (SCOD+Storage), Iain (Batch and Grid), Maarten (ALICE), Stefan (LHCb), Andrea (Grid Monitoring)
  • remote: Alessandro (ATLAS), Christoph (CMS), Onno (NL-T1), Dmytro (NDGF), Tiju (RAL), Thomas (KIT), Di Quing (Triumf), Sang (KISTI), Rolf (IN2P3), Matteo (CNAF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • Found solution for the FTS Share problem, apparently this was due to the capitalization of some shares

  • CMS -
    • Quite some trouble with overloaded Frontier infrastructure
      • Not fully clear - might be caused by some scaling exercise running many jobs
    • Trouble with staging from CASTOR at CERN since Friday, GGUS:112490
      • Any updates?

  • ALICE -
    • CASTOR at CERN: instabilities affecting both online and offline activities
      • Oracle row lock contention due to concurrent activities
      • bunch size of staging requests was not optimal (fixed)
      • disk pool was not being rebalanced, which led to unnecessary garbage collection (fixed)
      • garbage collection removed newly staged files because they were not (yet) accessed recently!
        • will be improved
      • good support from the CASTOR team!

  • LHCb reports (raw view) -
  • operations dominated by MC jobs
    • T0:
    • T1: eventual failures observed for user jobs trying to upload the output data to the IN2P3-User ST, currently investigating the issue within LHCb

Sites / Services round table:

  • ASGC:
  • BNL:
  • CNAF: NTR
  • FNAL:
  • GridPP:
  • IN2P3: NTR
  • JINR:
  • KISTI: NTR
  • KIT: NTR
  • NDGF: NTR
  • NL-T1: Today maintenance on the power feeds was successfully completed. In the meantime dCache was update to 2.12.3. Last Tuesday there was a problem during another power maintenance intervention and a fuse broke, as consequence one switch remain without power. On Monday there will be the last power maintenance intervention.
  • OSG:
  • PIC:
  • RAL: NTR
  • TRIUMF: NTR

  • CERN batch and grid services:
    • Incident with CE404, this CE was removed during a riconfiguration of LSF, this lead the bdii for CE404 reporting zero job numbers. This is the first time such incident happen, to fix the issue CE404 was removed from bdii, the cache was cleaned and LSF reconfiguration returned operational.
    • The SAM test problem is still being investigated, apparently it looks like they are timing out due to a very high load on the batch system. The batch team is waiting for new resources to become available in the next weeks.
  • CERN storage services: CASTOR LHCb update finished around 15:00
  • Databases:
  • GGUS-
    • Alarm tests: problems occurred at FNAL; test will be repeated today; INFN did not reply to the alarm up to now
    • GGUS-SNOW : problems with the interface after enabling new web service calls on GGUS side; problems are fixed now. Closing of SNOW requests in case of re-assigning in GGUS still does not work
  • Grid Monitoring:
  • MW Officer:

AOB:

  • NOTE: European summer time (CEST) starts on Sun March 29.

-- AndreaSciaba - 2015-02-27

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2015-03-27 - PabloSaiz
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback