Week of 150928

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Akos (grid & batch), Andrei (databases), Doug (ATLAS), Maarten (SCOD + ALICE)
  • remote: Christoph (CMS), Di (TRIUMF), Dimitri (KIT), Francesco (CNAF), Kyle (OSG), Michael (BNL), Onno (NLT1), Rolf (IN2P3), Sang-Un (KISTI), Stefan (LHCb), Tiju (RAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • FTS issues Last Thursday and Friday at CERN and BNL (both places ran out of inodes) - fixed
      • Due to BNL FTS issues redirected submissions to other FTS sites. beginning to slowly migrate back to BNL

  • CMS reports (raw view) -
    • Unfortunately nobody from CMS available to call in - please address questions afterwards to Christoph
    • Tier0
      • Presently having a backlog with file registration in our DBS system
    • Storage issues at FNAL due to bad update of CMS TFC (Trivial File Catalog)
    • Issues with SAM tests at CERN
      • One CE (ce406.cern.ch) with submission problems: GGUS:116440
      • glexec-Test seems not to get scheduled in time (24h): GGUS:116468
        • Akos & Maarten have followed up

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Data Processing:
      • Data processing of pp and pHe data at T0/1 sites, monte carlo mostly at T2, user analysis at T0/1/2D sites
    • T0
      • Investigation of slow worker ongoing is followed up internally the ticket is closed (GGUS:116023)
      • Problem accessing EOS caused by Namespace service stuck (GGUS:116445), issue solved
      • Failed transfers to CERN-RAW, re-surfaced on Sunday (GGUS:116321)
      • Observed that pilots seem to get stuck in the CREAM-CEs of LSF and not submitted to batch (GGUS:116473)

Sites / Services round table:

  • ASGC:
  • BNL: ntr
  • CNAF:
    • Tomorrow intervention is confirmed to replace the broken piece of one of the two electrical supplies affected by the fire. The intervention will start at 9.00 and it will take all the day.
    • We scheduled a downtime for the LHCb filesystem from Today (12.00) till the end of the intervention. The LHCb queue at CNAF will be closed for about 15 minutes during the shutdown of the filesystem to allow a change in the pre-exec command (filesystem no longer required during the intervention, file will be then accessed remotely)
  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT:
  • NDGF:
  • NL-T1:
    • last Fri there was a network intervention on 2 dCache pool nodes that may have caused some data access errors between 12:50 and 13:24 CEST
  • NRC-KI:
  • OSG: ntr
  • PIC:
  • RAL: ntr
  • TRIUMF: ntr

  • CERN batch and grid services:
    • FTS fts3.cern.ch upgrade at 10:00 CEST on Tuesday (tomorrow) OTG:0025042 - 3.2.33 to 3.3.31 this is the migration that was cancelled last week.
  • CERN storage services:
  • Databases:
    • experts have been investigating an issue affecting the ALICE online DB 2 weeks ago:
      • certain queries returned an empty set intermittently
      • the affected rows were found to have been deleted, then rapidly reinserted by the WinCC archiver
      • this resulted in a race condition
      • the matter will be followed up further
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Akos (grid & batch), Andrei (databases), David (ATLAS), Luca (storage), Maarten (SCOD + ALICE), Stefan (LHCb)
  • remote: Christoph (CMS), Dennis (NLT1), Di (TRIUMF), Matteo (CNAF), Michael (BNL), Rob (OSG), Rolf (IN2P3), Sang-Un (KISTI), Thomas (KIT), Tiju (RAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • Security incident on a UI node used by ATLAS users: all those users' certificates were revoked, looks like ATLAS resources were not affected by the incident

  • ALICE -
    • The Russian sites have been suffering network connectivity issues since yesterday.
      • JINR and PNPI are still not OK.
      • Stefan: LHCb also had network issues at RRC-KI; it looks OK now

  • LHCb reports (raw view) -
    • Data Processing:
      • Data processing of pp data at T0/1 sites, monte carlo mostly at T2, user analysis at T0/1/2D sites
    • T0
      • Failed transfers to CERN-RAW, re-surfaced on Sunday (GGUS:116321)
      • Observed that pilots seem to get stuck in the CREAM-CEs of LSF and not submitted to batch (GGUS:116473), back to normal, ticket closed
    • T1
      • CNAF: scheduled downtime for storage Mo/Tue
      • GRIDKA: scheduled site outage Tue - Thu (draining by LHCb Mo evening)
      • PIC: scheduled batch & storage downtime Wed (draining started by site 24 h before DT)
      • Maarten: we will discuss the issue of concurrent T1 downtimes at the Ops Coordination meeting later today

Sites / Services round table:

  • ASGC:
  • BNL: ntr
  • CNAF:
    • the power intervention on Tue went OK
    • the LHCb file system is back online
  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT:
    • the big maintenance downtime was ended early, everything looks OK so far
    • the SGE batch system was upgraded to version 8.3.0 and cgroups were enabled on the WN
      • the limits were taken from the VO cards in the EGI Ops Portal
      • the memory limit is applied to the RSS, not the VMEM
      • please let us know if jobs suffer from these new conditions
  • NDGF:
  • NL-T1: ntr
  • NRC-KI:
  • OSG:
    • the GGUS test alarm workflows appear to have gone OK, though 1 alarm was marked as 'critical' while the other was only 'high'
      • Maarten: for CMS an alarm ticket is generated, while for ATLAS it initially is a TEAM ticket that then gets promoted to an alarm;
        it has been like that for years, but it could be that the latest release has some change that is related to what you noticed
      • Rob: will compare the current situation with the previous set of test alarms
  • PIC:
  • RAL: ntr
  • TRIUMF: ntr

  • CERN batch and grid services: ntr
  • CERN storage services: nta
  • Databases: ntr
  • GGUS:
    • New release of GGUS deployed yesterday, including a new feature 'my dashboard', to show a personalized view with the tickets selected by the registered users
  • Grid Monitoring:
  • MW Officer:

AOB:

- SAM3 draft availability reports for September 2015 have been sent

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2015-10-01 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback