Week of 181126

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Alexander (LHCb, NRC-KI), Belinda (Storage), Borja (Chair, Monitoring), Maarten (ALICE), Miroslav (Databases), Vladimir (LHCb)
  • remote: Darren (RAL), Dave (FNAL), Di (TRIUMF), Dmytro (NDGF), Jeff (OSG), Xavier (KIT), Marcelo (CNAF), Onno (NL-T1), Sang-Un (KISTI), Vincenzo (EGI), Xin (BNL), David B (IN2P3)

Experiments round table:

  • CMS reports ( raw view) -
    • Likely no one from CMS able to join the call
    • Heavy Ion run
      • Had some problems with Frontier last week due to high load - better now
      • With increasing intensity also rising backlogs in tape archiving and "prompt" reconstruction (as expected)
    • Very good CPU usage
      • ~185k cores for production
      • ~55k cores for analysis

  • ALICE -
    • Normal to high activity levels on average
      • Lowish Mon-Wed last week due to ALICE central service instabilities

Vladimir asked what was the reason for the instabilities, Maarten explained that raw data replication partly goes through the central cluster and the high rates likely provoked the issues.

  • LHCb reports ( raw view) -
    • Activity
      • Data reconstruction for 2018 data
      • User and MC jobs
    • Site Issues
      • SARA: Ticket open concerning data transfers problems (GGUS:138472)

Sites / Services round table:

  • ASGC: NC
  • BNL: dCache space filled up last Friday, causing failures on transfers and jobs. As a temporary workaround, ATLAS DDM aggressively deleted a lot of secondary data, to free space. Later on it's found that the space reporting JSON file was not updated, due to NFS issue, fixed now.
  • CNAF: NTR
  • EGI: UMD 4.8.0 RC ready, bringing BDII update switching the "OSG" option from "true" to "false" and an issue with a stale PID. A broadcast will be sent right after the release, asking sites to apply the OSG=false option (either by hand or upgrading the BDII package). Please report any issues on the BDII to the GGUS Information System Support Unit.

Maarten asked for the release date, it's expected to happen in the next 2 days.

  • FNAL: NTR
  • IN2P3: Site will be in scheduled downtime next Tuesday Dec. 4th. CEs and SEs will be off for the whole day.
  • JINR: NTR
  • KISTI: NTR
  • KIT:
    • Last weeks downtime for LHCb was a success. We ended the downtime early and have heard no complaints from LHCb - so far.
    • Announced the dCache update downtime for ATLAS - atlassrm-fzk.gridka.de - for Thursday, 6th of December.
  • NDGF: Power outage in UmeŚ, impacts our computing and storage, temporarily.
  • NL-T1:
  • NRC-KI: NTR
  • OSG: NTR
  • PIC: NC
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services: NTR
  • CERN databases: NTR
  • GGUS:
    • A new release is planned for Wed this week
      • Release notes
      • A downtime has been scheduled for 07:00-10:00 UTC
      • Test alarms will be submitted as usual
  • Monitoring: NTR
  • MW Officer: NC
  • Networks: NC
  • Security: NTR

AOB:

  • Xavier raises attention to the same issue about the CMS SAM tests, that occurred the last couple of weeks already, were the most recent test results are unknown - this time since last Thursday. The GGUS ticket on this subject - GGUS:138351 - from last week was resolved and he has reopened it.

Monitoring team is already looking into the issue, more details will be provided.

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2018-11-26 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback