Week of 180910

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Borja (Monit, Chair), Eddie (Storage), Gavin (Computing), Ivan (ATLAS), Maarten (ALICE), Remi (Storage), Vladimir (LHCB)
  • remote: Christian (NDGF), Christoph (CMS), David B (IN2P3), Dave (FNAL), Di (TRIUMF), Jeff (OSG), John (RAL), Marcelo (CNAF), Onno (NL-T1), Sang Un (KISTI)

Experiments round table:

  • CMS reports ( raw view) -
    • Again a typical week regarding CPU usage: ~235k cores on average
      • ~190k cores for production
      • ~45k cores for analysis
    • We have a number of issues with EOS at CERN - the mentioned ones are critical for T0 operations and production
      • INC:1785566 - Correctly (=check sum ok) written files disappeared from name space or got zero size
      • INC:1784940 - We have services that depend on a reliable EOS fuse mount
      • INC:1783686, INC:1784454 - Problems with deletions, supposed to be fixed

Maarten advised to track issues using GGUS tickets, since otherwise they won't be accounted in the management report. It's not the supporter who has to push for it, but the clients to open them properly.

  • ALICE -
    • Normal activity levels on average last week
    • A bad MC production with huge memory consumption is under investigation

  • LHCb reports ( raw view) -
    • Activity
      • Data reconstruction for 2018 data
      • User and MC jobs
    • Site Issues
      • CERN: Pilot submission problem (GGUS:137037); Solved
      • CERN: Problem with accessing files (GGUS:137079)
      • CNAF: Minor problems at worker nodes

Sites / Services round table:

  • ASGC: NC
  • BNL: NC
  • CNAF:
    • The last version of StoRM has been installed at CNAF. It was observed an issue that under high load StoRM crashes. The Storage team is in contact with StoRM development to solve it. Meanwhile some temporary actions have been taken in order to not affect operation.
    • ATLAS tkt (GGUS:137060) forced GlueHostMainMemoryRAMSize to 3GB instead of 2GB (jobs were using more memory than declaring necessary and getting killed)

Marcelo mentioned also and LHCb worker problem that was sorted out and asked for feedback to Vladimir.

  • EGI: NC
  • FNAL: NTR
  • IN2P3: Next Tuesday 18th Sept., IN2P3-CC will be in scheduled maintenance. CEs and SEs will be in downtime for the whole day.
  • JINR: NC
  • KISTI: NTR
  • KIT: NC
  • NDGF: NTR
  • NL-T1:
    • We restarted all dCache pools this morning to fix a configuration issue affecting webdav access. Configuration issue was the lack of limit for transfers that could theoretically cause a dCache overload.
  • NRC-KI: NC
  • OSG: NTR
  • PIC: NC
  • RAL: Busy with HTCondor all last week, otherwise NTR
  • TRIUMF: NTR

  • CERN computing services:
    • Shared batch resources now patched for L1TF and ~back to full capacity.
    • Service cell patching for L1TF will start on 12th September to upgrade kernel and switch off SMT. Hypervisor reboots will proceed by availability zone as per the schedule in OTG:0045522. Please watch the OTG for updates.
    • Problem with Argus on Friday affecting Grid jobs.
    • Problem with our site-bdii noticed today - issue with DNS alias. Being looked at.
  • CERN storage services:
    • FTS: the VMs part of 2 instances ( CMS/LHCb and ATLAS) will be rebooted due to the Hypervisor rebooting campaign to install the patches to mitigate L1TF vulnerabilities (OTG:0045522), according to this calendar:
    • The running transfers at the time of the reboot will fail and clients may experience connection failures to the services
  • CERN databases: NC
  • GGUS: NTR
  • Monitoring: NTR
  • MW Officer: NC
  • Networks: NC
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2018-09-10 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback