Week of 180115

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Julia (WLCG), Maarten (ALICE, WLCG), Kate (chair, DB), Christoph (CMS), Vincent (security), Gavin (comp), Alberto (mon), Borja (mon), Ivan (ATLAS), Andrea (MW, FTS)
  • remote: Xavier (KIT), Onno (NL-T1), Dave M (FNAL), Darren (RAL), Marcelo (CNAF), Ulf (NDGF), Di Qing (TRIUMF), Victor (JINR), David B (IN2P3), Pepe (PIC), Vincenzo (EGI), Renato (LHCb)

Experiments round table:

  • ATLAS reports ( raw view) -
    • No major issues
    • Workflow management
      • Stable operation ~320k slots. Including HLT and T0 resources
    • Data transfer
      • CNAF recovery:
        • Pointed to 4 data16 RAW files lost on Castor - under investigation.
        • We continue with exporting data15 RAW from Castor (finished data16 and data17)
      • RAL-600k-lost-files incident - All files are declared bad.

  • CMS reports ( raw view) -
    • Productive week using close to 200k CPU cores (~140k production ~60k analysis)
    • No major issues
    • Main difficulty staging of some inputs from tape
      • Bad transfer from/to CERN Castor GGUS:132570 and INC:1560174
        • Suggest move to xrootd as transfer protocol needs only be done for T2 to T0 transfers
        • A patch was applied today to improve transfer situation

  • ALICE -
    • High activity on average
    • Central services incident Jan 10 afternoon
      • Slowness and file system corruption due to a bad disk
      • Recovery of the main impacted services took ~20 hours
        • With much less grid activity and most jobs failing

  • LHCb reports ( raw view) -
    • Activity
      • Running at maximum possible amount of resources. HLT farm stopped yesterday and returns "when cooling is stable again"
      • 2016 data restripping running full steam. Approx 1/2 of data processed (without CNAF) during YETS
      • Monte Carlo productions using remaining resources.
    • Meltdown & Spectre, performance hit after fix expected to be less cricital for data processing and monte carlo jobs (accounting for vast majority of work carried out).
      • voboxes patch: reboot will be tomorrow.
    • Site Issues
      • CERN/T0
        • NTR
      • T1
        • RRCKI problems with FTS transfers currently under investigation.
        • RAL had issues during weekend. "Burst"(jobs) reduced and all looks OK today.

Sites / Services round table:

  • ASGC: Meltdown/Spectre patch for CentOS 7 nodes was done. Patching SLC6 nodes.
  • BNL: Meltdown & Spectre --- patched interactive nodes and CEs, WNs to follow
  • CNAF: This week we will start to power up some systems and test the cooling.
  • EGI: ntr
  • FNAL: ntr
  • IN2P3: Patch applied for Meltdown and Spectre on interactive nodes and WNs. VOBOX will follow.
  • JINR: Singularity and corresponded cvmfs repo installed on the WNs, on SL6
  • KISTI: nc
  • KIT:
    • Latest kernels installed on VO boxes; batch farm almost completed (95 %).
  • NDGF: ntr
  • NL-T1: NTR
  • NRC-KI: nc
  • OSG: nc
  • PIC: 50% of the farm was patched with 1 and 3 patches, the remaining nodes are being patched. The micro code is being tested.
  • RAL: Issues with Castor over the weekend for Atlas and LHCb. Investigations are on-going.
  • TRIUMF: Updated kernel and microcode on CEs and WNs for Meltdown & Spectre

  • CERN site:
    • Service availability zone "cern-geneva-a" will be rebooted along with its VMs on Tuesday 16th. OTG0041682 and WARNING GOCDB downtime.
    • OTG0041682 describes the schedule for other availability zones, which we will do on Mon 22nd, Tue 23rd and Wed 24th next week. Please monitor OTG, as the schedule may change.

  • CERN computing services:
    • CERN batch service will be drained and rebooted in two halves, starting with draining this week for the first reboots on Monday 22nd. The 2nd half will then be drained for reboot on Monday 29th. While draining, jobs whose wall-clock fits will be accepted.
  • CERN storage services:
    • FTS:
      • 2 Pilot VMs have been down for 2 hours today at lunch time and the running transfers have been canceled. The Cloud team promptly fixed the issue.
      • the FTS Prod and Pilot VMs will be rebooted to install the patches for the Meltdown/Spectre vulnerabilities ( both on the VMs and the Hypervisors). The first half will be done tomorrow on Tue 16th Jan (OTG:0041784) the second half on Mon 22nd Jan (OTG:0041785). since we cannot get an estimation about the time the hypervisors will be rebooted and we cannot drain each time half of the clusters for the whole day, the transfers running on the VMs at the time of the reboot will fail.
  • CERN databases: A number of issues related to storage again during last week. Patch was applied today and it was not as smooth as expected. Sorry about that. Multiple databases were affected.
  • GGUS: ntr
  • Monitoring: ntr
  • MW Officer:
    • A recent upgrade on EPEL6-testing (https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2018-71db8f6f28) which includes new version of bouncycastle and canl-java, is breaking voms-client installations on UI and WN nodes installed from the UMD repository and CREAM-CE. UMD URT has been informed in order to fix the issue before this upgrade will be push to EPEL stable.
  • Networks:
  • Security: Speculative Execution:

AOB: Julia reminded there will be WLCG ops coordination meeting this Thursday. Please remember to reply to questions on SAM usage.

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2018-01-15 - KateDziedziniewicz
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback