Week of 171002

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Kate (chair, DB), Ben (Computing), Marian (network), Alberto (monitoring), Vincent (security), Maarten (Alice, GGUS, WLCG), Paul (storage), Alejandro (FTS), Mayank (WLCG)
  • remote: Luca (CNAF), Elena (CNAF), Andrew (NL-T1), David C (ATLAS), Zoltan (LHCb), Gareth (RAL), Christoph (CMS), Di Qing (TRIUMF), David (FNAL), Xin (BNL), Christian (NDGF), Vincenzo (EGI), Kyle (OSG)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Normal activities for the last week
    • A Rucio server issue on Friday night caused the grid to partially drain. Quickly fixed by experts on Saturday morning.
    • 3 Tier-1s down (unscheduled) for more than 12 hours in the last week
    • Sites with SLC5 DPM:
      • Bern: upgraded
      • RRC-KI T2: will upgrade this week

  • CMS reports ( raw view) -
    • Again overall good CPU utilization: ~140k cores for production and ~45k for analysis
    • High transfer activity incl. tape staging
    • T0 Condor pool was shrinking over the weekend
      • Traced down to a Factory problem

  • ALICE -
    • Normal to high activity on average
    • RAL: Stratum-1 was not updating between Tue and Fri afternoon (GGUS:130835)

  • LHCb reports ( raw view) -
    • Activity
      • Monte Carlo simulation, data processing and user analysis
      • pre-staging of 2015 data for reprocessing is started and will continue during weeks.
    • Site Issues
      • T1:
        • Failures in transfers to and from GRIDKA (GGUS:130848); This was due heavy load on dCache. It is stable now.
        • Files uploads and downloads failure at CNAF, due to hardware failure, which already fixed.
        • Missing expatbuilder at NIKHEF-ELPROD (GGUS:130832); solved
        • Failed transfers from IC to SARA (IPV6) (GGUS:129946); Problem probably in SARA connection to LHCOne
      • T2:
        • Problems with pilots failing to contact LHCb services at CERN from WNs at Liverpool (GGUS:130715); solved
Marian commented that IC issue to SARA has been solved with a temporary solution. Vendor is investigating the router issue.

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF:
    • Saturday Sept. 30th, LHCb storage system (DDN SFA10K) has suffered double failure: HW failure on one controller and Firmware failure on another bringing entire system off-line. Solved few hours later by power cycling the first controller. Half an hour later another failure (in a different computer room) has happened on a Fiber Channel switch which knocked out main production TSM server, tape library and several HSM server. On-site intervention of this problem was postponed till next working day (Monday). Solved this morning, now all services are up and running. We are investigating the causes for both failures with HW providers.
    • CMS: on Friday evening CMS reported an issue with XrootD tests accessing CNAF storage. Our analysis showed that it was caused by enormous number of IO requests from CERN WNs. About 2400 CERN WNs tried to get data from CNAF saturating both CPU and network resources of our servers and creating Denial of Service condition. To protect our servers we decided to limit number of XrootD threads running on each server.
  • EGI: NTR
  • FNAL: NTR
  • IN2P3: Site in downtime from today 20:00 UTC to tomorrow 11:00 UTC to update SL6 WNs because of security vulnerability [EGI-SVG-CVE-2017-1000253]. Interactive accesses have been already updated this morning.
  • JINR: nc
  • KISTI: nc
  • KIT: nc
  • NDGF: Round robin upgrade of dcache pools at sub-sites Tuesday 12 - Wednesday 12 UTC. Only short sub 10 min interruptions expected during reboots. Minimal impact expected.
  • NL-T1:
    • SARA-MATRIX SE went down Thursday evening; same as the week before. GGUS:130827 and GGUS:130691. This time we have found the cause. dCache nowadays relies on Zookeeper for internal communication. We have 5 Zookeeper VMs. A few weeks ago we started making weekly snapshots of all VMs, all at the same time on Thursday evening. This caused multiple Zookeepers to be unresponsive at the same time. We've scheduled this so that only one VM will make a snapshot at a time. If only one Zookeeper is unresponsive, the others should be able to handle that.
  • NRC-KI: nc
  • OSG: NTR
  • PIC: nc
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services:
    • Note Batch draining campaign for kernel vulnerability OTG (needs login): OTG0040095
    • We would like to propose that we use the opportunity to migrate GRID submissions fully to HTCondor, but welcome feedback if this is too aggressive a schedule
Maarten commented ALICE is ok with that. Christoph will check with factory experts to see if this is ok, Ben remarked single core vs multicore jobs need to be considered for CMS. Atlas will verify with Alessandro. Ben will contact experiments directly.

  • CERN storage services: FTS links have been stuck after an upgrade in BNL
  • CERN databases: CMSONR ADG had to be restarted today as an emergency reaction to the synchronisation delay. An Oracle bug is suspected.
  • GGUS:
    • A new release was deployed on Thu Sep 28
      • The alarm tests went fine
      • A new support unit for IPv6 was added 2 weeks earlier
  • Monitoring:
    • Draft reports for 09/2017 availability sent to WLCG Office for distribution.
  • MW Officer:
  • Networks:
  • Security: New vulnerability published CVE-2017-1000253 (EGI)
    • Local privilege escalation: affecting in particular UIs, VOBOXes, WNs
    • Currently only "High" because exploit not trivial and not yet public (but announced for CC7 & ping (partial escalation only))
    • Will most likely be escalated to "Critical" once exploit public: please prepare update plans!

AOB:

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2017-10-02 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback