Week of 170213

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Kate (chair, DB), Vladimir (LHCb), Julia (WLCG), Maarten (ALICE + GGUS), Gavin (computing), Vincent (security), Andrea M. (MW Officer + storage), Jesus (storage)
  • remote: Andrew (NLT1), David B (IN2P3), David C (ATLAS), John K. (RAL), Dimitri (KIT), Pepe (PIC), Kenneth (CMS), Matteo (CNAF), Ulf (NDGF), Xin (BNL), Vincenzo (EGI), Victor (JINR), David M. (FNAL), Kyle (OSG)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Very high level of running jobs, over 300k helped by 50k from HLT farm
    • Bulk reprocessing of all run 2 data still 2-3 weeks away
    • Tape staging test:
      • Last week we ran a test to stage ~150TB from each T1 tape
      • Results were mostly very good, just a couple of sites slower than expected
      • Since we submitted a large number of staging requests at the same time we set a large bringonline timeout of 48h in FTS
        • However we discovered that most sites had internal timeouts configured much smaller (4, 8, or 24h)
        • We would prefer that sites set this internal timeout high so that the FTS timeout is effective
    • CERN S3 object store is being heavily used by event service jobs, a couple of gateways were not running which impacted performance until they were quickly fixed
    • Maarten suggested opening tickets for sites with timeouts. Atlas is following up with sites concerned. Internal dCache setting is expected to be at fault.
    • Pepe (PIC) asking for the target for the tests run. It was internal and will be presented during next ATLAS SW & Computing meeting.

  • CMS reports ( raw view) -
    • Either this was a very quiet week or I didn't do a very good job of staying on top of things.
    • Life is good: we were consistently running 160k-180k cores through the week, but tailing off over the weekend as we were apparently getting through our backlog of work. That also led to a shift of usage from production towards analysis.
    • There was a scheduled network intervention at CERN on Thursday, but the only effects that I saw was one complaint that day about DBS.
    • And there was a scheduled reboot of a number of the CRAB3 VMs on Friday and others on Monday for a hardware intervention. The CRAB support team was properly notified, and the reboot was quick enough that there was no disruption.

  • ALICE -
    • High activity

  • LHCb reports ( raw view) -
    • Activity
      • MC Simulation, user analysis and data reconstruction jobs
    • Site Issues
      • T0:
        • Second instance of SRM for EOS LHCb is in production. Original EOS SRM reports zero for available space time from time.
      • T1:
        • CNAF: Downtime for 3 days
        • SARA: Downtime tomorrow (1 hour)
        • Maarten commented that CNAF downtime was well announced and SARA's very short. Vladimir reminded the "no 2 T1s in downtime simultaneously" rule.
    • Maarten asking if EOS reporting 0 space is followed up. There is a discussion ongoing with the EOS team.

Sites / Services round table:

  • ASGC: nc
  • BNL: ntr
  • CNAF: downtime until 9am on Feb 15th. No news about possible downtime shortening.
  • EGI: ntr
  • FNAL: brief downtime on Feb 21st
  • GridPP: ntr
  • IN2P3:ntr
  • JINR: Farm is almost empty from the start of weekend, very few production jobs. Our Tier1 is for CMS
  • KISTI: nc
  • KIT: ntr
  • NDGF:
    • We have one subsite down with limited networking since an about to be replaced optical networking part failed during the weekend. It was supposed to be swapped out this afternoon. (The subsite is University of Copenhagen)
    • One tape system restoring atlas tape files managed to bend a loader arm, so some files are delayed. This is being looked at.
    • Last week atlas staged loads of files, causing overload on the LHCONE link to Slovenia. This got worked around and since atlas is still restoring the workaround is also in place.
  • NL-T1: Zen server will be updated, should be transparent.
  • NRC-KI: nc
  • OSG: ntr
  • PIC: Intervention to Enstore instance on Feb 21st. Short stall is expected.
  • RAL:
    • Ongoing problems with the tests for srm-cms, but we believe it is just the tests and not production
    • Issues with the Atlas Echo instance, possibly a network issue.
  • TRIUMF: Last Tuesday upgraded production dCache system to 2.13.51 to avoid the issue with the 'Non Repudiation' key usage in RFC proxy.

  • CERN computing services: Network intervention on Thursday, multiple services will be affected. OTG:0035550
  • CERN storage services: NTR
  • CERN databases: ATONR will be migrated to new HW tomorrow. Downtime between 9am and 2pm OTG:0035535
  • GGUS: ntr
  • Monitoring: nc
  • MW Officer: NTR
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2017-02-13 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback