Week of 180924

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Kate (chair, WLCG, DB), Julia (WLCG), Maarten (WLCG), Gavin (computing), Borja (monitoring), Roberto (storage)
  • remote: Darren (RAL), Elena (CNAF), Jeff D. (OSG), Onno (NL-T1), Balazs (MW Officer), Christoph (CMS), Di (TRIUMF), Sabine (ATLAS), David B (IN2P3), Xin (BNL), Dmytro (NDGF), Zoltan (LHCb)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Usual activities:
      • Stable running at 300-400k run slots depending on opportunistic resources (CERN@P1 for instance)
      • Lately lower transfer traffic than usual (small reprocessing tasks finished)
      • Some UK moved sites from APF to Harvester submission
      • Few T1s full: lifetime model will be run soon
    • Problems
      • CERN reboot campaign impacted production a bit on Tuesday evening

  • CMS reports ( raw view) -
    • Lost RAW data report from last week could be recovered
      • Required quite some manual effort to re-run some T0 jobs
    • CPU usage by production is still on a rather low value: ~150k cores
    • CERN EOS Fuse mount become more stable: INC:1784940

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Activity
      • Data reconstruction for 2018 data
      • User and MC jobs
    • Site Issues

Sites / Services round table:

  • ASGC: nc
  • BNL: NTR
  • CNAF: NTR
  • EGI: nc
  • FNAL: nc
  • IN2P3: NTR
  • JINR: NTR
  • KISTI: nc
  • KIT: nc
  • NDGF: NTR
  • NL-T1: NTR
  • NRC-KI: nc
  • OSG: NTR
  • PIC: nc
  • RAL: NTR
  • TRIUMF: There will be a 5-hour downtime (17:00-22:00 UTC) today, we will replace the core switch of the new data centre in the downtime, it is part of plan of migrating to the new data centre.

  • CERN computing services: NTR
  • CERN storage services: NTR
  • CERN databases: NTR
  • GGUS:
    • A new release is planned for Wed this week
      • Release notes
      • A downtime has been scheduled for 07:30-10:00 UTC
      • Test alarms will be submitted as usual
  • Monitoring: NTR
  • MW Officer:
    • globus-gssapi-gsi issue
      • On Sep 21 globus-gssapi-gsi in EPEL was updated to version 13.10
      • That version sets v1.2 as the default TLS version
      • At the same time it sets the minimum version to that
      • Unfortunately there are "legacy" services in WLCG not ready for that
      • A handful of DPM instances were found using an old Globus version
        • They just need to upgrade ASAP
      • The BeStMan SRM on EOS services is also affected
        • Its code is implemented in Java + Jetty + JGlobus
        • We did not manage to make it accept TLS v1.2 yet
      • Quick tests of dCache and StoRM were fine
      • There could be other services affected, though
      • We have thus asked for the minimum to be set to TLS v1.0 again
      • Our EPEL packager Mattias Ellert prepared 14.7-2 with that change
        • and at the same time comments that "TLS 1.0 and 1.1 are recommended not to be used due to security concerns, and setting the default minimum version to 1.2 make sense long term. So it would be nice if this revert would not be permanent"
        • Hopefully it will soon be released
        • In the meantime GGUS:137130 has more information
      • Thanks to all involved for helping to minimize the fallout !
Maarten commented EOS BeStMan should no longer be in use in a few months and it should be possible early next year to redo the changes. Workflow will be improved to assure announcement get to wider community. Sabine asked if other sites were affected. Maarten replied that 4 sites were and they will be contacted directly.
  • Networks: nc
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2018-09-24 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback