Week of 180827

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Alberto (Monit), Borja (Chair, Monit), Gavin (Computing), Julia (WLCG), Maarten (ALICE), Michal (ATLAS), Miroslav (DB)
  • remote: Andrei (LHCB), Andrew (NL-T1), Christoph (CMS), Di (TRIUMF), Dave (FNAL), David B (IN2P3), Jeff (OSG), Jose (PIC), Marcelo (CNAF), Sang Un (KISTI)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Activities:
      • normal activities
    • Problems
      • T1_DATADISKs getting full
        • there are few PB of small files to delete - they are now mixed with bigger files so deletion can keep up but it still slows down deletion
        • this will take longer time
      • unavailable files at INFN-T1_MCTAPE (GGUS:136823) - they were on a list of files to be recalled which had not been processed since August 6th
      • lost files at INFN-T1 tapes - 95 files on datatape and 35 files on mctape - DDM ops will process the list
      • transfers from pic to all clouds fail with "Transfer canceled because the gsiftp performance marker timeout of 360 seconds has been exceeded, or all performance markers during that period indicated zero bytes transferred" (GGUS:136820)
        • failures from disks - some dCache pools were overloaded, the maximum number of active movers reduced
        • failures from tape - a hardware issue with one of tape recall pools
      • RAL IPv6 problem
        • during Thursday evening we observed increasing number of queued requests in rucio
        • on Friday, it turned out rucio is slowed down by slow submission to RAL FTS
        • sites using RAL FTS moved to CERN FTS (which overloaded CERN FTS)
        • IPv6 problem solved, RAL switched from CERN FTS Saturday morning
        • timeouts appeared again on Saturday afternoon and were solved in the evening

  • CMS reports ( raw view) -
    • CMS EOS instance at CERN melt down under huge (and improper) user access Aug 21st to Aug 22nd
    • RAL FTS3 service lost IPv6 connectivity evening hours on Aug 23rd: GGUS:136859
    • Global Pool basically recovered from short job issue last week
      • Pool almost collapsed on Friday due to a bad HTCondor parameter setting
        • Fixed on Saturday and recovering since then

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Activity
      • Data reconstruction for 2018 data
      • User and MC jobs
    • Site Issues
      • CERN: The reported problem with uploads to EOS via xrootd (GGUS:136720) is likely related to the LHCb bundled grid middleware, the fix is being tested
      • RAL: ipV6 connection problems resulting in failed FTS transfers (GGUS:136863)
      • RAL: Failing disk server at RAL resulting in jobs failing to get input data

Sites / Services round table:

  • ASGC: NC
  • BNL: NC
  • CNAF: ATLAS - files cannot be read on tape during routine checks, ATLAS DM team has been informed to check the possibility to replicate this files to CNAF.
  • EGI: NC
  • FNAL: NTR
  • IN2P3: NTR
  • JINR: NTR
  • KISTI: NTR
  • KIT: NTR
  • NDGF: NC
  • NL-T1: NTR
  • NRC-KI: NC
  • OSG: NTR
  • PIC: NTR
  • RAL: NC
  • TRIUMF: the data migration to the storage at new data centre completed.

  • CERN computing services:
    • L1TF mitigation:
      • CERN Advisory: https://security.web.cern.ch/security/advisories/l1tf/l1tf.shtml
      • Public batch share in HTCondor and LSF draining now, will be done in two slices. Tier-0 clusters being arranged with ATLAS and CMS. See OTG:0045525.
      • Services will need to be rebooted (together with the hypervisors) - this will start on the technical stop (12th September) for the 1st availability zone. After a few days to verify there are no problem, the remaining zones will be rebooted, day by day, according to the schedule in OTG:0045522.
  • CERN storage services:
    • EOSCMS: Degraded between Tuesday evening and Thursday ~15h00 CEST (User overload, router line-card issue, software bugs.
  • CERN databases:
  • GGUS: NTR
  • Monitoring:
    • Final availability reports for July sent around
  • MW Officer: NC
  • Networks: NC
  • Security: NC

AOB:

  • no site information in Top-BDII for OSG (GGUS:132611)
    • OSG resource XML no longer provide LDAPURL info (empty)
    • old OSG LDAP endpoints points to decommisioned is.grid.iu.edu
    • does WLCG even need Top-BDII (installed at each site)? Looks like broken OSG LDAPURL did not affected our services
    • answers:
      • top-level BDII services were never installed at each site, only a number per region
      • WLCG still needs a bit from the top-level BDII for some use cases
      • OSG services have been absent for more than 1 year without problems
Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2018-08-28 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback