Week of 180212

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Kate (chair, WLCG), Julia (WLCG), Maarten (WLCG, ALICE), Vincent (security), Michal (ATLAS), Vladimir (LHCb), Alberto (mon), Borja (mon), Roberto (storage)
  • remote: Xavier (KIT), Andrew (NL-T1), David M (FNAL), Dmytro (NDGF), John (RAL), Kyle (OSG), Marcelo (CNAF), Christoph (CMS), Sang Un (KISTI), Vincenzo (EGI), Jens (NDGF), Matthew (EGI)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Activities:
      • normal activities
      • HepOSlibs update - blas library needed for user analysis - sites were informed about update
    • Problems
      • slow transfers to BNL (GGUS:133295) - caused by dCache bug [www.dcache.org #9341]
      • timeouts to FZK-LCG2 tapes (GGUS:133332) - tapes are served by RAL FTS now and limit is set to 100 connections
      • Transfers to CERN-PROD_PERF-MUONS failed with "Permission denied" (GGUS:133348) - permissions fixed
      • Transfers from SARA-MATRIX are failing with "File is unavailable" (GGUS:133407) - a diskpool on one of nodes wasn't started
      • Transfers to RAL-LCG2-ECHO fail with "Address already in use" (GGUS:133399) - site ran out of available ports - fix needs changes on central router, so it will be done this week
      • Transfer from CERN-PROD_datadisk fail with "No such file or directory" (GGUS:133414) - we need to find out why the file(s) are unavailable
      • CNAF flooding - problem with lack of communication/response, e.g. GGUS:133320
Maarten remarked that with experts busy and site in downtime tickets might not be top priority. Marcelo stated CNAF is in downtime, tapes are being tested. User support might not be checking tickets regularly and ATLAS contact has recently moved to another place. Marcelo offered to check and update ATLAS ticket. CNAF director will provide the status update during GDB this week.

  • CMS reports ( raw view) -
    • Last week another Global Midweek Run (GMWR) with available CMS detector components
      • No major computing issues
    • CPU utilization on average scales
      • Present work rather I/O dominated
        • Production of Pileup-Premixing libraries
        • Test production of (a prototype) nanoAOD (analysis format of ~1-2kB/evt)
    • Continued heavy tape operations
      • RE-RECO of 2017 data continues and RAW data needs to be re-staged

  • ALICE -
    • High to very high activity level on average

  • LHCb reports ( raw view) -
    • Activity
      • HLT farm fully running
      • MC simulation and user jobs
    • Site Issues
      • CERN/T0 problem with updating DBOD - LHCbDirac was in downtime almost week

Sites / Services round table:

  • ASGC: We are going to have Chinese new year break this week, it starts from Wed (14th of Feb) till 20th of Feb. Still have on-call engineers, will keep best-efforts for operations.
  • BNL: the slow transfer problem on BNL dCache last week was traced down to orphaned CLOSE_WAIT connections between dCache door nodes and pool nodes, mitigated by restarting the door nodes. Ticket open with dCache dev for further investigation.
  • CNAF:CNAF Director will be present at CERN for the GDB this week
    • Powering up some systems
  • EGI: NTR
  • FNAL: NTR
  • IN2P3: nc
  • JINR: NTR
  • KISTI: NTR
  • KIT: NTA
  • NDGF: NTR
  • NL-T1:
    • A dCache pool for Atlas had fallen out of the configuration, leaving some files unavailable. Fixed this morning. GGUS:133407
    • The SARA IPv6 LHCONE issue of GGUS:129946 is now understood and a permanent solution is in place. There was a VPNv6 issue between our Juniper core routers where IPv6 traffic destined for LHCONE was not routed properly. The ipv6 tables are synced up using VPNv6 iBGP routing where the second core router would learn an MPLS label for this traffic. The problem was that our primary core router didn't handle this MPLS labeled traffic correctly. The label was stripped but couldn't be routed out of the IRB. We changed the behaviour to not advertise a label that corresponds to an interface, but to a label that corresponds to a table. So now the label is popped and routed normally instead of popped and punted out of an interface immediately.
    • over the week-end both compute and storage services at NIKHEF-ELPLROD suffered unexpected downtimes (first approx. 6 hours, second one approx. 2 hours) when the underlying virtualisation platforms failed. This impacted many services that were distributed over a wide range of different hosts, so that even services run in multi-redundant HA configurations failed. The failures are being investigated, and at the moment we consider these stuck XenServer hosts to be affected by a combination of broken Intel microcode (from the Spectre variant 2 patches) and the IO-intensive workload that the services impose on the virtualisation hosts. The offending microcode has now been manually removed form the platform. We continue to monitor the situation.
  • NRC-KI: nc
  • OSG: SSO service was deployed for MyOSG ticketing. Corner cases are under investigation. Downtime calendar is working, there might be issues with updates. More news to come.
  • PIC: nc
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services: NTR
  • CERN databases: NTR
  • GGUS: NTR
  • Monitoring:
    • Distributed January SAM3 Availability Site report to WLCG Office.
    • Not receiving XrootD transfers data from ATLAS US. Fine from other XrootD sources.
Julia commented it might be ok. Further verification to be done.
  • MW Officer: NTR
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2018-02-12 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback