Week of 180528

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Kate (DB, chair), Julia (WLCG), Vladimir (LHCb), Maarten (WLCG, ALICE), Gavin (computing), Belinda (storage), Vincent (security), Borja (monitoring), Marian (networks)
  • remote: Andrzej 0. (ATLAS), Dmytro (NDGF), Marcelo (CNAF), Balazs (MW), Di (TRIUMF), Dave M (FNAL), Ken (CMS), Sang Un (KISTI), Xin (BNL), VIctor (JINR), David B (IN2P3), Pepe (PIC)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Overall production smooth running
    • INFN-T1 temporary storage configuration and tape endpoint issues
    • CERN eosatlas crashed on Thursday at 10:10 and was not available until 19:45.
      There were some delays in transfers of files and production at CERN crashed
      but not more disturbances found

  • CMS reports ( raw view) -
    • It's a holiday weekend in the US -- why am I on shift?
      • Also, I wrote this Sunday might my time, some things may have happened during the CERN day.
    • Average CPU usage: ~170k cores for production and ~58k cores for analysis
    • Biggest news: Data files were being incorrectly deleted from the T0. I was asleep during the joint operations meeting about it this morning, hopefully we'll have an update on how to proceed. The streamer files were not lost, so it should be possible to recover all the data.
    • Site status board (SSB), problems were still going for much of the week, appear to be resolved now, GGUS:135095.
    • OSG downtimes may not be propagating properly to be visible to WLCG as a result of OSG operational transitions. We're working to make sure sites are following revised procedures.
    • GGUS:135291, trouble with transfers from T1_FNAL_Disk. Was under investigation at FNAL, no resolution yet (and then holiday weekend). Also, separately, continuing concerns about transfer rates to FNAL tape, but FNAL experts believe this is more a problem of PhEDEx reporting that the transfers were completed.
    • GGUS:135249, problems with CNAF transfers, resolved.
Julia has asked if the revised procedure changes where the information is published. Ken replied the information goes to the same place as before, only the procedure has changed.

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Activity
      • Data reconstruction for 2018 data
    • Site Issues
      • NIKHEF: Pilots Failed {GGUS:135325} during weekend; Fixed.
      • Most pilots at ce515.cern.ch finished "successfully" without matching jobs due to missing CVMFS.
Maarten has asked if a ticket was open. Vladimir replied LHCb was not sure if the ticket should be open with CERN. Gavin confirmed tickets for all CERN CEs should be opened with CERN and all types of CERN resources should be considered a part of CERN service.

Sites / Services round table:

  • ASGC: nc
  • BNL: 1) rolling upgrade of T1 farm is SL7 is done, jobs running inside singularity container (SL6); 2) cvmfs infrastructure was migrated a week ago, new hardware and network topology, cvmfs release is upgrade from 3.2 to 3.5
  • CNAF:
    • Atlas ticket (GGUS:135303) for file transfer problems: Soved
    • Downtime during the weekend for SRM GOCDB:25376 Filesystem issues on the Atlas gridftp: Solved
    • Trasnfer to Tape problems with Atlas also seems to be solved (no ticket was opened)
    • There will be and electric intervention in one of the lines this week, it should be transparent.
  • EGI: nc
  • FNAL:
    • 24-hour downtime is planned for July 20th
    • CMS transfer issue - there are multiple agents involved (Phedex, DCache). Phedex is not marking things as migrated quickly enough and this is causing a backlog.
  • IN2P3: NTR
  • JINR: NTR
  • KISTI: NTR
  • KIT: nc
  • NDGF:
    • One of the clusters will go into a week-long downtime starting Monday, 04 June, for the filesystem upgrade. The computation capacity for ATLAS at NDGF will be heavily reduced the next week.
  • NL-T1:
    • Unable to attend because of dCache workshop
    • Last Friday we had a network outage of 1 hour in our dCache instance; our networking experts are still investigating the cause.
  • NRC-KI: nc
  • OSG: nc
  • PIC: Problem with an ATLAS Disk pool. 145 TB of data migrated to another pool and the problematic one sent to the company for further inspections (it is the second failure since 4 months in operation).
  • RAL: nc
  • TRIUMF: New tape library is in production now and the total tape capacity at TRIUMF increased to ~30PB.

  • CERN computing services: NTR
  • CERN storage services: CASTOR nodes will be upgraded (aiming for 7th June, during Machine Development). Will confirm exact date.
  • CERN databases: NTR
  • GGUS:
    • There was a scheduled downtime 06:00-06:30 UTC today to apply patches for CVE-2018-3639
  • Monitoring: NTR
  • MW Officer: New WLCG baseline versions of the following products: FTS 3.6.8 (was 3.5.8), ARC-CE 5.3.1 (was 5.0.2), Singularity 2.5.0 (was 2.4.2)
  • Networks: GGUS:135304 - Uni.Freiburg unreachable from CERN; CERN prefixes were missing in the routing announcements to SWITCH
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2018-05-28 - KateDziedziniewicz
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback