Week of 180409

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Luca (SCOD+Storage), Julia (WLCG), Maarten (ALICE), Olga (Computing), Vincent (Security)
  • remote: David (ATLAS), Xin (BNL), David M (FNAL), David B (IN2P3), Sang Un (KISTI), Xavier (KIT), Andrew (NLT1), Dmytro (NDGF), Kyle (OSG), John (RAL), Di (TRIUMF), Alberto (Monitoring)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Activities:
      • Grid is full with mainly light activities: event generation and simulation
      • Some event generation tasks have rather high failure rates and/or use large amounts of disk I/O. Being investigated by experts.
    • Problems:
      • ADCR Oracle database serving all ATLAS grid-related services suffered under high load, leading to a couple of outages last week
      • CERN network outage on Thurs morning did not affect operations too much (HammerCloud blacklisting was disabled during the outage)
      • Certificate renewal season: one service mistakenly did not have the correct certificate renewed which led to a slight draining of the grid over the weekend
      • EOS reported potential corruption of files for an 8h period on 30 March - files may be corrupted even if adler32 checksum is correct.
    • Reminder:
      • ATLAS is looking forward to the 2018 pledge deployment. New disk space should be put into the ATLASDATADISK token.

  • ALICE -
    • Lowish to normal activity on average

  • LHCb reports ( raw view) -
    • Activity
      • HLT farm fully running
      • 2017 data re-stripping almost 100% finished
      • Stripping 29 reprocessing is ongoing

    • Site Issues

Sites / Services round table:

  • ASGC:
  • BNL: overlay disabled in singularity on WNs, in light of the reported vulnerability
  • CNAF:
  • EGI:
  • FNAL: Last week scheduled downtime successful
  • IN2P3: NTR
  • JINR:
  • KISTI: 52h planned downtime for power intervention, complete shutdown. All restored but few issue with routing path causing timeouts.
  • KIT:
    • Milestone storage resources 2018 deployed for all VOs.
    • Issues with full /var partitions for CMS Wed last week and over the weekend. Installed cronjobs to mitigate and talk to dCache.org about their very chatty CMS-TFC plugin...
    • This noon there was an at-risk downtime for adapting network configuration for deployment of IPv6 (in certain areas). To our knowledge, that was completely transparent to production activity.
    • We have discovered that over many years archiving of data to tape for Alice failed on rare occasion. That is, the process of flushing to tape did conclude successful, but with 0 Bytes transferred. We estimate that about 30k files since 2010 were affected by this problem, 2/3 of which are probably lost at GridKa. More safety checks were added to the workflow and investigation is ongoing.
  • NDGF: NTR
  • NL-T1:
    • One of the dCache doors had incorrect java.security settings, causing some GridFTP handshakes to fail. GGUS:134451 put on hold to wait and see whether the issue has been fixed.
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services:
  • CERN storage services:
    • issue with EOSLHCB namespace this night from 00:00 to 00:30, seems related with sssd locked and unable to resolve uids
  • CERN databases:
  • GGUS:
    • Changes concerning the OSG ticketing system are being discussed among all parties involved
  • Monitoring:
    • Draft Availability report for March 2018 sent
  • MW Officer:
  • Networks: NTR
  • Security: NTR

AOB:

  • discussion if SSLv3 can be retired from SRM endpoints
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2018-04-09 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback