Week of 190107

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Julia (WLCG), Kate (WLCG, chair), Remi (storage), Ivan (ATLAS), Miro (DB), Alberto (monitoring), Borja (monitoring)
  • remote: Alexander (NL-T1), Christoph (CMS), Di (TRIUMF), Dave M (FNAL), Dmytro (NDGF), Jeff (OSG), John (RAL), Marcelo (CNAF), Raja (LHCb), Xin (BNL), David B (IN2P3), Pepe (PIC)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Production - relatively smooth operation - 400k slots on average. Currently - everything normal.
      • AGLT2 with storage problems - blacklisted (whitelisted on 04.01.). As result - many transferring jobs, emptying Sim@P1.
      • Taiwan queues stuck since 02.01 - wrong configuration - fixed.

  • CMS reports ( raw view) -
    • Best wishes for 2019!
    • Very good CPU utilization during Xmas break
      • typically 200k CPU cores (and beyond) for production
      • typically 30-40k CPU cores for analysis
    • No major issues
    • Some user reported problems getting VOMS proxies due to one VOMS server with an expired certificate
      • Reported by ATLAS user: GGUS:139038
      • (Recent enough) Clients should fall back to other server

  • ALICE -
    • Best wishes for 2019!
    • High activity on average in the last 3 weeks: 140k+ jobs on average
      • Lots of everything: MC, reconstruction, analysis trains, user jobs
      • Almost no issues and nothing major
      • Thanks to the sites!

  • LHCb reports ( raw view) -
    • Activity
      • Data reconstruction for 2018 data
      • User, WG processing and MC jobs
    • Site Issues
      • CERN : Curious to know the status of the CERN cloud T-systems (GGUS:139080), RHEA (GGUS:138848)
      • CERN : Also ran out of space in EOS "recycle-bin" (GGUS:139077) earlier today. Requesting a shorter retention period for now, before we decide on further measures
      • RAL : Aborted pilots (GGUS:139081)
    • Also a few other issues over the holiday period which were resolved either internally or through GGUS tickets.

Sites / Services round table:

  • ASGC: nc
  • BNL: dCache downtime scheduled for Jan 22nd
  • CNAF: Happy New Year ragazzi! and NTR
  • EGI: nc
  • FNAL: NTR
  • IN2P3: Happy new year to everybody ! NTR for IN2P3-CC.
  • JINR:NTR
  • KISTI: nc
  • KIT:
    • Uneventful Christmas and New Years break, nothing broke/failed to our knowledge.
  • NDGF: NTR
  • NL-T1: NTR
  • NRC-KI: nc
  • OSG: NTR
  • PIC: Happy 2019!!! We will have an at RISK downtime next 15-Jan. for 2h approx. to upgrade the tape storage system. We can only declare this for LHCb, since we have Nearline SRM for them. If CMS and ATLAS are interested in having this, please contact me (J. Flix).
  • RAL: NTR Catalin is aware of LHCb ticket and is fixing arc-ce04.
  • TRIUMF: On Tuesday there will be a 6-hour downtime to update dCache, replace one of DDN storage controller and update the firmware of VM server and Oracle backend etc

  • CERN computing services: NC
  • CERN storage services:
    • EOSLHCB:
      • GGUS:139067 - headnode dropped out of external firewall since puppet was off, issues for external users. Fixed.
      • GGUS:139077 - recycle bin was full so could not delete anymore (after 1PB in 1 week). Workaround applied while LHCb decides what to do.
    • EOSATLAS: got tentative OK to switch off SRM-EOSATLAS.
Ivan commented that this was already disabled on ATLAS'es side.
    • CASTORALICE : turned off tape recalls since an ALICE service incorrectly attempted to access data from CASTOR instead of EOS. Resolved this morning.
  • CERN databases: NTR
  • GGUS: NTR
  • Monitoring:
    • Final reports for the Nov 2018 availability sent around
    • Draft reports for the Dec 2018 availability sent around
  • MW Officer: NC
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2019-01-07 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback