Week of 170306

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Alberto (monitoring), Andrea (MW Officer + storage), Belinda (storage), Gavin (computing), Jesus (storage), Julia (WLCG), Maarten (SCOD + ALICE + GGUS), Marian (networks)
  • remote: FaHui (ASGC), Alexander (NLT1), Christoph (CMS), Dario (ATLAS), David B (IN2P3), David M (FNAL), Di (TRIUMF), Dimitri (KIT), Elena (CNAF), Elizabeth (OSG), Jens (NDGF), John (RAL), Leonardo (Sussex), Sang-Un (KISTI), Stefan (LHCb), Victor (JINR), Vincent (security), Vincenzo (EGI)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Smooth operations, running on ~300k job slots. New production campaign MC16a continues.
    • Reprocessing still to start by the beginning of March with data16 first then data15. Staging of RAW files from Tier1 tapes started 2 weeks ago not yet completed, especially at KIT.
    • Deletion through xrootd not working at several US sites, reverted to SRM.
    • Kibana central services monitoring showed grey for 24 hours between Friday night and Saturday. Then back working but no feedback on GGUS ticket, nor on internal SNOW ticket.
      • see the Monitoring report below

- Dario: what was the explanation of last week's FTS issue at BNL?
Might we want a SIR for that?
- Andrea: the FTS DB got corrupted and had to be reinstalled;
When Atlas tried to setup the activity shares again they found a problem related to an incorrect DB schema;
BNL together with the FTS developers have fixed this
- Maarten: was the issue bad enough to warrant a SIR? Probably not...
The matter will be recorded in the Service Report for the MB next week.

  • CMS reports ( raw view) -
    • CMS General
      • Rather modest production and processing activity
      • Mid Week Global Run #2 (MWGR#2) end of last week
        • Detector commissioning
    • Frontier/Squid infrastructure basically brought down on Thursday
      • Affected basically all jobs - no access to to calibration
      • Traced back to a bad configuration on our Launchpads during Puppet4 migration
      • Some data now in the caches with bad (=too long) life time
      • Restarting site squids clears that out
    • Some trouble with monitoring infrastructure at CERN:
      • Propagation of one Dashbaord metric stuck GGUS:126940 (problem identified)
      • DNS entry for SAM3 Dashboard not accessible GGUS:126941 (fixed)
      • Meter/ElasticSearch information not updating GGUS:126945 (fixed)
        • Alberto: the Meter service got filled with logs at 10x the usual rate,
          due to lots of restarts for the Puppet 4 migration; an alarm finally was sent
          Fri late evening; the problem was cured by adding extra nodes
    • Investigating some problems with new GFAL2 plugin in FTS3
      • Has been reverted on Monday
      • Some problems
      • see the FTS item in the CERN storage services report below

  • ALICE -
    • Very low activity in the weekend and lowish activity last week,
      due to major issues affecting the central services:
      • A bad RAID controller slowed down I/O and had to be replaced
      • Frequent MySQL crashes, finally resolved by restoring the DB
    • Sang-Un: the SAM A/R of the SEs were affected
    • Maarten: we will check if corrections need to be applied

  • LHCb reports ( raw view) -
    • Activity
      • MC Simulation, Stripping campaign to start this week which will increase load on T1 tape systems
    • Site Issues
      • T0:
        • Wed: ALARM ticket GGUS:126874 about users running out of AUP signature validity. User AUP validity overwritten with admin rights.
          • Gavin: a bug has been identified and should soon be fixed
        • Observed CVMFS failures on batch and cloud machines (GGUS:126876). Failure rate decreased now
          • Gavin: probably due to high IO-wait whose cause was not yet identified
      • T1:
        • SARA: SRM problems over the week-end (GGUS:126937). Currently cannot test if fixed b/c site is in DT
        • FZK: Switching to ARC-CEs only. Last week sw update for ARC-CEs produced failures (GGUS:126882). CREAM-CE submission already stopped from LHCb side.
        • PIC: Currently in DT for dCache upgrade. Batch closed but CEs open --> produces aborted pilots on LHCb side.
          • Maarten: the CE queue states should be set to something else than Production

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: NTR
  • EGI: NTR
  • FNAL: NTR
  • IN2P3: during the maintenance downtime on March 21st, there will be an update of the router which will disconnect the site during around 1h (6:00 to 7:00 UTC). The services (CE, SE) will be off during most of the day. As usual details will be available one week before the event.
  • JINR: All OK on the site last week, but the Farm is still almost empty. Tape robot maintenance 10.03.2017 8:00-12:00 UTC, only affects the reading from tapes.
    • Christoph: there were data transfer issues which led to the low usage; we will follow up
  • KISTI: NTA
  • KIT:
    • Dimitri: the retirement of the CREAM CEs starts tomorrow.
      We also need to do a firewall maintenance soon, with several ~5-minute interruptions during 1-2h:
      should it be declared as an outage or is at-risk good enough?
      • at-risk is OK for all experiments
  • NDGF:
    • Jens: a transparent dCache upgrade to version 3.0.10 is foreseen this week
  • NL-T1:
    • Last Wednesday, the network at SURFsara went down. Restart of a Juniper Qfabric director fixed the problem. The outage lasted from 20:30 to 22:00 CET.
    • LHCb submitted GGUS:126937. Some transfers to/from SURFsara SRM end with a "No route to host" while others succeed. Investigating. GGUS has been updated with a request for additional information.
    • This morning we had a scheduled downtime to get one step closer to IPv6. Apologies for the short notice; we did this not to overlap the FZK downtime.
  • NRC-KI:
  • OSG: NTR
  • PIC: PIC will be in complete scheduled downtime during working hours (09:00 to 18:00 CERN time) on next Tuesday, March 7th in order to upgrade our dCache to 2.16 version. (J. Flix not present today, since I am in ISGC2017 conference)
  • RAL: At risk on Wednesday (08/03/2017) while we enable IPV6 on Tier1 routers, otherwise NTR.
  • TRIUMF: NTR

  • CERN computing services:
    • Multiple services will be affected on Wed 15th (06.00-08.30) and Wed 22nd (06.00-08.30). Watch ITSSB for full list.
  • CERN storage services:
    • FTS
      • Transfer issues after the gfal2 upgrade last week.(OTG:0036193, GGUS:126946 and GGUS:126955). This affected transfers from SRM Castor/Storm to gridftp with checksum enabled. We have downgraded both the production and the pilot clusters to the previous version of gfal2 and the issue has been fixed. RAL was also running this version and it has already downgraded.
    • EOSCMS: Intervention tomorrow between 07:30 and 08:30 CERN Time in order to apply a configuration change (explicitly map AAA T0 proxies to a known identity)
    • CASTOR* and EOS*: Partially unavailable on Wed 15th 06.00 - 08.30 and Wed 22nd 06.00 - 08.30 because of a network maintenance.
  • CERN databases:
  • GGUS: NTR
  • Monitoring:
    • Draft reports for February distributed. Recomputations done.
    • METER problems: OTG:0036165
    • MONIT problems: OTG:0036182
    • AUP signature expired for certificate used in SAM ATLAS tests, once reported it was quickly fixed by ATLAS (INC:1302417)
    • Issue with wlcg-sam.cern.ch from 4-March to 6-March. Web server unavailable, due to full disk space on all the machines behind the alias. GGUS:126941 solved
    • Issue with CMS SSB being investigated (GGUS:126940)
  • MW Officer: NTR
  • Networks:
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2017-03-07 - AndreaManzi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback