Week of 170605

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday: Whit Monday holiday

  • The meeting will be held on Tuesday instead.

Tuesday

Attendance:

  • local: Kate (DB, WLCG, chair), Maarten (ALICE, WLCG), Krystof (computing), Lorena (computing), Vincent (security), Belinda (storage), Herve (storage), Andrea M (MW, FTS), Roberto (storage)
  • remote: Di Qing (TRIUMF), Kyle (OSG), Luca (CNAF), Tommaso (CMS), Ulf (NDGF), Xin Zhao (BNL)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Grid production has been running full up to ~300k cores in the past week, and dominated by MC16 campaign simulation. Derivation campaign on the reprocessed data15/data16 has not started yet in bulk, waiting for git migration, updates/fixes of the cache, expected after a week.
    • Tier0 Grid resources have been increased to the pledges, 19k cores. Tier0 has been processing the special runs collected before.

  • CMS reports ( raw view) -
    • Fairly quiet, just the usual low-level problems.
    • 2016 Data Rereco almost done (~ 200Mevents missing from a total of 6B injected - some more can come)
    • end of this week preparatory work for 2017 MC injection should be finalized. Real massive MC production not before 1 week.
    • Most exciting event was probably GGUS:128788, citing EOS issues that were slowing transfers from P5. The CMS daily run meeting minutes reported that this was resolved, but the ticket hasn't been updated (or closed) with any information as to how. What is being done to make sure that this problem doesn't recur?
      • Herve commented that EOS issue was identified - only a small number of diskservers was used. New EOS release is being prepared with fixes for issues discovered.
      • Maarten has asked what makes ATLAS and CMS different. Herve replies ATLAS is making GSI calls much less frequently. Tommaso added that the number of calls in CMS has increased significantly since the issue was introduced, hence it's more visible now.
    • there was a CMS-IT meeting this morning on the subject, where some measures were proposed by IT. Not the end of the story, it will need some iterations.
    • HammerCloud has not been working for CMS since last Tue. Seems to have been solved today (although I have no further information atm).

  • ALICE -
    • continued very high activity

Sites / Services round table:

  • ASGC: nc
  • BNL: NTR
  • CNAF:
    • NTR
  • EGI: nc
  • FNAL: nc
  • IN2P3: Scheduled maintenance on next Tuesday June 13th: CEs and SEs will be down the whole day
  • JINR: Working on IPv6 for the Tier1 services. Done for SEs services.
  • KISTI: nc
  • KIT: nc
  • NDGF: ntr
  • NL-T1: nc
  • NRC-KI: nc
  • OSG: There is an issue with ticket synchronisation between OSG system and GGUS. Guenter is checking. Possibly some notifications from GGUS are not arriving.
  • PIC: nc
  • RAL: nc
  • TRIUMF: Tape system at new data centre is online, currently ~6 PB added into tape capacity.

  • CERN computing services: SAM BDII machines have been deleted this morning after they were decommissioned as announced previously (OTG:0037723). Please contact BDII team in case of issues discovered.
  • CERN storage services:
    • FTS
      • An issue on CRAB ASO transfer jobs has been discovered last week after we upgraded the FTS production servers to the latest OS packages ( OTG:0037924 ). The FTS servers have been downgraded first and we then discovered that the issue was due to a more strict Apache validation on the Request Headers. The problem has been fixed on the CRAB side.
      • The deployment of 2 Load Balanced HAProxy servers for the FTS production servers was performed today Tue 6 June at 10:00 CEST (OTG:0037930).
    • EOSCMS instabilities over the weekend.
    • IPv6 is being enabled for LHCb.
    • EOS glitches for ALICE and LHCb during the weekend.
  • CERN databases:
    • A rolling intervention to apply a patch in CMSONR ADG database will be done on Thursday
  • GGUS:
    • During last Wednesday's release we updated both BMC Remedy servers to the latest version, the production server and the stand-by server.
    • Unfortunately we faced a couple of problems which did not occur during the updates of the GGUS test instance.
    • Two unscheduled downtimes were needed to sort everything out. By 16:00 UTC the service was working again.
    • A SIR has been uploaded to the WLCGServiceIncidents archive.
  • Monitoring:
  • MW Officer:
    • As broadcasted by C Aiftimiei, the EMI repos will be shutdown on 15/06. Please make sure you are not using those repos at your site any longer.
    • The first version of the WN meta-package for C7 has started the verification process in UMD (GGUS:128753). The plan is to have it ready for the UMD4 June release.
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2017-06-06 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback