Week of 160516

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:
  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday: Whit Monday holiday

  • The meeting will be held on Tuesday instead.

Tuesday

Attendance:

  • local: Maria D. (SCOD), Liana Lupsa (CERN DB), Andrea Manzi (MW), Maarten Litmaath (ALICE), Stefan Roiser (LHCb), Julia Andreeva (WLCG Ops), Maria A. (CMS & WLCG Ops).
  • remote: John Kelly (RAL), Elena Corni & Francesco Noferini (CNAF), Dmytro Karpenko (NDGF), Eric Lancon (BNL), Andrew Pickford (NL_T1), Di Qing (Triumf), Rolf Rumler (IN2P3), Daniele Bonacorsi (CMS).

Experiments round table:

  • ATLAS reports (raw view) -
    • Activities:
      • MC15c, derivation production, and Heavy Ion reprocessing ongoing
    • Problems:
      • T1 DATADISKs were getting full but had secondaries - more space at the end of the week
      • Problem with RAL TAPE robot from Tuesday up to the weekend (now a bit of backlog being cleared)
      • Some transfers to CERN failing with "end-of-file was reached globus_xio" (GGUS:121550)

  • CMS reports (raw view) -
    • Full utilization of the pledges at T1+T2, >100% pledges since >1 week, really OK, sites ok and WM/DM keeping up
    • MC prod: major DigiReco campaign with CMSSW v8 for ICHEP at full speed, >3B evts produced out of 4B requested, in advance w.r.t planning
    • busy days for T0 ops but ultimately managing to keep up, experts on top of issues
    • a couple of EOS quota adjustments needed (unmerged on May 11, t0streamer on May 15)

  • ALICE -
    • Lowish activity on average until Monday

  • LHCb reports (raw view) -
    • Site Issues
      • T0: Failing Transfers to CERN/EOS srm on Monday, fixed (GGUS:121582)
      • T1:
        • FZK: Tape staging on Saturday down to 0 but resurrected (GGUS:121541)
        • FZK: WNs couldn't access site SRM on Wed, fixed (GGUS:121374)

Sites / Services round table:

  • ASGC: not connected
  • BNL: NTR
  • CNAF: A downtime is scheduled for CMS on 17/05 11:00 UTC to 18/05 16:00 UTC to add 500 TB to the cms storage. (https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=20690)
  • FNAL: not connected
  • GridPP: not connected
  • IN2P3: The site will have a full day downtime on June, 14th. All services are impacted for various durations. Among other things, power supply and network will be under maintenance. More details one week before the outage.
  • JINR: (from Victor by email) We had some issues due to disk buffer overflow in dCache, but the problem was resolved and last week was smooth, Nothing else to report.

  • KISTI: not connected
  • KIT: not connected
  • NDGF:
    • Downtimes this week:
      • Tuesday 07:00-07:30 headnode updates
      • Tuesday 07:30-12:00 Pool updates.
      • Friday 07:00-10:00 pool updates, Norwegian sites
    • The full outage is usually around 6 minutes, and the pool updates are rolling updates that should not be visible to the projects.
      The downtimes are in GOCDB.
  • NL-T1:
    • Downtimes in the near future (SARA):
      • 19-05-2016 10:00 - 19-05-2016 11:00 Network redundancy test in preparation of our move to our new data center later this year. There is a chance that this test might affect LHCOPN and LHCONE traffic.
      • 30-05-2016 15:00 - 01-06-2016 15:00 We are going to merge our two tape environments (on the software level) so we can deploy our tape drives more dynamically. This should lead to a higher availability of our tape drives for our users. During this maintenance the data in our tape libraries will be unavailable.
      • 29-07-2016 10:00 - 15-08-2016 07:00 Tape storage system will be moved to new datacenter. Please plan your tape storing and staging activities around this period.
  • NRC-KI: not connected
  • OSG: not connected
  • PIC: (email from Pepe after the meeting) we are going to upload tomorrow the SIR on T10KD incident we had in PIC.
  • RAL:
    • There have been two problems at RAL during this last week:
      • On Monday (9th May) there was a problem with the cooling in the machine room. At around 16:30 local time the air-conditioning chillers and pumps stopped. Our running batch work was suspended and no new jobs allowed to start. After around 30 minutes staff restarted the pumps and chillers and temperatures fell back. Batch jobs were then un-suspended and continued to run. In order to keep the rate of any temperature rise down should there be a recurrence overnight, new batch jobs were not started until the following morning. An outage for the CEs was declared in the GOC DB overnight.
      • There were problems with the Tier1 Tape Library from Monday (9th) through to Friday (13th). In the week beforehand one of the two "elevators" in the tape library had failed. This was not an operational problem. However, on the Monday the second 'elevator' stopped working too. This led to some tape mounts failing. At the end of the afternoon the following day (Tuesday) there was a more severe failure within the library and all tape mounts stopped. While the vendor was working on the problem some reconfiguration was done to make use of a second tape library and from Thursday 12th we were able to write Tier1 data. The vendor fully fixed the problem on Friday (13th) and during that afternoon full tape access (read & write) was restored. The tape system was left "at risk" over the weekend.
  • TRIUMF: NTR

  • CERN computing services: no report
  • CERN storage services: no report
  • CERN databases:
    • Storage intervention affecting experiment integration/development databases performed successfully last week. We also used this time slot to apply the latest OS patches to the databases running on the host affected by the intervention.
  • GGUS: NTR
  • Monitoring: no report
  • MW Officer: EOS ATLAS problem, not possible to report due to ATLAS and Storage representatives' absence and a fire alarm that obliged us to leave the building before the meeting ended.
  • Networks:
    • GGUS:120957 BNL->CERN->SARA resolved
    • GGUS:119820 ASGC connectivity, narrowed down to local Cisco N7K at ASGC, further testing needed to confirm
    • BNL->RAL consistent loss to RAL reported from BNL, but appears to have no impact on production traffic (no ticket)
  • Security: no report

AOB:

  • MW Readiness WG meeting tomorrow. Please check agenda http://indico.cern.ch/e/MW-Readiness_17 and join if interested.
  • A fire alarm obliged us to leave the building before the meeting ended. Nothing serious, here is the Computer Centre manager's report: The investigations from the fire brigade determined that two smoke sensors detected dust particles in the false floor of the computer centre tape vault. This was very likely related to maintenance works that are ongoing in this area.
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2016-05-18 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback