Week of 170828

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Kate (chair, DB, WLCG), Julia (WLCG), Maarten (Alice, WLCG), Ivan (ATLAS), Vladimir (LHCb), Yolanda (storage), Paul (storage),
  • remote: Kyle (OSG), Luca (CNAF), Vincenzo (EGI), Di Qing (TRIUMF), Christoph (CMS), Sang Un (KISTI), Christian (NDGF), Victor (JINR)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Normal activities - derivations, overlay
    • Problems - nothing ongoing

  • CMS reports ( raw view) -
    • CPU utilization about 150k cores over the last week
    • CMS transfer system is on high load (likely even too high load)
      • Have quite some backlog of data import to some European T1s
      • Import at 1-2GB/s seem to fully utilize available bandwidth
    • Some transfer issues
      • CCIN2P3
      • KIT
        • Proposal to adjust Phedex agent settings to improve transfer rate: GGUS:130160
      • Tape staging at CNAF
        • CMS Transfer system to stage data from CNAF disk, which is on disk elsewhere (not really indented by CMS): GGUS:130211
      • Large tape recall at RAL
      • Transfers from FNAL
      • Some files not being staged from FNAL tape

  • ALICE -
    • Continued high activity

  • LHCb reports ( raw view) -
    • Activity
      • Monte Carlo simulation, data processing and user analysis
    • Site Issues
      • T0:
        • Incomplete python installation at worker nodes (GGUS:130018)
      • T1:
        • Failed transfers from IC to SARA (IPV6) (GGUS:129946)
        • Failed transfers from many sites to dCache sites, see (GGUS:130190)

Sites / Services round table:

  • ASGC: nc
  • BNL: NTR
  • CNAF:
    • Referring to ggus ticket no. GGUS:130211:
      • This activity has caused physical damages to our tape robot, if it's necessary to continue contact our support ASAP
        • Maarten noted that improvement of tape efficiency was discussed during a recent operations coordination meeting and T1 sites were informed that a longer wait time for requests can be imposed by them. A site has the right to protect its own resources and can do it using restrictions and require users to verify their workloads.
          • Added after the meeting: it turns out those protections were already implemented at CNAF, yet their tape system still got hammered by problematic workflows that CMS are looking into.
      • is it planned to render GGUS bidirectional? So that tickets to experiments can also be opened here?
        • Maarten commented GGUS has always been bidirectional and site can open tickets to experiment. Christoph confirmed GGUS:130211 has arrived to CMS.
    • CMS has implemented the policy of canceling and purging 1 hour old queued pilots, this affects CEs in a strong way. Is this necessary?
      • Christoph commented that pilot purging was an attempt to reduce bad CPU efficiency by removing pilots waiting too long for a payload. Pilots will also exit by themselves in such cases, but the timeout was too long (30 mins later changed to 10 mins). As Maarten remarked that job cancelling is not a cheap operation and may well contribute to overloading CEs, it will be reported back to CMS submission infrastructure.
  • EGI: ntr
  • FNAL: nc
  • IN2P3: nc
  • JINR: Nothing to report.
  • KISTI: ntr
  • KIT: nc
  • NDGF: Bergen site will move to a new room. All local services offline tomorrow 7:30-16 CEST. Hope to have data back online by noon. Alice data affected. Downtime has the wrong time. It seems gocdb thinks Sweden is UTC: GGUS:130272
  • NL-T1: nc
  • NRC-KI: nc
  • OSG: OSG will not connect next week due to national holiday
  • PIC: NTR
  • RAL: Today is a Bank Holiday in the UK, so no one available to attend. Approx 20K Atlas files lost from Ceph/Echo storage last week, more details to follow.
  • TRIUMF: NTR

  • CERN computing services: nc
  • CERN storage services:
    • FTS: the FTS production cluster will be upgraded to v 3.7.3 on Sept 4th from 10:00 to 11:00 CEST (OTG:0039413)
  • CERN databases: ntr
  • GGUS: ntr
  • Monitoring: nc
  • MW Officer: nc
  • Networks: nc
  • Security: NTR

AOB: Christian has opened GGUS:130272 ticket for GOC DB not converting properly UTC time.

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2017-08-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback