Week of 161107

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:
  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Kate D-W (DB, Chair), Andrew Mc Nab (LHCb), Julia A (WLCG), Alberto A(Monitoring), Maarten L (WLCG), Marian B (network), Alessandro F. (storage), Vincent B. (security), Gavin M., Marcelo (LHCb), Ivan G. (ATLAS)
  • remote: Gareth (RAL), Vincenzo (EGI), Victor (JINR), Antonio F (CNAF), Christian (NDGF), Di Qing (TRIUMF), Dave M (FNAL), Xin Zhao (BNL), Luca L (CNAF), Onno (NL-T1), Max Fischer (KIT), Rolf (IN2P3), FaHui (ASGC)
  • apologies: CMS

Experiments round table:

  • ATLAS reports (raw view) -
    • Production running fine, pre-MC16 samples submitted, preparing for Heavy Ion run.
    • Network intervention last Thursday morning, recovered fine.
    • Data loss at TAIWAN; GGUS:124597.
    • Tier1 disks are getting full, constant auto and manual re-balancing. Dedicated discussion this week for production input replicas.

  • CMS reports (raw view) -
    • CMS Computing WEEK. Most probably no one from CMS will attend this meeting, please read the (few) news.
    • CMS switched to pA mode, ready for data taking; no collisions yet (tomorrow morning?)
    • Data ReReco up to ~August finished and ready for analysts.
    • Moriond 17 (high PU) MC production started a few hours ago. Heavy relying on Xrootd for PU distribution.
    • Last Nov 3rd network intervention left CMS w/o Kibana for some hours. Shifters basically blind up the early afternoon.
    • current outstanding tickets:

  • ALICE -
    • CERN: half of myproxy.cern.ch was broken last Tue (GGUS:124746)
    • CERN: two incidents caused short downtimes of EOS on Fri

  • LHCb reports (raw view) -
    • Activity
      • Monte Carlo simulation, data reconstruction/stripping and user jobs on the Grid
    • Site Issues
      • T0:
        • LHCb will be ok with a purely HTCondorCE service (i.e. without CREAM CE too.)
      • T1:
        • RRCKI - very slow transfers caused by network problems resolved (GGUS:124538) - optimising configuration then will close ticket

Sites / Services round table:

  • ASGC: Data loss is now confirmed. SIR has been requested by Maarten and should be shared with SCOD team.
  • BNL: HPSS upgrade went well last week, downtime ended one day earlier, no interruption to production jobs.
  • CNAF:
    • We need some clarification from ATLAS, about the new site movers. It is not clear to us which protocols are currently supported, also it seems like there has been some lack of communication given we were not aware of such changes. Ivan promised to follow up as no official document has been published.
    • Some CMS SAM tests are failing because of some issue with the storage, that we are investigating.
  • EGI: ntr
  • FNAL: ntr
  • GridPP: nc
  • IN2P3: One of the dCache core servers crashed last Monday afternoon due to a kernel bug. This led to an unscheduled downtime of about 3 hours, erroneously declared as "at risk", our apologies for this. The service is up and running since then and we updated our operations manual with respect to the downtime type. Finally, a heads-up: the site will be in maintenance downtime on December 6th, a Tuesday. As usual details will be available one week before the event.
  • JINR: NTR, except error in CMS_SAM test: used "file:/dir..." instead of "file://dir...". Ticket raised: GGUS:124860
  • KISTI: nc
  • KIT: ntr
  • NDGF: Storage downtime tomorrow 8-10 UTC. Final step in making dCache setup fully redundant.
  • NL-T1:
  • NRC-KI: nc
  • OSG: A ticket for blocking AFS callbacks has been reported
  • PIC: nc
  • RAL: On Friday we had problems with the database behind our "test" FTS3 system. We asked Atlas to move to the "production" service. We are working on a problem with glexec for CMS.
  • TRIUMF: One UPS inductor was burned last Thursday and UPS has been in by pass mode for safety. Hopefully the inductors will arrive and be replaced on Tuesday. The whole data center will be down about 3~4 hours for the replacement work.

  • CERN computing services: VOMS intervention tomorrow between 10.00 and 12.00
  • CERN storage services:
    • RFIO write operations decommissioning in CASTOR. As announced at the ITUM in October (and in SSB), we have started to inform users that are still sending data to CASTOR via RFIO. As a matter of fact, we have identified some residual production usage of RFIO also in the LHC instances. Experiments will be contacted for follow up.
    • Several GGUS Tickets have been opened to T2/T3 sites in order to address a firewall misconfiguration that blocks AFS callbacks and to make sure that they are aware of the AFS phaseout at CERN.
  • CERN databases: ntr
  • GGUS: ntr
  • Monitoring:
  • MW Officer:
  • Networks:
    • Investigated NRC-KI CERN issue, it was resolved by re-routing via Budapest, but root cause for high traffic in Amsterdam not understood
    • UNL to Fermilab link was reported to suffer from packet loss, UNL investigating
    • Issues with RAL packet loss and MIT inbound throughput were resolved
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2016-11-07 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback