Week of 150209

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Herve (Storage Services), Aris (Grid Services), Emile (Data Bases), Maarten (Alice), Mark (Alice), Alessandro (Atlas), Xavier (SCOD+Storage Services)
  • remote: Felix (ASGC), Michael (BNL), Matteo (CNAF), Lisa (FNAL), Rolf (IN2P3), KISTI (Sang-Uhn), Onno (NL-T1), Rob (OSG), Pepe (PIC), Tiju (RAL), Christof (CMS), Dimitri (KIT)

Experiments round table:

  • CMS reports ( raw view) -
    • Some issues with one of our internal web services ReqMgr (Request Manager)- GGUS:111627
      • Some slowdown of centrally managed production and processing
    • Tape staging tests progressing
      • Targeting CCIN2P3 this week (as usual communication via GGUS ticket)

  • ALICE -
    • CERN: EOS-ALICE problem on Fri caused a large drop in activity during 5 hours

  • LHCb reports ( raw view) -
    • "Legacy Run1 Stripping" still looking at recovery from lost files. Large MC campaign upcoming + User jobs but not much happening right now.
    • T0: Many of the job failures at CERN appear to be related to batch queue timeleft mechanism. Investigating still. (GGUS:111565)
    • T1: NTR

Sites / Services round table:

  • ASGC: Downtime extended until tomorrow
  • BNL: Following up the FTS3 issue reported by ATLAS
  • CNAF: ntr
  • FNAL: Major downtime tomorrow, updating EOS, dCACHE and Kernels (nodes' reboots).
  • GridPP: np
  • IN2P3: Announcing a downtime on the 24th of February
  • JINR: np
  • KISTI: Scheduled intervention from the 11th to the 14th of February. Expected downtime for few hours during the firewall intervention (disk replacement).
  • KIT:ntr
  • NDGF: np
  • NL-T1: SIR ready for the incident involving the raid controller failure that happened on the second week of January.
  • NRC-KI:np
  • OSG: Glide-in factory maintenance tomorrow.
  • PIC: ntr
  • RAL: ntr
  • TRIUMF: np

  • CERN batch and grid services:
    • VOMRS to be decommissioned and replaced by VOMS-admin on Monday 16th of February.
      • The intervention will last the whole day. During that period of time, VOMS will be unaffected (e.g. voms-proxy-init) but registrations won't work.
      • More info: OTG0018459
      • Alessandro (ATLAS) suggest to send mail to vo-admins before agreeing on the date.
  • CERN storage services: Proposed dates for upgrading EOS to latest version: EOSALICE (Tue/17th/Feb at 9h30) and EOSLHCB (Thu/19th/Feb at 9h30')
  • Databases:ntr
  • GGUS:ntr
  • Grid Monitoring:ntr
  • MW Officer:ntr

AOB:

Thursday

Attendance:

  • local: Luca (SCOD + Storage), Aris (Batch), Andrew (LHCb)
  • remote: Christoph (CMS), Felix (ASGC), Michael (BNL), Lisa (FNAL), Rolf (IN2P3), Jeremy (GridPP), Di Qing,Dennis (NL-T1), Gareth (RAL), Rob (OSG), Thomas (KIT), Sang Un (KISTI), Alessandro, Andrej (ATLAS)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Central Services/T0/T1
      • CERN-PROD: GGUS:111714 srm (used only for deletion) stuck - solved now. GGUS:111676 and GGUS:111701 are related to Castor ATLSPECIAL pool which is now under decommissioning - Thanks for the support, ATLAS is also trying to understand the issues (i.e. they are not purely CERN issues)
      • RAL-LCG2 GGUS:111709 - solved (an old squid node was left into AGIS)
      • TAIWAN-LCG2 GGUS:111664 - now solved. Site was in downtime - we need to follow up if it was declared or not.

  • CMS reports ( raw view) -
    • CRUZET has started this Tuesday
      • Cosmic data taking with magnet off (Cosmic RUn at ZEro Tesla)
    • Problems with Tier-0 virtual machine provisioning
    • Tape staging test at Tier-1s ongoing
      • CCIN2P3 this and next week

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Still finalising last few file statuses of Stripping21 campaign. User jobs and MC at the moment. Expect MC to ramp up soon.
    • T0: CERN job failures ticket still open (GGUS:111565) and on us to understand where the job time limit problem is.
    • T1: Thank you to T1s for bearing with us this week while we killed jobs from a user who had submitted a large number of tape requests.

Sites / Services round table:

  • ASGC: reminder about tomorrow downtime from 7:30 to 12:30 utc
  • BNL: NTR
  • CNAF:
  • FNAL: NTR
  • GridPP: NTR
  • IN2P3: NTR
  • JINR:
  • KISTI: in scheduled downtime, last night DNS was turned off, now the DNS is recovered and T1 services are recovering
  • KIT: NTR
  • NDGF:
  • NL-T1: NTR
  • NRC-KI:
  • OSG: NTR
  • PIC:
  • RAL: Downtime to be scheduled next week (still discussing the exact time)
  • TRIUMF: NTR

  • CERN storage services:
    • update of EOSALICE Tue 17 Feb from 9:30 to 11:30
    • update of EOSLHCB Thu 19 Feb from 9:30 to 11:30
  • CERN batch and grid services:
    • VOMRS to be decommissioned and replaced by VOMS-admin on Monday 2nd of March.
      • It has been postponed by two weeks after requests from CMS and LHCb.
      • The intervention will last the whole day. During that period of time, VOMS will be unaffected (e.g. voms-proxy-init) but registrations won't work.
      • More info: OTG0018459
    • The whole VOMS service will be unavailable on Wednesday the 18th of February from 9AM to 10AM due to a hardware intervention in the underlying database.
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB: -- AndreaSciaba - 2014-12-16

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2015-02-12 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback