Week of 111003

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Dirk, Jamie, Maria, Cedric, Peter, Maarten, Dan, Mattia, Massimo, Ulrich, Eva);remote(Burt, Michael, Onno, Rolf, Vladimir, Rob, Tiju, Maria Francesca).

Experiments round table:

  • ATLAS reports - Dan
    • T0/Central Services
      • t0merge file access issue (GGUS:74838) requires update to CASTOR release 2.1.11-6. Scheduled for 10-12 Tuesday by Castor ops.
    • T1 sites
      • Since Sunday AM RAL MCTAPE bring online failures (GGUS:74856) "did not answer within the requested time" and "SRM_FILE_UNAVAILABLE"
      • Taiwan: few hundred production jobs failed with little disk space left on WN(GGUS:74858). Retries succeeded.
    • T2 sites

  • CMS reports -
  • LHC / CMS detector
    • 2 large fills over the week-end : Acquired 210pb-1 (recorded 215.5pb-1)
  • CERN / central services
    • security vulnerability warning on several CMS job submission boxes (WMAgent) triggered a late Friday action by the CMS VOC, who closed most ports on 2 CERN VoBox's, however omitted to leave 2 essential ports open for production... This blocked most CMS central production/processing over the week-end... Problem got fixed Monday morning.
  • T0 and CAF:
    • Transfer blockage onto EOS showed that CMS needs to improve its quota management on EOS
  • T1 sites:
    • Fall 11 MC reprocessing on-going
  • T2 sites:
    • MC workflows (also with WMAgent) and analysis
  • AOB
    • next CMS CRC yet to be found...

  • ALICE reports -
    • T0: ntr
    • T1 sites : ntr
    • T2 sites: usual operations

  • LHCb reports -
  • Experiment activities
    • Reconstruction and stripping at CERN only
    • Reprocessing at all T1 sites and a few T2 sites
  • T0
    • Castor: Problems with access to disk pools via xrootd protocol (GGUS:74751), waiting for user reply [ Massimo - suggest to close and reopen if it comes back. Vladimir - cannot reproduce so ok to close. ]
    • Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775)
  • T1 sites:
    • IN2P3: Increased number of stalled jobs (GGUS:74733), upgrade of CREAM-CE has fixed the problem.
    • Gridka: Observing problems with jobs with many input files when accessing storage via protocol, mostly user jobs affected by this.
    • Gridka: missing results from nagios tests for CE, fixed yesterday around 8PM
    • SARA: (GGUS:74875) We have a problem with staging requests that are not responding for SARA-RAW. [ Some problem with tape devices ] Onno - problem with tape units under investigation and will check after meeting ]
  • T2 sites:


Sites / Services round table:

  • ASGC:
  • BNL : ntr
  • CNAF:
  • FNAL: ntr
  • IN2P3: reminder of major outage tomorrow - should have received notifications
  • KIT:
  • NDGF: ntr
  • NLT1: nta
  • PIC: ntr
  • RAL: currently seeing timeouts - not affecting us to much
  • OSG: ntr

  • CASTOR/EOS: Friday ATLAS and CMS have been contacted to agree on a upgrade of EOS (tomorrow Tue 4-OCT-2011 starting at 10:00). Details available on GOCDB/ITSSB (transparent update).
  • Dashboards: ntr
  • Databases: ntr
  • Grid services: ntr

AOB:

Tuesday:

Attendance: local(Peter, Cedric, MariaDZ, Maria, Jamie, Uli, Ookey, Alessandro, Ignacio, Maarten, Mattia, Eva, Massimo);remote(Gonzalo, Michael, Xavier, Paco, Burt, Rolf, Vladimir, Maria Francesca, Tiju, Rob).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • Transparent intervention on Castor/EOS today. No problem noticed
  • T1 sites
    • Lyon : Many services in SD today in particular LFC. The whole cloud is offline in Panda and DDM.
    • RAL_MCTAPE bring online failures (GGUS:74856) : two tapes were disabled. But still some errors seen in the last hours.
    • SARA : Problem reported last week about ATLAS jobs stuck on new worker nodes added to an existing CE. Was due to driver issue with Centos 5.6
  • T2 sites
    • Long standing issue regarding degraded transfer T0 - AGLT2 (GGUS:73463) fixed.


  • CMS reports -
    • nothing to report (in last minute wrong ticket opened to IN2P3 and closed with apologies as they are in downtime)
    • AOB
      • Peter Kreuzer will participate in the WLCG daily call for Markus Klute (CRC)

Experiment activities:

  • Experiment activities
    • Reconstruction and stripping at CERN only
    • Reprocessing at all T1 sites and few T2 sites
  • T0
    • Castor: Problems with access to disk pools via xrootd protocol (GGUS:74751) [ Massimo - will contact Joel offline. Ignacio - as posted in ticket more info is required to debug ]
    • Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775)
  • T1 sites:
    • Gridka: Observing problems with jobs with many input files when accessing storage via protocol, mostly user jobs affected by this.
    • SARA: (GGUS:74875) We have a problem with staging requests that are not responding for SARA-RAW.


Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF:
  • FNAL: ntr
  • IN2P3: we are in scheduled downtime. Everything is going fine except for Oracle where there is a problem with a disk. In consequence downtime for LFC for LHCb had to be extended. Still hope to keep other times as planned.
  • KIT: ntr
  • NDGF: ntr
  • NLT1: comment about ticket for SARA: solved the problem. Two dirs were set to same fs, now separated and now working much better. Ticket closed from their side and needs just to be verified.
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: nta
  • Dashboards: ntr
  • Databases: ntr
  • Grid services: 1 LCG CE got overloaded but is back now

AOB:

Wednesday

Attendance: local(Cedric, Maria, Jamie, Luca, Uli, Mattia, MariaDZ, Massimo, Maarten, Edoardo);remote(John, Michael, Lisa, Markus, Maria Francesca, Rolf, Vladimir, Jhen-Wei, Lorenzo, Rob, Ron, Dimitri).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • CERN-PROD castor xroot redirector auth issue (GGUS:74964) [ Massimo - seems to be a problem wih a specific ID - will discuss offline ]
  • T1 sites
    • Lyon : Came back online yesterday evening, but problem with condition DB (fixed this morning).
    • ASGC - tests were failing but killed by testing robot. Problem some jobs had accumulated in queues and hence took too long. Today all looks fine.
  • T2 sites
    • ntr


  • CMS reports - CAF job submissing problem in the morning. Solved, see ticket

  • T0 site
    • Nothing to report

  • T1 sites
    • CNAF. GGUS:74963. Delegation of the proxy to one of the CREAMs is not working [ Lorenzo - will look into it and hopefully solve asap and update ticket ]
  • T2 sites
    • Usual operations


Experiment activities:

  • Experiment activities
    • Reconstruction and stripping at CERN only
    • Reprocessing at T1 sites and few T2 sites
    • Only 1000 jobs running at CERN - any reason why? A: did you open ticket! Can't answer without checking - please open ticket!
  • T0
    • Castor: Problems with access to disk pools via xrootd protocol (GGUS:74751)
    • Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775)
  • T1 sites:
    • Gridka: Observing problems with jobs with many input files when accessing storage via protocol, mostly user jobs affected by this.
    • SARA: (GGUS:74875) We have a problem with staging requests that are not responding for SARA-RAW.
    • IN2P3: (GGUS:74961) Can not get pilots status at CREAMCEs. Solved.


Sites / Services round table:

  • ASGC: ntr
  • BNL :
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: finished downtime yesterday essentially ontime except for LHCb and conditions DB problems. Had to omit some updates to keep to schedule. Various problems with Oracle. Now ok. Will probably revisit internal architecture on Oracle side. Q: is replication re-established? A: yes, all ok now.
  • KIT: ntr
  • NDGF: 1) currently having connection problems between pool and head node due to fibre break. Some ALICE and ATLAS data might not be available. Break started 1h ago and hopefully solved soon. People working on fibre. 2) Tomorrow we have scheduled downtime for head node of system. Some of ATLAS data not available 10:00 CET for 1h.
  • NLT1: last night pool node crashed and this morning another crashed. Probably driver of 10GbitE cards. Today put new storage into production and now fulfilling 2011 pledges.
  • PIC:
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: nta
  • Network: ntr
  • Dashboards: ntr
  • Databases: investigating latency for CMS online to offline for PVSS data. Currently 17h.
  • Grid services: ntr

AOB: (MariaDZ)

  1. Usage of the field Type of Problem (ToP) in TEAM tickets since its entry into production on 2011/09/28 click here.
  2. Usage of the field ToP in ALARM tickets since its entry into production on 2011/09/28 click here.
  3. The search tool of points 1 and 2 above is temporary. The GGUS search engine will be changed to include ToP as a field to tick and display at the next GGUS Release 2011/10/19.
  4. No T1SCM tomorrow but, if GGUS tickets of concern to experiments and/or sites don't get appropriate support, please email MariaDZ for a short presentation at tomorrow's daily meeting. No T2 issues please as the relevant partners don't join this meeting.

Thursday

Attendance: local(Cedric, Jamie, Uli, Ookey, Dawid, Peter, Alessandro, Maarten, MariaDZ, Mattia);remote(Lisa, Lorenzo, Ronald, Kyle, Gareth, Vladimir, Rolf, Maria Francesca).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • CERN-PROD castor xroot redirector auth issue (GGUS:74964). Was due to a change of the certificate ATLAS use to read data from T0. (Changed in Sep and gridmapfile not updated). [ Ale - worked with IT-DSS to understand why xroot redirector is checking both gis and kerberos authentication. Problem not fixed yet - still working on it. First problem was just part of whole issue. ATLAS is issuing disk to disk copy - not expected; working with IT-DSS to understand ]
    • CERN-PROD_DATATAPE -> DATADISK transfer error (GGUS:74977) : Node giving errors was overloaded
    • Load on ATLR due to trigger repro.
  • T1 sites
    • Transfer errors from SARA (GGUS:74931 reopened) : "Indeed, at the end of the afternoon a pool nopdes crasht again. We have strong suspicions that the latest CentOS5.7 upgrade of the kernel got us a kernel that does not like the 10GigE cards of our older nodes very much. In the meantime we have downgraded the machines to an older kernel and everything is still up and running now."
  • T2 sites
    • Transfer errors from/to IL-TAU-HEP (GGUS:74967) + Weizmann. Problem due to bad CRL.
    • Some problem reported for praguelcg2 (SRM_ABORTED), GRIF-LAL (GRIDFTP_ERROR), DESY (network problem between Hamburg and Zeuthen), UKI-SOUTHGRID-OX-HEP (SE issue)


  • CMS reports - Problem with CMS services this morning. Suspected Oracle problem and issued a ticket. Restarting services fixed CMS problems. We are monitoring the situation.


Experiment activities:

  • Experiment activities
    • Reconstruction and stripping at CERN
    • Reprocessing at T1 sites and T2 sites
  • T0
    • Castor: Problems with access to disk pools via xrootd protocol (GGUS:74751)
    • Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775)
  • T1 sites:
    • GRIDKA: Observing problems with jobs with many input files when accessing storage via protocol, mostly user jobs affected by this.
    • NIKHEF: (GGUS:74976) Jobs failed, pilots aborted. Solved
    • RAL - aware of problem on CE and looking at it now.


Sites / Services round table:

  • ASGC: have CMS disk server failed yesterday, changed mother board and fixed it
  • BNL :
  • CNAF: dismissing LCG CE. Start to use only CREAM at CNAF. Contact us for any questions. ce01 and ce07-lcg.
  • FNAL: ntr
  • IN2P3: ntr
  • KIT:
  • NDGF: ntr
  • NLT1: downtime announcement Monday from 16:00 local to Tuesday 17:00 - compute nodes at NIKHEF will be unavailable due to maintenance on power systems.
  • PIC:
  • RAL: nta
  • OSG: ntr

  • CASTOR/EOS:
  • Dashboards:
  • Databases: except for CMS phedex issue; plan instability which has some performance problems. Reboots of CMS offline DB; issue still under investigation. (08:30) 2 nodes rebooted spontaneously.
  • Grid services:

AOB: (MariaDZ)

  • Following yesterday's AOB discussion on ToP please comment in Savannah:123890 with arguments in favour or against making ToP mandatory.
  • WLCG shifters being heavy GGUS users please comment in Savannah:120505 if you agree with the request for GGUS to generate an email notification to the GGUS ticket submitter.
  • GGUS usage from the WLCG community very often involves direct routing to sites (by selecting a value from the "Notify Site" drop-down). Please comment in Savannah:122581 on the suggestion to be offered the possibility to check site availability published in GOCDB before submitting a GGUS ticket 'against' a given site.
  • The duplicate email notifications received from GGUS and reported by the T0 service managers were due to a wrong flag configuration "slipped in" since the GGUS Release 2011/09/28.
  • The tickets showing "User notification: on every change" seem to be a choice of the submitters. E.g. GGUS:48685 and many others hold: "User notification: on solution".

Friday

Attendance: local(Cedric, Jamie, Maarten, Lukasz, Uli, Peter, Alessandro, Ookey, Massimo);remote(Xavier, Rolf, Vladimir, Lisa, Michael, Alexander, Kyle, John, Thomas, Lorenzo).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services CERN-PROD castor xroot redirector auth issue (GGUS:74964) solved
    • T1 sites ntr
    • T2 sites ntr

  • CMS reports -
    • Proxy renewal from myproxy.cern.ch failing Thursday night. Issued ALARM ticket GGUS:75056 at 17:43 UTC, work-around put in place at 18:43 UTC.
    • The GGUS ticket was solved and verified last night, but the SNOW ticket was only opened today at lunch-time.

  • ALICE reports -
    • myproxy.cern.ch stopped working Thu early evening after the host certs of the machines got updated; ALARM GGUS:75055 opened at 19:35 CEST, solved ~21:40
      • this ticket and the one from CMS led to a deluge of SNOW tickets due to some communication problem between SNOW and GGUS, which looks solved now (confirmed by Guenter Grein / SCC)

Experiment activities:

  • Experiment activities
    • Reconstruction and stripping at CERN
    • Reprocessing at T1 sites and T2 sites
  • T0
    • Castor: Problems with access to disk pools via xrootd protocol (GGUS:74751) Closed, problem with content of file.
    • Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775)
    • CERN: Pilots aborted at ce130.cern.ch (GGUS:75068). Reopened. [ Uli - was fixed before I got ticket so closed ticket again. Vladimir - reopened as we still have problems. ]
  • T1 sites:
    • GRIDKA: Observing problems with jobs with many input files when accessing storage via protocol, mostly user jobs affected by this.
    • RAL: (GGUS:75004) Pilots aborted at lcgce08.gridpp.rl.ac.uk. Solved.


Sites / Services round table:

  • ASGC: some transfer failures in ATLAS scratch disk due to "file exists" issues. No further errors in last 10 hours.
  • BNL : ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: one cluster that is a bit flakey inb accepting jobs; looking at it. Also flaky network to dcache pools in Finland since this morning. Network people looking into it - pools currently not available. Night Sun/Mon there will be maintenance on fibre connecting Sweden and Finland and so most pools offline for 5h. Mon evening another fibre maintenance to some of our pools.
  • NLT1: few fileservers lost network connection. Traced to old driver; created work-around and now upgrading drivers.
  • PIC:
  • RAL: ntr
  • OSG: didn't notice any problems with ggus outage - all ok,

  • CASTOR/EOS: might see some pools becoming red as being retired (monitoring).
  • Dashboards: ntr
  • Databases:
  • Grid services: (myproxy) (Uli) - At around 17:00 UTC Nagios probes complained about expiring cert - cert got renewed. Nodes should have host cert with subject = name of host. Put old certs back in place as temp measure and later renewed certs with proper subject. Change between SLC4 and SLC5 setup. Massimo - would be nice to review instructions for operations who called CASTOR PK and then were re-directed.

AOB: (MariaDZ)

  • There were problems with multiple GGUS notifications. Diagnostics in GGUS:75013.
  • The GGUS web service interfaces broke yesterday around 3pm CEST. The reason was a problem with the DNS service. It looks like the KIT DNS server could sometimes not access its partners outside KIT.
  • SNOW deployers were contacted to correctly route the CMS GGUS:74993 ticket to the right supporters.

-- JamieShiers - 14-Sep-2011-- JamieShiers - 14-Sep-2011

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2011-10-19 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback