Week of 090309

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

GGUS Team / Alarm Tickets during last week

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jean-Philippe, Julia, Harry, Alessandro, Ricardo, Roberto, MariaDZ);remote(Gareth, Daniele).

Experiments round table:

  • ATLAS - Several problems during the weekend. 1) U.Toronto site is still giving intermittent dcache permission denied errors to store files since more than a week. ATLAS suggest they talk to Triumf. 2) There are srm problems to export data to Coimbra and Lisbon and both have been taken out of production. 3) FZK is down for 3 days so the german cloud is out of production. 4) Cloud validation tests were started last Friday as a precursor to starting the planned reprocessing.

  • CMS reports - There is a summary of last weeks GGUS Alarm ticket tests showing response times on the reports Twiki. Response times were quite good with no showstoppers on reaching Tier 1 sites.

  • ALICE -

  • LHCb - 1) GGUS alarm tickets specialised for LHCb use cases will be tested this afternoon. 2) Dummy MonteCarlo production will restart using an older version of Gauss (the current one needs modifying to allow for jobs that run out of time). 3) Tier 1 reconstruction jobs are also restarting but at a low level. These have already shown several issues e.g. the issues at Nikhef trying to reconnect to the conditions database. 4) On the issue of wrong file locality being returned by the srm at IN2P3 they have followed advice from Lionel Schwarz but still have a problem.

Sites / Services round table:

CERN (by email from T.Cass): CMS has asked to postpone the already announced network intervention scheduled for the 18th of March: http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/ServiceChangeArchive/ImortantDisruptiveNetworkIntervention18March.htm

The new proposed date is Wednesday 1st of April 2009, with the same time schedule. Please let us know before tomorrow evening (tuesday 9th/3) if anyone opposes to this new date. In this case, the original date of the 18th/3 will be kept.

TRIUMF (mail from Di Qing following up on FTS 'defective credentials' error): Sorry for the late reply. We tried to investigate it on our SE. We found the same error message in the log of gridftp door on several pool nodes. The errors already appeared one hour earlier than when we reactivated all FTS channels, so definitely it was not a problem of the FTS. However, we still have not clues why this happened.

AOB: GGUS Team Tickets (MariaDZ): Experiments are reminded there is a well documented procedure for adding members to the team ticket submittors lists and she has added the link into the GGUS section above.

RAL (Gareth): RAL and other Tier 1 sites were marked as unavailable following CE SAM test failures from about 15.00 on Saturday. Ricardo said they were aware of this and had an open GGUS ticket.

Tuesday:

Attendance: local(Alessandro, Simone, Kors, Harry, Riccardo, Julia, Jean-Philippe, Nick, Olof, Patricia, Roberto, JT, Maria), remote(Daniele, Michael, Michel, Gareth).

Experiments round table:

  • ATLAS (Simone) - ATLAS observed problems in 4 T1s: RAL started unsched down yesterday, sched today - DNS problems. Lyon in sched down, ASGC too, FZK in sched down until tomorrow too! 40% of ATLAS T1! IFIC site in Spain - STORM + Lustre - need to check with STORM dev so that they can publish their info correctly and use them again!. Reprocessing campaign: pre-campaign site validation on-going. Hope validation would complete this week but let's see. Tomorrow another downtime at CNAF. The ADC review starts tomorrow. JT - timeout at NIKHEF accessing files? We have a couple of tickets re stagein of files to WNs and timeouts. We have 2Gbit network between NIKHEF and SARA - upgraded next month to 10Gb. Could limit # ATLAS jobs so that limit is not reached, wait, ban site or ... Simone: which files? Propose that 'hot files' are replicated to DPM in NIKHEF. Then normal input files would not congest link - will check with Hong. RAL (Gareth) - at RAL had problems since middle of yesterday related to DNS problem here. Put site into "AT RISK" which got extended through night - marked as scheduled but clearly not! CASTOR SRM for particularly ATLAS had quite severe problems - maybe should have been in unsched out. These resolved this morning and we believe we are back fully now.

  • CMS reports (Daniele) - quite some details in today's CMS report (see inline wiki:https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports#10_March_2009_Tuesday): Have opened about 7 CMS internal Savannah tickets, nothing yet forked to GGUS. T0: thanks for full CASTOR p-m. Tape migration - in consequence of above - now fixed. T1s: IN2P3 export errors to Fr site, fixed, closed. CNAF After some interventions on StoRM and PhEDEx at CNAF the transfers started with ~100% quality and ~10MB/s. The average rate of the last 24h is ~10MB/s so the link could be commissioned. (Savannah #107382 --> CLOSED) ; IN2P3 -> FNAL transfer issue - item fixed & closed. New tickets against FNAL, CNAF, IN2P3. Existing open tickets to CNAF & FZK. T2s: close ticket to IC re transfers to RAL; some new: KNU in Korea; Bari - problems getting files from CNAF - local storage in Bari; Pisa in downtime - heavy load on storage. Reinstall dCache to solve. MIT : problems reading from / writing to SE there. GGUS->OSG https://gus.fzk.de/ws/ticket_info.php?ticket=46798 reply last night. Reply on March 9. Assigned March 2 to affected site MIT_CMS. No reaction for 1 week.

  • ALICE (Patricia) - WMS issues: working quite well after upgrade to latest megapatch. About 3600 jobs running - detailed report to GDB tomorrow. Working hard with CREAM. Several sites working very well. Best FZK, RAL working well in the past. Putting CERN into production. 2 CEs only accessible from CERN - ce201, ce202. Small problem with info provider for ce201. First test : CREAM bug 47152 affecting CERN - in contact with developers to fix. Workaround for this problem sent to CERN; same problem seen at RAL & same workaround. Ricardo - these CEs really not production ready yet - one was reinstalled last night.

  • LHCb (Roberto) - continuing T1 commissioning activity by running recons jobs using fest data. RAL DNS issue also prevented recons jobs to resolve TURL using SRM e-p at RAL. GridKA down - also IN2P3. "Wrong locality" reported by SRM at Lyon - suggest to contact SARA who had problem and fixed. NIKHEF: reconnection to conditions DB still under investigation. CERN: LHCb user space token - globus xi/o eof error. Disk server exhausted # gridftp servers running / idle. Site admin has to intervene to cleanup. Re-used GGUS ticket opened a couple of weeks ago to find longer-term solution rather than regular cleanup. glexec tests in pps/production. Started with NIKHEF: tests failing due to LCMAPS failure - mail to mailing list. Ran GGUS tests yesterday - ok except CNAF (not recognised as valid alarmer - checking cert); problem with GGUS i/f. NL-T1: which of 2 sites to send alarm to in GGUS i/f but ticket sent to one specific site. Maria - concerning comment about NIKHEF/SARA: open ticket in Savannah - a solution foreseen for March release.

Sites / Services round table:

  • RAL - see above under ATLAS. [ Had significant problems for 24h - will do our own brief pm - first steps on disaster recovery plan as situation went on too long! Severely degraded all night and good chunk of yesterday and this morning. One thing to add - sorry for confusion over scheduled at-risk! ]

  • GRIF (Michel) - very severe cooling problem at end of last week - lost 2K jobs or so! Power cut unexpected during 2 consecutive nights.

  • NL-T1 (JT) - where to raise where one address or 3? -> MB

AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

  • LHCb (Roberto)
    1. Issue with the new French CA whose certificate does not appear under the usual grid-secuty path on the AFS UI preventing some French users that have renewed their certificates (or registered in the LHCb VO) before the change was done by Steve and Pierre. UI deployment responsibles are working on that and Steve is checking whether their procedures are updated for next time this happens.
    2. the issue of hung processes in gridftp castor servers (globus_xio problems) is still under a deep investigation from Shaun and Giuseppe Lopresti. This seems to be correlated with clients abruptly killed when timing out and leaving then processes on the server in a CLOSE_WAIT status and sitting there forever (then exhausting the number of threads).

Sites / Services round table:

  • ASGC (Suijian Zhou) - FTS is back at ASGC, and also voms, ui, BDII etc. The recovery of castor still need some time while we are trying to speed it up. See GDB agenda for a report on the incident.

AOB: (MariaDZ) Conclusions from the ALARMS' test exercise assembled in https://savannah.cern.ch/support/index.php?105104#comment36. Pending action to sort out for the next round in one month the Dutch case in https://savannah.cern.ch/support/?107440 . Progress will be monitored via https://savannah.cern.ch/support/?107452 Testing rules updated with dates in https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru Other interesting story: the routing to USA T1s to be discussed tomorrow 12 March (ATLAS, CMS please attend!), details in http://indico.cern.ch/conferenceDisplay.py?confId=54492 .

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

  • LHCb -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

  • LHCb -

Sites / Services round table:

AOB:

-- JamieShiers - 06 Mar 2009

Edit | Attach | Watch | Print version | History: r13 | r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 2009-03-11 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback