Week of 091123

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

All tickets are synchronized besides https://gus.fzk.de/ws/ticket_info.php?ticket=53365&from=search which seems not to make it into OSG system until now, although the mail was sent from GGUS to OSG on 2009-11-18 10:09:09 UTC.

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Simone, Alessandro, Miguel, Jan, Harry, JPB, Olof, Patricia, Roberto,Guiseppe,Maria, Maria, Dirk);remote(Daniele/CMS, Gonzalo/PIC, Ono/NL-T1, Alexei/ATLAS, Rolf/IN2P3, Angela/KIT, Jeremy/GridPP, Jason/ASGC, Gareth/RAL, Michael/BNL, Kyle/OSG Paolo/CNAF).

Experiments round table:

  • ATLAS - (Alexei) ATLAS is taking LHC data and exports to T1/2/3 (60 centers) smoothly. Big thanks to Miguel for his help during a SRM problem Fri afternoon (details below). During this problem also GGUS alarm system did not work. Dirk: a service incident report is requested from the GGUS responsible. ATLAS sees some T1-T1 transfer problems: FZK has (had) dCache problem. Data transfers from FZK to BNL and from FZK to T2s were affected. More info will be available after FZK meeting later today. There also was compettion between data and user datasets transfer from BNL to LYON (in addition LYON had problem with one of the dCache servers). ATLAS also experienced larger spikes in DB access on Sun. Luca Canali noticed spikes at 9:00 am and 30 mins later ATLAS notices slow DB access. Investigation between DB responsible in ATLAS and IT initiated. Simone: SRM issue on Fri 5:30-6:00pm with increased in transfer error rate to all T1s. Symptom looked similar to previous problem and are not related to SRM load. FTS started showing source errors, as GGUS was not operational experiment called castor operations directly and the problem was fixed in 10 mins by a SRM restart. No impact in data taking and the problem did not reappear on weekend. Still ATLAS has during the last period seen several SRM related problems and request a plan to find the root cause. Miguel: also Jan and Giuseppe were looking at the problem when it occurred. In the meantime a service incident report has been produced with the state of the analysis and an action plan to find the root cause and several other suggested improvements to the service. ATLAS suggests to send another test alarm ticket to confirm that the system is working for them. Depending on the problem analysis from the GGUS team, this should probably also be tested for the other experiments.

  • CMS reports - (Daniele) On 20 nov 19:00 - beam at P5, splash events, possibly collisions but high halo rate. Status of all tickets up to date in the CMS twiki. CMS is was affected by a condor bug (affected MC) - received working patch, more detailed report at 5pm CMS meeting. CMS lost 100 TB of custodial data at IN2P3 due to a communication problem between CMS and site staff. Operation procedures being reviewed now.

  • ALICE - (Patrica) stressing transfer infrastructure with new alien version, look very promising so far. Some T2 related issues over the weekend being followed up directly with the sites. Had to reopened ticket concerning CREAM CEs at CERN after problems - being worked on.

  • LHCb reports - (Roberto) collisions/splash events recorded also in LHCb - first files fully reconstructed and shipped. Problem at NIKHEF: WN jobs failing due to stuck nscd daemon (Root fails as it can not figure out $HOME). Site has put an nscd watch dog put in place. SARA: users with double VO cert (ATLAS and LHCb) failing as since the dCache upgrade to the golden release - dcap seems to be back at static VO mapping and certs were mapped to ATLAS. Now staticallymapping to LHCb as temporary workaround. Problem has been reported to dCache developers who work on a fix. PIC: run out of space for shared area for Brunel installation. PIC added more space. CNAF: On Fri STORM went down but was restarted quickly. All issues covered by GGUS tickets.

Sites / Services round table:

  • Gonzalo/PIC: ntr

  • Ono/NL-T1: NIKHEF problem with LHCb - workaround in place. nscd is checked every 1 min. Is rate sufficient? Roberto: when was this put in place. Users still complained on Sat. To be followed up with between LHCb and NIKHEF. NIKHEF: new computing nodes had to be put offline as communication problems with storage started appearing during this weekend - under investigation. SARA: SRM problems reported on Fri were not caused by iptable config: root cause was SRM dcache crash (confirmed as 1.9.5-8 bug due to use of non-thread safe library). Now fixed in 1.9.5.-9 which has been installed and works fine. Site is working on new storage h/w setup which should become available tomorrow or Wed.

  • Angela/KIT: Some discussions on ATLAS team ticket: local contact considered this rather an ALARM ticket. Simone: no, thats fine - shift decides on severity.

  • Garaeth/RAL: Rescheduled “at risk” time for bypass test for tomorrow morning (being added to GOCDB).

  • Jason/ASGC: new resources delivered to the site.

  • Michael/BNL: one storage server down for 45 min - restored. Simone: concerning transfer backlog from FZK to BNL - observed long latency for turl at FZK. Can you confirm that BNL SRM is 2.2 in legacy mode? BNL will get back to ATLAS.

  • Paolo/CNAF: ntr

  • Rolf/IN2P3: "at risk" scheduled for Wed to split EGEE VOs from LHC VOs. Should be transparent for LHC VO’s Alessandro: how long is the at risk? ->All day.

  • MariaG/CERN: high load on ATLAS DB on weekend - first DB node 3 (panda -9:15 evening) then node 5 (dq2 - 9:20). will follow up the cause of the large spikes (factor 5-6 of maximum load) with ATLAS DB experts. Other nodes of the cluster stayed unaffected (eg conditions). Also continuing plan to moving other application off the atlas cluster and to rebalance core apps (eg isolate panda and dq2 on dedicated DB nodes).

AOB:

  • Simone: DDM node of central catalog lost DB connection - due the cx_ora problem fixed in SL5. Question: Would be handy to take out node from DNS load balancing as workaround/ to simplify problem analysis. Can we have a procedure to reconfigure DNS alias for a small set of experiment experts?

  • MariaD: Note to Kyle: ticket 53365 GGUS is not synchronized with OSG - please have a look and add info to ticket.

Further info in the ticket itself.

* ATLAS ALARM ticket by Alexei: In reply to the post-mortem request at the wLCG meeting today, the solution in https://gus.fzk.de/ws/ticket_info.php?ticket=53520 by Guenter Grein contains the full explanation. We hope this suffices.

* GGUS fail-safe system: MariaD had a phone conversation after today's wLCG meeting with the GGUS team leader Torsten Antoni. Updates in https://savannah.cern.ch/support/?101122 The issue will be discussed at the next USAG meeting on December 10th http://indico.cern.ch/conferenceDisplay.py?confId=73657 and at the MB tomorrow.

* A service incident report for the ATLAS SRM problem on Fri has been produced by the CERN castor team. Please refer to https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents

Tuesday:

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

CERN LHC VOMS Service was unavailable for 20 minutes
Monday 16:02 -> 16:21 UTC there was complete unavailability of the voms service. Cause was a misconfiguration of the gd lcg-fw service. Service restored after automatic notification to service manager.

AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

Release report: deployment status wiki page

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

-- JamieShiers - 20-Nov-2009

Edit | Attach | Watch | Print version | History: r12 | r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2009-11-23 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback