Week of 090119

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

Site Date Duration Service Impact Report Assigned to Status

GGUS Team / Alarm Tickets during last week

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Akos, Gavin, Jamie, James, Jan, Jean-Philippe, Harry, Maria, Sophie, Eva, Julia, Olof, Patricia, Simone, Andrea, Markus);remote(Gonzalo, Michel, Daniele).

Experiments round table:

  • CMS (Daniele) - activity running smoothly. stageout errors at PIC related to 1 d/s for a couple of hours - privileges on directory. Gonzalo - intermittant, not fixed. Happens "every now and then". Daniele - reported by data ops in US time zone at night and fixed be Pepe this morning - almost invisible and nothing to worry about smile

  • ATLAS (Simone) - Friday: ATLAS DDM people fixed a few problems, especially in accessing MySQL DBs in site-services. Cured problem where rates acieved for first days dropped - effect on growing of some tables in DB - some info was kept instead of being cleaned. Fixed Friday night. Sat: dashboard stopped receiving callbacks. Sent mail to dashboard people - space in table exhausted. This version running against non-prod Oracle - INTR instance (also part of test - will move at end of test to production DB). Ricardo sent mail to Gancho & Florbella to increase tablespace but didn't happen over w/e. A few callabcks dropped, dashboard DB cleaned for old (Dec early Jan) entries. This helped. From Sat night started getting callbacks again. Julia- space increased finally? Not sure - to verify. Simone - FTS delegation problem at CERN. 100% of transfers failed CERN-> some T1s but not all. 100% failure on some, 100% success on others. Gav - box related. 2 boxes involved, all on fts203. Why did this happen? delegation patch running. ATLAS has 1 patch per service. Why did it happen? Gav - need to dig through logs to understand. This triggered an alarm e-mail which Gavin answered - something 'fishy' with ticket. Jan - GGUS team ticket opened at 11:15. This would normally be picked up Monday morning. Then 10H later mail to atlas operation alarm who called DM PK (Jan). Mail did not arrive - misconfiguration on smod alarm list now fixed. Would have liked also SMS but problems with signed emails. New gateway this month should allow sending of email subject - good enough! Simone - last thing is evolution of tests. All d/s now subscribed, last this morning. Some sites already have aggregated most of data - meant to be 10 days will finish this coming Sunday. System now in 'draining' mode -no new subs. Have to see whether backlogs will be digested B4 Sunday. Jan - problem with FTS ATLAS is with delegated proxies which hit us B4. What to do? Can this bug be fixed. Akos - scheduled in FTS 2.2 - next major release. Gav - impact of this is sufficiently serious that it should probably be fixed < FTS 2.2. Akos - have all sites installed FTS 2.1? Would not like to backport to SLC3, hence sites must be(?) on SLC4. Jan - timeline for fix -> production? Akos - timeline for fix < 2 weeks for proper release more time... More offline... Jan - problem with CASTOR ATLAS instance early Saturday morning. Two stuck sessions - call to Oracle PK and data services PK. Degradation < 1 hour just after midnight. Olof - not an alarm, direct automatic alarm.

  • ALICE (PAtricia) - continue with WMS in France. Have provided a dedicated node for ALICE which has been successfully tested. Last to be tested was GRIF-DAfnia - issues with firewall. WMSs at CERN also tested, being put into production. ALICE DB down for 2 hours this morning so could not continue tests. Must be done centrally so could not continue tests during this period. GRIF Dafnia reported v high load with VO box - much higher than e.g. CMS. Put new submission module in this VO box for testing. Should result in reducing load... Michel - about ticket, propose to leave open for a couple of days and close if no more issues.

  • LHCb (Roberto) - see http://lblogbook.cern.ch/Operations/1150, where it is reported that GridKA seemed to have some problem with SRM returning tURLS for user analysis jobs the 15th of January. FZK was problematic at that time (also reported by ATLAS) and put in an unscheduled downtime.

Sites / Services round table:

  • CASTORSRM (Jan) - testing new version propose to put into production in next few days. Upgrade should be transparent - fixes some core dumps.

  • GRIF (Michel) - 1 issue we discovered last w/e with Torque/Maui. Using version maintained by Steve Traylen - not part of official release but used by several sites. Torque client crashes on all WNs when server restarts. Workaround this weekend - hope that new version will fix. 2.3.5 of Torque.

  • DB (Eva) - intervention this morning on LHCb online DB to fix problem on storage one week ago. Intervention was fine - only problem was that some of archive logs lost. Need to recreate capture process on online DB.

  • ELOG (James) -backend daemon hung, signs reproducible, can trap and write an alarm. elog daemon stuck and had to be killed. Look closer if it happens again... Simone - next time ATLAS should send a mail to ??

AOB:

  • GGUS alarms testing (Maria) - have typical cases for ATLAS & LHCb alarms, on GGUS documentation in prominent place. A short "howto" written. Do we do it end Feb and every 3 months or?? Jamie - do in the week b4 the GDB in the target month.

Tuesday:

Attendance: local(Eva, Miguel, Ewan, Sophie, Ulrich, Jaimie, Julia, Olof, Gavin, Simone, MariaDZ, Roberto, Patricia, Jan);remote(Luca, Jeremy, Daniele, Gareth).

Experiments round table:

CMS (DB): 1) They are seeing a high load on the SRM server at FZK (recurrent problem). 2) There has been slow processing of jobs at IN2P3 over the last 12 hours. They were submitted from an FNAL machine whose proxy expired so have been cancelled and resubmitted though it is too early to tell if this was the root problem. 3) Since 24 hours there have been increasing numbers of timeout errors at CNAF exporting to other sites where the evidence points to a local site problem. Luca will take a look. 4) PIC is showing no signs of activity nor reporting problems. They had announced an FTS 'at risk' period this morning.

ATLAS (SC): The 10M file test is now in draining mode and should finish by Sunday. Sites will get no new subscriptions but should finish what is there. FZK is now transferring files with high efficiency. There will be a presentation of results on Thursday at an ATLAS jamboree meeting for which they are trying to get information from site ftm services. For 4 T1 they see no link to ftm, at CNAF they get a security exception in the web browser (Luca reported this as a closed firewall port that he will get opened), CERN needs ftm to be refreshed (Gavin will do this), NDGF will install ftm next month.

LHCb (RS): 1) Asked Luca the state of the shared software area. Reply was they have setup a new NFS-GPFS setup that they will start testing tomorrow. They believe there is a conflict over cache use where the few large data files read once push out the many small software files read many times and are discussing to buy new hardware in a small tender. 2) There are some 25K jobs stuck waiting in the CERN WMS. A GGUS ticket has been submitted.

ALICE (PM): Will be testing the new SLC5 CE at CERN. Ulrich suggested of the order of 100 concurrent jobs as there are 58 8-core worker nodes in the cluster. There will also be a ramp up of jobs at the CERN WMS. The CREAM CE at FZK has been upgraded to the latest gLite release and testing willl continue.

Sites / Services round table:

CERN (ER): Reminder that the RB and gLite 3.0 WMS will be stopped next Monday (will be first set into draining mode). There is still some work being submitted to WMS105 by ATLAS and LHCb and to WMS117 by CMS (experiment reps to follow up).

CERN SRM (JvE): The new maintenance release of SRM (2.7-14) will be applied for ATLAS this afternoon and for CMS and LHCb tomorrow morning.

CERN CE (US): The 4 SLC5 CE put in production yesterday were inadvertently set into draining mode which has been corrected. Now seeing ALICE jobs with no new problems yet.

Databases (EdF): LFC streams propogation to FZK has now been reset to real time synchronisation rather than the hourly one set up many months ago.

AOB: Gareth noted that GGUS alarm ticket testing is now running and queried was marking the ticket as solved in GGUS the only required action. Maria DZ confirmed this was the case and as a reminder gave the link to the testing procedure namely https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru

Wednesday

Attendance: local(Patricia, Roberto, Harry, Jamie, Jan, Andrea, Sophie, Maria, Nick);remote(Daniele, Jeremy, Gareth).

Experiments round table:

  • CMS (DB): 1) Yesterday I sent an alarm ticket to the CNAF T1 site. After a while, a mail landed to the "t1-alarms@cnaf.infn.it" mailing list (on which by chance I happen to still be in, due to my previous job at CNAF), so in this case I had the interesting opportunity to be both the submitter of the GGUS Alarm ticket and the receiver of the alarm (at least by mail). I did not see anything wrong in the mail itself, and no other mails reached to me. But I was pointed by Luca (cc'ed) to the online "public diary" section of the ticket, which updates (from t1-admin@cnafNOSPAMPLEASE.infn.it): Public Diary: You are not allowed to trigger an SMS alarm for INFN Tier1 so this should be looked into. Alarm ticket didn't generate SMS but forwarded to mailing list. Saw only this morning tried to reply but GGUS in maintenance. Problem seems ok - ticket on hold waiting for CMS to confirm. Need to follow up on how to submit such alarm tickets successfully.
    GridKA - Noticeable improvement wrt load on SRM server. 1000 concurrent jobs, reasonable I/O & responsiveness. Have to monitor for a couple of days to be sure its ok for > a few hours and then increase job slots.

  • ALICE (Patricia) - yesterday new production began last night. ~100 jobs running concurrently so far. No problems... Helping to test latest versions of submission modules etc. Maarten asked if new myproxy handler fully deployed - currently at 5 sites to test out. One is at SARA - so far so good but need large number of jobs to ensure all ok and before putting at all sites. Once a new bunch of jobs has been submitted we'll see...

  • LHCb (Roberto) - yesterday reported about WMS issues at CERN. 25K jobs "stuck" in waiting status. Further investigations suggested LB 'bug' that occasionally leaves jobs in limbo state. Plan is to see if latest patch will fix. In meantime DIRAC will discard such jobs after 24h. Usually dummy MC at 7-10K jobs in system. Modify workflow to allow also dummy data output to be uploaded to target SE. Will generate corresponding "dummy" load. Luca said would fix shared area issue but still no news from CNAF.

Sites / Services round table:

  • CERN Network (Edoardo) - This Thursday 22/01/2009 starting at 10h00 and ending by 17h00, we will progressively move back into production the LCG backbone routers that were shutdown before Christmas.
    We will do it because we've gotten the green light from the hardware support engineers, who will be on site during the maintenance. The operation will take the whole day because we will validate every linecard, one by one.
    We've been assured that the intervention will be transparent and resolutive; but since it was not the case last time, and we are not able to detect the issue when it is ongoing, we will need your collaboration to quickly identify the problem and abort the operation.
    So, during the maintenance window, we will appreciate if you could quickly report any issue with your services that you suspect may be related to our intervention to noc@cernNOSPAMPLEASE.ch.

  • SRM (Jan) CMS, ALICE and LHCb upgraded at CERN as announced without problems.

  • GGUS (Daniele) where can I find composition of USAG group? Maria - in simba or here:
Showing the Membership of the list usag AT cern.ch alias of project-eu-egee-sa1-esc AT cern.ch
-- List previously here removed for spam fighting / Should be seen in e-groups --

  • GGUS (Maria) - had today GGUS release, most important feature is direct routing to sites for all tickets, not just team & alarm. USAG meeting Thu 29 Jan - agenda and phone details to follow - main purpose is to test what we understood were requirements and for 'fine-tuning' as needed. Daniele - what does this mean if ticket not clearly identified for a site? A: if site not selected from drop-down will go usual TPM route. Otherwise ROC receives notification but site gets it directly. Requirement comes from EGEE conference. Andrea - purpose of team ticket? A: co-ownership e.g. for shift teams.

AOB:

Thursday

Attendance: local(Julia, Luca, Maria, Jamie, Nick);remote(Gonzalo, Daniele, Jeremy, Gareth).

Experiments round table:

  • CMS (Daniele) - 1 more site issue + 1 update. FNAL: data ops team started getting failure to open input file, segmentation fault. Not related to site problems but worth noting. CNAF: ggus ticket opened yesterday - problem solved - checked soon after. Notification from phedex monitoring moved files to several t2s with 0 errors. Some problems exporting to MIT - destination errors. Ticket closed. Luca - was problem on single disk server. Daniele - now put out of production and improved. Nothing worrisome seen after that. About ggus - CMS fac meeting next Monday afternoon how to profit from latest ggus release? Julia - usage of team and alarm tickets by ggus. Any plan to switch from savannah to use ggus alarm and team tickets? A: we need to reach sites. ggus is the tool sites want except some that prefer savannah. savannah also allows us to track problems related to subprojects, e.g. task tracker vs ticketing system. Trying to use ggus more especially for sites that request it. If ggus is having new features would like to profit, hence request for presentation. Julia - same system for all expts would give us tracing of all problems particularly at sites. Daniele - some problems not obviously sites so we use savannah but ggus golden system for sites.

Sites / Services round table:

  • CNAF (Luca) - exchanged emails also with Italian community of LHCb computing group. Situation: finalised small tender for new h/w, meanwhile doubled access (# of servers) to LHCb s/w area, also doubled link capacity of servers for other VOs. Trying to switch for this from gpfs to nfs over gpfs having gpfs H/A and cache on client side of nfs. gpfs has cache but behaviour dominated by data component >> s/w. Tests ongoing. From LHCb learned that SAM tests at CNAF still fail - hope for more news late pm or tomorrow. Should attend also tomorrow's meeting in person. To stress point, this is our main concern at this time. Risks to be a blocking issue also for other experiments in longer term.

  • PIC (Gonzalo) - q for CMS. Issue of yesterday and day before with high load@FZK with SRM. Yesterday at PIC also had issues with this. SRM yesterday 04:00 on was overloaded. In morning understood it was from CMS jobs doing recursive SRMls. What is understanding of CMS about this? Is it same problem? Daniele - AFAIK it is the same issue. Don't know status from Pepe. Armin at FZK - things quite smooth. Don't know level of problem at PIC. Gonzalo - "vulnerability" in SRM - SRMls does not have easy/feasible control. Any app that does this is essentially a D.o.S. attack. Was it a single user? New release? Saw it appearing just yesterday. Daniele - let me check closely and get back. q - what is situation regarding FTS for SL4 - 2.1. Hear that there are some issues. Is it official that Tier1s should migrate asap? Are there known issues? Luca - at CNAF testing but detecting some issues. Problem starting info provider service. Team managing FTS has contacted support. (Maybe https://twiki.cern.ch/twiki/bin/view/LCG/FtsKnownIssues21?)

  • DB (Maria) - have received January Oracle security patch. It is rolling, tested, will be deploying on validation clusters next week and then will go on production 2 weeks later.

  • ASGC (Jason) - 3D databases will be back in production hopefully early Feb, CASTOR & SRM mid Feb.

AOB:

  • WLCG section of weekly joint operations meeting: There is a suggestion that an explicit WLCG section - initially established in the days of the WLCG Service Challenges - is now no longer needed with the week-daily WLCG operations calls held since early 2008. This would reduce the strong overlap between the daily WLCG call and this section at the weekly call. No immediate action - just a "heads-up" to get feedback from regular attendees of this meeting...

Friday

No conference call due to CERN IT department meeting

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

-- JamieShiers - 16 Jan 2009

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ATLASDDMPM.ppt r1 manage 2374.0 K 2009-01-22 - 15:23 JamieShiers  
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2009-05-27 - GuillaumeCessieux
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback