Week of 081103

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

Site Date Duration Service Impact Report Assigned to Status
NDGF 18-20 Oct 2 days streams - https://twiki.cern.ch/twiki/bin/view/PSSGroup/StreamsPostMortem#Problem_with_ATLAS_replication_f input from NDGF pending received
CERN 24 Oct 3-4 hours FTS channels down or degraded https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortemFts24Oct08 Gavin McCance received
CERN 24 Oct 2 hours VOMS short interrupt then degraded https://twiki.cern.ch/twiki/bin/view/LCG/VomsPostMortem2008x10x24 Steve Traylen received
RAL 18 Oct 55 hours CASTOR downtime http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20081018 Andrew Sansum received
ASGC 25 Oct several days CASTOR down - ASGC due
SARA 28 Oct 7 hours SE/SRM/tape b/e down - SARA due

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Harry,Miguel,Jean-Philippe,Nick,Olof,Patricia,Simone);remote(Michel,Gareth).

elog review:

Experiments round table:

ALICE (PM): The latest version of aliroot is not completely ready for production so they will use this time for testing 1) at CERN to test the SLC5 worker nodes 2) at CNAF to implement a new feature collecting the number of running and waiting jobs when the information system is not giving this information. Michel said he would prefer not to invent a new system and Harry agreed that this should be solved in the middleware and for all experiments. This will be taken offline.

ATLAS(SC): Several problems some ongoing from last week. 1) Some 140 datasets, mostly ESD, were not replicated to BNL due to ATLAS internal problems. They are now on tape in CASTOR so there are many FTS failures with srm timeouts waiting for tape mounts on the first attempt to transfer these files. 2) ASGC still down so cut off from production activities. 3) FZK fixed their pnfs load problems at the end of last week but failed again late on Saturday. Some transfers are running again this morning but there is a very high failure rate of sites trying to get data from FZK. 4) CNAF had an srm problem from 11.00 today. 5) NDGF has experienced a well known dcache race condition where a directory tree becomes owned by root. a GGUS ticket has been raised. 6) The SARA endpoint for MCDISK is showing a 50% failure rate so production has been stopped.

Sites round table:

RAL: Email report from B.Strong at 14.00 GMT - at RAL we are seeing frequent errors in the atlas stager logs for “ORA-14403: cursor invalidation detected after getting DML partition lock”. We have seen these also in the past in the cms stager. The database team thought these would be fixed by upgrading from Oracle 10.2.0.3 to 10.2.0.4. This was true for cms, but not for atlas. The database group are also reporting this to Nilo (Segura). They theorize that it is related to poor performance we are seeing in the atlas stager database. Gareth Smith reported that they experienced another double disk failure on a disk server supporting ATLAS MCDISK early on Sunday morning. Investigations continue but they may have lost more data. There is a suspicion this type of raid array may have problems after a long period of low activity.

CERN (MC-S): One of the Sun tape robots failed on Saturday. Engineers were on site Monday morning and it came back about 1.5 hours ago. There is a long queue of mounts for ATLAS with an average queue time of 100000 seconds but dropping fast. Olof reported that the CASTOR disk servers are busy sustaining an aggregate rate of 15 GB/s of which 7 GB/s is from CMS.

Services round table:

  • VOMS "service incident report: https://twiki.cern.ch/twiki/bin/view/LCG/VomsPostMortem2008x10x24
    Summary:
    • 5 minute loss of service, 2 hours degraded.
    • A security scan triggered it. Am now in process of requesting that scans happen during working hours.
    • Operator alarm was not raised. The acctuator designed to "fix" the problem hung. If the acctuator does not complete the alarm is never raised. By default there is no timeout on accuators. These have now been configured for VOMS but other service managers are advised to consider such a case.

AOB:

Tuesday:

Attendance: local(Jamie, Maria, Harry, Simone, Olof, Miguel, Flavia, Nick, Patricia);remote(Gareth, JT, Gonzalo).

elog review:

Experiments round table:

ATLAS: By email from UK operations confirming yesterdays RAL report - We had another disk "double whammy" this week-end, affecting another RAID array serving the ATLASMCDISK token, srmns156. The data loss this time is 9 TB (63 K files), with ~1/3 srmv1 data.

  • ATLAS (Simone) - several problems with storage. SARA: MCDISK fairly unstable in last several days. Yesterday also DATADISK. Ron said seemed to suffer from pnfs overload. Production in NL cloud stopped - huge backlog of data to be transferred to SARA. Not able to aggregate output from Tier2s. Backlog needs to be reduced, for which problems with SARA storage need to be understood. Overload in other dCache sites; IN2P3. End last week Lyon closed all channels except CERN-Lyon channel to better understand. Thought overload fixed, reopened channels, ok for day and 1/2 now 40% failures getting or exporting. Same also FZK. Import ok with fairly high efficiency. Almost 100% failure export. Alessandro investigating. Ask help from Flavia. CNAF-BNL channel. CNAF can't get data from BNL. Other channels ok. ASGC dead - no news. Graeme reported about RAL problem with 1 disk for LFC. Unscheduled downtime today - news? Gareth - one of mirrored disks had problems. Being resolved. Mirror set reestablished. Resync happened - should be back up soon. LFC out since ~09:00 this morning UK time. Simone - last point: problems following intervention on CASTOR at CERN. Jan sent message problems being investigated - seemed ok ~12:30; Transfers now look ok. Jan - as part of intervention cleanup on one table performed. Incomplete - one of indices had to be recreated - problem in version SRM 1.3 has hit also CMS - run out of index space a couple of months ago. DB growing. Don't intend to fix in this version - ok in SRM 2.7. Olof - how are SRM 2.7 tests going? Look ok - success rate fairly high. Can put more load. Jan - will talk offline. Simone - 100% efficiency, 1MB/s. Flavia - pnfs business. Apparently ATLAS stresses alot pnfs due to many entries per directory. CMS went back to max 1000 entries per directory. Recommendation from some dCache developers to go to improved fast version of pnfs - requires postgres 8.3. Some incompatibilities with previous installations. Postgres libraries don't install nicely. Ignore errors! and go ahead with installation. NDGF and TRIUMF have observed drastic decrease of pnfs load with this version. SARA will try. SARA needs to perform a DB cleanup much more often than they do. vacuum and cleanup. Simone - confirmed these sites show no load problem. Root permissions on directories. Race condition in dCache. Fix - run program which looks at log and resets permissions on directory to those of parent. This is now part of dCache distribution. Published in dCache mailing list - sites encouraged to run it. Simone - big discussion on storage in pre-GDB next Tuesday. People from dCache at CERN - try to get people from major Tier1s. People from SARA will come. Lyon and FZK too would be good. Flavia - all dCache Tier1s informed. Registration page in agenda. Contacted also Tier2s - some should show up or connect via EVO. Simone - 1000 files per directory not impossible but then # dirs increase. Cure is to go to CHIMERA. Is it worth the effort if fixed in a couple of months?? Among topics to be discussed. See agenda. JT - we are trying to figure out how to improve things. Need a metric. Looking through official metrics by VO. Know you have had a lot of problems but need clear metric for ATLAS to see if things are improving. Best would if this is at some time in SAM suite but anything concrete would be appreciated. Simone - concrete thing is amount of failures per dashboard. Easily done but reluctant - dashboard needs to be interpreted. Sometimes problems from ATLAS workflow and some of Tier1. This is the meeting where problems are reported. JT - would be nice to have a number! In Sep NL-T1 was 97% up. Simone - main database is ATLAS dashboard. Can have snippet of this which is more site focussed. 80% in dashboard doesn't mean 20% broken - some might be someone else's problem. JT - a start...

  • LHCb (JT) - looked at same numbers as for ATLAS, LHCb says 0% - too pessimistic!

  • ALICE (Patricia) - regarding issue one week ago regarding WMS in FZK - none working for ALICE. Site has been down but advised to test again. Will report on this tomorrow. Regarding tests for SLC5 - began tests yesterday. voalice03 in pre-production. Jobs will be real production jobs! Several small issues. Reported via GGUS. Corresponding CEs were published as closed in IS. Solved - Farida Naiz informed prior to meeting. Tested one dummy job and ok. About to close GGUS ticket. WMS holds all jobs in status waiting so huge bunch of jobs waiting - 800 jobs for 16 nodes. Try to kill. If ALICE service is stopped - they are pilots - will die quickly. Hope more news tomorrow. One required. Will put in GGUS ticket - these WNs should have access to production s/w area (in afs - proper env variable). One question for NIKHEF - in Nov should be able to provide WMS for ALICE - how's it going? JT - prod and test-bed. Goes first into test-bed. Working WMS on test-bed. Being migrated to production system. Into NAGIOS, quattor profiles & this sort of stuff.

Sites round table:

RAL: By email from Gareth - In the early hours of Sunday morning (2nd November) another disk server that is part of the RAL Tier1 suffered a double disk failure. The server has a RAID 5 disk array. The failure of one of the disks in the array was followed very shortly (around 10 minutes) by the failure of a second. Attempts have been made during this morning to see if it is possible to recover data, but so far these are not looking promising. Note that this is not the same server as failed a week ago. However, this one also forms part of the ATLASMCDISK area. These two disk servers were same batch - a couple of years old. Look at serial numbers on disks to see how these match. Miguel - when ATLAS not running and no activity do you run any script to simulate load? A: no, but something that was discussed. Data should be periodically accessed - these failures look very much like 'silent failures'. Seems for this second case to be very much the case. Olof - have a probe which probes disks every hour. Started with this two years ago when started to have silent data corruptions. Presented at HEPiX a year ago - Peter Keleman. Harry - recommend sites to run this? Olof - should consider. Provides a mechanism to detect this. Gareth - do run fsprobe - need to check. Simone - if a site could indicate a few test files per disk server VO might try to read periodically - few times per day. This would have double benefit of using disk server and VO could then understand if there were problems. Site would have to publish were files are.

Services round table:

  • DB (Maria) - all recommended patches on integration clusters. All rolling. Scheduled for production as of next week.

AOB:

Wednesday

Attendance: local(Julia,Jean-Philippe,Maria,Harry,Sophie,Roberto);remote(Gareth, Jeremy, Michel, Jeff).

elog review:

Experiments round table:

ATLAS (Alessandro): Today much better situation with respect to yesterday:

    • BNL->CNAF connections is under investigation.
    • NDGF dCache still in problems with ACL (three time in the last two days)
    • FZK started working since yestrday around 17:00, would be interesting to know which problems had been found and solved.
    • ASGC still not working.

LHCb (RS): 1) there is a faulty T0D1 disk server at CERN (lxfsrd4302) since 1 month trapping needed data. LHCb need to know how to proceed with this. 2) last week started traffic load generator activities from CERN to Tier 1 and between Tier 1. Aggregate of 250 MB/s with 100 MB/s out of CERN. Some problems at dcache sites 3) MC production (for the later FDR) started last week. Currently 5000 jobs in the grid with no major problems. Running fake MC in parallel. Sophie asked if they still needed to use RBs for job submission (for CERN planning). The answer was yes, the RB at CERN are still needed for Dirac2 but in principle only till the end of this year.

ALICE (PM): At FZK only WMS1 is working properly while 2 and 3 still have lcmaps problems. ALICE need at least 2 WMS to use FZK in production. SLC5 tests are ongoing in the PPS after the software area pointers were changed as requested. About 20 jobs submitted but they have stopped, possibly with an ALICE problem. Tests of the new SRM version will begin but through the FTS layer so should be transparent to ALICE users.

Sites round table:

RAL (Shaun - email to castor external OPS): On the 31st we had a minor recurrence of big Ids being inserted into id2Type. These started at 12:35:37.788869 and the last bad value seems t have been at 16:14:19.577204. A total of 7 bad entries have been found in this timeframe (in the ATLAS STAGER database), and a restarter restarted the rhserver at 17:00:18 (due to presence of Unique Constraint Violations in the log – need to look at that).

Anyway, I have looked for anything similar in any of these and have found nothing. The entries come from different rhserver threads, different rh ports and come from different clients. I have attached a spreadsheet with what I have found. If anyone would like me to look for anything else please let me know. Shaun

TRIUMF (Reda - by email): I would like correct the ATLAS report:

TRIUMF does NOT run dCache with Postgres 8.3 yet. We are still at 8.2. Although our PNFS is not overloaded, we are doing some PNFS performance testing with 8.3 and we see improvement. Testing almost complete and will be reported later. We will move our production system to Postgres 8.3 once we're happy with the results.

Regards, Reda

NL-T1 (JT): raised a question for ATLAS concerning their SARA addressed GGUS ticket 43235 which reports two cases of 'disk full' in the SARA cloud and which he has returned. The sites concerned should be discovered and the tickets sent directly to them.

GRIF (MJ): they are running an lcg-ce and after update 33 see thousands of defunct processes (globus-gma) on the server causing performance problems. The associated GGUS ticket for a response is 42981.

Services round table:

fsprobe (Olof by email): In the meeting I promised to provide some information about the fsprobe program for exercising disk hardware.
The fsprobe RPM is part of the SLC4 distribution http://linuxsoft.cern.ch/cern/slc4X/extras/x86_64/RPMS/fsprobe-0.1-2.x86_64.rpm
It is also available under http://cern.ch/Peter.Kelemen/fsprobe/ . Peter gave a HEPiX talk about the fsprobe in 2007 http://hepix.caspur.it/storage/hep_pdf/2007/Spring/kelemen-2007-HEPiX-Silent_Corruptions.pdf
Tim Bell created a commodity disk management Twiki that gives some details about what other checks and procedures we are using at CERN https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskRefDiskManagement (I hope it's visible from outside CERN?). Olof

Databases (MG): are performing a migration of the LHCb bookeeping database today.

AOB:

Thursday

Attendance: local(Jamie, Maria D, Gavin, Harry, Nick, Jan, Miguel, Roberto, Flavia);remote(Gareth, Harry, Michel).

elog review:

Experiments round table:

  • ATLAS (Simone) - This is the report about the current situation for ATLAS:
    • ASGC is back in business! In the last 1h it has been running at 98%. Looks like the experts intervention this morning was successful. It will be reintroduced in functional test tomorrow after the conference call, unless there are major concerns.
    • FZK is still not very stable. There is a 40% failure rate from SRM timeouts. Cedric in DE cloud commented this might be effect of data cleanup (which has been interrupted in the morning, no visible effect). Simon Nderitu commented that this might be due to slow network. Still, a final conclusion is missing.
    • DDM stress test. A stress test for the ATLAS DDM services (site services in VOBOXES and central catalog) is being scheduled in 2 weeks. Currently, there is a preparation activity started yesterday. Approx 10K datasets (corresponding to 1M files of 50MB each) are being distributed from T0 to T1s (preplacement activity) with a share of 10% per T1. The full T1-T1 replication will be triggered in 2 weeks timescale, and this will be the DDM test mentioned above.

  • LHCb (Roberto) - preparation phase of fest09 activity completed soon. Next phase is merging activity (MC simul output merged into final MDF file injected into online system). MC production so far: went extremely smoothly. Up to 50K jobs both for physics and dummy MC production. Ramping up to 10K concurrent grid jobs. Starting to hit some DIRAC WMS limits. Grid: overloading 2 out of 3 WMS instances at CERN. Ask to use also additional (i.e. the 3rd) WMS. Load generator - keeps generating data out of CERN to T1s and across T1s aggregate 250MB/s. FZK endpoint unresponsive & problematic. Ticket? If not will do... Jan - followup on report from Andrew yesterday: sysadmins investigated status of repair. 3 attempts to fix problem have all failed. Will send log. Tickets opened - started by call from Andrew Cameron Smith and continue to use the 'open channel'. If we remove files from castor namespace and stager what happens when m/c recovers? Will be ok. List of files has been transmitted. Try to recover whatever data is possible from other sites..

Sites round table:

  • TRIUMF (Denice) - We have lost power for central TRIUMF computing services. This has not affected the power for the Tier-1 center; however our networking and DNS is affected. (1 hour later) An earlier power failure affected central services at TRIUMF. All services are now back to normal at the Tier-1 centre.

  • ASGC (Jason) - report concerning recent problems affecting CASTOR service at ASGC. A conference call with CERN DB/CASTOR experts was then held. The CASTOR instance was running but with a high activity of inserts into the id2type table which was blocking other accesses (ATLAS were getting srm prepare to put failures). This was traced to a castor stager internal cleaning job that had taken an exclusive lock and this was stopped at about 10.30 CET. Later a missing primary key was detected in the SUBREQUEST table and when this was corrected the performance of the DB went back to its normal good level. At about 13.00 CET the first successful transfers CERN to ASGC since 25 October were made. Another conference call will be held tomorrow at 11.00 CET, 18.00 Taiwan.

  • RAL (Gareth) - failed d/s issue. Do run fsprobe tool. Improve recognition of failed disks from controller- firmware upgrade a few weeks ago for this purpose. Other suggested things - continuous verification of raid checksumming. Not currently done but being discussed.

  • BNL investigating with CNAF network problems that ATLAS is experiencing. iperf installed both sides, running tests to find out where packet loss happens.

Services round table:

AOB:

  • GGUS (Maria) - How to remind VOs to submit team or alarm tickets and mention MoU violation field to get MoU compliance data.

Friday

Attendance: local(Nick, Gavin, Flavia, Patricia, Jan van Eldik);remote(Jeff, Jeremy, Derek, Michel).

elog review:

Experiments round table:

  • ALICE: WMS issues at FZK solved and ticket closed. Following with GRIF_DAPNIA the status of the WMS setup for Alice. SLC5 tests at CERN going smoothly, jobs will be continue running for the whole weekend until the next upgrade of the system next week to 64b
  • ATLAS: all major problems seem to be solved at the moment.
    • ASGC has been reintroduced in the functional tests and is currently importing and exporting without any problem
    • FZK performance also improved from the 50% failure rate of the last week. I have no understanding of what was fixed, but the SRM timeouts disappeared at approx 11AM this morning. Would be good to know what was fixed.
      • Doris Ressmann from FZK responds to this input later via email: site has been updated to the new "fast pnfs server" version improving therefoe the situation
    • Some minor open issues: There is currently 1 (one) file which can nopt be retrieved from tape at CERN and is causing failures to BNL (ticket submitted to castor.support). Transfers BNL->CNAF seem to be slow (200KB/s). BNL and CNAF people set up iperf test between the two endpoints and are actively following up the issue

Sites round table:

  • NIKHEF (JT by email) - post mortem on NL-T1 tape outage
  • GRIF_DAPNIA (Michel): GGUS tickets 42999 (related to WMS) and 42981 (related to lcg-CE) have been submitted 1 week ago and still they do not have any respond of the experts. Nick has taken the tickets numbers to follw the status.
  • Flavia: transfers CNAF-BNL: Apparently the traffic about T1 sites is taken over ESNET. This link seems however to be very busy, with a limited speed of 5MB/s currently. The open question is therefore if this is the correct link to use. Flavia has submitted a GGUS ticket just after the meeting to track and follow the issue (#44369). Still to be followed next week

Services round table:


AOB:

-- JamieShiers - 30 Oct 2008

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf ASGC-CASTOR-Nov5.pdf r1 manage 54.7 K 2008-11-06 - 08:30 JamieShiers  
PDFpdf Analysis_of_incident_with_the_SRM_service_at_PIC_on_31-Oct_-_1-Nov.pdf r1 manage 50.3 K 2008-11-05 - 18:20 JamieShiers  
PDFpdf nl-t1-nov6.pdf r1 manage 44.3 K 2008-11-06 - 13:54 JamieShiers  
PDFpdf post_mortem_tape-system_outage_25_10_in_NL.pdf r1 manage 49.9 K 2008-11-07 - 11:20 JamieShiers  
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2008-11-10 - PatriciaMendezLorenzo
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback