-- HarryRenshall - 21 Nov 2008

Week of 081124

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

Site Date Duration Service Impact Report Assigned to Status
ASGC on-going on-going CASTOR storage flaky - down   Jason due
CERN 20 Nov 1.5h degraded, 4h total ATLAS CASTOR down / degraded https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20081120   received

GGUS Team / Alarm Tickets during last week

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Maria, Jean-Philippe, Roberto, Simone, Jamie, Nick, Markus, Harry, Olof, Jan, Gavin, Maria);remote(Michael, JT, Michel, Jeremy, Gareth, Daniele).

elog review:

Experiments round table:

  • LHCb (Roberto) - this week: few production requests; more jobs, more MC simulation at sites. In parallel staging exercise should ramp up at some point. 10MB/s staging tape-disk. Staging exercise now devoted to prove 10MB/s sustainable across all T1's. GridKA even 10MB/s rate sustained is not possible, GGUS open. Problem 23/24 at RAL. Last week and during w/e not able to stage any file at CNAF - LHCb raw space taken was not configured after recent upgrade. Now fixed? If so close team ticket opened Friday.

  • ATLAS (Simone) - notification from sites; 1 RAL - DB and site at risk; 2 BNL - networking on trans-oceanic link. (Expect comments from sites...) Put entry in GOCDB plus "courtesy" elog- very nice. Thanks a lot! Problem notification. SRM problem with storm. Notified Friday night that someone would look at it but no notification - including GGUS ticket -when solved. Beijing on 20th (Thursday) at 1:00 pm - ticket assigned without specifying site -> AP ROC. Jason reassigned to CERN. 6pm Friday so people in Beijing already home. Still open.. (SE in Beijing). Info is in GOCDB! Discuss at TPM meeting Thursday. Enhance material for TPM training. AP ROC and CERN ROC in very different time zones which makes this worse.. GGUS release after Xmas will have direct routing to sites for all registered sites(!) Alarm ticket to SARA 08:00 THursday - problem with SARA SRM. Answered at 16:00 "looking into it" 20' later fixed. Problem on Sunday registering in LFC@CNAF. Logging at ATLAS DDM level insufficient to figure out such problems. ATLAS switched to FTS on SLC4 at CERN this morning. Configuration problem with site services. BNL datatape endpoint couldn't get files for 4-5h. Fixed end of morning - fixed. Similar problem for "Great Lakes" T2 - being fixed (Stephane Jezequel). Meeting with ASGC - comments? Harry - another working con-call tomorrow.

  • CMS (Daniele) - nothing special to report.

Sites round table:

  • BNL (Michael) - USLHCnet involved in transatlantic network problem. Global crossing technicians trying to fix problem at POP in NY & Chicago. Failure - reported last Thursday - "repaired" fibre deteriorated within a few hours to unusable. Communication still a problem - haven't heard back. WAN networking at BNL - backup path seems up. Problem was affecting internal USLHCNET infrastructure. Primary path for BNL never affected. Traffic BNL-US and BNL-CERN, BNL-T1s never impaired. FNAL lost primary path. ATLAS operations never affected. Communication path between VO and netowork engineers needs to be improved. Looks like infrastructure back to normal but no news from USLHCNET or network operations centre. Maria - GGUS - LHCOPN Savannah ticket is https://savannah.cern.ch/support/?103810 . The last 3 comments in the ticket show how uncertain we are about the outcome of the GDB presentation. Exchange with Guillaume to clarify. Needs to be followed...

  • RAL (Gareth) - SRM i/f for ATLAS over w/e. Extra logging to try to understand situation with Oracle. Graeme Stewart in loop. Networking issue. 11pm last night. One of network switch was reconfigured. Including backend DB for NAGIOS monitoring so got no notification until this morning! Simone - intensive test of last Friday stopped as agreed. Should we restart some heavier testing? Brian Davies - because of "at risk" and network problems enough of backlog at the moment that we can use. If we need more load before 10M event sample will holler. See also post-mortem blog

  • CERN (Jan) - as announced Friday upgraded castor CMS to 2.1.7-23. Will upgrade LHCb and others.. SRM 2.7-11 preproduction also ok.FTS production SL4 to current production patch.

  • DB (Maria) - patching online databases this week. Started with CMS then ALICE & LHCb. ATLAS (as in CC) already done.

Services round table:

  • FTS on SLC4. Fix - for bouncy castle stuff - should be tomorrow. Deployment test <1day, ready for sites day after tomorrow.

AOB:

Tuesday:

Attendance: local(Harry, Jean-Philippe, Gavin, Simone,Julia,Steve,Flavia,MariaG);remote(Jeremy,Michael,Michel,Jeff,Gareth).

elog review:

Experiments round table:

LHCb (RS by email): 1) MC simulation: large number of pilots running in WLCG currently. Some problems at various sites accessing the SQLlite DB (local instance of the ConditionDB at non-T1 sites that is meant to be found on the shared area of each site) because of the shared area timing out. 2) Preparations for the staging exercise are continuing across all T1's. 3) CERN, PIC, RAL and SARA tape recalls are sustaining happily the 10MB/s rate and LHCb will increase until the target rates (as reported in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek081117). 4) GridKA and CNAF reaching only 5 MB/s (investigating the reasons why. Also with this poor rate a backlog is forming on the tape recall queues!). 5) IN2P3 all staging request (BoL requests) seem to have been lost by remote system. Investigating with local contact person and Lionel.

ATLAS (SC): Following a cooling system leak at about 08.30 this morning CERN closed down much of its batch farm. This is where the ATLAS load generator for the functional tests runs so this activity stopped. The farm is coming back now so the FTs will automatically restart. ATLAS is planning to extend functional tests to Monte Carlo endpoints after many reports that MC production is suffering from data management issues notably file transfers from Tier 1 to Tier 2. This is not current seen in ATLAS monitoring probably because different space tokens are tested. A survey is being made to see if MC sites have space for additional temporary FT data and the plan is to start these tests with the start of the next FT cycle next Thursday.

Sites round table:

  • ASGC (Jason) - update on on-going CASTOR / Oracle problems.

Services round table:

  • The cooling system leak that lead to the batch farm close down this morning was fixed about 11.40 and the farm is being progressively restarted. This was signalled in GOCDB as an unscheduled downtime (see https://goc.gridops.org/downtime/list?id=13255360 ).

The voms-pilot has still seen little testing by LHC vos. In the last month and a bit there has been

* 59 requests by 3 users for ops. * 17 request by 6 users for dteam. * 17 requests by 2 users for atlas. * None for CMS, ALICE or LHCb.

The validation instance of SAM used voms-pilot.cern.ch voms service without incident, testing by me and others has been completed.

OSG started some tests yesterday so outcomes to this may influence proposed upgrade time.

Proposed upgrade now at 08:30 UTC Monday 1st of December with confirmation of this to be made on Thursday. It will be an upgrade of voms-core, voms-admin and vomrs.

Changes for users.

  • Emails from VOMRS now contain the DN-CA pair that the email is talking about. Should avoid some previous confusion for users.

There will be an at risk period of two hours during which time registration processing will be suspended for a 5 minute window.

Also at the same time write acces at the voms-admin level with be denied for VO-Admins. This had been the case in the sense that all writes were subsequently removed by vomrs but this will disable the write buttons on voms-admin. Read access for VO-Admins will be preserved.

  • Databases (MG): We are closing access of the LHC online databases so only named IP addresses can connect to increase their security. CMS has already been done and it is proposed to do ATLAS this week.

AOB:

Wednesday

Attendance: local(Simone, Gavin, Sophie, Flavia, Harry, Nick, Roberto, MariaDZ, Olof);remote(Michael, Jeremy, Jeff, Gareth).

elog review:

Experiments round table:

ALICE (by email from PM): About to release the AliEn2.16 version which is in testing phase right now at CERN, Legnaro, FZK, SARA, CNAF, Prague and JINR. The new version will contain the CREAM submission code although the testing of this new code will depend on the resources provided by the sites. This new AliEn version will be THE WMS version including the multi-wms code created in purpose for this version. Sites will be announced in time for the new code upgrade. More news after the Alice TF meeting tomorrow.

ATLAS (SC): 1) Problem yesterday with storage at PIC - alarm ticket was sent and this was acknowledged within 20 minutes and solved in a few hours. 2) Problems yesterday contacting SRMs at Desy and Great Lakes Tier 2 sites - again tickets sent and solved quickly. 3) This morning a storage problem at Lyon - ticket sent and it was resolved before midday. 4) ASGC is broken again with a new Oracle error 257 (archiver error). 5) Following yesterdays report on slow file transfers of Monte Carlo production the ddm shares for this activity in the two worst affected clouds of Germany and the UK were boosted with positive results.

LHCb (RS): We are increasing the staging rates at the well performing Tier 1 and have reached about 20 MB/sec. Sara is still only at 10 MB/s and CNAF and Gridka still at 5 MB/s. All are being followed up and more increases are planned for Monday. A SAM LHCb critical test of CEs was failing everywhere due to a table permissions change in the online DB so an internal LHCb problem. Meanwhile the test has been changed to be non-critical and Jeff asked how long it would take for the Nagios-feed to reflect this. The consensus was it should be with the next set of hourly tests (which may queue locally of course). Jeff then raised a follow up issue - they received a ticket from LHCb over the weekend - GGUS 43954 - which was submitted as a standard rather than a team ticket (these bypass the ROCs). The ticket hence spent 31 hours before being sent to ROC North and another 17 hours before it reached Sara by which time the problem had been resolved. Roberto replied they were trying to train their operations people and will look at the ticket to see the originator.

Sites round table:

NL-T1 (JT): Reminded that they prefer team tickets for known site problems as these go direct. They also feel they get too many operator alarm tickets and they are going to review this. Another issue is they have been red on the LHCb dashboard for some time due to a WMS test failing but this should not be treated as critical as the important site functions are still working. After some discussion it was agreed to bring this up at a future management board.

BNL (ME): Reported good results in the replication of ATLAS Cosmic AOD and DPD data sets to the US Tier 2 cloud. They see high aggregate transfer rates with an average rate of 350 MB/sec over the last 7 days and 200 MB/sec average over the last month with 5 million files replicated. They recently had 3 out of 7 cooling pumps go down but managed to install a 50-ton rental cooler unit within a day so there was no degradation of services. They will now be examining the failed pumps.

RAL (GS): A scheduled srm outage in the RAL-LCG2 in the Gocdb did not show up in the site detailed view so additional unscheduled outages were inserted. These were picked up by gridview but displayed as all RAL-LCG2 components being in maintenance for 1.5 hours after which the display corrected. Gareth will send Harry a detailed time line for follow-up with the gridview service.

Services round table: Jan reported CASTORLHCb was upgraded to 2.1.7-23 and the team is ready to rollout CASTORsrm 2.7 starting with the CERN public service next Monday. MariaDZ reported there will be a USAG meeting at 9.30 tomorrow (see http://indico.cern.ch/conferenceDisplay.py?confId=45916) which will address the changing role of the TPMs as increasing use is made of direct routing of tickets to sites and JRA1 also has other needs for their help. All interested are welcome to join the meeting.

AOB:

Thursday

Attendance: local(Harry, Gavin, Julia, Maria G, Jamie, Maria D, Jan, Graeme, Olof, Roberto);remote(Jeremy, Gonzalo, Gareth).

elog review:

Experiments round table:

  • ATLAS (Graeme) - not too much. CHange to this weeks functionakl tests. Usually flow out along T0 data path -> datadisk. This week decided to test more of production pathways. CERN -> datadisk at T1; between T1s from datadisk ->mcdisk. Then -> t2 proddisk era (i.e. simulating more MC flow). Don't anticipate big impact but better check... Yesterday tested new castor instance at CERN ppsdisk. Stuck 100K fiels in in 4-5hours. ~94% efficiency. ASGC: only ~50% efficiency over past 24h. Many periods of severe CASTOR/Oracle errors.

  • LHCb (Roberto) - preparation of staging exercise. RAL & CERN still behaving fine. 20MB/s - most likely will increase. PIC was at 20MB/s. Starting to accumlate big backlog over past 12h - rate has been dropped to 10MB/s. GridKA problem investigated - explanation is exercise using v small filesize (<< that of ATLAS - maybe ~300MB?). With real data should be ok (2GB files). CNAF : low rate; IN2P3 no activity at all. Site still not working properly for LHCb? Confirmation needed from Andrew. Working before meeting with Maarten on detailed broadcast to all sites to introduce pilot role.

  • CMS (Daniele): CMS has now a new prodAgent release installed and tested at CERN, so DataOps people are informing the T1 sites about next reprocessing activities starting soon, and T1 coordination under FacilitiesOps is collecting the site feedback and the preparation status. This is the "CRAFT reprocessing" exercise. Ops people would like to begin reprocessing at the custodial T1s ASAP. Budgeting extra effort wrt previous reprocessing rounds, the overal nbs seem now to be:

Site NewDataset Size
IN2P3 /Cosmics/Commissioning08-ReReco-v1/RECO 69.6TB
FZK /Calo/Commissioning08-ReReco-v1/RECO 62.2TB
RAL /MinimumBias/Commissioning08-ReReco-v1/RECO 19.9TB

This doesn't seem to pose a problem for any of these sites judging from the stats posted on the FacOps blackboard used by the site contacts. The (in CMS jargon for the namespace) "acquisition era" is also unchanged so Ops people don't believe that any setup work is required at the sites.

  • ALICE (Patricia) - regarding Alice and due to the latest AliEn v2.16 version, we are still testing the latest LCg modules which will be part of this new release. We have been able to add the CREAM submission module which will be ready for the testing of this service as soon as we begin to deploy it.

Sites round table:

  • CERN (TNT) - post-mortem of cooling problems at CERN Tuesday.

Services round table:

  • Jan - sent out mails to all experiments on proposal for srm 2.7 production deployment mid next week. Some minor problems but no show-stoppers.

  • Dashboards - new version of ATLAS data management dashboard released. All activities now in one instance. Much more convenient for usage.

  • Databases - replication for CMS stopped on Wed at 18:30 due to problem with APPLY side - wrong operation in adding a new schema. Fixed this morning. Since 05:00 observing problem with replication to TRIUMF and ASGC for ATLAS conditions. First investigations -> network? Investigating with experts..

  • Phone meeting on 4th December with OSG management, OSG GOC, US-ATLAS and US-CMS on direct routing of GGUS tickets to OSG sites. Requirement from ATLAS - how can it be implemented? (Current model GGUS -> GOC of OSG).Related savannah ticket Meeting today USAG Agenda and related documents - no VOs represented - decided on new role for TPMs: debugging, savannah ticket opening and escalation to ggus supporters / developers on slow tickets.

AOB:

Friday

Attendance: local(Ewan, Gav, Jamie, Harry, Nick, Patricia, Jan);remote(Mr Egee Broadcast, Gonzalo, Derek, JT).

elog review:

Experiments round table:

  • ATLAS (Graeme) - Functional tests with different data flow path started yesterday.
    T0 -> T1_DATADISK -> T1'_MCDISK ->T2'_PRODSISK
    Minor misconfiguration issues fixed on DDM to process subscriptions properly.
    Most clouds performing well.
    Issue with slow subscription processing to the DE cloud which is being looked at by experts.

  • LHCb (Roberto) - No much activity because of the LHCb week going on.
    Remarks on the preparation to the staging exercise:
    • CERN,RAL (rate increased to 30MB/s) - OK
    • PIC reduced to 10MB/s yesterday - Not fully OK (Gonzalo - pre-stage tests needed to reduce because of backlog, believe due to a couple of incidents yesterday and day before. 1 tape drive got stuck with LHCb tape. Caused rates to be reduced. 1 of 2 disk pools for this had an intervention 2 days ago. Caused a reduction in rate for some hours. In principle since yesterday morning both should be ok and hence backlog should clear. Should be able to cope with high rate.)
    • CNAF increased to 10MB/s and behaving well - OK
    • GridKA seems now to be able to at least sustain 5 MB/s - OK
    • SARA started degrading yesterday at ~7pm: TEAM GGUS #44204
    • IN2P3 under investigation (TEAM GGUS #44208 and dCache ticket will be filed by Lionel)

* ALICE (Patricia) - still waiting for green light to put alien 2.16 in all sites. WMS modules ok, cream support ok from ALICE point of view. still waiting for some fixes from dm point of view for xrootd (again internally to ALICE) - due today or monday - and then should be ready to deploy in sites.

Sites round table:

  • NL-T1 (via EGEE broadcast) - Dcache head node will be migrated to other hardware to solve performance issues. Intervention scheduled for 09-12-2008 07:00 - 18:00. Nick - meeting of GOCDB advisory board next week.

  • GRIF (ditto) - GRIF-IRFU NFS server with software area is down - investigation is ongoing: Start: 10:58, estimated(?) end: 12:58

  • RAL (Derek) - power glitch at 12:10 affected services across RAL. 20 disk servers rebooted - mainly affecting ATLAS but also CMS and LHCb. Some drives being reverified. Also affected GOCDB. Nothing seems to be down but fsck & array rebuild...

  • ASGC - extended log space (ran out of archive log space). 100% success rate when checked this morning.

  • IN2P3 (Catherine Biscarat [biscarat@in2p3.fr]) - the IN2P3-CC site (in Lyon) has scheduled a major downtime this coming Tuesday (December 2th 2008). This downtime is driven by electrical work on site.
    This will impact all major services:
    • a minimum service is maintained for dCache: the incoming transfers will be OK while all outgoing transfers will be closed (as well as the acces from the worker nodes);
    • HPSS will be worked on (an upgrade of the disc firmeware is planed) and it will be unavailable all day long;
    • the CEs will be closed starting Monday December 1st at 8AM; however new jobs will not be accepted on the batch queue starting Sunday evening (~11PM).

      All services are planed to be back mid-afternoon (~4PM) and the batch should be reopened around 6PM on Tuesday.

Services round table:

  • DB (Maria) - Since Thursday 27th at around 05:00 am, there are problems with the replication to Tier 1 sites TRIUMF and ASGC. Solved this morning(?)

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf Re__asgc_working_conf_call_Tuesday_25_Nov_09.00_CET_-_16.00_Taiwan.pdf r1 manage 60.0 K 2008-11-25 - 11:03 JamieShiers  
PDFpdf alice-ggus.pdf r1 manage 3.7 K 2008-11-24 - 10:48 JamieShiers  
PDFpdf atlas-ggus.pdf r1 manage 49.9 K 2008-11-24 - 10:48 JamieShiers  
PDFpdf cms-ggus.pdf r1 manage 8.4 K 2008-11-24 - 10:48 JamieShiers  
PDFpdf lhcb-ggus.pdf r1 manage 17.3 K 2008-11-24 - 10:49 JamieShiers  
Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2008-11-28 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback