Week of 090525
WLCG Service Incident Reports
GGUS Team / Alarm Tickets during last week
Archive of Broadcasts
Weekly VO Summaries of Site Availability
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
General Information
See the
weekly joint operations meeting minutes
Additional Material:
Monday:
Attendance: local( Jamie, Gang, Stephane, Nick, Harry, Andrea, Patricia,
MariaD, Simone, Roberto, Diana, Dirk);remote(Andrea (CNAF), JT, Daniele ).
Experiments round table:
- ATLAS - (Stephane/Simone) Issue with LFC at PIC on the weekend triggered discussion on mechanism for sites to contact experiment experts. ATLAS proposes to setup email list / egroup, with experiment experts which can be used by a small group of know T1 site contacts (10-20 in total) to get in contact with experiment expert on call (and possibly later Point 1 experts). MariaG: Could VOMS groups + email list be used? Or GGUS tickets? Simone: would need to be able to generate SMS. Jamie: traceability of this channel would be useful. On the technical side of the problem: ATLAS was trying to delete LFC entries at PIC with the central deletion tool. This has triggered instability of the PIC LFC instance, which had not been seen in similar activities/load before. The full analysis is still going on, but the problem seems to be related to particular non-canonical ACL values rather than unusual load. PIC experts and LFC support are working on the remaining cleanup and understanding of the problem (ticket 48980). Simone: T2 transfers from Taiwan are enabled again as of Sunday. No problems observed so far. Jeff mentioned problems on Sunday with failing sam tests due to expired proxies. Andrea saw similar problems for submissions via the WMS, even though direct submission is working fine. To be checked by the ATLAS.
- CMS reports - (Daniele) Daniele reminded sites to consult the wiki documents which have been put in place in preparation for STEP09. He asked if June 1st (CERN holiday) will be considered first full day of STEP09. Jamie: June 1st will be a best effort day.
- ALICE - (Patricia) No major issues during the weekend - smooth running. Some problems at CERN T0 with jobs which required manual killing. Not clear yet why this happens apparently only at CERN. New T2 at Hiroshima is ramping up. Still some issues with submission via their VObox. Working on the issue together with another new site in Spain.
- LHCb reports - (Roberto) MC09 production going on. Cooling problems at Lyon and problems with CRL at CNAF (being worked on by CNAF experts). Some jobs had been killed by NIKHEF watchdog - Jeff added that the problem has been analysed and was due to external database connections.
Sites / Services round table:
- CNAF (Andrea) - working on LHCb certificate problems and also cross checking on other nodes to avoid similar problems. The site also saw some job submission problems (~20 jobs died) which are being analysed.
- NIKHEF(Jeff): dCache upgrade to 1.9.2-5 showed problems with memory consumption (>8GB) for gsi-dcap doors. The issue has been forwarded to dCache support and the site is preparing for a possible downgrade to 1.9.0-10 in case the problem cannot be fixed quickly. Simone/Jeff warned other sites to take this into account in case of planned dcache upgrades.
- ASGC(Gang): SRM problems last week have been identified as another instance of the Oracle "BigID" problem.
- CERN-PROD ATLAS issue with Castor SRM last Thursday not yet understood. Service request
opened to developers. SIR will follow once we understand better.
AOB:
Tuesday:
Attendance: local( Nick, Roberto, Julia, Gavin,
MariaD, Gang, Dirk);remote( Daniele,
JohnK (
RAL), Andrea(CNAF), JT, Michael ).
Experiments round table:
- ATLAS - (Simone) ASGC progress: ATLAS started to test recall from tape functionality. Recalls without space tokens results in correct space. The access to a VO box for Simone has been appreciated and simplifies the recommissioning. Still iusse with small number of available tape drives(2) which are shared with CMS. Further 6 drives are expected to become available soon (ASGC is waiting for IBM intervention). Simone remarked that some new recall request have overtaken older ones and asked the site to investigate.
- CMS reports - (Daniele) CMS CRL UI expired at CERN (ticket 49039). CASTOR upgrade for CMS finished smoothly with some minor issues. The upgrade broke some CMS download scripts as the stager query format had changed. CMS has in the meantime adjusted their production fedex to cope with the unintended change. The old output format will be restored with the upcoming minor upgrade to 2.1.8-8. Daniele reported that CMS got confirmation from ATLAS for multi-vo tape i/o tests. A more concrete schedule is expected after a meeting between CMS and ATLAS tomorrow. Prestaging test have started at CNAF and will continue with other sites. Also transfer test samples have been prepared and tests will soon proceed.
- LHCb reports - (Roberto) Monte Carlo production is continuing smoothly but issues gsi-dcap have been observed at Lyon and during the weekend at GridKA. Roberto reported an issue with the UI in CERN AFS. Gavin added that this issue is currently being looked at by the support team in FIO.
Sites / Services round table:
- RAL (John): RAL had scheduled downtime for a network change which unfortunately failed. The intervention will have to be repeated.
- CNAF (Andrea) : Several tickets solved and closed. Upgraded to latest glite version and did not observe any problems so far.
- BNL (Michael) : BNL upgraded to latest version of ATLAS site services. Michael mentioned some issues but will provide more detail later. Simone added that this change was required as two weeks ago BNL did not properly mark locations in DDM catalog when BNL was still using a candidate release for the site services. This issue is now fixed with the upgrade, but it is not clear yet what happened to failed location from before. ATLAS will follow up with the site.
- NIKHEF/SARA (Jeff) : Still discussions / investigations on dcache downgrade or possibly upgrade to new patch level - triggered by problems after recent dcache upgrade.
- CERN(Gavin): ATLAS SRM issue of last Thu have now been understood. Post mortem about the AFS UI problems can be found here.
AOB: (
MariaDZ) Since the GGUS May 2009 Release (last week) one can now see the
LHCOPN tickets
linked from the
GGUS homepage
(bottom left). I am putting this link in these minutes' template. When people need more info a network person will be called to the wlcg-operations' meeting. Discussed today at a LHCOPN meeting too. We need OSG, at least BNL feedback on the provision of site contact and emergency email for use by GGUS. This was discussed in 2 dedicated meetings (last December and March). Detail in
http://savannah.cern.ch/support/?107531
. About the ATLAS requirement presented yesterday, please check
https://savannah.cern.ch/support/?108277
, we should refine the requirement in this ticket.
Wednesday
Attendance: local(Jamie, Nick, Gav, Harry, Gang, Antonio, Roberto,
MariaDZ, Simone, Diana);remote(John Kelly (
RAL), Andrea Chierici (CNAF), Michael, Daniele).
Experiments round table:
- ATLAS (Simone) - follow up on previous issues: PIC- contacts reported that entries with corrupted ACLs cleaned. Deletion agents restarted and problem did not reoccur. Can confirm problematic ACLs - ticket still open but would like to understand from developers - should not crash server. BNL: exchange wth Hiro who updated SS to latest stable release. SS DB - backlog of registrations in central catalog - being done by hand. Whole process now ok - close(d). Problem this morning with FTS@SARA - v. v. slow, eventually timing out. VO boxes serving SARA & IT sites "disturbed" - have to check SS (DDM developers). Ron Trompert fixed quickly. Tomorrow d/s marked CCRC'08 will stop to be produced. STEP'09 will be subscribed according to full pattern so that Mon/Tue can ramp-up export.
Brian (
RAL) - in hand to have space in datadisk ready for STEP. Within UK looking at perceived rate R/W CASTOR for ATLAS. Maybe due to small files? Simone - this is current data distribution? A: yes, for current transfers. Simone: T1 gets data which is later shipped to T2s but not fully synchronise. Might get T1 pileup. ?? Filling FTS channels etc?? Have to check. Brian - preparing T2 disk space as well.
- CMS reports (Daniele) - CRL expiration errors now fixed - thanks Sophie for PM. Tickets opened by CMS on this are now closed & verified. News from last 24H: mainly compact & good meeting with Graeme(ATLAS) yesterday. Discussion on details, all sorted. ATLAS+CMS overlapping T0 tape writing tests(!) Updated plans by other VOs on this would be good - still some possibility to shift to get multi-VO. T1: more tests with Andrea's script at CNAF. More work at IN2P3 to finalise.. First tests with script by Andrea & Elisa on Twiki - migrating soon to main CMS twiki. Transfer: possibly today and before tomorrow - tables and all details on samples on T1 sites for T1-T1 tests. Analysis at T2: no progress to be reported. Meeting later in time for (report) tomorrow... Roberto - when is overlap in tape writing? Daniele: June 2-14 overlap period from ATLAS side. CMS: global run 2nd week and CRUZET first week. Hope for ~2 x 48H slots over full speed overlap. 2nd week of June should be a couple of slots with ATLAS+CMS. Simone - ATLAS test suite runs writing to tape in T0 at all times but tape family is recycled. This will run for full period of STEP so any time CMS can kick in will be ok! Harry - plan to go up to 1000Hz - still on? Daniele - depends on timing. T0 guys writing to tape now but not so obvious we can push to 1KHz. Roberto - LHCb will be writing to tape at nominal rate "fest'09" for full period. Harry - first pass processing date also to tape? A: Y.
- ALICE - (site tutorial ongoing)
- LHCb reports (Roberto) - MC09 production ongoing T1/T2 and merging at T1 and CERN. Issue accessing data at Lyon still ongoing - to be confirmed by local contact and site admin. CRL problem on AFS UI - impact but now closed, thanks.
Sites / Services round table:
- NL-T1: We're downgrading Dcache to 1.9.0-10, because we keep having problems with the GSIdCap service of version 1.9.2-5.
- ASGC (Gang) - 5+2 tape drives now online. i.e. 7/8 so still 1 not online.
- RAL (John) scheduled down tomorrow 8 - 11 local - make 3rd attempt to fix network!
- CERN (Gav) - set of CASTOR interventions - another tomorrow - should be transparent!
AOB: (
MariaDZ): USAG meeting tomorrow agenda
http://indico.cern.ch/conferenceDisplay.py?confId=59811
Reminder - please join / attend so that we can start promptly at 15:00!
Thursday
Attendance: local(Jamie, Sophie, Nick, Jacek, Miguel, Julia,
MariaD, Simone, Gang, Harry, Gavin, Dirk); remote(Daniele(CMS), Andrea(CNAF), Gareth(
RAL), Micheal(BNL). Jeremy(
GridPP), Greig(LHCb)).
Experiments round table:
- ATLAS - (Simone) Worried for FZK tape problems as also Lyon will only be partially available for the first week of Step. As FZK represents 25% of ATLAS T1 capacity) this may significantly limits the STEP work for the VO. Other news: today, functional tests have been stopped leaving only STEP load generators running. ATLAS expects to reach 20% of the STEP load at weekend. 50% early next week and 100% by Wed.
Micheal: Are there already plans how to re-distribute STEP shares for missing sites? Simone: will be done according to
MoU fractions (ESD and AOD will not be redistributed).
- CMS reports - (Daniele) CMS will have their last STEP preparations meeting this afternoon. CMS STEP Wikis should contain relevant updates by tomorrow morning. Current Issues: major tape problems at FZK. It is not clear yet if the site can join STEP for CMS in which case a later tape test at the site may be required. Following up on problems to access data from a CMS global run CMS has been informed of a data unavailability caused by a tape volume which had to be sent to SUN for media recovery. Other two tapes have been sent in April and are not yet back. This triggered questions/proposals from CMS on the procedure for dealing with extended (week/months) of data unavailability (and possible loss):
-
- Q1: Could CMS be notified when tape volumes are sent to recovery at the tape vendor site?
- Q2: Are/should the files form the disk pool be prevented from garbage collection to insure that data which is possibly lost on the tape media data is not also removed from disk?
- Q3: What can be done to avoid discovery of unavailable data only at the time when a job attempts to read?
-
- Miguel: Castor team will get back to CMS with answers on above points offline. Aim to find a general solution as these problems will also apply to other experiments. MariaD: Would cms-vo-support sufficient for this notification? Daniele: probably yes, the list may even be larger than required. Miguel: List of offline tapes exists which is consulted also by other experiment. The mapping from individual tape volumes to unavailable files can be done using the castor tools, but files with more than one tape copy need to be taken into account as their accessibility is not affected by one tape being offline. Simone points out that similar issues with offline tapes exists also at other sites.
- LHCb reports - (Greig) LHCb week ongoing. Otherwise smooth running.
Sites / Services round table:
- ASGC(Gang): Today is a public holiday. Five new tape drives are not yet working - in progress.
- CNAF(Andrea): no issues to report
- RAL(Gareth): third try for network intervention failed - RAL is reviewing the procedures for this intervention.
- Physics Databases(Jacek) - proposed a database intervention on integration/pre-production database clusters to move to RH5 and 64bit for Tue next week 9:00-12:00. Affected clusters will be INTR (ATLAS), INT2R (CMS) and INT11R (WLCG).
AOB:
- Jamie: proposal to do a short (~.5h)network backup test on Wed 17 June. This all internet traffic will be at risk, even though no impact is expected. The daily ops meeting template has been changed to point to the experiment STEP wikis.
- Harry: Should review if kernel upgrade on 15th June could be done later later possibly (too soon in step schedule).
- Nick: baseline service wiki has been updated. Experiments are invited to cross check.
Friday
Attendance: local(Jamie, Gavin, Gang, Nick, Simone, Harry, Dirk);remote(Andrea (CNAF), Michael (BNL), Jeremy (
GridPP), Greig (LHCb), JT (NIKHEF), Daniele(CMS) ).
Experiments round table:
- ATLAS - (Simone) German cloud will not be able to cope with tape recall, but will be able to do tape writing. ATLAS will provide the nominal data volume and on their request a slightly reduced job count. For the French cloud at was confirmed that downtime only affects tape, so ATLAS will provide disk based data and redistribute the RAW volume on other sites. A problem with the STEP load generator has been found (10GB files instead of 5GB) and fixed by Simone. Simone explained that the full transfer setup is now ready for ATLAS and the rate will be ramping up as discussed on Thu. He mentioned that the Hammercloud load is being setup. NIKHEF saw problems with too aggressive pilot rates. ATLAS will reduce the rate
- ALICE - (included previously by Patricia): Several changes in the production cycles yesterday stopped the production and brief disturbance in the job profile at the sites were observed. No actions have been necessary by the site experts. Bug 49057 assigned to CERN (bad behavior of the CREAM-CE services) has been closed this morning the system is back in production.
Sites / Services round table:
Dear all,
unfortunately, we have some bad news about the GridKa tape infrastructure.
The connection to archival storage (tape) at GridKa is broken and will probably not function before STEP09.
The actual cause is yet unknown but is rooted in the SAN configuration and/or hardware.
We experience random timeouts on all fibre channel links that connect the dcache pool hosts to the tape drives and libraries.
GridKa technicians (and involved supporters from the vendors) are doing everything possible to fix the problems as soon as
possible but chances to have the issue solved until next week are low. We therefore cannot take part in any tape dependant
activities during STEP09. If the situation improves during STEP09 we might be able to join later, maybe during the second week.
Sorry for the late information (experiments have been informed earlier) but I hoped to get some better news from our technicians this afternoon.
We will keep you updated if the situation changes or (hopefully) improves.
Reagrds,
Andreas
AOB:
- Reminder - no meeting on Monday! STEP'09 starts Tuesday!
--
JamieShiers - 25 May 2009