--
HarryRenshall - 09 Feb 2009
Week of 090209
- This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
- Status can be received, due or open
GGUS Team / Alarm Tickets during last week
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
General Information
See the
weekly joint operations meeting minutes
Additional Material:
Monday:
Attendance: local(Harry, Jean-Philippe, Ricardo, Nick, Eva, Simone,
MariaDZ, Patricia);remote(Gonzalo, Gareth, Daniele).
Experiments round table:
CMS (DB): 1)
RAL had a permission denied problem this morning, promptly fixed by CMS contacts, no need for any tickets to the site. 2) Issues with many jobs pending on the gLite WMS at CERN. Main cause seems to have been unzipping input sandboxes with a first workaround to put zippedISB = false in their jdl. See complete report at
http://indico.cern.ch/getFile.py/access?contribId=25&sessionId=0&resId=0&materialId=slides&confId=52425
ATLAS (SC): 1) There was an upgrade of the machine hosting the ATLAS central catalogues this morning with a 5 minutes downtime. Follow-up validation checks look OK. 2) Currently site services are crashing from time to time. Restart is automatic but takes some time restarting databases so there are service gaps. The developers are investigating. 3) ATLAS operations have decided on a new policy on AOD copies in order to avoid a disk space crisis. AOD copies at Tier 1 sites can be deleted iff they were not the originating site and iff a copy exists in their associated Tier 2 cloud.
ALICE (PM): 1) Currently running about 4000 concurrent jobs gridwide. 2) Experiencing again at CERN the serious WMS backlogs seen during the Xmas runs so working with FIO on this. The problem is not seen at
RAL or CNAF. 3) The French WMS is failing again - a ticket has been submitted. 4) The software area in the Cagliari Tier 2 has become inaccessible - no ticket yet.
Sites / Services round table:
FZK: Following an earlier ATLAS report that FZK were running with 20-80% failure for file transfer / stagein-out
coming from pnfs overload, due to the amount of files and problems with DBload, that could be cured with ~1/2 day operation on pnfs database
Jos van Wezel replied: FZK will be able to do this not sooner than Tuesday. The operation
means taking down the dcache nameservice (pnfs) for all supported vos.
We have to discuss this with them. The ATLAS table space has grown
to disproportionate size because deleted records still linger. The
quickest way to compact the db is to dump and restore it which takes
an expected 5 hours in total. It may cure the problem. It may not.
It will not work miracles. The SRM is still under high load and no
medicine has been found yet to heal it. We most likely will dump/restore
on Tuesday or Wednesday but can tell you definitely no sooner than Monday
afternoon.
Databases (ED): The ATLAS online database was moved this morning. It is back up but with some residual connectivity problems.
GGUS weekly ticket review (MDZ): There were no experiment alarm tickets last week.
AOB:
Tuesday:
Attendance: local(Julia, Simone, Jean-Philippe,
MariaDZ, Jamie, Eva, Nick, Harry, Luca, Ricardo, Patricia);remote(JT, Daniele, Gareth).
Experiments round table:
- CMS (Daniele) - presenting yesterday propsoed mod to CMS VO card - ones already asked by Pepe from PIC, commented also at yesterday's call. Attached to twiki single slide (https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports#10_February_2009_Tuesday). T1 sites: some issues at several T1s over last week - affected overall status. 65% with >90% CMS-SAM tests - 88% previous week. Some minor issues: CNAF - myproxy-fts related issue (operational mistake, notified & corrected), IN2P3 glitches with SRM and HPSS - reported at weekly meeting - still seeing some export problems not strictly related to small size files. PIC: site today sched. down. Another myproxy-fts expiration (like CNAF...) just over w/e. T2: general status of CMS-SAM tests: quality is getting worse! Availability > 80% 52% sites only - 67% of sites previously. Some actions at regional level. Speed-up in restart of computing shifts to focus more closely. Andrea Sciaba presented status of USAG activities, GGUS ticket tests etc. JT- results from last week - noticed colours are different! Brown colour - undefined? Julia - maintenance. JT - can we use GridView colours as presented at MB. Julia - OK. JT: undefined? Julia: grey.
- ATLAS (Simone) - 2 site related issues: Canada & FTS in TRIUMF. Still uses v conservative # parallel files (3) for T1-T1 transfers. Thought we'd agreed to increase it? Several FTS channels which are set as inactive - backlog of transfers to be done. Mail -> TRIUMF. Reason for this? FZK: last week was comment about trying to clean DB for dcache. Tentative plan for Tue-Wed this week. Any news? Good news of today - ATLAS is ready to start merging HITS files. Big breakthrough - factor 10 less files for MC. Procedure retroactive - also for old hits. # files in SE will decrease... For computing model HITS should be processed and then go to tape - now with 2GB files this can be done and frees up T1 disk space - in prod end week. Next step: merge AODs - code ready but not tested.
pcache - tentative solution for jobs which try to access same file(s). Job downloads file to WN in common area - any subsequent job can use common area (cache) for certain time. Luca - talk tomorrow at GDB? Simone - intrusive as sites like to cleanup WN when jobs down. Discussion tomorrow... Sites - please participate. Julia- input files? A: y. db release tar ball is a typical example. JT - cache? yes, driven by site - can cleanup as wanted. JT - ATLAS-specific SAM tests - this morning ATLAS CE had only run successfully twice since 5 Feb. Looks like WMS related? (Same pattern as for ALICE - use same WMS? (no). Brian- good news about hits. Merging of AODs - problems transferring d/s - 16K files in a d/s going to a single directory. Encourage merging of AODs! Simone - some d/s with ~20K files. # files / dir a problem? Brian - srmls on a dir has max of 1023 entries (first 1023..) Simone - only used for single files. Luca- have problems with pre-staging exercise - saw affected by bug in srm e/p. CASTOR team advises to upgrade to latest version. gsoap error.. Q: were warned by Simone that deletion rate would increase from 1 to 10Hz. Is it going ok? Simone - observed no problems...
- ALICE (Patricia) - situation with WMS getting worse&worse. 3 WMS at CNAF - 1 down this morning and other 2 drain-mode so no WMS in Italy for ALICE; France - opened ticket yesterday, no reaction. WMSs at CERN completely overloaded. Trying to figure out reason - 1900 jobs so why so many requests? Asked for slot at GDB (10') to discuss.. Maria - WMS in France is where? A: GRIF. Maria - assigned directly to GRIF? A: warned site admins and assigned to French ROC. Ticket 46051.
- LHCb (Simone(!) ) - all VO views at CERN publishing 0 as # of waiting & running jobs. Ricardo - fixed today.
Sites / Services round table:
- NL-T1 (JT) - srm probs over w/e (dcache at SARA). Started Sat, solved Sun. No clue in logs! Started to become slow, # db connections backed up. Restart fixed in... Next Monday 16 Feb dcache at SARA offline for part of day as being moved to more powerful m/c and o/s upgraded.
- RAL (Gareth) - being scanned by WN at another site! 194.67.77.133 = wn012.lxfarm.mephi.ru
- DB ( Eva) - connectivity probs with ATLAS online DB fixed yesterday ~16:30.
AOB:
- GGUS (Maria ) - ticket 45850 requires some debugging from FZK!
Wednesday
Attendance: local(Harry, Jan, Miguel, Julia, Simone);remote(Daniele (CMS), John Kelly (
RAL)).
Experiments round table:
CMS (DB): Ten new tickets today - all so far internal in CMS savannah system. CMS is trying to use GGUS more than in the past but first pass issues through their internal filtering. 1) CERN -
RAL transfers failing with device busy. Not yet clear which site. 2) Two Tier 1 Issues with thansfer errors from FNAL to US Tier 2 and slow transfers from
IN2P3 to Purdue (also a USA Tier 2). 3) 4 Tier 2- Tier 2 transfer errors requiring consistency checks at the sites 4) 3 internal CMS tickets.
ATLAS (SC): Have found two new problems in the newly installed site services upgrade involving transfers into CNAF, SARA and NDGF. Was a tricky race condition in VO boxes which has already been patched at the problematic sites. There will be an upgrade to the central catalogue tomorrow to fix a few minor issues.
Sites / Services round table:
- FZK (Jos) - because of new information (the pnfs update of last Friday has significantly improved the scalability of pnfs) and the help offered by dcache.org for Wednesday 11/2 (to assist with the SRM problems) Andreas and I have decided to postpone the db shrink action to be done at 17/2 during the planned downtime for FTS and LFC. This means that tomorrow there will be intermittent restarts of the SRM in order to experiment with different settings.
It looks like some restrictions on the SRM can be reverted to the setting of early January which should reduce the SRM failure rate.
Apologies for this sudden change but we consider it the best option at the moment for all users of GridKa.
There is already a team GGUS ticket upon us. We are working on it.
We are forced to run at a lower capacity due to very high load on our FTS server causing duplicate entries in DB which essentially block the transfer agents. DB cleanup does not seem to help much as problems reoccur. Temporary fix is to run some of the T1-T1 broken channels to go through the STAR-TRIUMF to minimize channel maintenance.
The good news is that FTS 2.1/ SL4 have been already deployed and configured on a much better hardware. Most of the channels have been tested. We plan on putting it into production by Monday or later this week if we're happy with all the testing.
Hope this clarifies the situation.
Thanks
Reda
CERN (
JvE): The SRM endpoints will be upgraded tomorrow to level 2.1.15. This is a maintenance upgrade which should be transparent. It will fix the problem of creating recursive directory entries.
AOB: Low attendance today due to ongoing GDB meeting.
Thursday
Attendance: local(Harry, Eva, Nick, Julia, Ricardo, Jean-Philippe);remote(Joel, Gareth, Daniele, Brian, Jeff).
Experiments round table:
LHCb (JC): Several site problems: 1) CERN has files in status 'CAN-BE-MIG' since 7 Feb. Ticket sent. Also there are very few batch jobs running at CERN - about 30 with a queue of 10000 wheras the LHCb quota should allow about 600 running. Ricardo reported no obvious misfunction but will follow up. 2) CNAF has srm endpoint giving 'destination errors during preparation phase'. 3)
IN2P3 srm end point for user storage is giving globus errors. JPB clarified that these are in fact gridftp error messages.
IN2P3 still has the problem of replicated raw-data dcache files being in status 'UNAVAILABLE' (first reported end-January, see GGUS ticket 45699).
During questions Jeff asked the meaning of the LHCb dashboard at FZK and
RAL being red (as reported at
http://dashb-lhcb-sam.cern.ch
). Joel confirmed these are reporting the experiment specific SAM tests and Gareth replied they were failing an srm test since 06.00 today.
ATLAS (SC): 1) They would appreciate an update from FZK on the schedule of their cleaning up of the pnfs database. 2) At CERN for some months using lcgcp to copy between pools in a castor instance has been broken (following a gridftp change). ATLAS only has a few use cases of this but a new one has just arisen and they would now like a schedule for the repair of this problem. 3) Sites needing to communicate with ATLAS operations do so through the agreed mechanism of GGUS tickets but the ATLAS workflow for handling these is weak and it would be useful if sites could also mail the numbers of important tickets to the SCOD (
wlcg-scod@cernNOSPAMPLEASE.ch) for highlighted inclusion in these daily minutes.
CMS (DB): 4 open tckets of which 1 is new and 3 are internal to CMS. 1) MC production failing in Warsaw. 2) CERN to
RAL transfers were getting 'device or resource busy' errors so a ggus ticket was sent. This was replied as having been related to
RAL castor issues and there have been no errors since. Brian Davies asked how they could look for such errors in Phedex/FTS. To be followed up (Daniele/Harry). 3)
IN2P3 transfers to Tier 2 failing was an issue of files not being staged in from HPSS. 4) Various minor issues at Tier 2 e.g. misconfiguration of the FTS server in Nebraska.
Sites / Services round table:
Databases (ED): Security patches were installed on the ATLAS offline DB. A CMS application was down overnight as the destination DB was being upgraded.
AOB: Nick: We have set up a debug instance of a gLite 3.1 WMS (including with a new patch) to try and understand the backlog problems being seen by ALICE at CERN.
Friday
Attendance: local(Harry, Jean-Philippe, Eva, Ricardo, Patricia, Simone, Kors);remote(Michel,Daniele, Jeremy, Michael, Gareth, Joel).
Experiments round table:
CMS (DB): In the last 24 hours a couple of new tickets for Tier 2 sites, one for Tier 3 and a few internal data consistency Savannah entries. 1) The CERN to
RAL 'device busy' GGUS ticket has been closed and verified. 2) FZK is getting timeout errors transferring to T2_FR_GRIF_LLR. Under investigation by CMS contacts. 3) FNAL to US Tier 2 failing is still open. 4) IN2p3 to Tier 2 being slow has some progress e.g. to Pisa is resolved. 4) Getting some 'file exists' errors transferring to Strasbourg. Michel suggested some contact names.
ALICE (PM): 1) Continuing debugging the CERN WMS slow performance. Two new WMS from the PPS will be put in production with freedom to be drained/debugged as necessary. 2) CNAF have now put an upper limit to the number of jobs submitted to their WMS above which it will automatically switch to draining mode. 3) The WMS ticket to GRIF has now been closed as resolved.
LHCb (JC): 1) Have submitted a new ticket to
IN2P3 following failure to get Turl. The reply that rfio is not supported is not understood.
2) Yesterdays ticket to
RAL was resolved by a disk server intervention. 3) CNAF have scheduled to change their srm software next week. 4) Queried progress on the low number of running jobs at CERN. Ricardo replied they would be making a few
LSF configuration changes next Monday to make the system more agile. The suspicion is a few users were able to submit a series of very long jobs but it is not completely understood. Joel reported they have no jobs running actively now but 4 that seem to be stuck since the end of January. Ricardo will check and kill them in that case.
ATLAS (SC): Two site reports 1) FZK will perform their pnfs database cleanup next Wednesday. 2) PIC have upgraded their FTS to level 2.1 with a new, empty database and this shows an obvious improvement in the failure rate. There are, however, a few percent of transfers failing with 'file exists' errors that are not yet understood. The transfers are a consolidation of AOD data and also input files for the next Monte Carlo production.
Sites / Services round table:
AOB: