Week of 100927
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
- The scod rota for the next few weeks is at ScodRota
WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments
General Information
Monday:
Attendance: local(Miguel, Jamie, Maria, Stephane, Carlos, Lola, Ueda, Harry, Jacek, Alessandro, Julia, Nicolo,
MariaDZ, Massimo, Simone, Ricardo);remote(Jon, Ron, Michael, Rolf, Kyle, Xavier, Joel, Andrea V, G. Misurelli, Tiju, Farida).
Experiments round table:
- ATLAS reports -
- Ongoing issues
- INFN-BNL network problem (slow transfers) GGUS:61440
.
- BNL-NDGF network problem (timeouts/killed) GGUS:62287
.
- Sept 27 (Sat, Sun, Mon)
- LHC/ATLAS activities
- Physics runs (> 1 pb-1 collected in the last 24 hours)
- Main CPU activity for MC production (50 k jobs running over the week-end)
- T0
- Data export were not triggered during 24 hours : SantaClaus (tool to make subscriptions) no more working with python 2.3 (Problem to contact Oracle). Data export restarted on 25th afternoon.
- T1-T1 network issues
- INFN-BNL network problem (slow transfers) GGUS:61440
.
- BNL-NDGF network problem (timeouts/killed) GGUS:62287
.
- T1s
- IN2P3-CC Storage problem GGUS:62394
(Ticket issued at 7:00 amd/23rd). Errors evolved over time but some (recent or not) files are not accessible at all. Required manual operation from ATLAS to keep the MC production running (when data available in other sites) [ Noticed that have reconstruction jobs stuck in IN2P3-CC for one week; security update then scheduled downtime. Some urgent files and no feedback since this morning on this subject ] { Will update ticket }
- RAL-LCG2 : Disk server gdss405 (ATLASMCDISK) not accessible during 12 hours. Server put back in read-mode for the moment. RAL reported no data loss
- CMS reports -
- Experiment activity
- CERN
- Tier1 issues
- various workflows from data rereco, MC redigi/rereco and full MC production
- FTS settings for CERN-FNAL in the CERN FTS:
- GGUS:62401
submitted to get the defaults corrected, no update yet [ Use button "escalate" ]
- Tier2 Issues
- MC production
- Status of large 1200M production : 466M RAW events produced, available at T1s: 434M RAW events
- AOB
- due to meetings, I (Oli) cannot be at the meeting today. Sorry.
- Next CRC from tomorrow: Jose
- CMS DB problems seen a couple of weeks ago - an update will be deployed in production to reduce load from DBS application
- ALICE reports
- GENERAL INFORMATION: Two MC cycles in production during the weekend together with the usual reconstruction activities. T0-T1 raw data transfers have continued during the weekend (currently to IN2P3-CC, FZK and CNAF). All T1 SE services are performing well (MonALISA test results)
- T0 site
- Production interrupted for several hours (from Saturday night until Sunday morning). The orogin of the problem was a bad versioning of the PackMan service running in the T0 voboxes. Problem solved on Sunday afternoon, few minutes later the number of agents were ramping up.
- T1 sites
- All T1 sites have been in production with no important issues to report.
- T2 sites
- Subatech: Scheduled downtime declared on Friday afternoon (14:00 to 16:00) used to test the new CREAM-CE module. The module behaved as expected and it will be put in production today in INFN-Torino.
- LHCb reports - Analysis, no particular issues
- T0
- T1 site issues:
- RAL ; downtime for CASTOR upgrade.
- SARA : LHCBUser space full (GGUS:62447
)
Sites / Services round table:
- FNAL - ntr
- NL-T1 - NIKHEF site rebooted WNs with new kernels this morning. We see that LHCb transfers to LHCB_DST are failing due to lack of space. According to pledges 200TB which is there but full! Joel - will have a look.
- BNL - ntr
- IN2P3 - short update on ticket - two of our servers for dCache could not be updated correctly. The system has to be reinstalled. Servers are those responsible for stuck files - you will have access to those files when servers are back. No schedule for now...
- KIT - on Sautrday around 11:00 had to restart dCache servers for CMS - memory consumption up to 16GB. So many requests - came to internal timeouts. On-call service triggered and restarted dCache. Have updated LFC for ATLAS to SL5. Next for LHCb - transparent to clients.
- NDGF - CERN - Copenhagen link down for 20h over w/e - fibre now fixed - but GEANT took traffic.
- INFN - ntr
- RAL - upgrade progressing as scheduled
- ASGC - last Friday we had unexpected power cut. Due to that most of services down. Now back - since Friday night. Unfortunately 3 disk servers still failing. Trying to recover. Some ATLAS production jobs failed due to Python error. New pilotfactory does not check python path. Fixed manually in script. Watching... LFC problem - LFC DB was shutdown around 07:45 Taiwan time. Old cron job running - fixed. Cleaned up old cron jobs. Jacek - also a problem with ATLAS DB at ASGC - since Friday replication does not work. Felix Lee says some issue with "lost rights".
- Network - regarding BNL-NDGF problem - suspect s/w bug in routers. Still discussing with manufacturer. Will create update in a couple of weeks which will fix problem. NDGF prefix that may have other connectivity issues is 109.105.124.0/22. Any Tier1 with network related problems from/to tha tprefix should contact us in order to verify whether we have the same issue as with BNL and RAL
AOB:
- About GGUS escalation button - only for submitter. Only for submitter of team ticket. Will submit Savannah bug!
Tuesday:
Attendance: local(Stephane,
MariaDZ, Luca, Simone, Andrea, Jamie, Maria, Maarten, Alessandro, Harry, Lola, Zsolt, Ricardo);remote(Rolf, NDGF, Ronald, Gonzalo, Oliver, Gareth, Xavier, G. Misuelli, Jon, Michael).
Experiments round table:
- ATLAS reports -
- Ongoing issues
- INFN-BNL network problem (slow transfers) GGUS:61440
.
- BNL-NDGF network problem (timeouts/killed) GGUS:62287
(Test to be setup to validate the solution)
- Sept 28 (Tue)
- LHC/ATLAS activities
- Physics runs -> Data export to T1s
- Main CPU activity for MC production
- T0
- ALARM GGUS:62467
: LSF response time was slow (reached 1 minute to issue bsub) between 13:30 and 16:00. Solution : Kill a 'home made job-robot' coming from a production manager of another experiment
- T1-T1 network issue
- INFN-BNL network problem (slow transfers) GGUS:61440
: Update yesterday afternoon from Italy [ Giuseppe - investigation not trivial, 4 network teams trying to solve, our experts, GARR, BNL etc. Since we have been told to update the ticket on a daily basis think not important... Propose once every week. Jamie - please update every day even if no news! ]
- T1 :
- IN2P3-CC : GGUS 62389 : IN2P3 reports that
- 2 problematic disk servers are up
- files are being migrated to 2 other disk servers
- as soon as files are replicated, problematic disk servers will be reinstalled.
It should be transparent but situation should be considered AT RISK. ATLAS still observes problem.. ATLAS prepares to submit some of the urgent tasks to other clouds
- RAL : GGUS:62486
(Solved) : Files not accessible : Now solved with explanation from RAL (FTS feature)
- RAL : Elog:17566 Test glite3.2/64 bit FTS server: Few transfers successfull. Tests will continue over many days
- CMS reports -
- Experiment activity
- CERN
- Tier1 issues
- various workflows from data rereco, MC redigi/rereco and full MC production
- FTS settings for CERN-FNAL in the CERN FTS:
- GGUS ticket 62401 submitted to get the defaults corrected, no update yet
- T1_UK_RAL: SAM "mc" and "user" tests fail: https://savannah.cern.ch/support/index.php?117041
- transfers to FNAL put a big load on the disk servers, admins investigating
- Tier2 Issues
- MC production
- AOB
- Next CRC from tomorrow: Jose
- ALICE reports
- Production status: Continuing the MC production together with the usual reconstruction activities. T0-T1 raw data transfers have continued during the weekend (currently to IN2P3-CC, FZK and CNAF). All T1 SE services are performing well (MonALISA test results).
- T0 site
- T1 sites *NDGF SE: showing xrdcp errors this morning. Checking the issue with the Alice experts
- T2 sites
- Continuing the setup work with new sites (LLNL)
- LHCb reports - Analysis, no particular issues
- T0
- T1 site issues:
- RAL ; downtime for CASTOR upgrade.
- SARA : LHCBUser space full (GGUS:62447
)
- PIC: PIC FTS credentials problem GGUS:62490
Sites / Services round table:
- PIC - in last couple of weeks SAM tests for LHCb SRM show failures - no space left on device. Any comment from site? Gonzalo - was issue with test itself. One of LHCb managed tokens "USER" - managed by VO - test was marking site as unavailable - we interacted with Roberto who is developer of test who changed code of test so if VO managed tokens are full not marked as unavailable.
- IN2P3 - ntr but comment: there was just some seconds ago a small discussion about tickets in Savannah/GGUS - nobody here follows up on Savannah. If there is any problem with site here then local CMS support will know about it. On other hand if problem comes via GGUS then much better.
- NDGF - ntr except Estonians pretty slow at processing GGUS tickets.
- NL-T1 - next week on Monday maintenance on network as as result MSS unavailable for a couple of hours
- RAL - ATLAS question in report on file transfer; not alot to add over ticket. One problematic disk server put back in r/o mode. Problems getting files over it - needs further investigation. CASTOR LHCb upgrade - going fine, open for LHCb testing now. Question: LHC schedules and technical stops. Where? Simone - concerning files not available. Why is this an FTS issue? A: don't understand either. Due to disk server in CASTOR draining state. This triggers disk to disk copy and then data served off another disk. Simone - CASTOR reports file unavailable then FTS can do nought. Brian - issue is that FTS does correctly determine file not accessible. FTS job then exits. In older version as used stager get this triggered disk to disk copy. New FTS does not initiate this - this means subsequent transfers which used to succeed now come to same state. Looking into it...
- KIT - ntr
- CNAF - ntr except issue before
- FNAL - ntr
- BNL - ntr
- OSG - ntr
- ASGC - In yesterday meeting I reported a problem with 3 disk server which was a misunderstanding on my side, sorry about that. All disk servers are working very well, it was actually an issue with the 3D project as mentioned by Jacek.
- CERN Grid
- LSF problem - new version of job management being tested by non-LHC VO. Was querying LSF heavily and caused problem.
- For AFS UI there is the new link, i.e new_3.1 is updated from 3.2.6 to *3.2.8*- please test and report back.
- CREAM CEs - gridftp only using one port due to syntax error - fixed and deployed. (Bug in YAIM) Ale - did monitoring work or did you need ALARM ticket from ATLAS? A: helped - got beep in monitoring about time it happened - when we got ticket we really looked. Need to monitor better.
- CERN DB - update - ASGC DB still down - received report on grid-service(s)-databases - restoring on new DB up in a few days.
AOB: (
MariaDZ) On yesterday's AOB progress on allowing TEAMers to escalate GGUS tickets can be followed in
https://savannah.cern.ch/support/?117033
LHC Schedule available at
https://espace.cern.ch/be-dep/BEDepartmentalDocuments/BE/2010-injector-schedule_v1.8.pdf
Wednesday
Attendance: local(Harry(chair), Andrea, Stefan, Alessandro, Zoltan, Ricardo, Patricia, Jose, Simone);remote(Onno(NL-T1), Guissepi(CNAF), Xavier(KIT), Catalin(FNAL), Joel(LHCb), Michael(BNL), Kyle(OSG),Tiju(
RAL), Gang(ASGC), Rolf(
IN2P3), Gonzalo(PIC)).
Experiments round table:
- LHC/ATLAS activities
- No data collected
- Main CPU activity for MC production
- T0
- T1-T1 network issue: No recent update in the last 24 hours.
- T1 :
- IN2P3-CC : GGUS 62389 : Solved : All problematic tasks are finished successfully.
- IN2P3-CC : GGUS 62500 : Transfers IN2P3-CC -> T2s have significant error rate (about 10%). IN2P3-CC just check that it was not a FTS error
- RAL : Elog:17602 Test glite3.2/64 bit FTS server: Extended to more UK sites to trigger more activity
- Tier1 issues
- various workflows from data rereco, MC redigi/rereco and full MC production
- FTS settings for CERN-FNAL in the CERN FTS:
- GGUS ticket 62401 submitted to get the defaults corrected, no update yet (has been assigned at CERN).
- T1_UK_RAL: SAM "mc" and "user" tests fail: https://savannah.cern.ch/support/index.php?117041
- transfers to FNAL put a big load on the disk servers => export capacity reduced: number of files/streams in the FTS RAL-FNAL channel; concurrent stagein's in PhEDEx stager agent reduced.
- Failing export transfers from IN2P3: GGUS ticket 62551. Problem has gone but no ticket update yet.
- AOB
- CRC this week: Jose Hernandez
- GENERAL INFORMATION: Pass1 reconstruction activities ongoing. Total number of running jobs decreasing at the moment (no new MC cycles for the moment). No issues reported by MonALISA for the local T1 SEs systems.
- T0 site: No issues to report
- T1 sites:
- IN2P3-CC: AFS sw area full. Many obsolete packages still placed in that area. For the moment, the cleanup procedure can partially remove part of these files. Rolf asked if site can help but responsibility is with ALICE.
- CNAF: GGUS: 62576. Wrong information provided by the local CREAM resource BDII (the number of total CPUs published by the BDIIs is zero). Ticket has just been closed.
- T2 sites: Usual operations at these sites, no remarkable issues to report
- Experiment activities: . Analysis, no particular issues
- T1 site issues:
- RAL ; downtime for CASTOR upgrade finished and re-opened at lunchtime. FTS problem since - being looked at.
- PIC: Transfers problem at PIC now ok - to be checked with LHCb data manager.
Sites / Services round table:
- RAL: Will be rebooting disk servers for a new kernel tomorrow so have declared an outage of srmatlas.
- PIC: Have closed the LHCb ggus ticket of yesterday. Joel requested to get copies of the FTS log from PIC - requested by LHCb data manager for use by RAL.
- OSG: Have submitted ggus ticket 62543 addressing bdii caching issues at bnl. Ticket is advancing well.
- CERN CASTOR: An LHCb user disk in the Tier0 pool has a hardware failure. They are attempting to recover data from it and will inform LHCb of any lost files.
- CERN Streams: The ASGC database is still down.
AOB:
Thursday
Attendance: local(Eddie, Jacek,
MariaDZ, Maarten, Jamie, Maria, Alessandro, Jan, Stephane, Flavia, Ricardo, Zsolt);remote(ROlf, Gareth, Rob, NDGF, Xavier, Jon, Michael, G. Misurelli, Jose, Ronalf, Gang).
Experiments round table:
- ATLAS reports -
- Ongoing issues
- CNAF-BNL network problem (slow transfers) GGUS:61440
.
- BNL-NDGF network problem (timeouts/killed) GGUS:62287
(Tests will be done in the coming days)
- Sept 30 (Thu)
- LHC/ATLAS activities
- Data collected
- Main CPU activity for MC production
- T0
- T1-T1 network issue
- CNAF-BNL : Info from CNAF : Detailed E-Mail written yesterday to Geant NOC, ES-Net, and all the involved people about the asymmetry but no answer 'till now
- T1 :
- IN2P3-CC : GGUS:62500
: Transfers IN2P3-CC -> T2s failing. Same behaviour for transfers IN2P3-CC -> FZK (added in GGUS ticket). No feedback from IN2P3-CC since last tuesday evening
- RAL : GGUS:62623
: Test glite3.2/64 bit FTS server: Fails for some UK T2 -> RAL transfers. Error message reports 'cannot access srm in T2 site' but SAM reports OK. Switching back to old FTS solved the issue. Tests are stopped in agreement with RAL
- BNL : - GGUS:62613/GGUS:62594
: Short glitches in BNL affecting SE (Solved)
- CMS reports -
- Experiment activity
- CERN
- Tier1 issues
- various workflows from data rereco, MC redigi/rereco and full MC production
- Smooth sailing
- Tier2 Issues
- MC production
- AOB
- ALICE reports
- GENERAL INFORMATION: Pass1 reconstruction activities ongoing.
- T0 site - ntr
- T1 sites
- CNAF: GGUS:62576
. batch system information provider plugin was running with the wrong execution environment. SOLVED
- T2 sites - ntr
- LHCb reports - Analysis, no particular issues
- T0
- T1 site issues:
- RAL : FTS transfers out of RAL are failing. checksum errors (checksum disabled for the time being)
Sites / Services round table:
- NDGF - have some failing SRM tests due to some kind of certificate error. Some compute nodes in Finland down to o/s problems. Don't know how long that will be - perhaps until next week. MariaDZ - had this question to you on how to handle alarms. No reply.
- FNAL - ntr
- BNL - comments to issues reported by ATLAS: SRM failed yesterday afternoon - we observed SRM still to some extent functional but cancelled requests at a high rate. Restart ok. pnfs server: further investigation in last ocuple of days revealed we have exhausted h/w capabilities of server. ns operations have dramatically increased over last couple of days - probably hitting bandwidth limit - in process of preparing replacement. Depending on activity may hit problem again and have to restart pnfs server whenever. Short spike of transfer errors. Monitoring and alarm works well - in both cases solved before GGUS ticket.
- CNAF - haven't checked if BNL answered but for us important to have feedback on tests we required. Otherwise difficult to give daily updates to ticket.
- RAL - reported diskserver problem to ATLAS first thing this morning. Doing checksumming tests - put back in draining state and watch than back in production. LHCb: CASTOR upgrade for LHCb instance yesterday - this issue being investigated. Nothing to add on ATLAS FTS testing. AT RISK over w/e - power outage in one of buildings on site which houses networking equipment. Alternative power but just in case. R-GMA registry unavailable over w/e.
- NL-T1 - at NIKHEF have deployed VM limit - > 4GB will be killed by batch system.
- IN2P3 - ticket mentioned in ATLAS report - behind scenes some work ongoing. Reason of problem not yet known - looks like a network problem. This affects traffic between T1 and FR T2s since last maintenance update last week. The people from dCache tried the current version of dCache and that before last maintenance. Both showed very low performance for these sites - Stephane same impression. Looks like certain type of data possibly in some pools?
- ASGC - ntr
- KIT - downtime in GOCDB one week from now, large part of ATLAS disk down for firmware upgrade. Down for 7 hours
- OSG - GGUS ticket transfer tests and some alarms opened against BNL and FNAL. All transfers went fine but SMS messages not delivered to Michael Ernst. MariaDZ - if you need GGUS to investigate anything please let us know.
- Dashboards - ATLAS DDM dashboard down for schduled downtime at 14:00 - back in 50'
- DB - started upgrading selected experiment DBs to 10.2.0.5. Today CMS Online integration and next week some more.
- Storage - ATLAS problem - file probably lost, h/w errors. Have 2nd copy of file. Another ticket on production & analysis jobs failing due to timeout. Appears to be overload and would like to discuss what to do.
AOB: (
MariaDZ) The
Did you know ...?
link on GGUS pages' left (or top) banner changed content yesterday with the release like every month. This time it reminds you of the easy way to 'migrate' an email thread into a GGUS ticket and the advantages of doing this.
- Maria - discrepancy on site availability plots and what discussed in daily meeting. Would like to tackle this a bit better: e.g. checking VO site availability at daily meeting.
Friday
Attendance: local(Eddie, Lola, Pszmek, Alessandro, Maarten, Simone, Ueda, Jamie, Maria, Farida, Stephane, Jan, Ricardo, Jean-Philippe, Zsolt);remote(Michael, Jon, Xavier, Vera, Joel, Rolf, Gonzalo, Onno, Rob).
Experiments round table:
- ATLAS reports -
- Ongoing issues
- CNAF-BNL network problem (slow transfers) GGUS:61440
.
- BNL-NDGF network problem (timeouts/killed) GGUS:62287
(Tests will be done in the coming days)
- Oct 1 (Fri)
- LHC/ATLAS activities
- No data collected
- Main CPU activity for MC production
- T0
- GGUS ALARM GGUS:62662
: afs26 slow response
- GGUS ALARM GGUS:62568
: Test of alarm (29th September). Ticket not closed
- GGUS:62591
: Access to some files too intensive (DBRelease). Replication of hot files (no HOTDISK at CERN) will be discussed with CERN
- GGUS:62607
:File missing in T0 (I/O errors with a disk server). File was recovered from ATLAS DAQ.
- T1-T1 network issue
- CNAF-BNL : Info from CNAF : Update from M. Ernst/ESNET : Tests done on US side.
- T1 :
- IN2P3-CC : GGUS:62500
: IN2P3 -> FZK transfers (+ IN2P3-CC -> FR T2s) are penalised. Backlog of ESD from recent data transfer exists now [ Rolf- still under investigation but not yet understood ]
- IN2P3-CC : GGUS:62660
: Unaccessible files in IN2P3-CC [ Also being looked at ]
- TAIWAN-LCG2 : GGUS:62663
: SRM/lsf issues in TAIWAN-LCG2 -> FTS transfer fail. Problem should be fixed
- RAL : GGUS:62623
: Test glite3.2/64 bit FTS server: Problem not reproduced by Matt. Need deeper tests with ATLAS
- SAM tests: PIC and TAIWAN issues are coherent with ATLAS tools
- PIC : Temporary glitch
- TAIWAN : see GGUS ticket
- CMS reports -
- Experiment activity
- CERN
- Computing shifter got an alarm because SLS shows that the CVS Service for LCG is not available: http://sls.cern.ch/sls/service.php?id=LCGCVS
. We opened a ticket (CT715077 to clarify if it is a real problem or a monitoring glitch.
- Tier1 issues
- various workflows from data rereco, MC redigi/rereco and full MC production
- No issues
- Tier2 Issues
- AOB
- ALICE reports
- GENERAL INFORMATION: Pass1 reconstruction activities ongoing.
- LHCb reports - Analysis, no particular issues
- T0
- T1 site issues:
- RAL : CONDDB error : (GGUS:62667
) [ Gareth - thanks for update to ticket. Last successful runs one week ago before draining queues for CASTOR upgrade? Joel - can't give a definite date & time but probably. When CORAL looses connection tries to reconnect but crashes. One common symptom is loss of connection. ]
- AOB - yesterday had problem with transfers - disabled checksums. We have our own checksums and these seem ok. Didn't open GGUS ticket for this.
- Tests - SRM tests failing for PIC, IN2P3, NL-T1 (50TB disk space issue) - giving "false negatives" in site availability plots
Sites / Services round table:
- BNL - ntr
- FNAL - ntr
- KIT - ntr
- ASGC - for GGUS ticket FTS is back, was temporary problem due to misconfiguration which is now fixed.
- NDGF - ntr
- IN2P3 - nta
- PIC - ntr
- NL-T1 - ntr
- RAL - we announced an at risk over w/e - this has been cancelled so removed at risk. R-GMA registry will be available over w/e too.
- OSG - working on GGUS - SMS problems reported yesterday. Some on-going emails between GOC, Michael & GGUS.
- CERN Grid - comment on problem with LSF this morning. Two parts to LSF config - one started before other daemon responding which causes some queues to disappear. First time this happened - race condition? Increased delay between the two phases of configuration. Will try to understood full chain. Fixed just before 12:00 by reconfig. Joel - next time can submit an alarm ticket. Ric - would be justified.
- CERN Storage - gridftp checksum failures between RAL & CERN seem different. RAL: 32 bit issue? CERN: sent LHCb list of files with wholes in them - do these correspond to aborted transfers? CASTOR will try to get these files on disk asap. Would be nice to know if these correspond to problems.
- CERN DB - ASGC managed to create an empty DB there and preparing for re-sync between RAL and ASGC which will be done Monday - Tuesday.
AOB:
--
JamieShiers - 22-Sep-2010