Week of 120507
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
- The scod rota for the next few weeks is at ScodRota
WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments
General Information
Monday
Attendance: local(Andrea V, Jan, Jhen-Wei, Lukasz, Maarten, Maria D, Przemek, Simone);remote(Gonzalo, Joel, Kyle, Lisa, Michael, Onno, Paolo, Stephen, Ulf, Xavier M).
Experiments round table:
- ATLAS reports -
- T0/Central Services
- T1s/CalibrationT2s
- Taiwan-LCG2: AGENT error during TRANSFER_SERVICE. It's test spacetoken which suffered large amount of transfers. Error ceased after transfer done. No harm to production service, ticket closed. GGUS:81877
- Taiwan-LCG2: Core switch down (04:00 - 08:00 UTC, Monday).
- CMS reports -
- LHC machine / CMS detector
- CERN / central services and T0
- Database problem on Saturday causing jobs to fail everywhere. INC:126416
.
- Przemek: our NAS has problems when there are a great many files in a single directory; a workaround has been implemented on Sat, while a permanent fix is expected soon; as it will not be transparent, the intervention will be agreed with CMS
- Tier-1/2:
- T1_TW_ASGC has CASTOR problems causing jobs to fail. Savannah:128277
/ GGUS:81887
.
- Jhen-Wei: as of 1h ago CASTOR should be OK; CMS should verify that through a few production jobs
- T1_TW_ASGC core network went down. Reported to be back and vendor working on fix.
- T2_US_UCSD lost cooling water Thursday evening. We have two CRAB Servers there for glideinWMS which have had problems over the weekend still.
- Other:
- LHCb reports -
- DataReprocessing of 2012 data at T1s with new alignment
- MC simulation at Tiers2
- Prompt reprocessing of data
- New GGUS (or RT) tickets
- T0
- T1
- RAL : 1 disk server down during the week-end..
- Joel: will be looked into tomorrow (today is a bank holiday)
- Joel: was there a modification in the queue length at NIKHEF? it was requested 2 weeks ago and supposedly done, but seems to have been reset now
- Others
- FTS : Problem of FTS transfer between RAL and CNAF (under investigation By LHCb)
Sites / Services round table:
- ASGC - nta
- BNL - ntr
- CNAF - ntr
- FNAL - ntr
- KIT - ntr
- NDGF - ntr
- NLT1
- Mon May 14 13:30-14:00 UTC downtime for SARA SRM, dCache will be restarted
- OSG
- we received an e-mail stating the REBUS topology is not available? maybe related to issue affecting the forwarding of accounting records from Gratia to APEL?
- Maarten: APEL services were affected by last week's network problem at RAL; an EGI broadcast announced delays in the processing of accounting records, but recovery would be automatic; send e-mail to the operations list and/or open a GGUS ticket if you still see issues on the OSG side; REBUS currently looks OK:
- PIC - ntr
- dashboards - ntr
- databases
- Oracle security patch will be applied to WLCG integration DBs on Wed and to the production DBs on Mon if no problems were seen; existing connections may suffer from the interventions; the physics DBs should also be done soon
- GGUS/SNOW
- storage
- EOSATLAS had a problem 11:30-13:30 CEST, some modifications may have been lost; under investigation
- rolling DB updates for CASTOR on Wed and Thu, transparent for clients
AOB:
- (MariaDZ) File ggus-tickets.xls with total numbers of GGUS tickets per experiment per week is up-to-date and attached to WLCGOperationsMeetings page.There was one real ALARM last week GGUS:81786
on Mayday by ATLAS against SARA-MATRIX. Detailed drills next week for the 2012/05/15 WLCG MB (they will cover 8 weeks - 13 real ALARMs so far since the last MB).
- (MariaDZ) In case people haven't noticed the latest https://ggus.eu/pages/didyouknow.php#2012-04-25
about new status "Closed" introduced in production at the lastest GGUS Release of 2012/04/25.
- the new state has already been applied to some tickets
Tuesday
Attendance: local(Massimo, Claudio, Simone, Gavin, Luca, Maria D, Maarten);remote(Gonzalo, Joel, Lisa, Kyle, Ulf, Xavier, Ronald, Jeremy, Tiju, Jhen-Wei, Giovanni).
Experiments round table:
- ATLAS reports -
- T0/Central Services
- T1s/CalibrationT2s
- Questions to WLCG/sites
- After the EOS downtime yesterday and the intervention today, I understand there is no data loss. Is this correct/confirmed?
- At the T1SCM last week, T1 sites have been asked to patch FTS for the gridftp2 issue. Could we have a statement from sites about who applied the patch?
- CMS reports -
- LHC machine / CMS detector
- Data taking with beam during the night, cosmics afterwards
- CERN / central services and T0
- Waiting for the DB upgrade to fix the cause of the problem seen on Saturday (SNOW: INC:126416)
- Tier-1/2:
- T1_TW_ASGC Savannah:128277 / GGUS:81887 solved in pronciple but still open (high job inefficiency because ofhigh load on the storage servers).
- T2_US_UCSD: CRAB Servers are back
- Other:
- LHCb reports -
- DataReprocessing of 2012 data at T1s with new alignment
- MC simulation at Tiers2
- Prompt reprocessing of data
- New GGUS (or RT) tickets
- T0
- T1
- NIKHEF : (GGUS:81930) Pilot aborted and queue lenght was reset and put back this morning
- IN2P3 : (GGUS:81927) Pilot failed at one creamce05
- GRIDKA : What are the plan for migration of the SRM instance ? Meanwhile can we have more space because we are low in space
- Others
- FTS : (GGUS:81996) Problem of FTS transfer between RAL and CNAF (CNAF is not seeing the RAL FTS instance)
Sites / Services round table:
- ASGC: Investigating the CMS problem
- BNL:
- CNAF: ntr
- FNAL: ntr
- IN2P3:
- KIT: new SRM for LHCb: the procedure is being prepared. Expect to have some firm dates/planning in ~ 2 weeks. For the disk space addition (LHCb): 30 TB ready to go in; LHCb to confirm details
- NDGF: ntr
- NLT1: Next Tue (May 15): all-day downtime due to multiple maitenance
- PIC: 1 CE (out of 4) was drained. Noticed that Panda is using only one CE (hardcoded in Panda). ATLAS took note of it
- RAL: FTS patch applied. The disk server (LHCb) having problems is back in R/O: all the files should be available
- OSG: ntr
- CASTOR/EOS: Confirm that the intervention on EOS ATLAS should recover all the data
- Central Services: ntr
- Data bases: CMS data guard problem: one more intervention needed (server file system): DB will inform the experiment
- Dashboard:
AOB:
Wednesday
Attendance: local(Massimo, Simone, Luca, Maria D);remote(Claudio, Lisa, Kyle, Ulf, Pavel, Ron, Tiju, Shu-Ting, Paolo).
Experiments round table:
- ATLAS reports -
- T0/Central Services
- After yesterday's intervention on EOS to restore missing metadata after monday's problem to the name server we still see files missing. GGUS:81907 has been updated.
- T1s/CalibrationT2s
- More investigation about the analysis queues in PIC, reported by Gonzaolo yesterday:
- The pilot factories were submitting to many CEs, not only ce07.pic.es (in downtime)
- The number of analysis jobs in PIC was low, but not zero (100 jobs out of 2000 jobs running in total for ATLAS). This is consistent with the 5% share of analysis we asked T1s to configure
- as conclusion, we believe there was never a problem in running jobs at PIC
- Problem in Taiwan DPM was notified by TW people before ATLAS could realize. No GGUS. Problem fixed in approx 1h
- Problem in PrepareToPut in SARA SRM. GGUS: 82032 sent during the night bu Asia-Pacific shifters. Problem under investigation at the time of writing this report.
- RAL reported a on-site connectivity problem. Problem spotted before ATLAS could realize it. No GGUS.
- CMS reports -
- LHC machine / CMS detector
- CERN / central services and T0
- Patching of CMS integration databases (INT2R and INT9R) today apparently gave no problems
- Patch on CMSR database expected on Monday. Will the ADG reconfiguration happen at the same time?
- Tier-1/2:
- T1_TW_ASGC Savannah:128277 / GGUS:81887 site managers are still investigating
- T2_US_UCSD some 'unmerged' data loss due to the cooling incident last week
- Other:
Sites / Services round table:
- ASGC: DPM crash. dev team contacted (patch will be provided)
- BNL:
- CNAF: ntr
- FNAL: ntr
- IN2P3: ntr
- KIT: 30 TB added. Pilot roblem solved (waiting for confirmation). Good progress on the WMS.
- NDGF: 16:00 UTC netwrok outage (Finland-Swedent link). Some ALICE data might be unavailable for ~2h
- NLT1: FTS patched (ATLAS req)
- PIC:
- RAL: 4:00 UTC last night a network switch was down (~1 h). Some disk servers and batch nodes were unavailable
- OSG: ntr
- CASTOR/EOS: EOSATLAS file recovery ongoing
- Central Services:
- Data bases: Security patches applied on all instances. This is also in preparation of various upgrades next week.
- Dashboard:
AOB:
Thursday
Attendance: local(Massimo, Claudio, Simone, Yarka,
MariaDZ, Maarten);remote(Gonzalo, Michael, Joel, Ulf, Ron, Kyle, John, Lisa, Giovanni, Jhen-Wei, Rolf).
Experiments round table:
- ATLAS reports
- T0/Central Services
- T1s/CalibrationT2s
- At 18:30 yesterday the problem at SARA mentioned in GGUS:82032 appeared again. Ticket has been re-opened and is being looked into.
- CMS reports -
- LHC machine / CMS detector
- CERN / central services and T0
- Tier-1/2:
- Other:
Sites / Services round table:
- ASGC: ntr
- BNL: ntr
- CNAF: ntr
- FNAL: ntr
- IN2P3: ntr
- KIT: ntr
- NDGF: dCache pool update today in DK: some ALICE files unavailable for few hours. Overnight Norway batch capacity went down: now back to normal
- NLT1: Investigating the ATLAS problem (it might be some misconfig in the storage DB sector)
- PIC: ntr
- RAL: CASTOR upgraded
- OSG: ntr
- CASTOR/EOS: DB upgrades (transparent) in ~ 10 days
- Central Services: CERN FTS today at 07:00 UTC the t0export and t2 fts service had gridftp2 patch
and msg patch
applied.
- Dashboard: ntr
AOB: (MariaDZ)
- GGUS-SNOW interface will be ready with the May Release of 2012/05/30 to support the creation of REQUESTS (not only incidents). Details in Savannah:120007
.
- The BIOMED VO asked for the implementation of an automatic TEAM ticket creation out of the Operations dashboard https://operations-portal.egi.eu/dashboard
. We'd like to understand if the WLCG VOs use this dashboard for GGUS ticket creation and if this functionality would be useful for the WLCG community. Details in Savannah:127494
.
- Important info for developers of ticketing systems interfaced to GGUS: a new SOAP web service is available for testing in the GGUS test system https://train-ars.ggus.eu/arsys/WSDL/public/train-ars/GGUS
. It will eventually replace the current one, so please DO test it. Details in Savannah:127763
.
Friday
Attendance: local(Massimo, Luasz, Claudio, Simone, Stephan, Jan, Jarka, Maarten, Eva, Ignacio); remote(Michael, John, Alexandre, Kyle, Xavier, Giovanni, Rolf, Jhen-Wei, Catalin, Roger).
Experiments round table:
- ATLAS reports -
- T0/Central Services
- T1s/CalibrationT2s
- At 18:30 yesterday the problem at SARA mentioned in GGUS:82032 appeared again. Ticket has been re-opened and is being looked into.
- CMS reports -
- LHC machine / CMS detector
- CMS magnet was down during LHC collisions last night
- CERN / central services and T0
- Tier-1/2:
- T1_TW_ASGC Savannah:128277 / GGUS:81887 many "unmerged" files were garbage collected during the intervention and production has been stopped again. Will restart soon
- Other:
- LHCb reports -
- DataReprocessing of 2012 data at T1s with new alignment
- MC simulation at Tiers2
- Prompt reprocessing of data
Sites / Services round table:
- ASGC: ntr
- BNL: Incident last night (SRM DB). Initially switched the system to a backup and now back in production
- CNAF: ntr
- FNAL: FTS being updated
- IN2P3:
- Very high number (50k) pilot jobs extremely short lived. It results in a very low efficiency
- The site asks LHCb for example of corrupted files
- KIT: downtime 21-MAY [EDIT by Xavier: wrong date corrected] from 6:00 UTC to 11:00 UTC (tape back end intervention - only tape reading/writing will be blocked)
- NDGF: ntr
- NLT1: ATLAS problem under investigation (hopefully fixed next week)
- PIC: ntr
- RAL: 2 downtime uploaded for next week
- OSG: ntr
- CASTOR/EOS: Retry policy for migration changed
- Central Services: as requested by ATLAS (Cedric), an additional node has been added to prod-lfc-atlas.
- Data bases: Next week rolling DB upgrades
- Dashboard: ntr
AOB:
--
JamieShiers - 11-Apr-2012