Week of 140217
WLCG Operations Call details
- At CERN the meeting room is 513
R-068.
- For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
- Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
- To have the system call you, click here
- In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here
. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.
General Information
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
Monday
Attendance:
- local: Alessandro, Felix, Luca M, Maarten, Nacho
- remote: Gareth, Kyle, Lisa, Michael, Onno, Pavel, Pepe, Rolf, Sang-Un, Stefano, Vladimir
Experiments round table:
- ATLAS reports (raw view) -
- Central Services
- T1
- BNL-ATLAS: Sat-Sun BNL dCache was suffering from the high IO load in the name space database, caused by the auto-vacuum of the PostgreSQL. dCache down, site blacklisted for transfers and production (GGUS:101275
). Fixed now.
- INFN-T1: Sat-Sun thousands of jobs failing at INFN-T1, failure rate at about 60%, with get error, staging input files failed (GGUS:101281
).
Sites / Services round table:
- ASGC
- downtime Mon 24 04:00-10:00 to fix 2 issues:
- firmware bug in storage back-end for CASTOR DB
- DPM HW failure
- BNL
- over the weekend the dCache name service provider became unresponsive due to a similar behavior as was seen for the SRM: a massive vacuum operation launched by the PostgreSQL DB interfered with name server queries from Chimera; parameters were adjusted and the situation has been stable since; the resolution will be communicated to the dCache admin forum
- FNAL - ntr
- IN2P3 - ntr
- KISTI
- downtime tomorrow 0:00-9:00 for network maintenance
- KIT - ntr
- NLT1 - ntr
- OSG - ntr
- PIC
- today there was a downtime of the tape back-end system; we tried to use the new GOCDB mechanism to declare that only the tape back-end was affected, but we did not manage
- Maarten: to be followed up in the Ops Coordination meeting on Thu
- RAL - ntr
- GGUS
- Activity for the last 4 weeks attached to this page for tomorrow's MB.
- grid services
- transparent WMS updates tomorrow
- storage
- LHCb EOS SRM upgraded OK to new SHA-2 compliant version
AOB:
Thursday
Attendance:
- local: Felix, Maarten, Nacho, Oliver
- remote: Dennis, Jeremy, John, Lisa, Michael, Pavel, Rolf, Sang-Un, Saverio, Vladimir
Experiments round table:
- CMS reports (raw view) -
- global xrootd redirector stopped serving responses, fixed through restart GGUS:101414
- work to introduce redundancy and provide proper critical service instructions started
- CNAF T1 is in downtime to upgrade storage, production activities were stopped but queues were kept open for analysis reading input via AAA, seems to work fine
- LHCb reports (raw view) -
- MC simulation and user jobs. Stripping verification.
- T0: NTR
- T1: NTR
Sites / Services round table:
- ASGC
- reminder: downtime Mon 24 04:00-10:00 to fix storage-related issues
- BNL - ntr
- CNAF - ntr
- FNAL - ntr
- GridPP - ntr
- IN2P3
- on March 18 there will be maintenance affecting various services:
- batch will be down for at least half a day
- mass storage downtime duration not yet known
- KISTI - ntr
- KIT - ntr
- NLT1
- short downtime yesterday to replace a broken disk controller; may have affected availability of ATLAS data
- RAL
- on Tue the FTS-3 twice suffered a downtime of ~2h, due to a failed move of the MySQL DB; the service is OK now
- Oliver: beware this service is becoming steadily more important for production
- GGUS
- NB! Monthly Release next Wed 2014/02/26 with ALARM tests.
- grid services
- fts-t2-service.cern.ch service certs had expired, fixed on Mon (GGUS:101301
)
- FTS-3 was updated as agreed with the users
- WMS updates went OK
AOB: