WLCG Operations Coordination Minutes - 1st November 2012
Agenda
Attendance
- Local: Maria Girone (chair), Andrea Sciabà (secretary), Wei-Jen, Felix Lee, Alessandro Di Girolamo, Maria Dimou, Stefan Roiser, Jan Iven, Maarten Litmaath, Maite Barroso Lopez, Nicolò Magini, Simone Campana, Ikuo Ueda, Massimo Lamanna, Michail Salichos, Andrea Valassi, Ulrich Schwickerath, Ian Fisk, Oliver Gutsche, Helge Meinhard, Domenico Giordano
- Remote: Stephen Burke, John Gordon, Lucy, Alexei Klimentov, Burt Holzman, Dave Dykstra, Gareth Smith, Rob Quick, Di Qing, Ian Collier, Ron Trompert, Peter Solagna
Task Force reports
CVMFS
- Reminder sent to all site admins to fill in the twiki page, so far we have responses from ~ 1/3 of sites
- Sites that deployed CVMFS since the last meeting
- DESY-ZN (LHCb)
- IN2P3-CC (CMS)
- IN2P3-CC-T2 (CMS)
- ru-PNPI (ATLAS, CMS, LHCb)
- RU-Protvino-IHEP (CMS)
- UAM-LCG2 (ATLAS)
- Total deployment status (since start of task force): 94 sites contacted, 13 sites deployed, 1 site decomissioned
- Sites asking for new CVMFS version 2.1 with shared cache and NFS export (release foreseen for Q1/2014)
gLExec
- ATLAS: integration of glexec usage into the pilot code has restarted; testing expected in the coming weeks.
PerfSonar
Tracking tools
ALARM raising for additional CERN services?
This item
concerns the Tier0 only. Presentation
http://indico.cern.ch/getFile.py/access?subContId=0&contribId=1&resId=1&materialId=slides&confId=215003
was prepared on ATLAS request. The 2012/11/20 WLCG MB will discuss which additional services (if any) should be added in the list of services eligible to raise ALARMs. E-groups' "criticality", in particular, should be well justified, the service being supported by another department (not CERN IT). Implementation-wise, there is consensus that GGUS ALARMs is the way to go.
Send GGUS reminders for outstanding tickets to "Notified Sites" as well (not only ROCs/NGIs)?
This item concerns
sites only except the Tier0. Presentation
http://indico.cern.ch/getFile.py/access?subContId=0&contribId=1&resId=3&materialId=slides&confId=215003
was prepared to get fWLCG eedback for
Savannah:131988
. First reactions were in favour of this development because reminders group all outstanding tickets for a given Support Unit (SU) and they are sent 1-2 times per week, i.e. not too much traffic.
We left 1 month for more detailed comments.
SHA-2 migration
- EGI validation infrastructure time frame?
- Try to let the experiments profit from it.
Generic links:
FTS 3 integration and deployment
- Functional tests started on fts3-pilot-service.cern.ch, some bugs observed and promptly fixed by FTS developers:
- "Connection reset" errors in communication between client and server: TCP keepalive disabled (already included in FTS3 version deployed at other sites)
- Server repeatedly crashing when checksuming was issued for bulk transfers: fix deployed on Monday 22nd together with other bugfixes
- Tests will resume now that the stability issue is solved.
Middleware deployment
- Worker Node testing for WLCG
- CERN: EMI-2 WN deployed in preprod (~10% of the farm), allowing ATLAS to verify compatibility also with the EOS SRM (BeStMan)
- WLCG Software Life Cycle Process beyond EMI: please look at the document and/or presentation attached to the Oct 16 Management Board meeting agenda
- please send comments or questions to Markus and Maarten
XrootD
Squid monitoring
WMS future
News from other WLCG working gropus
Experiment operations review and plans
ALICE
- The old SE hosting conditions data has been retired last week: for that type of data ALICE mainly relies on EOS-ALICE now, with backup replicas at other sites.
- Migration from CASTOR ALICE_DISK to EOS ongoing, foreseen to be finished by Nov 30. Thanks to the CASTOR/EOS team for their efforts in this matter!
- On Nov 1 between 00:00 and 01:00 local time job submissions were found failing on multiple (possibly all) gLite 3.2 VOBOX nodes at various sites, with the CREAM client complaining that the proxy had supposedly expired while it was fine. This happened for different DNs from different CAs. GGUS:87997
opened for the CREAM developers.
ATLAS
- The bulk reprocessing starting soon, earliest today.
- Some of the urgent simulation jobs have been stuck due to FZK TAPE+DISK problems
CMS
LHCb
- Reprocessing of 2012 data progressed very well so far, first part 1.2 /fb have been processed. Currently waiting for new conditions to be deployed next week. Until then the activities will be ramping down
- Discussion with FZK next week on how to improve the staging performance at the site
- EOS, deleted files b/c of buggy script and problem with SRM upload and concurrent writing to the file at the same time. Some files could be recovered, some reproduced, some lost.
GGUS tickets
No VO or Site was unhappy about support provided to tickets of their concern.
Tier-1 Grid services
Storage deployment
Site |
Status |
Recent changes |
Planned changes |
CERN |
CASTOR 2.1.13-5; SRM-2.11 for all instances. EOS 0.2.21/xrootd-3.2.5/BeStMan2-2.2.2 for all instances except CMS (0.2.16) |
EOS-0.2.19/20/21 |
CASTOR - deploy 2.1.13-6 over next weeks; EOS - get CMS onto current version, deploy prototype readonly slave |
ASGC |
CASTOR 2.1.11-9 SRM 2.11-0 DPM 1.8.2-5 |
None |
None |
BNL |
dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup) http (aria2c) and xrootd/Scalla on each pool |
None |
None |
CNAF |
StoRM 1.8.1 (Atlas, CMS, LHCb) |
|
|
FNAL |
dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3 Scalla xrootd 2.9.7/3.2.2-1.osg Oracle Lustre 1.8.6 EOS 0.2.20/xrootd 3.2.2-1.osg with Bestman 2.2.2.0.10 |
|
|
IN2P3 |
dCache 1.9.12-16 (Chimera) on core servers and 1.9.12-24 and pool nodes. New hardware (more RAM, SSD disks) for Chimera and SRM servers (with SL6). Postgres 9.1 xrootd 3.0.4 |
None |
None |
KIT |
dCache atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera) cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera) gridka-dcache.fzk.de: 1.9.12-17 (PNFS) xrootd (version 20100510-1509_dbg) |
|
|
NDGF |
dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes. |
|
|
NL-T1 |
dCache 2.2.4 (Chimera) (SARA), DPM 1.8.2 (NIKHEF) |
|
|
PIC |
dCache 1.9.12-20 (Chimera) |
None |
None |
RAL |
CASTOR 2.1.11-8/2.1.12-10 2.1.11-8/2.1.12-10 (tape servers) SRM 2.11-1 |
|
|
TRIUMF |
dCache 1.9.12-19 with Chimera namespace |
voms+kpwd authentication and tape recycling |
postgres9 upgrade |
FTS deployment
Site |
Version |
Recent changes |
Planned changes |
CERN |
2.2.8 - transfer-fts-3.7.11-1 |
applied new proxy delegation patch on Mon 29th |
|
ASGC |
2.2.8 - transfer-fts-3.7.10-1 |
None |
None |
BNL |
2.2.8 - transfer-fts-3.7.10-1 |
None |
Planning to apply proxy delegation patch next week |
CNAF |
2.2.8 - transfer-fts-3.7.10-1 |
|
|
FNAL |
2.2.8 - transfer-fts-3.7.10-1 |
applied patch1 last week |
will test patch2 on cmsfts2.fnal.gov next week and cmsfts1.fnal.gov after 1 week of stable running |
|
IN2P3 |
2.2.8 - transfer-fts-3.7.12-1 |
Last patch applied on Oct. 31st |
|
KIT |
2.2.8 - transfer-fts-3.7.10-1 |
|
|
NDGF |
2.2.8 - transfer-fts-3.7.10-1 |
|
|
NL-T1 |
2.2.8 - transfer-fts-3.7.12-1 |
applied the patch during the ops coordination phone conf november 1st :-) |
|
PIC |
2.2.8 - transfer-fts-3.7.10-1 |
None |
None |
RAL |
2.2.8 - transfer-fts-3.7.12-1 |
applied new patch on Nov 1st |
|
TRIUMF |
2.2.8 - transfer-fts-3.7.10-1 |
|
new proxy delegation patch next week |
LFC deployment
Other site news
Data management provider news
CASTOR news
CERN operations and development
EOS news
xrootd news
dCache news
StoRM news
DPM news
FTS news
A new patch for the proxy expiration issue is available and has been deployed at CERN on Monday 29th.
The patch fixes the following issues:
- submitters proxy expiration (GGUS:81844
)
- jobs wrongly allocated to different VOs (GGUS:87929
)
- VOMS attrs can't be read from CRLs, delegation ID can't be generated (GGUS:87975
, GGUS:86775
)
- mkgridmap cron job has old gLite paths, thereby preventing the addition of new VO members to the submit-mapfile
VOs report that the frequent transfer submission and allocation errors seen with the previous patch have disappeared, and the proxy expiration issue has not yet resurfaced.
Therefore we would like to ask all T1s to deploy the patch starting from next Monday, please update the table with the deployment schedule.
Installation instructions:
- Download and update only the transfer-fts-3.7.12-1 rpm from this repository:
- Restart tomcat
- Disable the cron job used to restart tomcat automatically (such service interruptions are no longer needed and should be avoided)
LFC news
gfal/lcg_util news
Middleware news and baseline versions (Nicoḷ)
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
AOB
Action list
--
AndreaSciaba - 26-Oct-2012