WLCG Operations Coordination Minutes - 1st November 2012
Agenda
Attendance
- Local: Maria Girone (chair), Andrea Sciabà (secretary), Wei-Jen, Felix Lee, Alessandro Di Girolamo, Maria Dimou, Stefan Roiser, Jan Iven, Maarten Litmaath, Maite Barroso Lopez, Nicolò Magini, Simone Campana, Ikuo Ueda, Massimo Lamanna, Michail Salichos, Andrea Valassi, Ulrich Schwickerath, Ian Fisk, Oliver Gutsche, Helge Meinhard, Domenico Giordano
- Remote: Stephen Burke, John Gordon, Si Liu, Alexei Klimentov, Burt Holzman, Dave Dykstra, Gareth Smith, Rob Quick, Di Qing, Ian Collier, Ron Trompert, Peter Solagna
Communication tools
Andrea announces that the
wlcg-tier1-contacts@cernNOSPAMPLEASE.ch list has been checked and can be used to contact the Tier-1 sites. He proposes to return
wlcg-operations@cernNOSPAMPLEASE.ch to its original, "pre-working group" status and create a
wlcg-ops-broadcast@cernNOSPAMPLEASE.ch for broadcasts. The proposal does not encounter any objection. Maria D. proposes to add it to the CIC operations portal to be able to send the broadcasts from the web portal.
Task Force reports
CVMFS
- Reminder sent to all site admins to fill in the twiki page, so far we have responses from ~ 1/3 of sites
- Sites that deployed CVMFS since the last meeting
- DESY-ZN (LHCb)
- IN2P3-CC (CMS)
- IN2P3-CC-T2 (CMS)
- ru-PNPI (ATLAS, CMS, LHCb)
- RU-Protvino-IHEP (CMS)
- UAM-LCG2 (ATLAS)
- Total deployment status (since start of task force): 94 sites contacted, 13 sites deployed, 1 site decomissioned
- Sites asking for new CVMFS version 2.1 with shared cache and NFS export (release foreseen for Q1/2014)
gLExec
- ATLAS: integration of glexec usage into the pilot code has restarted; testing expected in the coming weeks.
PerfSonar
Simone presented some
slides
.
- The PS service type has been added to GOC; Simone and Rob will discuss later today how to do it for OSG.
- Now we have the list of the most important sites and channels to test; the next step is to:
- register existing PS instances in GOC/OIM
- install PS at sites which do not have it
Oliver asks if there is any recommendation for using a special hardware setup, as it is the case in the US. Simone answers that this is not the case for now (but it is recommended to use different nodes for the bandwidth and latency measurements) but he will discuss it with Shawn.
Tracking tools
ALARM raising for additional CERN services?
This item
concerns the Tier0 only. Presentation
http://indico.cern.ch/getFile.py/access?subContId=0&contribId=1&resId=1&materialId=slides&confId=215003
was prepared on ATLAS request. The 2012/11/20 WLCG MB will discuss which additional services (if any) should be added in the list of services eligible to raise ALARMs. E-groups' "criticality", in particular, should be well justified, the service being supported by another department (not CERN IT). Implementation-wise, there is consensus that GGUS ALARMs is the way to go.
Maria G. encourages the experiments to give her the list of additional services for which it should be possible to submit ALARM tickets. She will bring this up at the MB in two weeks during her update on the service criticality.
Maite points out that it never happened that ALARM tickets were abused and that it is correct to use them for any serious incident.
Maarten clarifies that sending an alarm ticket does not necessarily awake an expert in the night: only the operator will immediately react, and follow the published procedures. Helge says that for most services an SMS is sent to the expert at any time, and then he/she decides how to proceed. The general consensus is that the operator's response time is always very short and the experts' response more than adequate in most cases, even when support is best effort (as for the majority of the services).
Rob asks why the twiki is considered so critical and Ian explains that it's basically because almost all experiment documentation is on twiki, including the procedures to restart services!
Send GGUS reminders for outstanding tickets to "Notified Sites" as well (not only ROCs/NGIs)?
This item concerns
sites only except the Tier0. Presentation
http://indico.cern.ch/getFile.py/access?subContId=0&contribId=1&resId=3&materialId=slides&confId=215003
was prepared to get WLCG feedback for
Savannah:131988
. First reactions were in favour of this development because reminders group all outstanding tickets for a given Support Unit (SU) and they are sent 1-2 times per week, i.e. not too much traffic.
We left 1 month for more detailed comments.
SHA-2 migration
- EGI validation infrastructure time frame?
- Try to let the experiments profit from it.
Generic links:
FTS 3 integration and deployment
- Functional tests started on fts3-pilot-service.cern.ch, some bugs observed and promptly fixed by FTS developers:
- "Connection reset" errors in communication between client and server: TCP keepalive disabled (already included in FTS3 version deployed at other sites)
- Server repeatedly crashing when checksumming was issued for bulk transfers: fix deployed on Monday 22nd together with other bugfixes
- Tests will resume now that the stability issue is solved.
Middleware deployment
- Worker Node testing for WLCG
- CERN: EMI-2 WN deployed in preprod (~10% of the farm), allowing ATLAS to verify compatibility also with the EOS SRM (BeStMan)
- WLCG Software Life Cycle Process beyond EMI: please look at the document and/or presentation attached to the Oct 16 Management Board meeting agenda
- please send comments or questions to Markus and Maarten
XrootD
Squid monitoring
It is agreed that the squid monitoring support by CMS also for ATLAS and CVMFS can be considered a CMS contribution to WLCG.
WMS future
News from other WLCG working gropus
Experiment operations review and plans
ALICE
- The old SE hosting conditions data has been retired last week: for that type of data ALICE mainly relies on EOS-ALICE now, with backup replicas at other sites.
- Migration from CASTOR ALICE_DISK to EOS ongoing, foreseen to be finished by Nov 30. Thanks to the CASTOR/EOS team for their efforts in this matter!
- On Nov 1 between 00:00 and 01:00 local time job submissions were found failing on multiple (possibly all) gLite 3.2 VOBOX nodes at various sites, with the CREAM client complaining that the proxy had supposedly expired while it was fine. This happened for different DNs from different CAs. GGUS:87997
opened for the CREAM developers.
ATLAS
- The bulk reprocessing starting soon, earliest today.
- Some of the urgent simulation jobs have been stuck due to FZK TAPE+DISK problems
CMS
Ian brings up the topic of xrootd fallback to Tier-1 sites. Normally the WNs are not in the OPN, so it might generate a very heavy traffic on the site firewall.
Alessandro adds that in fact CERN and FNAL have the WNs in the OPN and Oliver says that KIT agreed to accommodate the higher load on the firewall but the official approval from the site management will require at least three weeks.
Maria G. proposes to start xrootd discussions in the working group to decide how to best deal with this kind of issues.
LHCb
- Reprocessing of 2012 data progressed very well so far, first part 1.2 /fb have been processed. Currently waiting for new conditions to be deployed next week. Until then the activities will be ramping down
- Discussion with FZK next week on how to improve the staging performance at the site
- EOS, deleted files b/c of buggy script and problem with SRM upload and concurrent writing to the file at the same time. Some files could be recovered, some reproduced, some lost.
GGUS tickets
No VO or Site was unhappy about support provided to tickets of their concern.
Tier-1 Grid services
Storage deployment
Site |
Status |
Recent changes |
Planned changes |
CERN |
CASTOR 2.1.13-5; SRM-2.11 for all instances. EOS 0.2.21/xrootd-3.2.5/BeStMan2-2.2.2 for all instances except CMS (0.2.16) |
EOS-0.2.19/20/21 |
CASTOR - deploy 2.1.13-6 over next weeks; EOS - get CMS onto current version, deploy prototype readonly slave |
ASGC |
CASTOR 2.1.11-9 SRM 2.11-0 DPM 1.8.2-5 |
None |
None |
BNL |
dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup) http (aria2c) and xrootd/Scalla on each pool |
None |
None |
CNAF |
StoRM 1.8.1 (Atlas, CMS, LHCb) |
|
|
FNAL |
dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3 Scalla xrootd 2.9.7/3.2.2-1.osg Oracle Lustre 1.8.6 EOS 0.2.20/xrootd 3.2.2-1.osg with Bestman 2.2.2.0.10 |
|
|
IN2P3 |
dCache 1.9.12-16 (Chimera) on core servers and 1.9.12-24 and pool nodes. New hardware (more RAM, SSD disks) for Chimera and SRM servers (with SL6). Postgres 9.1 xrootd 3.0.4 |
None |
None |
KIT |
dCache atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera) cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera) gridka-dcache.fzk.de: 1.9.12-17 (PNFS) xrootd (version 20100510-1509_dbg) |
|
|
NDGF |
dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes. |
|
|
NL-T1 |
dCache 2.2.4 (Chimera) (SARA), DPM 1.8.2 (NIKHEF) |
|
|
PIC |
dCache 1.9.12-20 (Chimera) |
None |
None |
RAL |
CASTOR 2.1.11-8/2.1.12-10 2.1.11-8/2.1.12-10 (tape servers) SRM 2.11-1 |
|
|
TRIUMF |
dCache 1.9.12-19 with Chimera namespace |
voms+kpwd authentication and tape recycling |
postgres9 upgrade |
FTS deployment
Site |
Version |
Recent changes |
Planned changes |
CERN |
2.2.8 - transfer-fts-3.7.11-1 |
applied new proxy delegation patch on Mon 29th |
|
ASGC |
2.2.8 - transfer-fts-3.7.10-1 |
None |
None |
BNL |
2.2.8 - transfer-fts-3.7.10-1 |
None |
Planning to apply proxy delegation patch next week |
CNAF |
2.2.8 - transfer-fts-3.7.10-1 |
|
|
FNAL |
2.2.8 - transfer-fts-3.7.10-1 |
applied patch1 last week |
will test patch2 on cmsfts2.fnal.gov next week and cmsfts1.fnal.gov after 1 week of stable running |
|
IN2P3 |
2.2.8 - transfer-fts-3.7.12-1 |
Last patch applied on Oct. 31st |
|
KIT |
2.2.8 - transfer-fts-3.7.10-1 |
|
will apply new patch on 06.11. |
NDGF |
2.2.8 - transfer-fts-3.7.10-1 |
|
|
NL-T1 |
2.2.8 - transfer-fts-3.7.12-1 |
applied the patch during the ops coordination phone conf november 1st :-) |
|
PIC |
2.2.8 - transfer-fts-3.7.10-1 |
None |
None |
RAL |
2.2.8 - transfer-fts-3.7.12-1 |
applied new patch on Nov 1st |
|
TRIUMF |
2.2.8 - transfer-fts-3.7.10-1 |
|
new proxy delegation patch next week |
LFC deployment
Other site news
Data management provider news
CASTOR news
CERN operations and development
EOS news
xrootd news
dCache news
StoRM news
DPM news
FTS news
A new patch for the proxy expiration issue is available and has been deployed at CERN on Monday 29th.
The patch fixes the following issues:
- submitters proxy expiration (GGUS:81844
)
- jobs wrongly allocated to different VOs (GGUS:87929
)
- VOMS attrs can't be read from CRLs, delegation ID can't be generated (GGUS:87975
, GGUS:86775
)
- mkgridmap cron job has old gLite paths, thereby preventing the addition of new VO members to the submit-mapfile
VOs report that the frequent transfer submission and allocation errors seen with the previous patch have disappeared, and the proxy expiration issue has not yet resurfaced.
Therefore we would like to ask all T1s to deploy the patch starting from next Monday, please update the table with the deployment schedule.
Installation instructions:
- Download and update only the transfer-fts-3.7.12-1 rpm from this repository:
- Restart tomcat
- Disable the cron job used to restart tomcat automatically (such service interruptions are no longer needed and should be avoided)
It is agreed that, given the simplicity of the patch installation process, the target date for having it installed at all Tier-1 sites is the end of next week. An official release (2.2.9) including all the patches released so far will take longer, but there won't be any difference between installing it or installing the individual patches. It is also agreed that rollback instructions will be provided, as requested in particular by Burt.
After the meeting the instructions were consolidated on the FTS 2.2.8 admin documentation page:
LFC news
gfal/lcg_util news
Middleware news and baseline versions (Nicoḷ)
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
Nicolò repeats some recommendations for sites:
- In general, if a site wants to upgrade a service, the EMI-2 version should be chosen
- Sites with a gLite WMS should upgrade to version 3.3.8, which fixes important bugs
- From today, EMI-1 gets only security updates: therefore, important bug fixes will be available only for EMI-2
- Now the baseline version for the WN is the EMI-2 one: sites are strongly encouraged to upgrade if they still have a gLite WN, but it's less urgent if they have EMI-1 (check the WLCG WN testing page for the most recent information)
- the EMI-2 UI has a serious bug affecting the submission to the gLite WMS; moreover the tar ball distribution is not yet available
Peter brings up the fact that the SAM/Nagios boxes still use the gLite UI. Anyway, this is not considered a problem because it's a very controlled environment and the required functionality works as needed with gLite.
AOB
It is agreed that the next meeting will be on November 22, in order to avoid the GDB week.
Action list
--
AndreaSciaba - 26-Oct-2012