WLCG Operations Coordination Minutes - December 4th, 2014
Agenda
Attendance
- local: Pepe Flix (chair), Nicolò Magini (secretary), Andrea Sciabà, Tsung-Hsun Wu (ASGC), Luca Canali (IT-DB), Andrea Manzi (MW Officer), Catherine Biscarat, Stefan Roiser (LHCb), Alberto Peon, Maite Barroso (Tier-0), Maria Dimou, Maarten Litmaath (ALICE)
- remote: Antonio Perez-Calero Yzquierdo (PIC), Burt Holzman (FNAL), Christoph Wissing (CMS), Dave Dykstra, Di Qing (TRIUMF), Dave Mason (FNAL), Jeremy Coles (GridPP), Michel Jouvin, Rob Quick (OSG), Yuri Lazin (NRC-KI-T1)
Operations News
- European HTCondor Site Admins Meeting on December 8 and 9 (link to the agenda
)
- Workshop on the future of Argus on December 11 (link to the agenda
)
- WLCG critical services: feedback received - independent meeting to discuss with the experiments to be held next week
- WLCG survey: web form finished and broadcasted last week (http://wlcg-survey.web.cern.ch/
). Deadlined to collect information is set to 19th December. So far, ~20 sites answered.
- Reminder. WLCG OpsCoord meeting on 8th is virtual. January 22nd ok.
Middleware News
- Baselines:
- FTS 3.2.30 set as baseline. Deployed already everywhere
- MW Issues:
- gridftp logging too verbose on DPM 1.8.9. The fix is now in EPEL stable. The problem reported last time with EGI SAM probe failing with the new version has been fixed. The deployment by the NGIs is foreseen next week.
- T0 and T1 services
- NDGF
- KIT
- planned dCache upgrade to 2.11 (via 2.10) for
- ATLAS on December 9th, 2014
- CMS on December 11th, 2014
- LHCb on December 16th, 2014
- planned Update the xrootd pool monitoring plugin and FAX name-2-name plugin (no repository known for CMS' TFC plugin).
- JINR-T1
- dCache upgraded on pools and tape to 2.10.13
- RRC-KI-T1
- planned to add ALICE support for tape instance and upgrade tape dCache to 2.10*
- CERN
- Castor upgrade to 2.1.14-15 completed for all VOs
- PIC
- dCache upgraded to 2.6.38 on pool doors
- On 15th December, updating to dCache 2.10.x + xrootd pool monitoring plugin and FAX name-2-name plugin and CMS TFC plugin (though not avail. in WLCG repo!)
- FTS upgrade to v 3.2.30
- Christoph Wissing explains that currently the CMS TFC plugin for xrootd on dCache is provided by dCache developers at DESY who put the rpm for download on dCache.org. Pepe Flix and Andrea Manzi suggest to put it in the WLCG repository for homogeneity.
Oracle Deployment
- Luca Canali reminds that the Oracle HW migrations were concluded as reported last week. He announces a security upgrade on the LCGR DB next week:
Tier 0 News
- voms.cern.ch and lcg-voms.cern.ch replaced by voms2.cern.ch and lcg-voms2.cern.ch on Wed, Nov 26
- AFS UI statistics: ran today (1h period, p.gd.lcgshare only). Proposal to decommission it by the end of January? This is the documentation on how to get the CA certs locally and have the standard fetch-crl cronjob updating regularly the CRL lists (instead of using the CAs on the AFS UI): https://cern.service-now.com/service-portal/article.do?n=KB0002772
Top volumes accessed: stat fetch store create remove
lck/ul Acc
p.gd.lcgshare aa: 13895 11587 2308 0 0 0
0 377199
[..]
Totals : 14291 11719 2440 132 0 0
0
Top hosts accessing:
cmsui04.na.infn.it : 626
vocms157.cern.ch : 448
lxplus0229.cern.ch : 411
vocms122.cern.ch : 321
lxplus0027.cern.ch : 266
lxplus0227.cern.ch : 234
lxplus0003.cern.ch : 226
Top users accessing:
anonymous : 8928
acyz (Alice) : 294
lxu (Atlas) : 200
pwang (cms) : 182
zli (RE1) : 177
dmytro (cms) : 145
davec (cms) : 143
kimura : 132
- Maite Barroso notes that the top accesses are anonymous: it means no valid AFS token (either expired, or non-AFS account) - most of the files are accessible without.
- CMS VOBOXes in the top list; also other experiments. Migration in progress.
- After the meeting, Maite provided the daily statistics, with a complete list of top hosts and users. In attachment to these minutes.
- Agreed to propose 2nd of February as decommissioning date to experiments.
- Stefan Roiser asks to get statistics also for the CRL volume.
- Alberto Peon presents the experiment feedback on voms-admin, see the agenda for the slides.
- Users editing their email address in voms-admin will get suspended at the next renewal, since the matching with HR DB is performed on e-mail; it should be read-only. Agreed that allowing users to edit their personal data in voms-admin is probably not a showstopping bug, but it should be cross-checked with the WLCG security team.
- Discussion about the CMS suggestions: deducing the user name from the DN is not high priority; wildcard searches would be useful.
- Maria Dimou reminds that the project for the migration from VOMRS to voms-admin running until 2009 collected requirements. She will circulate the list to evaluate if they are still relevant. This is now done in GGUS:110227#update#61
- Pepe Flix proposes 2 of February as target date for the migration if no showstoppers, agreed.
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- high activity during the last 2 weeks
- ARC CE SAM tests:
- direct job submission probe being worked on by ALICE colleague from UA-BITP
- code is being debugged with good progress
- new probe will also benefit LHCb
ATLAS
CMS
- Production/Processing overview
- Tier-1: DIGI-RECO campaign Phys14DR
- Tier-2: Various MC productions: Upgrade, Run2
- VOMS server migration
- No major issues
- 'Detected' quite some production service still depending on AFS-UI
- UI got quickly updated for new VOMS servers - THANKS!
- A couple of sites needed to update their Phedex machines
- Difficult to test prior to actual closing
- Cristoph Wissing explains that the sites that needed to update their PhEDEx machines for VOMS were some T2s and T3s.
LHCb
- Stripping 21 - final checks performed. The production input data range has been extended by some 10k files, from then on keep on pushing as much as we can before the Xmas break.
- Some hiccups with the VOMS migration b/c individual users were referring to many different locations for VOMS servers including afs UI, /cvmfs/grid.cern.ch. All were fixed but we shall make an effort to reduce possible locations within LHCb
- All but two VOBoxes have been migrated to Openstack/AI. The two remaining boxes are currently needed for accounting
- Stefan Roiser adds that running during the Xmas break or suspending until January is still under discussion.
- Stefan will check the timeline for the migration of the remaining voboxes.
Ongoing Task Forces and Working Groups
gLExec Deployment TF
- gLExec in PanDA:
- testing campaign ongoing with good results
- minor issues at a few sites were handled, one still in progress
Machine/Job Features
- TF meeting on Monday
- Agreed on new protocol and implementation proposal for virtualized environments, i.e. single retrieval of a feature similar to file based interaction. Possibly implementable e.g. via apache which should serve additional features like horizontal scalability
- No more need for a client library (mjf.py), all features can be retrieved in the same way (e.g. python: openurl(featurepath)
- Need to get all implementations (also batch) into the repository
- GDB presentation next week on status of the TF
- Stefan Roiser clarifies that the sites can deploy the MJF as a local file on the WN, or as a central apache service - discovered via environment variable on the WN. For the server, the TF will provide a reference configuration, but the sites are free to customize (e.g. firewall rules, authorization).
Middleware Readiness WG
- DPM 1.8.9 verification completed, the bug discovered on dmlite lib has been fixed and dmlite 0.7.2 is on EPEL stable now.
- dCache 2.11.0 verification for ATLAS completed
- dCache 2.6.38 and 2.11.4 verifications for ATLAS are ongoing
- Progresses can now be followed also on the MW readiness jira dashboard
- the WLCG package reporter has been rebranded after the discussion with EGI security team as Pakiti v3 and released to EPEL ( under review). It will be used both by EGI security team and MW readiness.
- Our Tasks overview is updated.
- Full minutes of our last meeting are now published here. Please observe the actions.
- Maria Dimou asks if the task overview table in the twiki needs to be maintained. Agreed to keep only tasks with larger implications in the task overview table; day-to-day verification tasks can be followed in JIRA.
Multicore Deployment
- CMS multicore:
- Ongoing test of submission of PromptReco multithreaded jobs to T1s.
- Working with CNAF to understand and improve low number of max. multicore pilots which get to run.
- Test deployment to CMS T2s still waiting for test-bed infrastructure (pilot factory) deployment.
- Antonio Perez adds that CNAF is running LSF as batch system.
SHA-2 Migration TF
- new VOMS servers finale
- ATLAS: old ports were blocked Nov 24 afternoon
- ALICE, LHCb and Ops: Nov 26 afternoon
- CMS: new ports were opened Nov 24 afternoon, old ports blocked Nov 26 morning
- AFS UI and CVMFS UI configurations were quickly fixed Nov 26 afternoon
- PhEDEx hosts at various sites needed to be fixed similarly
- LHCb users had to fix private recipes
- VOMS
LSC
and vomses
documentation page updated
- EGI and WLCG broadcasts pending
- update after the meeting: done on Fri Dec 5 (link
)
- retirement plans for the old servers
- experiments need to update their VO cards:
- the old hosts will be "alive" until Tue Feb 3, 2015
- the VOMS daemon ports refuse connections, while VOMRS and VOMS-Admin are available
- on Feb 3 the special router configurations will be removed
- further references to the old services may hang from then on
- either or both of the old services can get decommissioned as of that date
- hopefully also VOMRS is no longer needed by then
- experiments should get the old servers removed from the VOMS client configuration on the UI instances they use
- for
lxplus
that will happen "automatically" next week
- update after the meeting: next week OSG will release a configuration update that has references to the old services removed
- RFC proxies
- shall we pick that up now?
- For SHA-2 the task force can be closed.
- Concerning RFC, they should already be supported by all middleware. There were known issues with SAM and the Condor version used in ATLAS pilot factories. LHCb should also test DIRAC again.
- Michel Jouvin asks if legacy and rfc proxies can coexist. Maarten explains that they can, provided that they are not mixed in a single chain.
- Maarten proposes the start of Run2 as target for switch to RFC proxies for WLCG.
IPv6 Validation and Deployment TF
Squid Monitoring and HTTP Proxy Discovery TFs
- The WLCG squid monitor page
that's automatically generated from GOCDB and OIM was updated to support multiple squid services per site. The registered DNS name of the site was included with the site name to keep it unique.
- We're now ready to go ahead with an announcement asking everybody to register. Here is the proposed text, is it OK? Where should it be sent, and who should send it?
Subject: WLCG Squid registration
Dear sites,
The WLCG Operations Coordination Team requests that all sites now
register their Squid services in GOCDB or OIM. These registrations
are beginning to be used to automatically configure monitoring.
A representative from each site should please follow these
instructions:
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGSquidRegistration
Questions can be directed to wlcg-squidmon-support@cern.ch.
Best regards,
The WLCG Operations Coordination Team
- Andrea Sciabà suggests to send the broadcast to wlcg-broadcast and cc wlcg-operations, wlcg-ops-coord and project-lcg-gdb.
Network and Transfer Metrics WG
- Metrics area meeting held last week, minutes available at https://indico.cern.ch/event/354593/
- WG waiting on input from transfer systems and experiments on use-cases/requirements for network metrics
- Strawman planned for early next year
- Status of perfSONAR presented also at ATLAS jamboree yesterday
- Update campaign ongoing, hard deadline for all sites to update is 8th January 2015
- perfSONAR data store configured in ITB; stress testing ongoing
Action list
- ONGOING on the experiments: check the usage statistics of the AFS UI, report on the use cases at the next meeting.
- Maite provided full updated statistics after lxplus5 closure.
- ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled in production on SAM CMS; sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM). Number of CMS sites publishing HTCondor-CE is increasing.
- Ongoing discussions on publication in AGIS for ATLAS.
- ONGOING on experiment representatives - report on voms-admin test feedback
- CLOSED on Alberto Peon and IT-PES - summarize feedback on voms-admin received so far at the next meeting
- ONGOING on Andrea Sciabà - review the critical services table
- Andrea will provide an update at a dedicated meeting on Tuesday and report at the next regular meeting.
AOB
- Next meeting on December 18th.
--
NicoloMagini - 2014-11-21