WLCG Operations Coordination Minutes - 11 April 2013
Agenda
Attendance
- Local: Maria Girone (chair), Andrea Sciabà (secretary), Simone Campana, Joel Closier, Maarten Litmaath, Stefan Roiser, Christoph Wissing, Felix Lee, Oliver Keeble, Nicolò Magini, Michail Salichos, Maite Barroso Lopez, Alessandro Di Girolamo, Jan Iven, Ikuo Ueda
- Remote: Ewan Mac Mahon, Peter Gronbech, Stefano Belforte, Stephen Burke, Alessandro Cavalli, Jeremy Coles, Andreas Petzold, Burt Holzman, Daniele Bonacorsi, Gareth Smith, Thomas Hartmann, Maria Dimou, Salvatore Tupputi, Ian Fisk
- Apologies: Pepe Flix, Alessandra Forti
News (M. Girone)
The IPv6 task force CERN responsible, Edoardo Martelli, contacted us to ask if we can coordinate the testing of experiment and WLCG applications; it was therefore decided to start a task force on IPv6 compatibility. Volunteers from experiments and sites are needed; Edoardo mentioned that Tony Wildish and Costin Grigoras are already involved for CMS and ALICE respectively. The leadership details of the TF are still being discussed.
Joel: Peter Clarke can represent LHCb.
Middleware news and baseline versions (N. Magini)
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
Highlights:
- there was a security release for CREAM, sites should upgrade to it
- now the baseline versions table contains the versions of clients to deploy on UIs and WNs
- EMI-3 has been released but no product is baseline yet; still, sites are free to upgrade services to EMI-3 (the WN needs more testing)
- WLCG VOBOX
- EGI WLCG repository tested, looks usable so far
- official release expected in a few days
- CERN WLCG repository to be created in the coming days
- can augment EGI WLCG repository and/or serve for failover
- will serve various use cases
Oliver mentions that there will be a meeting in Rome to discuss middleware support after EMI. It would be good to have a well defined WLCG position concerning the repository. Maarten thinks that the current approach is sufficient for our needs but it can always be rediscussed if needed.
Tier-1 Grid services
Storage deployment
Site |
Status |
Recent changes |
Planned changes |
CERN |
CASTOR: v2.1.13-9 and SRM-2.11 for all instances EOS: ALICE (EOS 0.2.20 / xrootd 3.2.5) ATLAS (EOS 0.2.28 / xrootd 3.2.7 / BeStMan2-2.2.2) CMS (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2) LHCb (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2) |
CASTOR: Closed file update feature and full load balanced stager aliases |
EOSALICE: short term=next week go to 0.2.29/xroot-3.2.7; looking at BeStMan2.3 |
ASGC |
CASTOR 2.1.13-9 CASTOR SRM 2.11-2 DPM 1.8.6-1 xrootd 3.2.7-1 |
None |
None |
BNL |
dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup) http (aria2c) and xrootd/Scalla on each pool |
None |
None |
CNAF |
StoRM 1.8.1 (Atlas, CMS, LHCb) |
|
Waiting for new tapes to be installed in the next few days |
FNAL |
dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3 Scalla xrootd 2.9.7/3.2.7-2.osg Oracle Lustre 1.8.6 EOS 0.2.29-1/xrootd 3.2.7-2.osg with Bestman 2.2.2.0.10 |
|
|
IN2P3 |
dCache 1.9.12-16 (Chimera) on SL6 core servers and 1.9.12-25 on pool nodes Postgres 9.1 xrootd 3.0.4 |
Upgrade to 1.9.12-25 on all pool servers during the last SD (20130318) |
Upgrade to 2.2-10+ in June 2013 (3rd golden release) |
KIT |
dCache- atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera)
- cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera)
- lhcbsrm-kit.gridka.de: 1.9.12-24 (Chimera)
xrootd (version 20100510-1509_dbg and 3.2.6) |
|
Will update to the new golden release but no date yet |
NDGF |
dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes. |
|
|
NL-T1 |
dCache 2.2.4 (Chimera) (SARA), DPM 1.8.2 (NIKHEF) |
|
|
PIC |
dCache 1.9.12-20 (Chimera) - doors at 1.9.12-23 |
Progressive upgraded to 1.9.12-23 on all pool servers |
|
RAL |
CASTOR 2.1.12-10 2.1.13-9 (tape servers) SRM 2.11-1 |
|
|
TRIUMF |
dCache 1.9.12-19(Chimera) |
|
Will upgrade dcache to 2.2 on April 17, with webdav enabled |
FTS deployment
Site |
Version |
Recent changes |
Planned changes |
CERN |
2.2.8 - transfer-fts-3.7.12-1 |
|
|
ASGC |
2.2.8 - transfer-fts-3.7.12-1 |
|
|
BNL |
2.2.8 - transfer-fts-3.7.10-1 |
None |
None |
CNAF |
2.2.8 - transfer-fts-3.7.12-1 |
|
|
FNAL |
2.2.8 - transfer-fts-3.7.12-1 |
|
|
IN2P3 |
2.2.8 - transfer-fts-3.7.12-1 |
|
|
KIT |
2.2.8 - transfer-fts-3.7.12-1 |
|
|
NDGF |
2.2.8 - transfer-fts-3.7.12-1 |
|
|
NL-T1 |
2.2.8 - transfer-fts-3.7.12-1 |
|
|
PIC |
2.2.8 - transfer-fts-3.7.12-1 |
None |
None |
RAL |
2.2.8 - transfer-fts-3.7.12-1 |
|
|
TRIUMF |
2.2.8 - transfer-fts-3.7.12-1 |
|
|
LFC deployment
Other site news
- CERN:
- Upgrade of WMS servers to EMI-3: The CERN WMS (Workload Management System) service will be upgraded to version 3.5 (EMI-3 release) on Tuesday, April 16th 2013. The WMS servers are load-balanced and will be progressively upgraded so that there will be no service interruption for job submission. During the upgrade, users may notice a delay in job status updates as the L&B service responsible for providing and updating job status may stop responding for a few minutes. See IT SSB entry: https://itssb.web.cern.ch/planned-intervention/upgrade-wms-servers-emi-3/16-04-2013
. The upgrade will include the Condor version that is necessary to submit to ARC CEs (this is the case also for CNAF).
Data management provider news
GFAL/lcg_util v. 1.15.0 has been certified and will be released to EPEL and EMI. A bugfix release, including fixes for
xrootd-3.3 has been released to EPEL, a non backward-compatible release. DPM has consequently produced a new version of the dpm-xrootd plugin to be released to EMI.
dCache - developers collaborating with PIC to solve disk/tape separation for CMS, going to involve all CMS dCache Tier-1s
UMD release plans (P. Solagna)
Experiment operations review and plans
ALICE (M. Litmaath)
- CERN: very bad job efficiency for a number of days due to excessive job submission times and failures, causing CERN to get almost fully drained of ALICE jobs; fixed as of March 21 early evening (GGUS:92521
)
- CERN: DAQ now uses CASTOR-ALICE Xrootd redirector instead of ALICE VOBOX since March 20
- KIT:
- new Xrootd servers and redirectors online since March 21
- ~1.2 PB migrated to new servers
- consistency checks ongoing to address missing and dark data
- new servers do not yet work for writing (Apr 10), to be debugged...
ATLAS (I. Ueda)
Activities:
- After finishing the urgent jobs for the conferences, there have been periods when ATLAS did not have enough jobs to fill all the sites.
- now we are getting enough MC jobs again
Issues:
- GGUS:92166
(transfers to CERN failing with "Error with credential") still open and creating troubles. The issue has been open at the beginning of march, has been intermittent, never really understood AFAIK. CMS did observe the same issue at some point. ATLAS updated the ticket today with the most recent failures. WLCGDailyMeetingsWeek130408. Michail reports that he is working on the problem.
CMS (C. Wissing)
- Little Production and Processing activity at the moment
- Updated strategy for CVMFS requirements at CMS sites
- Stop of sending install jobs by September 30th 2013
- Cron based pull methods remain support for another few months
- Spring 2014 (April 1st) CVMFS becomes a requirement (cron based pull installation might stop functioning)
- Supported models
- CVMFS as classical WN client
- CVMFS over NFS
- Large enough disk space on WNs to allow CVMFS installation during runtime via Parrot
- Requests to the Tier-2 sites
- Fair share allocations: 50% Role=production or Role=t1production, 40% Role=pilot, 10% remaining CMS
- Provide and publish 48h job queues
- Multi core strategy (not changed with respect to last report)
- Start using multi-core slots in dynamic partitioning mode running single-core jobs
- No request yet to configure additional multi core queues
- Will use what is already provided
- Still interest for a common SAM submission probe based on condor_g
Concerning SAM, Maarten says that if all Globus CEs use GRAM 5, there should be no need any longer to use Condor-G; this would avoid having to run a Condor scheduler just for the SAM tests. Also the ARC CEs
must have a working solution, though. Maarten: they have their own client tools that can be used by the corresponding probe.
- Request to re-discuss the switch to Wall clock time accounting
Maarten adds that technically the switch could happen at any time, but the decision must be taken by the WLCG MB (Ian thinks it will be discussed next Tuesday).
LHCb (S. Roiser)
- Very little activities on the distributed computing facilities, mainly simulation and user jobs
- CASTOR / EOS migration: bulk of data has been migrated, many small files (histograms) still on CASTOR, currently being checked what needs to be migrated of this data
- FTS3 test instance with the requested features for LHCb (bringonline) has been deployed and testing shall start this week
- Upcoming major activities
- Full data reprocessing of last pp data taken in 2013. Full processing chain to be executed (reco-stripping-merging), expected to last max 1 week
- Incremental stripping of all 2011 / 12 data (stripping-merging). Final tests are being done. Expected to last 6-8 weeks, start to run only CERN + T1 sites, if needed extra compute power can be used on specific T2 sites. The bottleneck will be that all FULL.DST data (on tape) needs to be staged. Needed bandwitdh for staging was communicated in the last planning meeting.
- Issues
- CVMFS Stratum 1 server was going out of sync, promptly fixed by IT/PES
- SLS sensor for LFC flickering, this is currently being discussed with providers of the sensor
- for FTS3 testing, the client has been deployed, a dependency on the system provided Boost has been discovered which could interfer with the LCG/AA provided sw stack (used for LHCb applications). A solution is currently being discussed with the development team.
Nicolò suggests to use the REST API of FTS3: this would completely decouple the FTS usage from any other software.
Task Force reports
CVMFS (S. Roiser)
- 24 sites missing CVMFS deployment
- out of these 11 for ATLAS/LHCb (to be deployed by end of this month)
- out of these another 5 plan to deploy before end of April
- CMS has decided on deployment target dates (see report above)
- SAM probe for CVMFS currently in preparation
- may be included into the experiment SAM suites - is this enough for experiment testing?
Alessandro is very interested in a CVMFS SAM probe to be used also by ATLAS.
gLExec (M. Litmaath)
- LHCb: the DIRAC developers discovered that a lot of the gLExec support essentially has to be reimplemented
- technical discussion via e-mail to determine the most robust way for the pilot to prepare the payload work area, run the payload, take care of output files and clean up
- no time estimate yet
Simone reports that the manpower issue for the ATLAS-gLExec integration has been solved for the time being; tentatively, the development could be finished in May.
perfSONAR (S. Campana)
SHA-2 migration (M. Litmaath)
- the new CERN CA is ready for pilot users to get certificates
- the CA is NOT yet supported by any machine in the CERN VOMS clusters (voms, lcg-voms)
- further followup has led to a plan B that looks feasible in the short term (days, not weeks)
- EMI/UMD compatibility table
maintained by EGI
- for some important products only EMI-3/UMD-3 supports SHA-2
FTS 3 integration and deployment (N. Magini)
- Starting discussion with IT-PES about deployment of an FTS3 server with production-level support to migrate larger-scale experiment workflows
- Continuing experiment testing on pilot FTS3 server
- Transfer to StoRM sites with FTS3 requires upgrade to latest StoRM EMI-3 version 1.11.0 - looking for volunteer sites to test.
Xrootd deployment (D. Giordano)
Organize distribution and deployment of external plugins.
- Select a standard repository where to collect the external plugins. Among the options a repository at CERN seems to be a good compromise.
Examples:
- VOMS-XRootD plug-in (developed by G. Ganis) provides a function to validate and extract the VOMS attributes from a proxy, and is meant as an add-on to the libXrdSecgsi authentication plug-in.
- requires VOMS >= 2.0.6 and XRootD >= 3.3.1 .
- Tests: positive feedback provided by David Smith (DPM) and Wei Yang (ATLAS)
- tarball and RPM for SL5 and SL6 (plus documentation) available from http://gganis.github.com/vomsxrd/
- UCSD Collector for detailed monitoring (developed by M. Tavel)
- dCache-xrootd monitoring plugin for f-stream (developed by I. Vukotic)
Tracking tools (M. Dimou)
- Update of GGUS host certificate with the April release on Wednesday 2013/04/24. Interface developers received multiple email and savannah notifications starting on 2013/03/01. No reply so far except for GGUS SU ROC_CERN (SNOW system). This was also reported in WLCGDailyMeetingsWeek130304#Thursday Details in Savannah:136227
- GGUS dev. team needs the CMS position and contact point for Savannah:131565
and dependent dev. items.
Http Proxy Discovery (D. Dykstra)
News from other WLCG working groups
AOB
Action list
- Build a list by experiment of the Tier-2's that need to upgrade dCache to 2.2.
- Inform sites that they need to install the latest Frontier/squid RPM by April at the latest.
- Maarten will look into SHA-2 testing by the experiments when experts can obtain VOMS proxies for their VO.
- Tracking tools TF members who own savannah projects to list them and submit them to the savannah and jira developers if they wish to migrate them to jira. AndreaV and MariaD to report on their experience from the migration of their own savannah trackers.
- CMS to appoint a contact point for Savannah:131565
and dependent dev. items for the savannah-ggus bridge replacement by a GGUS-only solution tailored to CMS ops needs.
Chat room comments
--
AndreaSciaba - 09-Apr-2013