Summary of GDB meeting, October 9, 2013 (CERN)
Agenda
https://indico.cern.ch/conferenceDisplay.py?confId=251190
Welcome - M. Jouvin
September summary only available today: apologies for being late.
- Reminder: Need more volunteers.
- Today, thanks to Oliver for accepting taking the notes
Future GDBs until end of 2014: second wednesday of each month, except in January
- January moved to 15th because of a clash with CERN Director's New Year speach
- 2014 events created so please check for clashes.
Next pre-GDBs
- Possible topics: review of cloud activities, batch system support and ops coord F2F.
- Ops Coord F2F meeting will probably be in February
- Other suggestions welcome
Actions in progress others than those followed by Ops Coord
- Storage accounting: update planned at December GDB
- site Nagios testing: more sites needed
- handling job with high mem requirements: no feedback received but still a potential problem
- ALICE issue when heavy ion running takes place.
- ATLAS has some workflow for jobs with more memory needed than on ID card - specific sites agreed to this situation
- CMS similar. Small number of jobs have these requirements and make special arrangements for them.
- LHCb: RSS and virtual memory issue not solved yet. LHCb has written needs into VO ID card.
- JT: Some VOs outside LHC world with similar issues. Best to solve at the batch system level. Could present at a pre-GDB our approach and try to see if it would fit in CREAM-CE. LHCb liked our approach. Counts all of process requirements.
EGI TF
- Strong focus on federated cloud infrastructure.
- EGI leading role in operating DCIs in Europe recognised.
- Ops coordination with WLCG improved.
- Transition period coming with end of EGI-Inspire in 6 months and S. Newhouse (director) leaving
- Some hope to see EGI-Inspire extended...
Simone Campana takes over Maria’s role as she is becoming the new CMS Computing Coordinator in January
- Andrea Sciaba will be Simone's deputy
Data Preservation - J. Shiers
More on DP at CHEP next week: introductory talk + DP workshop
DPHEP Implementation Board set up: similar to WLCG GDB/MB
- Public agendas in Indico
- Twitter
DP is more than HEP: many projects/disciplines, can profit a lot by collaboration
- Some other communities more advanced
- Jamie involved in several coordinations around these projects or related
- High level strategy with others (projets, funding agencies): make them aware of us, clarify what we can offer
DP may have implication on services
- Site representatives should participate to the cost evaluation
Concentrate on use cases (motivations and costs): 3 identified for HEP
- Long tail of paper after the end of an experience requiring access to data
- New theorical insights requiring reprocessing/reanalyzing the data
- Should preserve date forever just in case: no clear business case
Translate into scenarios and evaluate the cost: 1 decade preservation, 2 decades, 3 decades
- Planning a workshop to estimate the costs of curation (January)
- As input, look at many migrations we have performed in the past (Linux, Objectivity...)
- Take into account media migration required during the preservation period and the OS changes...
- Manpower expected to be the dominant cost
- Cost foreseen as affordable as long that there is a valid/strong business case
- Identify who to optimize the cost by a better coordination and sharing of efforts
- See if we can make our data more "preservational" by adapting the way we work today
Data Management
Future Directions for DM Clients - A. Alvarez Ayllon
GFAL2: a replacement for GFAL allowing grid and cloud operations
- Addressing the shortcomings of GFAL: error handling, extensibility
- No requirement to use an information system (disabled by default but still possible)
- Session reuse
- More protocols supported
- Already used by FTS3
A new set of CLI interfaces to GFAL2 to replace lcg_util: GFAL2-utils
- Drop support for LFC? Will affect lcg-cr and catalog specific CLI (lcg-aa, lcg-lg...)
- Either not used by WLCG exps (CMS) or not used through lcg-xx commands (ATLAS/LHCb)
- Impact on other VOs? would use plugin level not command line or the LFC CLI (all lcg-xx commands having a matching LFC CLI command? to be checked)
- LFC will remain usable as one of the protocol supported to access a file
- Other commands (in addition to LFC-related ones) with no replacement planned in GFAL2-utils: lcg-get-checksum, lcg-getturls, lcg-gt, lcg-stmd (space token mgmt)
- Comments or complaints go to: lcgutil-devel@cernNOSPAMPLEASE.ch all documented in the wiki.
Python API not yet complete but will expose most of the C API entries
- Philippe - definitely need the getturl functionality at the python level
Proposal
- Freeze development of GFAL: support until end of next year only for critical bugs
- Release in EPEL5 and EPEL6 for GFAL and GFAL2 only: will be removed from EMI-2 and EMI-3
- Remove gfal/lcg_util from EPEL after the proposed/agreed end of life
Experiment feedback
- ATLAS: seems doable... but need to sort out the impact of utils not ported
- CMS and ATLAS: would like to see the new clients deployed in CVMFS...
- CMS experts for DM clients absent: need to check details with them
- No major objection from experiments: more precise feedback from experiments expected by end of November to come with a finalized migration plan at December GDB
- Need to inform and get proper feedback from non LHC VOs: to be done by EGI?
Discussion
- Helge - At CERN have had problems with EPEL and EMI repos getting in way of each other. If you can not sync the withdrawal from EMI repos please be sensitive to such issues.
- Oliver: Removal has already started. Done for DPM. Can we ask expt reps to check internally and gfal team assemble feedback to conclude what we have got.
FTS3: Entering Production, Future Plans - M. Salichos
FTS3 released in EPEL6.
Status
- Protocol supported: SRM, GridFTP, http, xroot
- DB : MySQL + Oracle
- Clients: FTS2 compat, FTS3 cli with new features, REST API
FTS3 entering production: heaviliy used by ATLAS for prod transfers
- Some activity also in CMS and LHCb
- LHCb started to use FTS3 for all prod transfers yesterday
- Some non LHC VOs already started evaluating FTS3, including EUDAT with gridftp (to dCache or iRods)
WLCG FTS3 TF still actively involved but developers would like to reduce the frequency of demos to 1/month (or 6 weeks)
8 instances installed now, all of them except one using
MySQL
- RAL succeeded to transfer 300 TB per day with its instance
- Stats from the last week: FTS2 transferred 4.6 PB/8.7 Mfiles, FTS3 1.8 PB/1.3 Mfiles
- 8 VMs (4 at CERN, 4 at RAL) compared to 38 for FTS2
FTS3 monitoring: 3 solutions
- Transfer dashboard
- Standalone monitoring service
- Nagios probe (not yet released)
Main issues seen during the last year mostly connected to DB tunning and usage: most problems fixed
Testing to be done
- HAve not reached yet the 1M file per day on 1 instance
- New features implemented in FTS3 like session reuse
Missing features to be implemented in the future
- Multi-instance VO-shares
- Multi-hop transfers
- Integration with perfSonar
- Activity fair-share inside a VO
- Look at https://svnweb.cern.ch/trac/fts3/roadmap
for details
- Would be nice to have any other requirements from experiments as soon as possible
- In particular those who may require database schema changes to avoid too many downtimes later.
Dev team confident that FTS3 can handle the load of FTS2 with only one instance but exact configuration to be deployed must be discussed with the exps.
- VO-specific instances vs. global instance: everything possible, let's start with the simplest configuration (global instance), always possible to evolve
Philippe: FTS3 big advantage is the simple configuration, would like this to remain in the future rather than implementing very complex use case into FTS3
Oliver: need to discuss with experiments the migration to new FTS3 clients and the obsolescence of FTS2 ones
- ruccio already using it, LHCb will start working on it soon
- LHCb: current situation makes reverting to FTS2 easier until FTS3 is fully validated
- CMS: experts to be asked
Impact on non WLCG VOs: work well advertized in EGI
Oliver must send Michel a pointer to documentation of changes compared to FTS2.
Discussion
- Chris - can a site restrict the bandwidth used by FTS? Not directly
- NDGF - we used FTS3 and it's the first time our 10Gb link was filled up.
Actions in Progress
Ops Coord Report - J. Flix
New Ops Coord leader: S. Campana
MW baseline
- New EMI-3 StoRM version
- Critical bug affecting top BDII: fix announced via EGI Broadcast
- DPM 1.8.7 released to EPEL
- gfal/lcg_util bug fix released to EPEL
glexec: 46 sites verified, 48 in progress
- Often coupled with SL6 migration
SL6
- T1: 10/16 completed, 3 in progress, 3 with a plan
- T2: 81/130, 15 to complete in time to meet the deadline, 30 not replied yet
- HEPSPEC06 for SL6 is being filled: sites encouraged to publish their results
- HEP_OSLibs new version 1.0.13: sites must upgrade at their convenience
SHA-2
- CERN VOMS servers should become SHA-2 compliants in the next month
- CERN plans to move a SHA-2 certificate for VOMS host certificate in the coming month, inducing a DN change
- Concerns after the problem that affected BNL VOMS server after a similar change: plan is to deploy a 3d server and remove the old one when it has been been properly configured everywhere
VOMRS
- Need experiments to test the new VOMS-Admin
- No progress seen on this in the last months
CVMFS
- ALICE: good progress, 50% sites done
- CMS: not yet completed despite the deadline set to Oct. 1st
- LHCb: conddb repo can be removed, no longer used
Machine/job features
- Now really active
- Meetings and minutes in Indico
- Developed a tool mjf.py returning the machine/job information
- Tested at CERN
- Next step: packaging for wider deployment
- Open question: how to extract HS06 in VMs?
perfSONAR
- All sites must upgrade to 3.3.1
IPv6 validation and deployment
- Concentrate on assessing that moving to dual stack doesn't break existing Ipv4 services
- Joint work with HEPiX WG: sites invited to join the testbed and the WG
XrootD monitoring deployment
- Detailed monitoring of dCache instances now available
FTS3
- See previous talk
- Plan is to support the whole WLCG load on 1 instance (with 1 failover)
Tracking tools
- Important meetings these weeks, more next month
MW readiness verification
- Maarten is TF convenor
- A twiki page created, an email being created
- Membership still to be defined
WMS decommissionning TF just created
- CERN would like to assess the future usage of VMS and see if decommissionning could be planned at some point
Next Ops Coord meeting clashing with CHEP: postponed to the week after (Oct. 24)
HEPiX Puppet WG - B. Jones
Config tools used at sites evolving: last survey by EGI showed that Puppet is now as used than Quattor with several sites without a config tool and planning to look at it.
- Growing user community in EGI and a huge wider community
- CERN working on and publishing modules on GitHub
- Every grid service except VOMS has a puppet module. Half probably published. SOme work out of the box and some don’t. SOme need testing for hard coding. As on github will help fix for other sites.
WG goals: share information, experiences and code amongst sites using puppet
- Help with possible migration paths for products where YAIM future is unclear
- Several collaboration options: central point for documentation and support, more formal collaboration on modules, a full suite of HEP modules, publishing to the Puppet forge
- Is Puppet forge too formal?
- DPM dev team publishing its modules to Puppet forge
WG status
- 30 subscribers to the list
- Currently documenting modules available for EMI-2 and EMI-3
Would like to receive feedback from WLCG sites
- They are encouraged to join the WG if interested in Puppet
Discussion
- Jeremy: UK has several sites involved in a semi-active group. Found that there are several ways of doing things but do not yet have a consensus on ways forward but good to get the discussion started. Will be very useful to have a WLCG wide community effort sharing experiences and modules.
- Helge gives assurance that contributions back to CERN modules will be accepted after a careful review, to avoid the community splintering.
- Maarten: Are there any signs in Quattor community that sites would be interested in moving to Puppet?
- Michel: A Quattor workshop held 2 weeks ago. Not a very big community but still active and no site looking in short-term. Keeping an eye on events: some may want to move in the future but the critical feature is the availability of a service description maintained by the community, one of the great Quattor success.
WLCG IS - M. Alandes Pradillo
BDII distribution
- EMI-2 and EMI-3 repository: versions aligned
- UMD: UMD-2 still has the old version, to be updated soon
- EPEL5 and EPEL6: only the resource BDII
- Other packages by the end of the year
WLCG baseline updated recently: 5.2.21-1
- Still many sites with the previous version
GLUE2 validation
- MW: now done, still some consistency issues for storage attributes between different implementations (see slides)
- Sites: still quite a lot of errors due to publishing of obsolete attributes, should be reduced with last version of BDII
- Also many errors due to MW issues: followed up with developers
- GLUE2-validator will be used by EGI in a Nagios probe
New webpage with all information related to information system: gridinfo.web.cern.ch
- Sysadmins, users, developers...
GLUE1 retirement plan (EGI)
- 2014Q1: assess/fix MW clients for correct work with GLUE2 data
- May 1, 2014: decommission GLUE1 (doesn't mean it will not be published anymore but new services like cloud will not appear)
- Will require using ginfo rather than lcg-info or lcg-infosites but ginfo is not working with GLUE1
- Impact on OSG? To be followed up by Ops Coord TF
WLCG IS service still being deployed but still in a prototype phase
- Progresses may be presented at a future GDB
Discussion
- MJ: Retirement and impact on OSG. Long standing issue. Is there any progress?
- MAP: No. No plans in OSG to use Glue2. Perhaps
- ML: when the experiments need it then effort will be provided. That was the management response from OSG.
- Lothar Bauerdick (LB): There is no request for Glue2. Thinking about having a compatibility layer to cover those things we got from Glue1. Not planning to make a transition but may be revised if something really depends on GLUE2. Our general direction is to reduce dep on global information services.
- MAP: Glue1 will remain but will not fix it.
- SB: The key thing is that new developments will be in Glue2. EGI is looking at publishing Cloud services and these will only be in Glue2.
- MAP: In next TF meeting we’ll discuss the VO feedback on the use of ginfo.
- MAP: ginfo will not work with OSG resources as it uses Glue2.
Storage WG Report - W. Bhimji
Storage interfaces: now concentrating on replacement interfaces for disk-only sites (dav, xrootd)
- ATLAS exploring dav deletion with Ruccio
- ATLAS also testing http put with spacetokens for stageouts
- LHCb integrating with xroot and http/dav
- Need to ensure that both xroot and http/dav are registered in GOCDB in the near future
- Probably time for setting up a deadline for RFIO retirement: currently progressing slowly
Space tokens: Atlas use decreasing and could probably live without it (through implementation of Ruccio quota)
- Need to ensure non-ATLAS ST use is covered too
Benchmark/IO pattern activity continuing.
- Mainly based on EOS. Harvesting log info proving interesting.
WAN transfers:
- ALICE has a lot of interesting real data in MonAlisa: failover to remote site working well
- CMS AAA: 2/7 T1, 39/51 T2
- Main goal is failover and support of diskless sites
EOS now has
WebDav and http support
Davix: toolkit for optimized remote I/O
- Not another http library
- Support all http based protocols (S3, WebDav, CDMI)
A future
F2F meeting as a pre-GDB with a wider attendance would be useful
Discussion
- Michel: on adding xrootd to GOCDB, I thought this was decided.
- Wahid: If it is then not all sites are doing it, even the ones being used.
- Michel rfio decommissioning - what is current usage by exps? can we get a status on use of rfio (ie move to xrootd as default access protocol)?
- Alessandro: Atlas his taking a relaxed approach: not many sites left which have rfio as preferred access protocol.
- Maarten - can we move to disabling rfio by default at some time in the future?
- Oliver: DPM team to understand site implications (ie can rfio be turned off for access but not for internal use?)