Summary of GDB meeting, December 11, 2013 (CERN)
Agenda
https://indico.cern.ch/conferenceDisplay.py?confId=251192
Welcome - M. Jouvin
Next meetings
- March planned outside CERN: proposal to host it must be sent quickly to Michel
- Pre-GDB expected confirmed for January and February, probably one in March (outside CERN)
VO-based SAM tests: potential scheduling issue as test jobs are run in competition with production jobs
- No easy solution
- Do we care about the status of site when the VO has no share available?
- Agreeement we need consistency in the handling between VOs: all VOs seems happy not to consider a timeout as a failure
Security
Identity Federation pre-GDB - R. Wartel
25 persons
Identity federation
- Follow-up discussions on the pilot
- Not only a technical problem: devils in the details!
Need a policy for "attributes for WLCG": which attributes are required from the users, when/how they should be released
- Persistent ID
- Back channel to the user (e.g. email address)
- VO membership and roles
- Real name: not to be released except under very special circumstance like incident handling
We should accept to lose Common Name in credentials
- Requirements to comply with data privacy policies
Transfer
LoA on VOs
- Make easier for existing IdP to contribute to WLCG
WLCG should build on existing federations, generally based on NRENs
- eduGAIN seems the appropriate forum
Building blocs identified and exist and several successful pilots done but unclear how to fit them together
- Concentrate on web applications as CLI is difficult with the lack of ECP availability currently
- Would be useful to have a pilot app from an experiment, like downloading a file with gridftp from a web portal
- Use the experience gained with a few pilot apps to build a strawman architecture
New Authorization Profile - D. Kelsey
Current authorization profile is based on ID vetting done by CA: works well for structured communities like WLCG but doesn't fit all communities,
IdPs and stronger authorization needs
IGTF proposal: Identifier-Only Trust Assurance with Secured Infrastructure Authentication Profile (IOTA)
- Lower ID vetting by CAs
- Transfer level of assurance to communities/VOs
- Persistent unique identifier: the unique identifier will never be reused for a different user/entity
- Generated by authorities using secured and trusted infrastructure
- Renewal or re-keying must ensure that this is the same entity that is requesting the renewal as the original one
- Nothing prevents an entity for requesting several certificates with different identifiers but this is already the case
- Recommendation to issue long lived certificates
- Authorities are required to collect only the user data necessary to ensure ID uniqueness, not the full traceability
- Must be used with complementary services/informations managed by other authorities doing the strong vetting process
- No requirement for incident handling
- Current draft uploaded to GDB agenda (original page not publicly accessible)
- Input from CILogon Basics and UK SARoNGS
- Accept federations/IdPs that do not perform F2F identity vetting (photo-ID)
- Accept federations/IdPs that refuse to release common names
Many operational issues
- Identity vetting: for WLCG VOs could be done by CERN HR
- Are site accepting not to have the common name?
- VO should know the name of the user and have way to contact him if necessary
- IOATA CAs may operate trusted credential repositories that can be used by other services
- Will be difficult for sites to have a mix of VOs IOTA-compliant and VOs that are not
TERENA about to call for tender for the renewal of TCS service: IOTA may be added as a requirement for the next version of the service
EGI core services provisionning and UMD decommissionning plans - P. Solagna
UMD2: security updates only
- End of support has been fixed to end of April 2014: according to policy means updgrad by end of May
- Jan. 31: First broadcast about EMI-3 upgrade
- Feb. 28: start of alarms to site
- May 31: sites not upgraded eligible for suspension
- At least service downtime
- UMD3: no specific problem known, except the VOMS client memory leak on the WN
- Problem not fully understood yet
- Not considered as a showstopper, workaround available for affected VOs (mainly Atlas)
New core service provisionning will start May 1st
- Affected services are services run by EGI.eu that will be run by NGIs in the future
- Handover process, where applicable, is starting
- Message brokers network will be provided by GRNET and SRCE
- The main change is that CERN will not contribute anymore
- Not much impact on the service except that the test infrastructure will be discontinued
- Accounting: still provided/operated by STFC but with only one instance
- Also 2nd level of support for APEL by STFC
- Accounting/metrics portal: no change
- SAM central service: SRCE, GRNET + CNRS in replacement of CERN
- Also a migration plan to a new availability/reliability calculation engine for EGI
- CERN will continue to operate the central SAM until the end of handover by CNRS
- Monitoring central tools (e.g. probes for MW decommissionning and security): SRCE, GRNET
- CERN stopping its contribution
- GOCDB: no change (STFC)
- Operational support services (COD, catch-all services, dteam): GRNET (catch-all VOMS/CA, dteam) + CYFRONET (COD)
- SARA stopping its contribution
- Security tools: NIKHEF + CESNET + SRCE/GRNET for security-related Nagios componenents
- Security coordination: FOM, STFC as currently + SNIC
- Security training to be funded by EGI mini-project (link with EGI 6-month extension)
- UMD criteria: will continue as today by CSIC + FCTSG (IberGrid)
- No new criteria added in 2014
- Reducing the testbed
- Staged rollout coordination: IberGrid rather than LIP (CISC added)
- No expected changed despite a small effort reduction
- SW provisionning tools: CESGA in addition to GRNET
- UMD repositories and release tools
- Release management still handled at EGI.eu
- Minimum requirement for availability reduced to 90%
- Helpdesk
- KIT committed to continue support/operation/development for GGUS
- 1st and 2nd level support: CESNET + IBERGRID instead of CESNET + KIT + INFN, both level merged, SAM and APEL maintained by sites operating them
Plan for core services provisionning is independent of EGI-Inspire extension: funded by EGI partners
Discussion on the impact on WLCG of further reduction of services provided by EGI
- Maarten: WLCG has been relying on UMD provisionning and 1st/2nd level support provided by EGI, will be difficult to live without them
- No consensus on whether WLCG can live without them in the future but WLCG will continue to use them as long as they are available
Actions in Progress
Ops Coord Report - A. Sciaba
TF changes
- New TF approved on multicore deployment, led by A. Forti and A. Perez-Calero
- Focused on grid resources
- 2 TF cancelled : data access and dynamic data placement
SL6 TF completed
- 92% of the infrastructure upgraded at the end of October
- EMI3 WN usable
CVMFS
- ALICE deployment going more quickly than anticipated: new deadline advanced to end of this year
- A few operational issues (in particular with caches): SAM probe being added
- NEw baseline version: 2.1.15
glexec
- Still 30 sites having to deploy it (some not yet migrated to SL6)
- ATLAS and ALICE need developments to use it
perfSONAR
- 3.3.1 upgrade deadline: April 1st
- Sites with older version already received tickets
FTS3
- Service stable in the last 2 months
- 30% of production tranfers for ATLAS, 100% for LHCb
- Deployment scenario: single instance favoured
Machine/job features
- Recent contact with I. Sfiligoi's development in CMS to avoid the draining waste of time
- Involves bi-directional communication between machine and pilot
- Intention to merge both approaches
IPv6
- CMS has started testing with promising results
- ATLAS about to start DDM tests
WMS decommissionning
- Still a small usage in CMS for analysis: migration to be done
- LHCb: still 20 sites to move to direct submission
MW readiness
- TF will have a kick-off meeting tomorrow...
Christmas break: all experiments will continue some activity but happy with best-effort support provided by most sites during this period
SHA-2 readiness - M. Litmaath
Main concern is SEs
- Last EGI OMB, Nov. 28: 5 dCache, 7 StoRM to be done
- OSG: BNL planned Dec. 17, FNAL just started, aiming to be ready at the end of the month
- dCache SRM client needs to be updated to 2.6.12 to be able to connect to handle SHA-2 host cert
- Not necessarily urgent...
- Last client being released as part of EMI3 update 11, EMI2 update being prepared
Experiment frameworks: lots of testing, no problem found so look ready for SHA-2
- Tested with CERN test CA, need to be confirmed with first real user certs
By mid-January, WLCG infrastructure should be ready
- Unlikely that SHA-2 cert will appear before next year
VOMRS: deosn't support SHA-2 compliant and not maintain anymore
- Should be replaced by VOMS-Admin but different GUI and CLI delayed adoption by experiments
- VOMS-Admin test setup started a few weeks ago
- Loaded with VOMRS data
- Some instabilities being investigated
- In the meantime (workaround), SHA-2 certificate can be uploaded as a secondary cert in VOMRS if the user has a SHA-1 certificate
Migration to new generation of DM clients - A. Alvarez Ayllon
gfal 2.3.0 released
- GFAL1 in maintenance mode
- ABI and API incompatible with GFAL ones but many advantages (protocol independence, less dependencies...)
- 2.4.8 about to be released: mainly FTS3 related changes
- 2.5 will bring LFC registration support and multiple BDII
gfal-utils ready for testing: feedback needed
- Replaces lcg-utils
- No replacement for lcg-stmd
- Partial replacement for LFC related commands
- lcg-cr is 2 step operation now
Released only in EPEL
lfn:// deprecated and replaced by lfc://
Discussion
- Claudio: any plan for a GFAL2 plugin for ROOT to replace the GFAL1 one
- Oliver: no known plan, to be further rediscussed if there is a real need
- LHCb: will move to to new DM clients as part of the effort in progress to move to native FTS3 clients
- CMS: no firm plan yet but no major difficulty expected
- ATLAS: to be checked offline
Action
- Experiments must update develop team with their concrete plan to move to new DM clients (GFAL2 and FTS3)
Network
LHCOPN/ONE Report - E. Martelli
New T1s:
- KISTI ready to connect to LHCOPN with a 2 Gb/s link
- Russia: problem to connect to LHCOPNE, aiming at connecting LHCONE first using a connection to Starlight in Chicago... Not optimal
OPN evolution
- Triggered by I. Fisk presentation on evolution of computing models at WLCG workshop
- Discussion about opening LHCOPN to T2s and possibility of merging of the LHCOPN and LHCONE
- Workshop planned on Feb. 10-11
- Overlap with Ops Coord F2F: adjust agendas to minimize overlap
LHCONE
L3VPN: 50 sites connected
P2P service: on-demand service available at most providers in Europe and US, would like to demonstrate their service with LHC use cases
- Looking for sites wanted to be involved in this effort
- Also need integration with the experiment SW
- Coordinated by M. Ernst and the Network WG
100G transatlantic link just put in production
- Currently a demonstration that it can be used
IPv6 WG Report - D. Kelsey
IPv6 usage taking off/growing: Google reported 2.5%, growing
New testbed sites: good site coverage now but still very few machines
- Also increased involvment from all LHC experiments, CMS the most active
- Lots of testing activities
WLCG Ops Coord TF created: working together
- Focusing on some concrete use case: see slides
- Dual stack being tested at Imperial College
IPv6 file transfers since 8 months (T. Wildish, CMS)
- 1 GB file with gridftp (UberFtp)
- Monitoring time to transfer and errors
- Good success rate (87%) if taking into consideration that this is pure best-effort
CMS
PhEDEX/FTS3/DPM
- Dual-stack FTS3 server at Imperial College
- 2 IPv6-only DPM at Imperial and Glasgow
- No problem: PhEDEX is production ready on IPv6
Stress testing important to find problems
- E.g. FNAL has been using IPv6 for more than a year and never noticed a problem on their border router due to a misconfiguration
- Useful for a site to be in the testbed!
Dual-stacked production tests: Imperial configured a subset of almost all their services (including core service like DNS, NFS, SSH) to dual-stack
- Using Stateless autoconfiguration
- No problem observed: no need to turn off IPv6
SW and tools survey in progress: need to cover
all apps
- when IPv6 readiness is known, can be registered. Else need to be investigated further.
- http://hepix-ipv6.web.cern.ch/wlcg-applications
- DPM, StoRM, dCache all work in some configuration
- May require some specific configuration
- Last Globus (5.2.5) fixing the globus ftp client issues found by the WG at the beginning of its tests
- xrootd: will have to wait v4, expected beginning of next year
- Batch systems: many known problems, work in progress
Testing activities planned for 2014
- Try more T2s with dual-stacked services
- Glasgow, CERN, KIT deploying larger test clusters
- Decide a target date for large deployment of dual-stacked services?
Next
F2F meeting in Spring/Summer at CERN
Conclusion: good progress despite limited effort
- Sites encouraged to join: contact Dave
IPv4 depletion foreseen during 2014 based on current usage and growth seen in last year
CERN approach to IPv6 decided one year ago
- All machines/devices dual-stacked when possible
- True for GPN only, can be done on experiment networks when they want, probably not done on technical network (accelerators) before next LS
- Every device with an IPv4 address has an IPv6 address in CSDB, independently of the real use of it
- DynDNS for dynamic/portable devices
- Identifical performances as IPv4
- True everywhere except for the IPv6 firewall bypass (expected next year)
- Same provisionning tools
- True for the main tools: cfsmgr, CSDB, WebReq
- Same network services
- True but with some restrictions
- IPv6 address returned from DNS only if querying the ipv6 zone (to avoid timeout with device without IPv6 connectivity) until the device is flagged as IPv6 ready
- Common security policies
dhcpv6
- For static and dynamic (portable) devices
- IPv6 address returned only if the device is flagged as IPv6 enabled in CSDB
- Unknown portable registration can only be done through IPv4 but they can use IPv6 after registration
- Know issue with CERN MAC address authentication as dhcpv6 client doesn't have to use the MAC address of the interface. Will be fixed by a new RFC.
IPv6 ready flag is triggering opening the appropriate thing into the firewall for IPv6
Next steps
- January 2014: deploy SW for main firewall, training for support, user information
- January: dhcpv6 for IT departement
- February: dhcpv6 for static devices
- March: dhcpv6 for dynamic devices
- General IPv6 availability at the end of 2014Q1
WLCG Global Service Registry - M. Alandes Pradillo
VO Information Systems are the authoritative source of info for VO information
- Partly duplicating BDII information
- Some specific VO information about services like internal VO names
- Full control by the VO
GSR is an attempt to provide VO with a central authoritative source of information, hiding the different sources and avoiding the duplication of effort by each VO
- Dynamic aggregation of different sources
- Unique entry point, single interface
- Caching, including ability to fix problem in information published by sources
- No intention to replace VO configuration databases/sources but to simplify their maintenance and increase their consistency
ALICE and CMS are currently not interacting with information system services (GOCDB, OIM, BDII) but are interested by GSR
- ALICE has no effort available in the short term
- ATLAS has been strongly involved in the prototype phase and will keep on integrating GSR into AGIS
- In fact GSR has been designed with the AGIS use case in mind...
Positive feedback after the first prototype but problems remain about what is the authoritative sources for VOs
- If a VO doesn't trust GOCDB, this will not work if it is used as a source by GSR
- Need a real use case to make further progress
- Need to evaluate the cost/motivation of VOs to migrate to GSR
- Effort to move is not big, pretty simple.
Next steps proposed after this discussion: ATLAS and CMS should try to put some efforts in evaluating it and in attempting to integrate it into their VO information system for better assessment of benefits and shortcomings.
- Specific meetings already exist to discuss this with VOs
HEPiX Report - H. Meinhard
Reminder: open to everybody (mainly sysadmins and service managers) interested
- No formal procedure to apply: just register to the mailing list and to workshops
Last workshop in Ann Arbor, Michigan
- 115 participants: record for North-American meetings
- ~half from the US: many new faces from universities and T2s
Networking & Security
- 100G now available for WAN and several successful tests demonstrating efficient usage
- BNL mentionning looking at IPoIB as an alternative to 10G for the WNs: cost advantage, better perfs
Storage
- openAFS: complex situation, 2 companies providing closed-source versions where most new features are added, no IPv6 support foreseen but now confidence that not really need in HEP (access restricted in the site)
- CEPH: very promising, several pre-prod services including CERN and RAL
- Currently mainly distributed object storage (block devices)
- Very interesting talk from WD about drive reliabilities and new features planned for improving predictions: see slides!
Batch systems: situation now clearer
- Several large sites moved to UNIVA GE
- Oracle sold all assets to UNIVA
- Little uptake at scale for open-source projects
- Several sites looking at HTCondor: scalability seems impressive
- RAL moved its production CE, CERN investigating
- SLURM: several disappointing experiences
- Not so good scalability with high number of jobs: more focused on high number of nodes (large HPC clusters)
Configuration management: Puppet is the clear winner nowadays, adopted by most sites starting with a configuration tool
- Other configuration management tools (CfEngine, Quattor, Chief) still present
- WG in charge of establishing best-practicies and promoting collaboration between sites in module development
Several WGs in
HEPiX
- IPv6: see Dave's talk
- Benchmarking: lots of results collected, new CPU benchmark from SPEC expected for October 2014
- Need to prepare for a new HEP benchmark, starting now: long process to validate the benchmarks, requires experiment participations, need to identify the people wanting to contribute
- Discuss boundary conditions: OS and compiler versions, optimisation level...
- Configuration management: see above
- Bit preservation: technical advices on bit preservation as an input to DPHEP project
- Energy efficiency: nothing at Ann Arbor but W. Shalter in charge of a new attempt at next meeting in Annecy
Next
HEPiX meetings
- LAPP, Annecy, May 19-23
- Fall 2014: Univ. of Nebraska, dates to be defined soon
- Spring 2015: a candidate site in Europe identified
- Proposals to host meeting always welcome
--
MichelJouvin - 03 Jan 2014