Summary of GDB meeting, November 14, 2012
Agenda
http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=155074
Attendance
Local
- Adrien Devresse
- Alberto Di Meglio
- Alberto Masoni
- Alessandro De Salvo
- Alessandro Di Girolamo
- Andreas Petzold
- Antonio Perez
- Borut Kersevan
- Dave Kelsey
- David Colling
- David Smith
- Eric Lancon
- Frederique Chollet
- Guenter Grein
- Hans Von Der Schmitt
- Helge Meinhard
- Ian Bird
- Ikuo Ueda
- Jeremy Coles
- John Gordon
- John Green
- Laurence Field
- Luca Dell'Agnello
- Maarten Litmaath
- Maite Barroso Lopez
- Maria Alandes
- Maria Dimou
- Maria Girone
- Marian Babik
- Markus Schulz
- Mattias Wadenstein
- Michal Simon
- Michel Jouvin
- Philippe Charpentier
- Renaud Vernet
- Ricardo Rocha
- Rod Walker
- Romain Wartel
- Simone Campana
- Stefan Roiser
- Stephen Gowdy
- Ulf Tigerstedt
- Yannick Patois
Remote
- Alessandra Forti
- Andrew Elwell
- Christoph Wissing
- Christopher John Walker
- Claudio Grandi
- Doina Cristina Aiftimiei
- Ewan Mac Mahon
- Fabrizio Furano
- Jeff Templon
- Leslie Groer
- Matt Doidge
- Peter Gronbech
- Peter Solagna
- Stephen Burke
- Tiziana Ferrari
Introduction
See slides.
December GDB: there will be a pre-GDB on storage AAI.
- Probably only 1/2 day
- Doodle poll
running if you are interested
Action updates: see slides
- glexec deployment status
- RAL to have their new CEs publish the "glexec" capability (fixed).
- ATLAS to investigate why KIT was no longer receiving any SAM CE tests (fixed).
Forthcoming meetings: see slides
GGUS Update
Report Generator - G. Grein
New feature usable on demand
- Access requires a special authorization: ask for it with a ticket
Several reports available
- Response time, solution time, status, ...
- Status: by every status available in tickets (only one), including meta-status (eg. opened)
- Time: working days only, ability to exclude the time waiting on submitter
- Ability to access individual tickets from the report summary when the result list contains individual tickets
- Not yet possible to select the ticket on a site basis to get a site view
- Will come in the future, in particular for T0 and T1 sites; to be investigated how to manage the huge list of sites
Export possible in CSV format
Result list can be searched for specific values in all the columns.
Status: still in development, send feedback, changes may happen...
Discussion
- A ticket can go directly from "Assigned" to "Waiting for reply" without passing through "In progress", it will not disrupt the report analysis
- The exact time zone associated with a support unit will be exposed at some point, e.g. in the FAQ
- The Ticket Timeline Tool (TTT) can be used per site and enhancements can be discussed, e.g. numerical output, T3 support
Discussions about future and interfaces of all reporting tools used in our community: GGUS, Savannah, JIRA, Trac, SNOW...
- Highly linked with Ops Coordination: report at each fortnightly meeting
- Use the weekly GGUS developer meeting for detailed discussions
Some hot topics
- SHA-2 support in GGUS
- ALARMS and site email notification
- Future of Savannah-GGUS bridge: meeting scheduled beginning of December
- CMS announced its intention to use GGUS only
- Future of Savannah itself and transition to JIRA: meeting planned mid-December
Discussion
- JIRA and Trac developments are to be discussed in the IT Technical Users Meeting (ITUM) and the IT-Experiment meetings
- Maria Dimou will join when needed
- JIRA itself is not open source, but can work with other products that are
Actions Update
EMI Migration and SHA-2 Proxy Support - P. Solagna
Policy: see last GDB
EGI security dashboard extended to raise alarms in case of unsupported MW
- Used by EGI CSIRT and COD
- COD opens tickets against sites
- Some issues because some components didn't publish correctly their version: CREAM, dCache, VOMS
- Generally related to customized info providers
Issues with Quattor templates available only late October
- Everything is available now (except MyProxy)
Status
- 120 sites (210 services) with unsupported services deployed
- 90 planning to upgrade by end of Nov.
- 8 with late plans (after mid-December)
- 19 sites didn't answer: will start to be suspended beginning of next week
DPM/LFC will be unsupported from 1st of December
- Deadline to upgrade: January 31
WN timeline to be defined in the coming days: likely end of Nov
- After release of last WN in UMD expected next Monday
EMI-1 will be end of life at the end of April
- Warning alarms will be triggered in the operations dashboard starting in January
- Critical alarms starting in March
- Alarms will be handled by NGI RODs with central escalation if necessary
SHA-2 support
- Product assessment: CREAM and StoRM readiness developed as part of EMI-3 but backport to EMI-2 might be possible afterwards
- DPM and LB not reported yet (both should already be OK)
- EGI document public and updated regularly
- UMD will soon start testing SHA-2 support of components, with alerts sent to developers in case of failure
- EGI planning a Nagios probe to check SHA-2 support in deployed MW during the winter
Security WGs - R. Wartel
Traceability WG: ensure there is enough traceability on CE and SE and that they use standard services (syslog)
- A questionnaire about to be issued
- SE: next pre-GDB
Identity Federation pilot: non-browser pilot
- A CLI login tool needed to replace voms-proxy-init: CILogon (USA) and EMI STS are the 2 candidates
- Both based on SAML ECP but implementation very different and incompatible
- Make them compatible: not easy, investigation by experts in progress but both projects have very limited resources
- Now would be better than later, seems doable
- Short-term plans
- Set up a pilot with EMI STS at CERN
- Conduct a survey to see how many IdPs support SAML ECP in EU
- Identify the attributes that must be pulled from IdPs
Discussion
- EU sites currently cannot use CILogon, only enabled for US sites
- Documentation on how to set up syslog etc. will be done in the Traceability WG
Information System - M. Alandes Pradillo
BDII development status: no major incident, no release since August
- EMI-1/2/3 versions aligned
- Updates only to EMI-2
- Easy upgrade path from EMI-1 to EMI-2 on SL5
Next release planned in December
- EPEL compliance
- EMIR integration
- ARC integration
- glue-validator improvements
- Service information provider bug fixes
Top-level BDIIs and failover: after discussion with EGI, underperforming top-level BDIIs in small NGIs will be decommissioned
- WLCG recommendation is to use standard top-level BDII in the NGI with possible failover to another well-performing BDII
- Good cooperation between WLCG and EGI
- Most sites still configure one top-level BDII but most of these top-level BDIIs are 100% available
- No clear policy for LCG_GFAL_INFOSYS: still being discussed with EGI
GLUE2 publication: 99 sites already publishing, 59 not publishing
- 1/3 of these non publishing sites are part of EGI: should make progress as part of the ongoing upgrade effort
Multicore support: is there anything else needed from IS?
- No more reactions after proposal made during the summer...
Service discovery requirements discussed with experiments: see slides
top-level BDIIs usage statistics now available
- Based on LDAP logs
- Broken out by type of requester (WN, WMS, ...)
- Most queried attributes: no attribute, GlueSE, GlueSA, GlueCE
- Conclusions are difficult to extract... except that BDII is heavily used
- Cached BDII has improved the perceived stability
What next?
- Monitor quality of information: part of GLUE2 effort
- Common query tool for OSG and EGI could be
ginfo
- Some missing info (compared to lcg-info) reported by experiments
- But OSG has no plan with GLUE2...
- Possibility of implementing service discovery with EMIR
EMIR now available
- Provides references to services
- Services are the authoritative source
- 3 components: service publisher, domain publisher, global registry
Requirement for GLUE-2 publication by OSG: needs to be rediscussed with all the parties involved
Discussion
-
LCG_GFAL_INFOSYS
is ordered by decreasing preference, i.e. not round-robin or random
- Tools need to use it correctly
- It seems to work OK e.g. at many UK sites
- ATLAS are using the BDII for EGI and OSG info; would prefer one tool/place also in the future
EGI GLUE2 Profile - S. Burke
EMI-2 has ~ full GLUE2 support
- Took 6 years from initial discussions
BDII implementation: merged LDAP schema for 1.3 and 2.0
- Differences with 1.3: attribute names, case sensitivity, usage of foreign keys
- New namespace: o=glue instead of o=grid
No explicit steps needed to configure for GLUE2 at the site level (
AdminDomain)
- gLite retirement campaign helped
- 195 sites publishing GLUE2 against 228 sites publishing GLUE1
- A few issues being followed up, in particular case sensitivity
Service publication status
- Most services GLUE2 ready in EMI1, some only in EMI2
- Fraction of services published in GLUE2 roughly matching the fraction of sites with GLUE2 enabled
Explicit EGI GLUE2 profile required because the schema is intentionally very flexible
- Currently "in EGI" means "in BDII"
- Attributed according to their potential use: 5 categories
- Service Discovery: static information
- Service Selection: dynamic information
- Monitoring: not used by MW
- Oversight: high-level management info like installed capacity
- Diagnostics
- Schema just defines some attributes as mandatory, others are implicitly optional. EGI profile refines 'optional'.
- Mandatory
- Recommended: should be published unless there is a good reason not to do it
- Desirable: likely to be useful if published
- Optional: no known usage but harmless to publish
- Undesirable: some negative side-effect
Validation is an integral part of GLUE2 effort
- Many mistakes and inconsistent info published in GLUE1 have been one major source of problems
- GLUE2 validation defines 4 error severities
- FATAL: value so incorrect that it invalidates the information structure, like unique IDs not unique
- ERROR: value definitely incorrect but with a limited impact
- WARNING: value likely to be incorrect (eg. very large number of jobs)
- INFO: value technically correct but suspect
EGI Profile review process
- Long, technical document: will take a long time to converge
- Feedback welcome, by Dec 7 for v1.0
- Document is versioned and will evolve based on experience
- Version 1.0 expected at the end of the year
- Start validating published information by hand and integrate progressively checks to glue-validator
- Starting with important things
- Complete coverage as a goal
Goal: have GLUE2 fully usable next spring
Discussion
- GLUE 2 can properly represent capabilities etc. where v1.3 needed (or would need) hacks
- E.g. multiple CPU benchmarks, multi-core support, installed/unavailable capacities
- Lists of batch system names etc. will be collected and maintained by the GLUE WG in OGF
- Easy availability via the web for other uses (e.g. accounting)
- A generic "data dictionary" service might also be conceived in the future, hosted e.g. by EGI, OSG, ...
IPv6 Test Activities and Deployment - D. Kelsey
Address pool exhaustion now a reality in Asia and Europe
HEPiX site survey: questionnaire sent on Sept. 26, only partial coverage in the answers
- 42 sites, 20 countries
- Still time to answer
- 6 questions: see slides
- 15 sites are already IPv6 enabled
- Generally means the basic core services (DNS, web servers, email servers) are reachable by IPv6
- 2 sites already using extensively dual-stack
- In fact IPv6 active at many sites through Windows/Mac systems where it is started by default and where machines are sensitive to rogue Router Advertisements
- 10 sites have plans in the coming year, including CERN (CERN reports possible IPv4 exhaustion in 2 years from now)
- Concerns about IP address management and about IPv6 security
- Concerns about apps and tools (including operations tools) not being ready
Survey main conclusions
- IPv6 is there, we cannot ignore it
- Main problem is applications, tools and sites
- End systems and core networks are ready
Many tests to do with IPv6?
- Does the service break/slow down?
- When is IPv6 used? Does it have a higher priority than IPv4?
- Failover from IPv6 to IPv4?
MW readiness assessment concentrated on data transfers
- Managed to get gridftp, globus-url-copy and FTS working with some patches
- DPM working
Asset survey underway: spreadsheet where application/tool readiness can be recorded
- Summary available soon on IPv6 WG wiki
- Goal: complete coverage of all the apps/MW/tools used in WLCG
- When readiness status is unknown, further investigations are needed
- Several apps already fixed (patches contributed) but some well-known problematic apps
- OpenAFS: no plan
- UberFTP: patches fed back
- All batch systems seem to have problems in internal communication between head node and WNs (tests made by an ARC site, ARNES, in Slovenia)
- Many IGTF CA CRLs not available on IPv6: followed-up by IGTF
Recent news
- IPv6 peering established via LHCOPN between CERN T0 and NDGF T1 for dCache testing
- IPv6 @ CERN going well: available for end users who want it next year, starting to configure facing servers
- LFC being tested at GARR
- FTS3 being assessed: contact with developers
News from USA
- US government mandate for facing servers did not apply to DOE labs but they basically met the September deadline...
- Now concentrating on IPv6 support on client: deadline is Sept. 2014
Future work
- Coordinate tests with EMI
- Install services using standard configuration tools and strategies
- Currently only isolated services
- HEP IPv6 day next year?
- Probably not before June and only if it makes sense, ie. if there is no well-known showstopper
NDGF T1 plans to be become fully dual stacked in the coming year
- May take more time in case problems are identified...
- Currently working on dCache readiness
Discussion
- F2F meeting in Jan, in particular about security aspects
- China: carrots for IPv6 use (lower cost, higher bandwidth), but particle physics does not seem to be pushed yet
MW Clients in Application Area
CMS
The main issue is DM clients
- Hack required for DPM rfio support: same as in gLite
- dCache: unauthenticated dcap only
- Bug being fixed in EMI-2 gsidcap
- lcg-cp time-out fixed in October EMI-2 release but not in EMI-1
- EMI-1 is in security fixes phase
From CMS SAM, most sites still running gLite WN
- EMI-2 recommended but EMI-1 is acceptable
- CMS ready to use SL6 WNs: demonstrated perf advantages
- CMS only sites can move to SL6 WNs
- DPM and dcap tested successfully
- OSG fully supported
LHCb
All the MW clients used by LHCb are in AA
- Not relying on any components deployed by the site
- Deployment in LHCb CVMFS from LCG-AA not specific to LHCb: follow LCG-AA platform convention
- Now includes WM commands and FTS CLI
Platform/versions used by LHCb
- SL5 (32 and 64-bit), SL6 (64-bit)
- gcc 4.6
- Python 2.6 (SL5), 2.7 (SL6)
Happy with this approach!
ATLAS
ATLAS currently relies on what is installed on the WN, leading to many problems: want to move away from this
- MW clients in CVMFS: each release with a dedicated path
- Appropriate setup.sh script in this path
- PanDA job will select the version to use
- Also need various versions of Python
MW clients mean DM clients +
VOMS clients
- Require both 32 and 64-bit
CVMFS approach was used to test EMI-2 clients without asking sites to set up EMI-2 WNs
- Built a tar ball from RPM
The problem is not to make the tar ball but to maintain it
- Should discuss who is interested and who can participate in the effort
- ATLAS can offer an infrastructure for testing (PanDA+HammerCloud)
- Not clear if it can be experiment agnostic
EMI-2 WN tar ball - D. Smith
Idea is to build a tar ball from the EMI WN RPM, bringing in all dependencies
- 141 dependencies
- tarball: 219 MB
- Also a few OS provided RPMs that the sites must provide on the WN
- Site-provided information
- Additionally need egi-ca-policy-core and CRLs: don't plan to add them to the tar ball
- Site must configure VOMS clients
- config for all LHC VOs could be included standard
- A central location could be made available with up-to-date information
Issues
- glexec: unlikely to be part of the archive
- Larger scale tests to be done
Discussion
WN tar ball from EMI releases looks an interesting approach to improve the deployment time of MW clients
- Build a CVMFS repository common to all exps to host it?
- Would ensure immediate uptake by the vast majority of sites/resources
- Might reduce cache thrashing (as of CVMFS v2.1)
Unclear if it will fit all needs but ATLAS interested to test it
- LHCb would like to keep the flexibility of adding unreleased versions fixing problems, as done today using LCG-AA
- In the short-term will keep its private copy of MW clients in its CVMFS repository: will review it later
Tar ball and CVMFS distribution are alternatives to regular WN; LCG-AA work is independent
System dependencies need to be handled carefully, e.g. the change of OpenSSL library ".so" version on SL6
- Avoid strange job failures
- Restore "HEPOSLibs" meta package that was maintained for gLite?
The setup script currently needs write access in the area where the tar ball was unpacked
- Will be fixed to allow it work with CVMFS
The CAs and CRLs are a complication, being looked into
Glexec configuration and dependencies might be served through CVMFS or tar ball
- Sites would just need to install the setuid binary in a convenient location
GFAL2 - A. Devresse
New features
- Support for cloud protocols in addition to grids
- Protocol generic: protocol derived from provided URL, negotiation if/when possible (eg. SRM)
- Parallel get/put operation during copy operation
- No environment variable needed: a configuration file in /etc/gfal2.d
- config directory location can be overridden with one environment variable
- GFAL1 environment variables used if defined and config file not present
Components
- libgfal2: C library, set of independent plugins
- gfal2-python: simple and pythonic python bindings
- gfalFS: fuse module for GFAL2
- gfal-tools: experimental command line tools similar to lcg_util
- Feedback and suggestions welcome: prototypes available, open to discuss the level of backward compatibility needed for options and error reporting
Designed to be the successor of GFAL1 and lcg_util but not 100% backward compatible
- POSIX API is backward compatible
- Python API to lcg_util: users would need to update their code to use the new GFAL2 Python bindings instead
Status: part of EMI-2 release
- Packages in EPEL
- Soon packaged for Debian
- Sources publically available and development open to everybody
- GFAL2 is the core of FTS3
Discussion
- Source and destination URLs can have different protocols (e.g. "dcap" —► "root")
- the copy will then go through the issuing host (UI, WN)
- Third-party copies are used whenever possible
- Reimplementing lcg-cp etc. with the same behavior would be difficult
- Command line arguments would be doable, error messages would not!
- It would allow a faster uptake wherever the CLI (not the GFAL API) is being used
- Asynchronous calls: bringOnline missing, will appear soon
- Library is completely thread-safe
- C++ binding to stream object would be easy, but may be Linux specific
- GFAL2 plugin for ROOT could help reduce the list of plugins to be maintained
- Follow up in Operations Coordination WG
Wrap-Up
Forgot one forthcoming workshop: DPM workshop at LAL, Dec. 3-4
Progress on several actions
- EMI migration progressing at a good pace: very successful major upgrade so far...
- short follow-up in Dec GDB
Follow-up discussion on GLUE2 publication by OSG
WLCG CVMFS repository with MW clients
GFAL2: feedback about lcg_util compatibility needed
--
MichelJouvin/Maarten Litmaath - 14-Nov-2012