Summary of GDB meeting, November 14, 2012

Agenda

http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=155074

Attendance

Local

  • Adrien Devresse
  • Alberto Di Meglio
  • Alberto Masoni
  • Alessandro De Salvo
  • Alessandro Di Girolamo
  • Andreas Petzold
  • Antonio Perez
  • Borut Kersevan
  • Dave Kelsey
  • David Colling
  • David Smith
  • Eric Lancon
  • Frederique Chollet
  • Guenter Grein
  • Hans Von Der Schmitt
  • Helge Meinhard
  • Ian Bird
  • Ikuo Ueda
  • Jeremy Coles
  • John Gordon
  • John Green
  • Laurence Field
  • Luca Dell'Agnello
  • Maarten Litmaath
  • Maite Barroso Lopez
  • Maria Alandes
  • Maria Dimou
  • Maria Girone
  • Marian Babik
  • Markus Schulz
  • Mattias Wadenstein
  • Michal Simon
  • Michel Jouvin
  • Philippe Charpentier
  • Renaud Vernet
  • Ricardo Rocha
  • Rod Walker
  • Romain Wartel
  • Simone Campana
  • Stefan Roiser
  • Stephen Gowdy
  • Ulf Tigerstedt
  • Yannick Patois

Remote

  • Alessandra Forti
  • Andrew Elwell
  • Christoph Wissing
  • Christopher John Walker
  • Claudio Grandi
  • Doina Cristina Aiftimiei
  • Ewan Mac Mahon
  • Fabrizio Furano
  • Jeff Templon
  • Leslie Groer
  • Matt Doidge
  • Peter Gronbech
  • Peter Solagna
  • Stephen Burke
  • Tiziana Ferrari

Introduction

See slides.

December GDB: there will be a pre-GDB on storage AAI.

  • Probably only 1/2 day
  • Doodle poll running if you are interested

Action updates: see slides

  • glexec deployment status
    • RAL to have their new CEs publish the "glexec" capability (fixed).
    • ATLAS to investigate why KIT was no longer receiving any SAM CE tests (fixed).

Forthcoming meetings: see slides

GGUS Update

Report Generator - G. Grein

New feature usable on demand
  • Access requires a special authorization: ask for it with a ticket

Several reports available

  • Response time, solution time, status, ...
  • Status: by every status available in tickets (only one), including meta-status (eg. opened)
  • Time: working days only, ability to exclude the time waiting on submitter
  • Ability to access individual tickets from the report summary when the result list contains individual tickets
  • Not yet possible to select the ticket on a site basis to get a site view
    • Will come in the future, in particular for T0 and T1 sites; to be investigated how to manage the huge list of sites

Export possible in CSV format

Result list can be searched for specific values in all the columns.

Status: still in development, send feedback, changes may happen...

Discussion

  • A ticket can go directly from "Assigned" to "Waiting for reply" without passing through "In progress", it will not disrupt the report analysis
  • The exact time zone associated with a support unit will be exposed at some point, e.g. in the FAQ
  • The Ticket Timeline Tool (TTT) can be used per site and enhancements can be discussed, e.g. numerical output, T3 support

TrackTools - M. Dimou

Discussions about future and interfaces of all reporting tools used in our community: GGUS, Savannah, JIRA, Trac, SNOW...
  • Highly linked with Ops Coordination: report at each fortnightly meeting
  • Use the weekly GGUS developer meeting for detailed discussions

Some hot topics

  • SHA-2 support in GGUS
  • ALARMS and site email notification
  • Future of Savannah-GGUS bridge: meeting scheduled beginning of December
    • CMS announced its intention to use GGUS only
  • Future of Savannah itself and transition to JIRA: meeting planned mid-December

Discussion

  • JIRA and Trac developments are to be discussed in the IT Technical Users Meeting (ITUM) and the IT-Experiment meetings
    • Maria Dimou will join when needed
  • JIRA itself is not open source, but can work with other products that are

Actions Update

EMI Migration and SHA-2 Proxy Support - P. Solagna

Policy: see last GDB

EGI security dashboard extended to raise alarms in case of unsupported MW

  • Used by EGI CSIRT and COD
  • COD opens tickets against sites
    • 261 tickets opened!
  • Some issues because some components didn't publish correctly their version: CREAM, dCache, VOMS
    • Generally related to customized info providers

Issues with Quattor templates available only late October

  • Everything is available now (except MyProxy)

Status

  • 120 sites (210 services) with unsupported services deployed
  • 90 planning to upgrade by end of Nov.
  • 8 with late plans (after mid-December)
  • 19 sites didn't answer: will start to be suspended beginning of next week
    • Several warnings sent

DPM/LFC will be unsupported from 1st of December

  • Deadline to upgrade: January 31

WN timeline to be defined in the coming days: likely end of Nov

  • After release of last WN in UMD expected next Monday

EMI-1 will be end of life at the end of April

  • Warning alarms will be triggered in the operations dashboard starting in January
  • Critical alarms starting in March
  • Alarms will be handled by NGI RODs with central escalation if necessary

SHA-2 support

  • Product assessment: CREAM and StoRM readiness developed as part of EMI-3 but backport to EMI-2 might be possible afterwards
    • DPM and LB not reported yet (both should already be OK)
  • EGI document public and updated regularly
  • UMD will soon start testing SHA-2 support of components, with alerts sent to developers in case of failure
  • EGI planning a Nagios probe to check SHA-2 support in deployed MW during the winter

Security WGs - R. Wartel

Traceability WG: ensure there is enough traceability on CE and SE and that they use standard services (syslog)
  • A questionnaire about to be issued
  • SE: next pre-GDB

Identity Federation pilot: non-browser pilot

  • A CLI login tool needed to replace voms-proxy-init: CILogon (USA) and EMI STS are the 2 candidates
    • Both based on SAML ECP but implementation very different and incompatible
    • Make them compatible: not easy, investigation by experts in progress but both projects have very limited resources
      • Now would be better than later, seems doable
  • Short-term plans
    • Set up a pilot with EMI STS at CERN
    • Conduct a survey to see how many IdPs support SAML ECP in EU
    • Identify the attributes that must be pulled from IdPs

Discussion

  • EU sites currently cannot use CILogon, only enabled for US sites
  • Documentation on how to set up syslog etc. will be done in the Traceability WG

Information System - M. Alandes Pradillo

BDII development status: no major incident, no release since August
  • EMI-1/2/3 versions aligned
  • Updates only to EMI-2
  • Easy upgrade path from EMI-1 to EMI-2 on SL5

Next release planned in December

  • EPEL compliance
  • EMIR integration
  • ARC integration
  • glue-validator improvements
  • Service information provider bug fixes

Top-level BDIIs and failover: after discussion with EGI, underperforming top-level BDIIs in small NGIs will be decommissioned

  • WLCG recommendation is to use standard top-level BDII in the NGI with possible failover to another well-performing BDII
    • Good cooperation between WLCG and EGI
  • Most sites still configure one top-level BDII but most of these top-level BDIIs are 100% available
  • No clear policy for LCG_GFAL_INFOSYS: still being discussed with EGI

GLUE2 publication: 99 sites already publishing, 59 not publishing

  • 1/3 of these non publishing sites are part of EGI: should make progress as part of the ongoing upgrade effort

Multicore support: is there anything else needed from IS?

  • No more reactions after proposal made during the summer...

Service discovery requirements discussed with experiments: see slides

top-level BDIIs usage statistics now available

  • Based on LDAP logs
  • Broken out by type of requester (WN, WMS, ...)
  • Most queried attributes: no attribute, GlueSE, GlueSA, GlueCE
  • Conclusions are difficult to extract... except that BDII is heavily used
    • Cached BDII has improved the perceived stability

What next?

  • Monitor quality of information: part of GLUE2 effort
  • Common query tool for OSG and EGI could be ginfo
    • Some missing info (compared to lcg-info) reported by experiments
    • But OSG has no plan with GLUE2...
  • Possibility of implementing service discovery with EMIR

EMIR now available

  • Provides references to services
    • Services are the authoritative source
    • 3 components: service publisher, domain publisher, global registry

Requirement for GLUE-2 publication by OSG: needs to be rediscussed with all the parties involved

Discussion

  • LCG_GFAL_INFOSYS is ordered by decreasing preference, i.e. not round-robin or random
    • Tools need to use it correctly
    • It seems to work OK e.g. at many UK sites
  • ATLAS are using the BDII for EGI and OSG info; would prefer one tool/place also in the future

EGI GLUE2 Profile - S. Burke

EMI-2 has ~ full GLUE2 support
  • Took 6 years from initial discussions

BDII implementation: merged LDAP schema for 1.3 and 2.0

  • Differences with 1.3: attribute names, case sensitivity, usage of foreign keys
  • New namespace: o=glue instead of o=grid

No explicit steps needed to configure for GLUE2 at the site level (AdminDomain)

  • gLite retirement campaign helped
  • 195 sites publishing GLUE2 against 228 sites publishing GLUE1
    • A few issues being followed up, in particular case sensitivity

Service publication status

  • Most services GLUE2 ready in EMI1, some only in EMI2
  • Fraction of services published in GLUE2 roughly matching the fraction of sites with GLUE2 enabled

Explicit EGI GLUE2 profile required because the schema is intentionally very flexible

  • Currently "in EGI" means "in BDII"
  • Attributed according to their potential use: 5 categories
    • Service Discovery: static information
    • Service Selection: dynamic information
    • Monitoring: not used by MW
    • Oversight: high-level management info like installed capacity
    • Diagnostics
  • Schema just defines some attributes as mandatory, others are implicitly optional. EGI profile refines 'optional'.
    • Mandatory
    • Recommended: should be published unless there is a good reason not to do it
    • Desirable: likely to be useful if published
    • Optional: no known usage but harmless to publish
    • Undesirable: some negative side-effect

Validation is an integral part of GLUE2 effort

  • Many mistakes and inconsistent info published in GLUE1 have been one major source of problems
  • GLUE2 validation defines 4 error severities
    • FATAL: value so incorrect that it invalidates the information structure, like unique IDs not unique
    • ERROR: value definitely incorrect but with a limited impact
    • WARNING: value likely to be incorrect (eg. very large number of jobs)
    • INFO: value technically correct but suspect

EGI Profile review process

  • Long, technical document: will take a long time to converge
    • Feedback welcome, by Dec 7 for v1.0
  • Document is versioned and will evolve based on experience
    • Version 1.0 expected at the end of the year
  • Start validating published information by hand and integrate progressively checks to glue-validator
    • Starting with important things
    • Complete coverage as a goal

Goal: have GLUE2 fully usable next spring

  • Seems feasible

Discussion

  • GLUE 2 can properly represent capabilities etc. where v1.3 needed (or would need) hacks
    • E.g. multiple CPU benchmarks, multi-core support, installed/unavailable capacities
  • Lists of batch system names etc. will be collected and maintained by the GLUE WG in OGF
    • Easy availability via the web for other uses (e.g. accounting)
  • A generic "data dictionary" service might also be conceived in the future, hosted e.g. by EGI, OSG, ...

IPv6 Test Activities and Deployment - D. Kelsey

Address pool exhaustion now a reality in Asia and Europe

HEPiX site survey: questionnaire sent on Sept. 26, only partial coverage in the answers

  • 42 sites, 20 countries
  • Still time to answer
  • 6 questions: see slides
  • 15 sites are already IPv6 enabled
    • Generally means the basic core services (DNS, web servers, email servers) are reachable by IPv6
    • 2 sites already using extensively dual-stack
    • In fact IPv6 active at many sites through Windows/Mac systems where it is started by default and where machines are sensitive to rogue Router Advertisements
    • 10 sites have plans in the coming year, including CERN (CERN reports possible IPv4 exhaustion in 2 years from now)
  • Concerns about IP address management and about IPv6 security
  • Concerns about apps and tools (including operations tools) not being ready

Survey main conclusions

  • IPv6 is there, we cannot ignore it
  • Main problem is applications, tools and sites
  • End systems and core networks are ready

Many tests to do with IPv6?

  • Does the service break/slow down?
  • When is IPv6 used? Does it have a higher priority than IPv4?
  • Failover from IPv6 to IPv4?

MW readiness assessment concentrated on data transfers

  • Managed to get gridftp, globus-url-copy and FTS working with some patches
  • DPM working

Asset survey underway: spreadsheet where application/tool readiness can be recorded

  • Summary available soon on IPv6 WG wiki
  • Goal: complete coverage of all the apps/MW/tools used in WLCG
  • When readiness status is unknown, further investigations are needed
  • Several apps already fixed (patches contributed) but some well-known problematic apps
    • OpenAFS: no plan
    • UberFTP: patches fed back
    • All batch systems seem to have problems in internal communication between head node and WNs (tests made by an ARC site, ARNES, in Slovenia)
    • Many IGTF CA CRLs not available on IPv6: followed-up by IGTF

Recent news

  • IPv6 peering established via LHCOPN between CERN T0 and NDGF T1 for dCache testing
  • IPv6 @ CERN going well: available for end users who want it next year, starting to configure facing servers
  • LFC being tested at GARR
  • FTS3 being assessed: contact with developers

News from USA

  • US government mandate for facing servers did not apply to DOE labs but they basically met the September deadline...
  • Now concentrating on IPv6 support on client: deadline is Sept. 2014

Future work

  • Coordinate tests with EMI
  • Install services using standard configuration tools and strategies
    • Currently only isolated services
  • HEP IPv6 day next year?
    • Probably not before June and only if it makes sense, ie. if there is no well-known showstopper

NDGF T1 plans to be become fully dual stacked in the coming year

  • May take more time in case problems are identified...
  • Currently working on dCache readiness

Discussion

  • F2F meeting in Jan, in particular about security aspects
  • China: carrots for IPv6 use (lower cost, higher bandwidth), but particle physics does not seem to be pushed yet

MW Clients in Application Area

CMS

The main issue is DM clients
  • Hack required for DPM rfio support: same as in gLite
  • dCache: unauthenticated dcap only
    • Bug being fixed in EMI-2 gsidcap
  • lcg-cp time-out fixed in October EMI-2 release but not in EMI-1
    • EMI-1 is in security fixes phase

From CMS SAM, most sites still running gLite WN

  • EMI-2 recommended but EMI-1 is acceptable
  • CMS ready to use SL6 WNs: demonstrated perf advantages
    • CMS only sites can move to SL6 WNs
    • DPM and dcap tested successfully
    • OSG fully supported

LHCb

All the MW clients used by LHCb are in AA
  • Not relying on any components deployed by the site
  • Deployment in LHCb CVMFS from LCG-AA not specific to LHCb: follow LCG-AA platform convention
  • Now includes WM commands and FTS CLI

Platform/versions used by LHCb

  • SL5 (32 and 64-bit), SL6 (64-bit)
  • gcc 4.6
  • Python 2.6 (SL5), 2.7 (SL6)

Happy with this approach!

ATLAS

ATLAS currently relies on what is installed on the WN, leading to many problems: want to move away from this
  • MW clients in CVMFS: each release with a dedicated path
    • Appropriate setup.sh script in this path
    • PanDA job will select the version to use
  • Also need various versions of Python

MW clients mean DM clients + VOMS clients

  • Require both 32 and 64-bit

CVMFS approach was used to test EMI-2 clients without asking sites to set up EMI-2 WNs

  • Built a tar ball from RPM

The problem is not to make the tar ball but to maintain it

  • Should discuss who is interested and who can participate in the effort
  • ATLAS can offer an infrastructure for testing (PanDA+HammerCloud)
  • Not clear if it can be experiment agnostic

EMI-2 WN tar ball - D. Smith

Idea is to build a tar ball from the EMI WN RPM, bringing in all dependencies
  • 141 dependencies
  • tarball: 219 MB
  • Also a few OS provided RPMs that the sites must provide on the WN
  • Site-provided information
    • Additionally need egi-ca-policy-core and CRLs: don't plan to add them to the tar ball
    • Site must configure VOMS clients
      • config for all LHC VOs could be included standard
    • A central location could be made available with up-to-date information

Issues

  • glexec: unlikely to be part of the archive
  • Larger scale tests to be done

Discussion

WN tar ball from EMI releases looks an interesting approach to improve the deployment time of MW clients
  • Build a CVMFS repository common to all exps to host it?
    • Would ensure immediate uptake by the vast majority of sites/resources
    • Might reduce cache thrashing (as of CVMFS v2.1)

Unclear if it will fit all needs but ATLAS interested to test it

  • LHCb would like to keep the flexibility of adding unreleased versions fixing problems, as done today using LCG-AA
    • In the short-term will keep its private copy of MW clients in its CVMFS repository: will review it later

Tar ball and CVMFS distribution are alternatives to regular WN; LCG-AA work is independent

System dependencies need to be handled carefully, e.g. the change of OpenSSL library ".so" version on SL6

  • Avoid strange job failures
  • Restore "HEPOSLibs" meta package that was maintained for gLite?

The setup script currently needs write access in the area where the tar ball was unpacked

  • Will be fixed to allow it work with CVMFS

The CAs and CRLs are a complication, being looked into

Glexec configuration and dependencies might be served through CVMFS or tar ball

  • Sites would just need to install the setuid binary in a convenient location

GFAL2 - A. Devresse

New features
  • Support for cloud protocols in addition to grids
  • Protocol generic: protocol derived from provided URL, negotiation if/when possible (eg. SRM)
  • Parallel get/put operation during copy operation
  • No environment variable needed: a configuration file in /etc/gfal2.d
    • config directory location can be overridden with one environment variable
    • GFAL1 environment variables used if defined and config file not present

Components

  • libgfal2: C library, set of independent plugins
  • gfal2-python: simple and pythonic python bindings
  • gfalFS: fuse module for GFAL2
  • gfal-tools: experimental command line tools similar to lcg_util
    • Feedback and suggestions welcome: prototypes available, open to discuss the level of backward compatibility needed for options and error reporting

Designed to be the successor of GFAL1 and lcg_util but not 100% backward compatible

  • POSIX API is backward compatible
  • Python API to lcg_util: users would need to update their code to use the new GFAL2 Python bindings instead

Status: part of EMI-2 release

  • Packages in EPEL
  • Soon packaged for Debian
  • Sources publically available and development open to everybody
  • GFAL2 is the core of FTS3

Discussion

  • Source and destination URLs can have different protocols (e.g. "dcap" —► "root")
    • the copy will then go through the issuing host (UI, WN)
  • Third-party copies are used whenever possible
  • Reimplementing lcg-cp etc. with the same behavior would be difficult
    • Command line arguments would be doable, error messages would not!
    • It would allow a faster uptake wherever the CLI (not the GFAL API) is being used
  • Asynchronous calls: bringOnline missing, will appear soon
  • Library is completely thread-safe
  • C++ binding to stream object would be easy, but may be Linux specific
  • GFAL2 plugin for ROOT could help reduce the list of plugins to be maintained
  • Follow up in Operations Coordination WG

Wrap-Up

Forgot one forthcoming workshop: DPM workshop at LAL, Dec. 3-4

Progress on several actions

  • EMI migration progressing at a good pace: very successful major upgrade so far...
    • short follow-up in Dec GDB

Follow-up discussion on GLUE2 publication by OSG

WLCG CVMFS repository with MW clients

GFAL2: feedback about lcg_util compatibility needed

-- MichelJouvin/Maarten Litmaath - 14-Nov-2012

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2012-11-16 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback