Summary of GDB meeting, February 13, 2013

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=197800

Welcome - M. Jouvin

Note takers: more needed

  • Currently 7, 1/2 from CERN, most others from UK...

Next meetings

  • March meeting at KIT : registration required
    • See Indico
    • pre-GDB during the afternoon on cloud issues
  • April meeting canceled
  • An "outband" pre-GB on storage interfaces and I/O benchmarking will be take place on next Tuesday (Feburary 19) afternoon
    • See agenda
    • Remote participation possible

Coordination with Operations Coordination to avoid overlap : "Actions in progress" of future GDBs will be mainly a report from the Ops Coordination with a specific (short) focus on 1 or 2 topics

  • Will be filled in the days before the GDB, after the last fortnightly meeting, the week before the GDB

Upcoming meetings: see slides

  • ISGC (Taipe, mid-march): several workshops scheduled, including the 3d DPM workshop and a dCache worskshop

Michel would like people to be more reactive for reply when asked/proposed to give a presentation at the GDB to help making the agenda ready in advance...

Future Support of of EGI Services - T. Ferrari

Serious impact of EGI-Inspire end in 2014

  • Global tasks current funding
    • 25% EC (EGI-Inspire)
    • 25% EGI.eu (NGI membership fees)
    • 50% NGI
  • NGI International tasks
    • 33% EC (EGI-Inspire)
    • 67% NGI

EGI Council strategy defined in Spring 2012

  • Global tasks
    • EC Horizon 2020 new projet for funding development and innovation
    • Community funding
  • International tasks: national funding

Global tasks classified in Critical, Desirable for non degradation, Desirable for continuation

  • EGI global tasks have been prioritized, this will be used to decide on future funding schemes
  • Different funding models to be discussed at next council in April
  • Definition of a process to select between candidates for global tasks (April council)
  • July: selection of new global task owners
  • September: transition plans for migrating global tasks

Operational service changes foreseen

  • Services interacting with PTs (end of EMI): will impact the UMD process
  • Several coordination functions can continue with a reduced level of funding: consolidate effort on fewer partners
  • Review current security operations, in particular for incident response
  • Emerging support services for federated clouds
  • Operational tools: more pro-active maintenance

New UMD structure/process envisionned to deal with more uncoordinated PTs at the end of EMI

  • UMD SW provisionning will include verification and staged rollout
  • Move to quarterly rather than montly releases
  • Give PTs access to EGI testbed for integration activities (new)
  • 3 main categories of UMD components: core products, contributing products managed by Platform Integrators (PI), Community products
    • Core products: SW verification and release centrally by EGI
    • Contributing products: SW verification by PIs, released in UMD but in a separate repo
    • Community products: complete delegation to community, released separately from UMD
  • UMD Release Team: will collect information/release plans from PTs, track the progress in fixing urgent bugs, communicate requirements to PTs/PIs

Discussion

  • Have the platform integrators been identified? Not yet
  • Will the funding work? Unclear yet, more after the next EGI council in April (during Community Forum)
  • What channels between WLCG and EGI UMD Release Team?
    • No specific channel currently defined: WLCG not represented per se in the Release Team
    • Normal channel is the UCB (User Coordination Board) but Tiziana agrees that it's more the the small/medium size communities. WLCG is specific! Even thought there is less dupliction if all requirements are fed into the same channel.
    • Collaboration increased in the last years/months: will continue to collaborate as previously done
    • Take opportunity of Operations Coordination fortnightly meetings to discuss about releases: Maria pointed out that the agenda for the fortnightly meeting is already very dense and the Ops Coordination has already a lot to do with a limited manpower. Details will be discussed offline.
    • Stay tuned!

IPv6 Update and New Plans - D. Kelsey

Update of the Nov. 2012 report based on IPv6 F2F meeting outcome (Jan. 2013)

News

  • CERN policy: every node must be dual-stacked to allow a transparent ipv4->ipv6 transition
  • IPv6 shortage at CERN
    • 5 /16 heavily used: a new GPN /16 started in Oct. and already 5% used
    • CERN can only for 1 /22 subnet more
    • VM plans put the risk of IPv4 depletion during 2014: IPv6-only VMs or private IPv4 (but performance and other problems foreseen)
  • OpenAFS and IPv6: as of now, no IPv6 support, no activity to add it
    • IPv4 embedded everywhere
    • Backward compatibility is the highest priority
    • No multi-home support (implied by dual stack)
    • For IPv6 support to happen, funding situation must change: openAFS does not come for free!

IPv6 approach at CERN based on dual stack: same provisioning tools, same performance...

  • IPv6-only VMs
    • +: will push forward WLCG IPv6 adoption
    • -: several apps not working properly with IPv6, most CERN partners without an IPv6 connectivity
  • IPv4 private networks + NAT are considered a plan B
    • Not ideal for performances but not necessarily worst that going through a firewall

openAFS future usage: Maarten/IanF stressed that it may be time to move off AFS usage

  • CERN pretty unique in its high and increasing AFS usage: almost shutdown everywhere else
  • Recent increase of AFS quotas at CERN doesn't help...

New timetable for IPv6 transition at CERN: now IPv6 client support required by 2014

  • A challenge!
  • Focus carefully on what is really needed for WNs and this kind of services

WG has seen recently an increased WG participation

  • Eg. ATLAS, GridPP plans to join the testbed
  • But still several T1s missing: please contact Dave

Short term plans (until April)

  • Add more sites to the testbed (Glasgow DPM), some LXPLUS nodes?
  • Griftp N*N mesh is still useful
  • Understand CMS data transfer performance issues, move to tests of PHEDeX with DPM endpoints (and may be other implementations)
  • LHCb: planned tests for job submissions and CVMFS at Glasgow and RAL
  • Currently continue using SL5 but need to look at SL6 soon: will require to redo some of the manual work done for SL5
    • US-LHCNET will deploy SL6 and IPv6 soon

Need to plan for tests in the production infrastructure this year

  • Step 1: confidence that dual stack breaks nothing. MAy result in ability to enabling IPv6 permanently on production systems allowing easier further testing
  • Step 2: functionality and performance testing: service by service
  • Step 3: complete 24-hour turn up

Dual-stack endpoint we are aiming for: all the services required for IPv6-only clients

  • When do we require it at all T1s? all T2s?
  • Is an SL6 UI required: probably not...

Known issue with batch systems: none IPv6-ready yet

Next F2F meeting at CERN April 4-5

  • Increased participation welcome...

Actions in Progress

MW upgrade Status - P. Solagna

Current status: 92% of the infrastructure upgraded to supported MW in 4 months. No risk of major disruption of the infrastructure in case of a security vulnerability.

  • Unsupported CREAM, dCache, LB or BDII: 19
  • DPM, LFC and WN: 48
  • Total number of sites affected: 29
    • 21 sites have a downtime
    • 5 are waiting the WN tarballs
    • 3 sites need to open a downtime
  • This not covering VOMS services: some VOMRS need EMI-3 VOMS to maintain the features

WN tarball: EGI encouraged to make plublicity on the CVMFS repository `grid.cern.cn` as an alternative to tarball deployment

  • Same maintainer as tarball

This upgrade allowed to define a workflow that will be reused in future upgrade campaigns

  • Soon for EMI-1: first alarms to appear March 1st in the operations dashboard
    • Dedicated Nagios monitoring
    • Many services upgraded directly to EMI-2: hope it will help...
  • Too early for direct installation of EMI-3 but if some sites would like to do an early evaluation, can be discussed

Operations Coordination News - A. Sciaba

2 new co-chairs of the fortnightly meetings: Pepe Flix and A. Forti

Try to increase site participations

  • T1-related items moved at the beginning of the meeting
  • Identify people from different regions to represent T2s: reinforcing the coordination is very important

During L1, only 2 "daily meetings" per week (Monday, Thrusday)

Candidate for new TFs

  • Data placement

MW deployment

  • Last issues with tarball WN fixed
  • EMI-2 UI recommended but a bug in WMS submission affecting 2% of the jobs
  • EMI-2 WLCG VOBOX almost ready (tested by ALICE): waiting for a WLCG repo courtesy of EGI

CVMFS: only 6 sites which didn't answer

Squid monitoring: TF completed with some proposals

glExec: more to come in March

SHA-2: new CERN CA (for testing) available soon

FTS3: new features demonstrated, stress tests ongoing and ramping up

Tracking tools: evaluation of non cert-based authentication to tracking tools

  • Will develop an authz based on IDPs

perfSONAR

  • 60% of sites installed: still missing a contact person for Asia
  • Unfortunately: perfSONAR not yet treated as a production infrastructure
  • Need to improve its reliability
  • TF manpower is very limited: more welcome

SL6: all experiments validated SL6 except ATLAS group production

  • For ATLAS, SL5 and SL6 WNs must be in separate CEs due to the need to tag separately the supported releases
    • Other VOs don't have this requirement
  • T1s are asked not to moved until the final validation, others can
  • No push from experiment to upgrade
  • LXPLUS will move part of the resources to SL6: same proportion as the SL6 grid resources
  • No plan to upgrade VOBOX to SL6 soon
  • TF coordinated by Alessandra: goal is migration by Sept. 2013

I. Fisk: thanks for the work done by the coordination and the progress made in a few months on languishing issues...

Storage System AII - M. Litmaath

pre-GDB to complete the TEG work

  • Good attendance: 36, almost all parties exept LHCb who could not make it at the end

Read access to data: ATLAS and CMS insist for no world-readable access to data

  • Access must be restricted to VO
  • No military secrets... just raise enough the bar
  • Final analysis and Indico presentations are considered more sensitive than data themselves
  • Local clients can still be given a lower authz overheadh for better performance: not clear if achievable
    • SE nneds to determine at least the client's VO

Data ownership

  • Discussion on using VOMS nickname = CERN account = Krb principal for ownership
  • Feeling that ACLs are offering an alternative approach to fix changing ownership issues and needs for VO superuser
    • With appropriate use of VOMS groups, no need to update the ACLs when people are changing
    • Full use of ACLs probably require using the implementation-specific API
  • ALICE using a different approach where users are stored in a central LDAP, each user being able to have several DNs
    • SE just need to check validity of security envelope generated from LDAP and central catalog permissions: doesn't really care about the owner...
    • May be an interesting approach to support external identity providers, like a federated identity infrastructure
  • Recommended use of robot certs rather than user certs for ownership by VO, group or service
  • Krb local access: a local matter rather than a WLCG one
    • Generally implies maintenance of a map file: already done by CMS for EOS, Atlas ready to do it

VO superuser for SE

  • Available in EOS but VO-dedicated instances
  • Can be done for file deletion with appropriate ACLS in other implementations
    • But no ability to fix ACLs/DNs problems
    • Require use of implementation specific API

Cloud storage concerns: too early

  • Avoid asking non trivial feature that are not used at the end and make the use of standard technologies more difficult
  • Use industry solutions like ACLs: will improve ability to use new technologies and get funding

Cloud Discussions Summary - M. Jouvin

See slides.

Site and/or user contextualisation

  • Agreement that contextualisation is need: probably as startup script
    • In the past, agreed on amiconfig (used by CERNVM)
    • Now Cloudinit emerged as new standard : more flexible but more user-contextualization oriented

VM duration ? still a debate: use hot spots like in Amazon, or favor graceful end instead ?

  • The ideal for LHCb would be very long VMs.
  • Gavin proposed two parameters : a) minimum days of lifetime and b) end notification within a minimum number of hours in advance.
    • Latchezar : The user must be able to shut down the machine if the job is done, even though the site would allow him to run it longer
    • Claudio: need to be able not only to shutdown the VM but also to deallocate the resources

Follow-up of initial discussions expected to take place at a F2F meeting during next pre-GDB

  • Will not be a pure F2F meeting: remote participation will be possible
  • Several overlapping meetings (eg. ATLAS, ROOT) but experts from each experiment available at least for remote participation. No alternative possible before May...

Storage Federations

WG Final Report - F. Furano

~10 meetings in a few months of life

Federation: see everything natively as one storage, minimizing complexity and the amount of glue

  • Thin clients preferred
  • Interactions that use only one protocol are preferred
  • In fact the original grid vision with products like LFC
    • All experiments share the vision, just they are doing different tools

Glue: SW to bridge heterogeneous systems

  • +: independence from vendors, many knobs to influence behaviour
  • -: scalability (at least effort required to achieve it), never perfect, requiring a lot of babysitting

Common objectives

  • Relaxing requirements on data locality to increase the chances of using CPUs
    • Does not mean end of data placement
  • Lower the bar for physicists to do analysis: reduce complexity/steps to get access to data when he knows what he wants
  • Direct access to data
    • Ultimate wish: WAN direct data access
  • Consistent/coherent file naming, independant of location/site (SFN)
  • Dynamic handling of down storage
  • Enable the use of non WLCG resources

Protocols

  • Be prepared for future projects that may involve HTTP
    • Most GRID SE implementations provide a HTTP/DAV interface
  • All GRID SE technologies provide a Xrootd interface and can join a Xrootd federation
    • Xrootd is born with federation in mind
    • Try to be topology-aware
  • Recommandation: add a http/DAV R/W interface to Xrootd
  • DAV adds to Xrootd browsing features

Use case scenarios

  • Keep separated the concept of federation from applications
    • Apps should not care about the federation: only about data acess, self healing implementation...
  • Job failover: requires a protocol supporting redirection (Xrootd or http)
    • Transparent failover
  • Self-healing: instrumentation required
    • CMS prefers to leave this instrumentation to Xrootd server
    • ATLAS proposes an optional Xrootd instrumentation
    • ALICE has everything in place but inactivated
  • Site cache for analysis: will serve all local users
    • Can decide to serve a user from the local cache, download the missing piece or redirect to an external location
    • Should be able to make its volatile contents available to the federation
    • Recommendation: continue tests of site cache implementations with Xrootd and http/DAV

Security: federations rely on access protocols used

Name translation and catalogues: current situation non trivial because path names "decorated" with site-specific tokens, in particular for Atlas

  • CMS uses an algorithmic name translation: no external service involved
  • LHCb gets the relevant tokens from their IS and LFC. LFC used only to get host names.
  • ALICE: already a global namespace, final path topology-aware
  • ATLAS: instrumentation of all SE so that they synchronously contact LFC
    • Work in progress to use new file names that can be translated algorithmatically

Status of existing federations

  • CMS: federation already well in placed, global redirectors available, 60% of sites involved representing 90% of data
  • ALICE: redirection used only inside sites: no global redirector
    • Connected to the central catalogue
    • A lot of access already made remotely
  • ATLAS: a file is first search locally using the unique global name, then use a hierarchy of redirectors
  • LHCb uses LFC+SRM
    • LFC is used only as a replica catalog
    • Provides federation features but client is not redirection-aware

Monitoring: useful for many different things like identifying HW issues, debugging clients...

  • Challenge: may produce huge amounts of data, put a lot of pressure on internal implementation to scale with federation size
    • Took 4 manxyear for Xrootd to achieve it
  • Need to ensure that monitoring data are tagged by site
    • Current activity: impact of enabling WNA direct access
  • Consumers: a lot of lessons learnt from dashboard experience, helped to identify real challenges

Discussion

Markus: main motiviation for federation is probably not failover as this can be implemented on the client side as demonstrated by LHCb with SRM+LFC

  • Other use cases probably more important: first "site cache" or ability to run jobs at storage-less sites, second self-healing
    • ATLAS confirms that ability to use CPUs available at a sites with no storage (or at least no storage dedicated to ATLAS) is a top priority and a strong motivation for FAX
    • Storage is the most difficult service to run: a barrier for small sites to contribute (efficiently) to experiment computing.

How difficult is it today to use http/xrootd with existing storage systems?

  • No major difficulty: just an access protocol to configure, not more difficult than others
  • 3rd party transfers work seamlessly with xrootd

http plans

  • No concrete plans currently for CMS and ATLAS but checks were made to ensure that http is working fine as an access protocol. May join http-based federation later.
  • Main reason for http rather than xrootd: lots of clients available, including browsers, some with appealing features like multi-stream transfers, and caching

Site caching and file permissions

  • Need to check if file permissions are honoured when accessing a data from the site cache: as stated during the pre-GB on storage AAI, ATLAS and CMS insist that their data must no be readable from outside the VO.
  • ARC cache is caching access rules.

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2013-02-20 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback