Summary of GDB meeting, May 14, 2014 (CERN)

Agenda

https://indico.cern.ch/event/272621/

Introduction - M. Jouvin

Next GDBs

  • July meeting replaced by WLCG workshop. No meeting in August.
  • Do we want a meeting outside CERN in Autumn 2014? Attendance is often lower.
  • January 2015 GDB on 2nd Wednesday, as usual: no conflict this year with CERN Director's New Year Speech

WLCG workshop in Barcelona July 7th-9th

  • Registration deadline 9th June
  • Workshop sections: Status + Experiment session + future evolution. Suggestions welcome.
  • Preliminary agenda available: https://indico.cern.ch/event/305362
    • Comments and suggestions welcome: contact Maria, Andrea and Michel

June pre-GDB on IPv6 management (please register):

Actions: https://indico.cern.ch/event/305362/

  • Experiments to update developers with their GFAL2/FTS3 plans.
    • Decommissionning (planned) date: October 2014
  • CVMFS MW client repository restructuring: J. Bloomer seems responsible of the action, will contact him to get news.
  • IPv6: need volunteer sites to move production endpoints to dual stack
  • There is an FTS web portal available for testing
  • More volunteers sought for batch accounting WG

LHC news

  • LS1 completion started.
  • Run2 will start at the end of Q1 2015 and last until summer 2018.

Meetings

  • Ops coordination: see Indico
  • HEPiX 19-23 May
  • EGI Communit Forum 19-23 May
  • IPv6 pre-GDB 10th June: please register
  • WLCG workshop 7-9th July: please register asap
  • ACAT 1st-5th September.

Configuration Management Tools

Puppet Community Update - B. Jones

HEPiX WG formed to share experience and possibly code/effort

  • Not intent to replace the wider community

A few pre-existing projects

  • github.com/cernops: modules developed at CERN but also interaction with upstream through GitHub standard features
  • github.com/hep-puppet: started by UK sites
    • ARC CE, HTCondor, Torque, Dirac...

Standard modules can be submitted to forge.puppetlabs.com

  • 2.3K modules registered...
  • Searchable with 'puppet module' command
  • Community modules produce for grid services are submitted to Puppet forge
    • GitHub only for development and issue tracking

Success using the open development model with several modules already

  • Some contributed by sites, others by product developers (DM team in particular)
  • Also issues with the open development model: slow delivery of things that are published...
    • Open development takes confidence...
    • GitHub use may be a barrier at the beginning
  • Secret handling is an issue without an agreed/easy solution yet: specific developments at some sites (CERN/DESY), but generally not really shareable
    • In particular authorization methods

Main open questions:

  • Where do I go to find list of modules?
  • Where do I report issues?
  • Where do I put my developments?
  • Made a bit more complex by the multiple (GitHub) projects developing modules

Next GridKA school of computing has puppet course (September)

WG has not committed to offer a full YAIM replacement even though it remains a goal$

  • No attempt to deliver a fully integrated suite to configure a grid site

Quattor Community Update - I. Collier

One of the big change in the last year: move of everything to GitHub

  • Much easier/effective collaboration, more active contributions
  • New release process with regular releases

Next generation Quattor configuration database now entering production: Aquilon

  • Based on a SQL database for storing stie-specific information: no need to edit templates for everyday operations
  • Templates in a Git repo: give access to Git workflows
  • Developed by Morgan Stanley
  • Already used at UGent and RAL

YUM-based package management: a major important change remove the main 'pain point' in Quattor

  • Most sites migrated, at some migration complete
    • EMI3 templates only support YUM-based deployment
  • Despite this was a major internal change, very minimal impact on site configuration: all sites migrating without any major effort, other than setup an appropriate YUM infrastructure (YUM snapshot management)
    • Very good demonstration of efficiency of the Quattor template library approach where sites only maintain their site configuration description, not how to configure the service
    • Also demonstrate the backward compatibility that Quattor has been able to implement: most site descriptions are 5 to 10 years old...
  • Frees some effort for real feature development

A new component to benefit from Puppet developments: ncm-puppet

  • Allows to use Puppet module in standalone mode, driving them from Quattor
  • Benefit from mature Puppet modules in a way similar we used YAIM in the past
    • Avoid duplicating efforts to maintain Quattor specific configuration modules when there is a good one produced/maintained upstream
    • DPM as a prototype

ncm-metaconfig configuration module introduced 1 year ago allow to manage services without writing a full configuration modules

  • Can manage configuration files in a wide variety of format
  • Bring ability to define a service specific schema without writing a component
  • May allow to retire a lot of legacy components that where here just to define a schema

Community is small but engaged

  • Most sites remain committed to use Quattor
  • ~15 regular contributors/reviewers on GitHub
  • Weekly development meetings, twice yearly workshop

Discussion

  • How dependent is Aquilon product on Morgan Stanley?
    • MS has done the initial development and is a major contributor but in last year but expertise now outside MS and most of the development and packaging happened outside MS
    • BTW MS is a very active contribution not only to Aquilon: very much involved in the general development/reviewing process
    • GitHub was instrumental in making this strong collaboration possible
  • YAIM replacement: Quattor template library allows to configure a full grid site without using YAIM. So the end of YAIM has no impact
    • Developing the YAIM-free configuration has been a significant effort done a while ago...
    • Maintenance of some specific (and complex) configuration modules (e.g. for DPM) is a significant effort, thus the attempt to benefit from what is maintained by the developers (Puppet module)
    • ncm-metaconfig configuration module will help to reduce this load (by reducing the number of required components)

WLCG Monitoring Consolidation - P. Saiz

Goals: reduce complexity; simplify operations; common development; unify… in order to support with less resources.

Recent improvements

  • Service management/deployment: rely on Agile infrastructure and its tools
    • 100 nodes, 5 TB of storage, SLC6 except Nagios boxes
  • New submission method for Nagios probes
    • CREAM CE (ALICE and LHCb): going to production in the next days
    • CondorG (ATLAS and CMS): validated, ready for pre-production
    • These probes are different from standard CREAM CE probes: provide a richer support for specifying user payload, a requirement for VO tests. Information complementary to standard probe run as part of ops tests (more detailed testing of CREAM CE internal services)
  • Migration of SAM (ops tests) for EGI: handover to a consortium of GRNET, SRCE and CNRS
    • Transparent for WLCG
    • WLCG monitoring doesn't depend at all from the EGI infrastructure: completely disconnected (EGI infrastructure only for ops tests)
  • Site nagios plugin, contributed by PIC
    • Import dashboard data in JSON format and publish it in the local Nagios
    • Generic: development for the WLCG community
    • Dynamic: automatically handle tests added/removed locally
    • 1 dependency: jq (JSON processor)
  • Data imported from VOs
  • SAM3 interfaces implemented and used for avail/reliability calculation
    • Database shared with SSB
    • Currently validating that SAM3 is giving the same result as previous SUM implementation: a few differences mainly due to a different/better handling of downtimes

WLCG Nagios boxes

  • Currently 3 instances: production pre-production, development
    • Still managed by Quattor
    • 1 box per VO in each instance
    • Prod instance publishing both in SAM and SAM3 db
  • Plan to remove the development instance and build a new pre-prod instance in Agile infrastructure that will publish only to SAM3 db
  • After validation that the new instance produces the same results as the current prod instance: remove the Quattor-based prod instance

A lot of progress in the last months but still plenty of work ahead of us

Discussion

  • What about ARC CE probes?
    • Condor-G probes will test for ARC CE too.

Future MW Support - C. Aiftemei

UMD-3: at end EMI most partners commited to “at least one year support”

  • End of IGE: best effort from EGCF
  • UMD-2 support now over, including bug fixes

Carried out support calendar survey, covering both support, bug fixes and further developments

  • AMGA (KISTI), ARC, caNI-c++ (NorduGrid), APEL (STFC-UK) – updates coming,
  • BDII and lcg-info-clients (CERN) okay
    • Long term support for BDII
    • Clients: ginfo (GLUE2) supported, lcg-infosites : critical bug fixes only, lcg-info not supported anymore
  • CERN Data Management (DPM/LFC/FTS3/DMC) okay (including developments) except LFC best effort
    • GFAL replaced by GFAL2, lcg-util replaced by gfal2-utils
    • No major release breaking backward compatibility foreseen
  • dCache (DESY) – all golden releases receive 2yrs support
    • 2.6 branch until Apr 2014
    • 2.10 to be released in July
    • Every new golden release tend to introduce non backward compatible changes...
  • CREAM, WMS, STORM, VOMS, EMI UI/WN (INFN): support till at least April 2015.
    • No longer supporting for Cluster, TORQUE config
    • Yaim-core is only best effort but no changes in the last years
  • CREAM GE module (LIP) & CREAM LSF (CERN) – best effort
  • caNI(-c), GridSite, ProxyRenewal, L&B (CESNET) – maintenance only
  • caNI-java (ICM) – support till (at least) end 2014.
  • MPI (IFCA/CSIC) – at least till end 2014
  • QosCosGrid (PCSS) – at least until end 2015
  • ARGUS (SWITCH): the main problem, SWITCH wanting to step out
    • Offering best effort support from April 2014
    • Would like to hand over support, bug fixes and further developments by the end of the year
    • Discussion going on with EGI, WLCG & INFN.
  • glexec-wn, including LCMAPS and LCAS: currently unknown
  • EMIR, UNICORE (JUELICH): unknwon

Discussion

  • Cristina thanked for this detailed survey
  • Probably an update at next GDB

Actions in Progress

Ops Coord Report - P. Flix

Next meetings: see Indico http://indico.cern.ch/category/4372

Main news in the last month

  • Alastair Dewhurst replaces S. Campana as the IPv6 TF convener
  • Discussion on ARGUS support: see previous talk
  • New WG on Network and Transfer Metrics: mandate being discussed
    • Goal: identify and publish relevant metrics for network and transfers, that could be used by network-aware apps
    • Chaired by S. McKee and M. Babik
  • Xrootd and perfSONAR TF stopped
    • PS deployment finished: 8 sites without it, 64 still have to upgrade to last versions, 205 sites in the mesh
    • Xrootd: main goal was to ensure proper coordination between AAA and FAX, achieved, still several open questions/actions, in particular regarding monitoring (e.g. dCache sites)
  • New dCache 2.9 released May 5
  • Many details in minutes from last week and planning meeting on 17th April. https://indico.cern.ch/event/305362/

T0 investigations about possible efficiency differences between Meyrin and Wigner

  • Several causes of inefficiency found and being investigated
    • Combination of CPU architecture, virtualization, OS version, zombie pilot jobs, malfunctionning network HW...
  • No evidence of an efficiency difference related to geographical sites/network latency
  • Investigation continuing: further reports expected

Other T0 news

  • WMS decommissioning progressing as expected
  • VOMS-admin: waiting for next release
  • ARGUS sporadic authentication failures: situation improved by adding more nodes to the ARGUS alias
    • Detailed investigation in progress
  • 70% of resources in SLC6

ALICE

  • KIT overload understood and fixed: troubleshooting difficult, thanks to KIT support
  • QM2014 next week: intense analysis activity in the last week

ATLAS

  • Rucio stress tests planned May 19
  • Multicore: request to reduce multicore partition if static partitionning of the cluster

CMS

  • SAM test for glexec will become critical May 15
  • AAA: scale testing in progress
  • Multicore: ramping up as several sites
  • FTS3 mandatory for all sites PHEDEX debug instance, prod migration in June

LHCb news

  • CASTOR -> EOS user data migration completed
  • Pb with certificates from Brazilian VO at some sites: under investigation

GGUS

  • Email ticket submission to be stopped soon
  • Savanah to GGUS migration for CMS in progress

FTS3: agreement by all experiments to stop FTS2 in August

MW readiness

  • Initial list of sites approved at last MB, rewarding method agreed
  • Next meeting of the TF tomorrow

glexec

  • Only 15 sites with open tickets
  • TF continuing
  • Get experience after CMS has defined the glexec SAM test as critical

Machine/job feature

  • Active dev/deployment of a service for cloud infrastructure
  • Sites asked to provide the appropriate information for the client
  • Client doesn't need to be deployed on WN for LHCb: is deployed in LHCb SW area

MW validation: next meeting tomorrow, more concrete plans then

Multicore TF

  • Concentrating on getting feedback from early adopter sites
  • Will then provide recipes for dynmamic provisionning for all batch systems

SHA2

  • A problem identified in CREAM CE with SHA512 certificates used by new VOMS servers
    • Fix tested, waiting for deployment in EMI/UMD repos: sites will be requested to upgrade asap then, timeline tight
  • CMS experienced a problem with current (legacy) proxy format: moved some core services to RFC proxy, no problem so far

HTTP Proxy TF

  • New fields defined in GOCDB and OIM
  • Waiting for Squid monitoring recommendations to be implemented
  • Frontier client ready to use WPAD/PAC
  • CVMFS client can use WPAD/PAC but only a limited support so far

Discussion

  • What services use the legacy proxies?
    • Maarten: just about everything. Voms-proxy init gives it to you by default. Experiments need to confirm there are no services left that require legacy proxies. Should make RFC proxies default later this year (will require a new voms-proxy release).
    • Maarten: infrastructure cleanup for SHA-2 support provides RFC support for free (was part of the same changes).
    • Check status: CMS have done it. ATLAS now checking. A similar check for infrastructure should be done – pre-prod SAM may switch to RFC soon. Some coordination with EGI needed.

UMD-2 Decommissioning Status – Cristina Aiftimiei

UMD-2 EOL – April 2014

  • From end of May sites must put in downtime any remaining UMD-2 services (except dCache). Failing this sites may be suspended.
  • Statistics of remaining UMD-2 services per NGI: see slides
  • Most remaining sites indicate they will upgrade by the end of May.

Discussion

  • Michel: thanks to EGI for coordinating this migration, a new success after the previous EMI-2 migration and SHA-2 migration

Data-access pre-GDB Summary – Wahid Bhimji

A pre-GDB summary is available at the usual place: https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20140513.

Meeting goals

  • Do we understand our data access well enough? Are I/O performance wins out there?
  • Data federations are in production and offer increased flexibility and resource usage:
    • But do we have everything needed to work at scale?
    • Do sites need to plan or provision more?
    • How do we use this software: monitoring; caches etc.?
  • Are we employing solutions compatible with wider communities? Should we? (c.f. Big Data etc.)
  • Is our protocol zoo growing (http / xrootd /(rfio ) /gridftp etc.. )? Are there paths to simplification?

Main items presented

  • Progress allowing WLCG Tier-2 disk-only sites to not have SRM in Run-2: take a look if interested
  • Federation Workshop (mid-April, SLAC) summary: lots of interesting items, a few of them rediscussed and updated during pre-GDB
    • Xrootd4 in RC, release expected by the end of the month: caching proxy server, IPv6 support, http plugin...
    • A. Hanuchevsky came from SLAC to attend the pre-GDB in person: a good sign that the whole community is engaged together in this federation work
  • Comprehensive monitoring overview for xrootd. Views could be used more than now, in particular by sites.
    • Can easily been extended to httpd: proposal presented
  • Storage (EOS) infrastructure monitoring: semi-auto detection of performance anomalies from EOS logs and LSF. Metrics need to be defined appropriately.
  • Looked at Meyrin-Wigner test results. Some differences seen but no definite conclusions.
    • See Ops Coord report today
  • CMS: data federation (AAA): working and widely used for fallback and remote work.
    • Scale tests show some infrastructure limits to be understood, more tests planned to identify the real limits
    • Want wider deployment
    • Throttling plugin for xrootd needed by some sites: will come in 4.1 (July)
  • ALICE: remote reading for ‘urgent’ tasks and data access failover. Detailed understanding and monitoring of job efficiency and site bottlenecks.
    • Avergae analysis needs 2MB/s/core.
  • ATLAS: FAX in production and stable.
    • Improved with Rucio (no LFC lookup).
    • Failover used widely: no visible impact on the network infrastructure
    • ‘Overflow’ (rebrokering without moving data) in test.
    • Ask for sites to offer xrootd and HTTP/webdav.
  • LHCb: Jobs CPU-bound so download input dataset. Access files using xrootd.
  • German sites: widely deployed FAX and AAA.
    • Good experience with LAN direct IO since several years, using dcap
    • Interested by xrootd to reduce protocol zoo (replace dcap specific to dCache)
    • Lots of hammercloud testing: see slides from pre-GDB
  • Recent HTTP developments
    • A plug-in for xrootd4 (XrdHTTP done) coming soon.
    • Dynamic federations using http for redirection (to find and list) in test.
    • ATLAS – Rucio uses DAV (where available) for renaming etc.
      • Not monitored as a service yet.
      • Direct access with Davix under test.
  • Root I/O: detailed presentation on recent enhancements and the new things on the roadmap
    • TTreeCache configurable in environment coming soon.
    • Root I/O now thread friendly.
    • Workshop at CERN on 25th June: everybody interested should register and come!

Conclusions of the meeting

  • We have lots of data to access on data-access performance covering all the way from ROOT I/O to storage and network
    • we needs more work to understand them and use them to improve
  • Data federations in production: need to convince remain sites to join
    • Monitoring available: we need to use it
    • Waiting a few new features like throttling, VO filtering for monitoring
  • Many interesting developments on http, brokering, caching etc.
    • Should help to work with other communities

Q&A:

  • JT: if having only http, could a site join an Xrootd federation?
    • Wahid: Currently not. This site and xrootd sites could join the dynamic (http) federation thanks to the new Xrrootd http plugin but this is not what ATLAS and CMS are using currently. One reason that makes the dynamic http federation potentially interesting to WLCG.
    • Michel: no clear use case since DPM and dCache support both xrootd and http. For DPM, not much complexity in running Xrootd in addition to http/Dav in recent versions.
    • A usecase could be to add cloud storage to the federation
    • Dirk: protocol re-brokering (different protocol between the client and the redirector and the client and the disk server) within client is being discussed. May help with moving towards http federation.
  • JT: would be good to reduce protocols: adding Xrootd is not going in the right direction... http/Dav already enabled for us for quite a long time after local request.
    • Michel/Wahid: hopefully we are in transition and we'll be able to remove older protocols (RFIO, gridftp) in the future.
    • Long term: this could be simplified even more if redirectors become http based.
  • ALICE’s statement about EOS being the recommended solution for new storage requiring some follow-up, outside GDB
    • There is ‘pressure’ on sites, including small ones, to move from Xrootd to EOS: this is a non sense as EOS has been designed for large/very large storage installation. No real benefit to expect for small sites
    • Support capacity at CERN may quickly reach its limit: no enthusiasm about this request/recommendation
    • ALICE advice to use RAIN configuration makes no sense: this feature has not been used yet, even at CERN. Other sites must wait validation of this feature in CERN context (Meyrin/Wigner). It is even not clear if a real benefit will result from this configuration...

EGI Federated Cloud Update – Peter Solagna

Task force activities started 2 yrs ago. Main goals: enable cloud services technical integration over EGI, and support user communities in porting applications to a federated cloud environment.

EGIs cloud infrastructure

  • Uses OCCI (VM management) and CDM (storage management).
  • Integration: Uses GOCDB for cloud service registration; cloud sites publish to top-BDII; AA – integration of X509 (in Keystone for Openstack).
  • Accounting – usage uses Usage Record 2 for APEL
  • Monitoring – several probes available (OCCI, CDMI, BDII, accounting fresehness…) using RFC proxies (so currently using alternative SAM).
  • Cloud specific services: AppDB – inc. mechanism for images distribution. Brokering via Slepstream extension – gives possibility for HelixNebula interoperation

EGI cloud moves to production this month (5000 cores, 225TB storage).

  • Sites being certified: certification process is similar to the grid process (GOCDB reg., publishing and monitoring working, security assessment, passing SAM tests)
  • EGI SVG team and CSIRT prepared a security survey for Cloud providers
  • To be “production” means: certified; EGI policies endorsed; services monitored successfully during 3 days in a raw
    • Donwtimes and service problems monitored the same way (same workflow and tools) as grid services, tickets raised in case of problems
    • By participating, a site agrees on availability/reliability targets (same as for grid)
  • Proof-of-concept use cases confirmed on testbed.

Q&A:

  • Andrea: do you expect many grid sites to move resources to cloud?
    • In general, doubt there will be a full migration from grid to cloud. But some sites very interested.
  • Pepe: according to slides, Spain is contributing 3PB for the federated cloud infrastructure – so confused if they are offering what they have (already pledged to existing projects) or adding more.
    • Peter (PS): Numbers in slides may not be correct but did check.
    • PIC resources are provided for specific projects and are allocated.
    • JG: Is it any different from grid?
    • MJ: yes because the resources deployed are for projects that are not really using the EGI cloud infrastructure. What is the model for sustainable funding of resources in the federated cloud infrastructure? Current users of this infrastructure are not bringing any resource...
    • PS: Process for central allocation by NGIs, otherwise users must agree with RCs.
  • MJ: OCCI interface has been and remains controversial. At GRIF and Stratuslab we do not use it or have resource to develop it. The strong requirement means we have to fund development to build an interface for communities who do not provide resources when the communities that we support don't care about interface (they have the ability to speak to any cloud infrastructure with its native interface)
    • PS: its true that groups approach cloud with existing interfaces. OCCI is currently developed for Openstack and OpenNebula, rOCCI should allow to add more...
    • MJ: The real value of EGI Federated Cloud infrastructure is AA, monitoring, accounting, publication/discovery... The only requirement to participate to the infrastructure should be there. OCCI requirement is reducing the sites who may potentially participate...
    • PS: There is definitely room for discussion here.

IPv6 Workshop Summary – Andrea Sciaba

Experiment activities

  • ATLAS: Panda and Pilot development machines dual stack. NDGF T1 added dual stack storage for AGIS. Assume CE either IPv4 or 6, while SE will be IPv4 or dual-stack.
  • CMS: Use GridFTP to send files around full mesh. Expanding PhEDEx testbed to more storage tehcnologies. IPv6 CEs at some sites. glideinWMS dual stack being tested.

Sites and services

  • CERN: DHCPv6 enabled on all nodes, including portables
  • Xrootd 4.0.0 RC1 released.
  • Plan to contact perfSONAR TF to use PS to monitor Pv6 links

Conclusions

Recent news from last week Vidyo meeting (mostly dedicated to June pre-GDB preparation)

  • dCache 2.9 allows IPv6 FTP transfers without having to proxy via FTP door
  • lxplus has a dual-stack sub-cluster (lxplus-ipv6): small today, can be increased according to the needs
  • NDGF testing dual-stack with SLURM (IPv4 only) and dCache 2.9
  • Imperial and Nebraska testing IPv6 with dual-stack BeStMan SE.
  • QMUL testing StoRM and PS.
  • CCIN2P3 joined the data transfer testbed.
  • Next Vidyo meeting on 27th May
  • Pre-GDB IPv6 workshop agenda now available: https://indico.cern.ch/event/313194/
    • All T1s sites should attend
    • Participating to this effort is considered as a contribution to the community and will be rewarded the same way as participation to the MW readiness verification

Q&A

  • When CERN says that it enabled dhcpv6 for all portables, does it mean there are dual-stacked
    • Not really, this means that IPv6 stack is configured if present. But it is not used by default: require an additional step
  • MJ: What does mean "ATLAS will work with the assumption of an IPv4 or IPv6 CE"? Is dual-stack a problem?
    • No, dual stack will work too. But it is not a requirement on CE as the CE is contacted from PanDA which is dual stack and can talk either IPv4 or IPv6 depending on the CE.

-- MichelJouvin - 14 May 2014

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2014-05-15 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback