Summary of GDB meeting, October 9, 2013 (CERN)

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=251190

Welcome - M. Jouvin

September summary only available today: apologies for being late.

  • Reminder: Need more volunteers.
  • Today, thanks to Oliver for accepting taking the notes

Future GDBs until end of 2014: second wednesday of each month, except in January

  • January moved to 15th because of a clash with CERN Director's New Year speach
  • 2014 events created so please check for clashes.

Next pre-GDBs

  • Possible topics: review of cloud activities, batch system support and ops coord F2F.
    • Ops Coord F2F meeting will probably be in February
  • Other suggestions welcome

Actions in progress others than those followed by Ops Coord

  • Storage accounting: update planned at December GDB
  • site Nagios testing: more sites needed
  • handling job with high mem requirements: no feedback received but still a potential problem
    • ALICE issue when heavy ion running takes place.
    • ATLAS has some workflow for jobs with more memory needed than on ID card - specific sites agreed to this situation
    • CMS similar. Small number of jobs have these requirements and make special arrangements for them.
    • LHCb: RSS and virtual memory issue not solved yet. LHCb has written needs into VO ID card.
    • JT: Some VOs outside LHC world with similar issues. Best to solve at the batch system level. Could present at a pre-GDB our approach and try to see if it would fit in CREAM-CE. LHCb liked our approach. Counts all of process requirements.

EGI TF

  • Strong focus on federated cloud infrastructure.
  • EGI leading role in operating DCIs in Europe recognised.
  • Ops coordination with WLCG improved.
  • Transition period coming with end of EGI-Inspire in 6 months and S. Newhouse (director) leaving
    • Some hope to see EGI-Inspire extended...

Simone Campana takes over Maria’s role as she is becoming the new CMS Computing Coordinator in January

  • Andrea Sciaba will be Simone's deputy

Data Preservation - J. Shiers

More on DP at CHEP next week: introductory talk + DP workshop

DPHEP Implementation Board set up: similar to WLCG GDB/MB

  • Public agendas in Indico
  • Twitter

DP is more than HEP: many projects/disciplines, can profit a lot by collaboration

  • Some other communities more advanced
  • Jamie involved in several coordinations around these projects or related
  • High level strategy with others (projets, funding agencies): make them aware of us, clarify what we can offer

DP may have implication on services

  • Site representatives should participate to the cost evaluation

Concentrate on use cases (motivations and costs): 3 identified for HEP

  • Long tail of paper after the end of an experience requiring access to data
  • New theorical insights requiring reprocessing/reanalyzing the data
  • Should preserve date forever just in case: no clear business case

Translate into scenarios and evaluate the cost: 1 decade preservation, 2 decades, 3 decades

  • Planning a workshop to estimate the costs of curation (January)
  • As input, look at many migrations we have performed in the past (Linux, Objectivity...)
  • Take into account media migration required during the preservation period and the OS changes...
  • Manpower expected to be the dominant cost
  • Cost foreseen as affordable as long that there is a valid/strong business case
  • Identify who to optimize the cost by a better coordination and sharing of efforts
  • See if we can make our data more "preservational" by adapting the way we work today

Data Management

Future Directions for DM Clients - A. Alvarez Ayllon

GFAL2: a replacement for GFAL allowing grid and cloud operations

  • Addressing the shortcomings of GFAL: error handling, extensibility
  • No requirement to use an information system (disabled by default but still possible)
  • Session reuse
  • More protocols supported
  • Already used by FTS3

A new set of CLI interfaces to GFAL2 to replace lcg_util: GFAL2-utils

  • Drop support for LFC? Will affect lcg-cr and catalog specific CLI (lcg-aa, lcg-lg...)
    • Either not used by WLCG exps (CMS) or not used through lcg-xx commands (ATLAS/LHCb)
    • Impact on other VOs? would use plugin level not command line or the LFC CLI (all lcg-xx commands having a matching LFC CLI command? to be checked)
    • LFC will remain usable as one of the protocol supported to access a file
  • Other commands (in addition to LFC-related ones) with no replacement planned in GFAL2-utils: lcg-get-checksum, lcg-getturls, lcg-gt, lcg-stmd (space token mgmt)
  • Comments or complaints go to: lcgutil-devel@cernNOSPAMPLEASE.ch all documented in the wiki.

Python API not yet complete but will expose most of the C API entries

  • Philippe - definitely need the getturl functionality at the python level

Proposal

  • Freeze development of GFAL: support until end of next year only for critical bugs
  • Release in EPEL5 and EPEL6 for GFAL and GFAL2 only: will be removed from EMI-2 and EMI-3
  • Remove gfal/lcg_util from EPEL after the proposed/agreed end of life

Experiment feedback

  • ATLAS: seems doable... but need to sort out the impact of utils not ported
  • CMS and ATLAS: would like to see the new clients deployed in CVMFS...
    • CMS experts for DM clients absent: need to check details with them
  • No major objection from experiments: more precise feedback from experiments expected by end of November to come with a finalized migration plan at December GDB
  • Need to inform and get proper feedback from non LHC VOs: to be done by EGI?

Discussion

  • Helge - At CERN have had problems with EPEL and EMI repos getting in way of each other. If you can not sync the withdrawal from EMI repos please be sensitive to such issues.
    • Oliver: Removal has already started. Done for DPM. Can we ask expt reps to check internally and gfal team assemble feedback to conclude what we have got.

FTS3: Entering Production, Future Plans - M. Salichos

FTS3 released in EPEL6.

Status

  • Protocol supported: SRM, GridFTP, http, xroot
  • DB : MySQL + Oracle
  • Clients: FTS2 compat, FTS3 cli with new features, REST API

FTS3 entering production: heaviliy used by ATLAS for prod transfers

  • Some activity also in CMS and LHCb
    • LHCb started to use FTS3 for all prod transfers yesterday
  • Some non LHC VOs already started evaluating FTS3, including EUDAT with gridftp (to dCache or iRods)

WLCG FTS3 TF still actively involved but developers would like to reduce the frequency of demos to 1/month (or 6 weeks)

8 instances installed now, all of them except one using MySQL

  • RAL succeeded to transfer 300 TB per day with its instance
  • Stats from the last week: FTS2 transferred 4.6 PB/8.7 Mfiles, FTS3 1.8 PB/1.3 Mfiles
    • 8 VMs (4 at CERN, 4 at RAL) compared to 38 for FTS2

FTS3 monitoring: 3 solutions

  • Transfer dashboard
  • Standalone monitoring service
  • Nagios probe (not yet released)

Main issues seen during the last year mostly connected to DB tunning and usage: most problems fixed

Testing to be done

  • HAve not reached yet the 1M file per day on 1 instance
  • New features implemented in FTS3 like session reuse

Missing features to be implemented in the future

  • Multi-instance VO-shares
  • Multi-hop transfers
  • Integration with perfSonar
  • Activity fair-share inside a VO
  • Look at https://svnweb.cern.ch/trac/fts3/roadmap for details
  • Would be nice to have any other requirements from experiments as soon as possible
    • In particular those who may require database schema changes to avoid too many downtimes later.

Dev team confident that FTS3 can handle the load of FTS2 with only one instance but exact configuration to be deployed must be discussed with the exps.

  • VO-specific instances vs. global instance: everything possible, let's start with the simplest configuration (global instance), always possible to evolve

Philippe: FTS3 big advantage is the simple configuration, would like this to remain in the future rather than implementing very complex use case into FTS3

Oliver: need to discuss with experiments the migration to new FTS3 clients and the obsolescence of FTS2 ones

  • ruccio already using it, LHCb will start working on it soon
    • LHCb: current situation makes reverting to FTS2 easier until FTS3 is fully validated
  • CMS: experts to be asked

Impact on non WLCG VOs: work well advertized in EGI

Oliver must send Michel a pointer to documentation of changes compared to FTS2.

Discussion

  • Chris - can a site restrict the bandwidth used by FTS? Not directly
  • NDGF - we used FTS3 and it's the first time our 10Gb link was filled up.

Actions in Progress

Ops Coord Report - J. Flix

New Ops Coord leader: S. Campana

  • Deputy: A. Sciaba

MW baseline

  • New EMI-3 StoRM version
  • Critical bug affecting top BDII: fix announced via EGI Broadcast
  • DPM 1.8.7 released to EPEL
  • gfal/lcg_util bug fix released to EPEL

glexec: 46 sites verified, 48 in progress

  • Often coupled with SL6 migration

SL6

  • T1: 10/16 completed, 3 in progress, 3 with a plan
  • T2: 81/130, 15 to complete in time to meet the deadline, 30 not replied yet
  • HEPSPEC06 for SL6 is being filled: sites encouraged to publish their results
  • HEP_OSLibs new version 1.0.13: sites must upgrade at their convenience

SHA-2

  • CERN VOMS servers should become SHA-2 compliants in the next month
  • CERN plans to move a SHA-2 certificate for VOMS host certificate in the coming month, inducing a DN change
    • Concerns after the problem that affected BNL VOMS server after a similar change: plan is to deploy a 3d server and remove the old one when it has been been properly configured everywhere

VOMRS

  • Need experiments to test the new VOMS-Admin
  • No progress seen on this in the last months

CVMFS

  • ALICE: good progress, 50% sites done
  • CMS: not yet completed despite the deadline set to Oct. 1st
  • LHCb: conddb repo can be removed, no longer used

Machine/job features

  • Now really active
  • Meetings and minutes in Indico
  • Developed a tool mjf.py returning the machine/job information
    • Tested at CERN
    • Next step: packaging for wider deployment
  • Open question: how to extract HS06 in VMs?

perfSONAR

  • All sites must upgrade to 3.3.1

IPv6 validation and deployment

  • Concentrate on assessing that moving to dual stack doesn't break existing Ipv4 services
  • Joint work with HEPiX WG: sites invited to join the testbed and the WG

XrootD monitoring deployment

  • Detailed monitoring of dCache instances now available

FTS3

  • See previous talk
  • Plan is to support the whole WLCG load on 1 instance (with 1 failover)

Tracking tools

  • Important meetings these weeks, more next month

MW readiness verification

  • Maarten is TF convenor
  • A twiki page created, an email being created
  • Membership still to be defined

WMS decommissionning TF just created

  • CERN would like to assess the future usage of VMS and see if decommissionning could be planned at some point

Next Ops Coord meeting clashing with CHEP: postponed to the week after (Oct. 24)

HEPiX Puppet WG - B. Jones

Config tools used at sites evolving: last survey by EGI showed that Puppet is now as used than Quattor with several sites without a config tool and planning to look at it.

  • Growing user community in EGI and a huge wider community
  • CERN working on and publishing modules on GitHub
    • Every grid service except VOMS has a puppet module. Half probably published. SOme work out of the box and some don’t. SOme need testing for hard coding. As on github will help fix for other sites.

WG goals: share information, experiences and code amongst sites using puppet

  • Help with possible migration paths for products where YAIM future is unclear
  • Several collaboration options: central point for documentation and support, more formal collaboration on modules, a full suite of HEP modules, publishing to the Puppet forge
    • Is Puppet forge too formal?
    • DPM dev team publishing its modules to Puppet forge

WG status

  • 30 subscribers to the list
  • Currently documenting modules available for EMI-2 and EMI-3

Would like to receive feedback from WLCG sites

  • They are encouraged to join the WG if interested in Puppet

Discussion

  • Jeremy: UK has several sites involved in a semi-active group. Found that there are several ways of doing things but do not yet have a consensus on ways forward but good to get the discussion started. Will be very useful to have a WLCG wide community effort sharing experiences and modules.
    • Helge gives assurance that contributions back to CERN modules will be accepted after a careful review, to avoid the community splintering.
  • Maarten: Are there any signs in Quattor community that sites would be interested in moving to Puppet?
    • Michel: A Quattor workshop held 2 weeks ago. Not a very big community but still active and no site looking in short-term. Keeping an eye on events: some may want to move in the future but the critical feature is the availability of a service description maintained by the community, one of the great Quattor success.

WLCG IS - M. Alandes Pradillo

BDII distribution

  • EMI-2 and EMI-3 repository: versions aligned
  • UMD: UMD-2 still has the old version, to be updated soon
  • EPEL5 and EPEL6: only the resource BDII
    • Other packages by the end of the year

WLCG baseline updated recently: 5.2.21-1

  • Still many sites with the previous version

GLUE2 validation

  • MW: now done, still some consistency issues for storage attributes between different implementations (see slides)
  • Sites: still quite a lot of errors due to publishing of obsolete attributes, should be reduced with last version of BDII
    • Also many errors due to MW issues: followed up with developers
  • GLUE2-validator will be used by EGI in a Nagios probe

New webpage with all information related to information system: gridinfo.web.cern.ch

  • Sysadmins, users, developers...

GLUE1 retirement plan (EGI)

  • 2014Q1: assess/fix MW clients for correct work with GLUE2 data
  • May 1, 2014: decommission GLUE1 (doesn't mean it will not be published anymore but new services like cloud will not appear)
    • Will require using ginfo rather than lcg-info or lcg-infosites but ginfo is not working with GLUE1
    • Impact on OSG? To be followed up by Ops Coord TF

WLCG IS service still being deployed but still in a prototype phase

  • Progresses may be presented at a future GDB

Discussion

  • MJ: Retirement and impact on OSG. Long standing issue. Is there any progress?
    • MAP: No. No plans in OSG to use Glue2. Perhaps
    • ML: when the experiments need it then effort will be provided. That was the management response from OSG.
    • Lothar Bauerdick (LB): There is no request for Glue2. Thinking about having a compatibility layer to cover those things we got from Glue1. Not planning to make a transition but may be revised if something really depends on GLUE2. Our general direction is to reduce dep on global information services.
    • MAP: Glue1 will remain but will not fix it.
    • SB: The key thing is that new developments will be in Glue2. EGI is looking at publishing Cloud services and these will only be in Glue2.
    • MAP: In next TF meeting we’ll discuss the VO feedback on the use of ginfo.
    • MAP: ginfo will not work with OSG resources as it uses Glue2.

Storage WG Report - W. Bhimji

Storage interfaces: now concentrating on replacement interfaces for disk-only sites (dav, xrootd)

  • ATLAS exploring dav deletion with Ruccio
  • ATLAS also testing http put with spacetokens for stageouts
  • LHCb integrating with xroot and http/dav
  • Need to ensure that both xroot and http/dav are registered in GOCDB in the near future
  • Probably time for setting up a deadline for RFIO retirement: currently progressing slowly

Space tokens: Atlas use decreasing and could probably live without it (through implementation of Ruccio quota)

  • Need to ensure non-ATLAS ST use is covered too

Benchmark/IO pattern activity continuing.

  • Mainly based on EOS. Harvesting log info proving interesting.

WAN transfers:

  • ALICE has a lot of interesting real data in MonAlisa: failover to remote site working well
  • CMS AAA: 2/7 T1, 39/51 T2
    • Main goal is failover and support of diskless sites

EOS now has WebDav and http support

Davix: toolkit for optimized remote I/O

  • Not another http library
  • Support all http based protocols (S3, WebDav, CDMI)

A future F2F meeting as a pre-GDB with a wider attendance would be useful

Discussion

  • Michel: on adding xrootd to GOCDB, I thought this was decided.
    • Wahid: If it is then not all sites are doing it, even the ones being used.
  • Michel rfio decommissioning - what is current usage by exps? can we get a status on use of rfio (ie move to xrootd as default access protocol)?
    • Alessandro: Atlas his taking a relaxed approach: not many sites left which have rfio as preferred access protocol.
  • Maarten - can we move to disabling rfio by default at some time in the future?
    • Oliver: DPM team to understand site implications (ie can rfio be turned off for access but not for internal use?)
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2013-10-14 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback