Summary of GDB meeting, November 14, 2012


Welcome - M. Jouvin

No more GDB summaries in the future if there is no volunteers to take notes * Should not be a specialized role handled by 1 or 2 persons

Operations Coordination

Introduction - G. Merino

All meeting minutes available on Twiki

  • 6 coordination meetings since September
  • F2F meeting at pre-GDB in January

Several dedicated task forces

  • SHA-2: need to test RFC-proxy readiness, test SHA-2 CA to be started at CERN in January
    • MW validation will be done with EGI
    • OSG ahead of EGI: preparing a compliant release very soon but still to be deployed. Bestman status still unclear (trick to be confirmed). Not really coordinated by TF.
    • dCache still not ready: waiting for an update (not clear they will go for the Bestman trick)
  • MW deployment: moving away from unsupported MW, migation to SL6
    • EMI-2 SL5 WN: ok for all VOs, tarball expected in January
    • EMI-2 SL6 WN: ATLAS not yet ready, others VO ok. Sites should not upgrade if they support ATLAS
  • glexec: SAM test added to EGI ROC_OPERATIONS, failures raise alarms followed up by EGI/NGI
    • Goal: reach 75% by beginning of next year
    • VO: CMS ready, ATLAS/ALICE work to be done, LHCb needs to check if current support is ok
  • Tracking tools: email list created, people interested should subscribe
    • Hot topic: Savanah-JIRA migration + GGUS bridge
  • FTS3: functional tests ongoing for CMS and ATLAS, several FTS3 features demonstrated already, asked to install FTS3 clients in a public area
    • MySQL backend already available
  • xrootd
  • Squid monitoring: move monitoring of Squid as a WLCG responsibility rather than a VO responsibility
    • Currently using a tool developed at FNAL based on MRTG and awstats)
    • Evaluating the possibility to use Aggregated Topology Provider (ATP) as the source of configuration information (feeds available from OIM and GOCDB, an extension to GOCDB planned that may be enough)

Gonzalo moving to a new job, thanks for the work done

  • Probably need to find a replacement for Ops Coord

CVMFS - S. Roiser

Version 2.1.5 now in beta: shared cache and NFS export

  • NFS export is intended as a "last resort" configuration when there is no other option
  • Stable version foreseen in January

ALICE recently joined the TF: will move to CVMFS for SW deployment

  • Now all 4 VOs involved in TF
  • ALICE first targets the HLT farm, deployment at sites later next year after validation
  • ATLAS and LHCb set a target for full CVMFS deployment by end of April 2013
    • A GGUS ticket will be sent starting in Feb. for missing sites
    • No new SW deployement at sites after the deadline outside CVMFS

When the TF started, 97 sites targetet

  • 26 already deployed it
  • 24 have planned to deploy it early 2013
  • 10 have plans but no date
  • 32 sites have not responded: GGUS tickets opened

perfSONAR - S. Campana

perfSONAR "requires" 2 different hosts for bandwith and latency tests

All sites should run a perfSONAR instance but started by list of sites provided by experiments

  • Sites grouped in region or clouds

Until recently, the perfSONAR configuration was local to each instance: does not scale with a meshed configuration

  • Will move to a central configuration: feature available but not yet natively built into the current perfSONAR release: need to run a shell script to activate it
    • Will be in next release planned next quarter
    • Sites should not wait for this release to install perfSONAR
    • Existing sites are requested to move to the central configuration

Sites must register their perfSONAR instances (both) in GOCDB/OIM

Scope of tests

  • One mesh per region + OPN mesh
  • T2s to T1s: Each region against OPN
  • Inter-region: considered important by CMS and ATLAS
  • Bandwith test frequency depends on the scope of the tests
    • Every 6h for intra-region/cloud tests

TF manpower mainly from IT, US, UK

  • Would like to have one person per region
    • 1 common to all experiments
    • Small FTE fraction: more a contact

Documentation being streamlined

A modular dashboard is available to display the results.

  • Code being moved from BNL to GitHub
  • Region welcome to start their instance

From January, larger deployment campaign will start.

From EGI: a GGUS support unit will start in the next month, handled by DANTE

  • Will be for perfSONAR-MDM but test interoperability not yet demonstrated
    • To be followed up in the next month

Actions Update

EMI Migration Update - P. Solagna

Sites still running unsupported MW today: 44 sites, 82 services, not including DPM/LFC/WN

  • BDII, CREAM, lcg CE, VOMS, WMS, LB, dCache
  • 26 sites had plan to migrate by end of Nov. but failed to do it
  • New deadline: Dec. 17. After this date, services must be in at risk downtime for the impacted services
  • No site suspended yet: a downtime for unsupported services is considered enough if there is a plan to upgrade

DPM, LFC and WN now unsupported: alarms are opened against sites sinc Dec. 11 (already 44)

  • WN status not clear after CE migration
  • Handled by ROD rather than COD
  • Ticket asks for plan to be communicated in the next 10 dys
  • Jan. 31 is the deadline for upgrading

Next steps

  • Add probes for ARC unsupported MW (prior to EMI-1)
    • 5 sites affected
  • Deployment of probes for EMI-1 products that will be obsolete end of April
    • Warning from start of January
    • Alarms from March 1

Requirements for GLUE2 -M. Alandes Pradillo

Monthly meeting betwwen IS experts, experiments, EGI and OSG

  • First meeting last week
  • Twiki page to track discussions

GLUE2 deployment status: 10 more sites in EGI

GLUE2 validation: validator tool being upgraded to comply with EGI GLUE2 profile

  • Will be added as a Nagios probe
  • Longer term plan: have the GLUE validator part of the resource BDII: will prevent publication of invalid information
    • BDII will include it in next version but will take time to get it updated in all products

ginfo: extended use cases, not only service discovery

  • In contact with experiments to get feedback

Discussion ongoing about the information needed in the BDII, between static info provided by GOCDB and dynamic info provided by messaging.

OSG confirmed it had not request from its stakeholder: wants to see GLUE2 proven in EGI first

  • From EGI use, will revise their plans in the future

Oracle at T1s

ATLAS - D. Barberis

3 databases

  • geometry, calibration data... : Oracle + FronTIER
    • Confirmation from 4 T1s that they will continue to support it in the future. Will continue with these sites until end of RunII
    • DB releases will be phased out: direct use of FronTier when required
    • Condition data files now moved from HOTDISK to CVMFS
  • TagDB: reducing from 5 to 3 sites
    • Commitment from these 3 sites for 12-18 months: enough to develop a new backend
    • Longer term plan: use Hadoop instead of Oracle
  • Muon calibration centers
    • Calibration centers don't profit from CERN license
    • Revising plans to attempt to move to standard license (rather than the one including STREAM)
    • Also working on alternative plans

CMS - I. Fisk

Conditions DB accessed by FronTier

  • Oracle at T1 only as an insurance against catastrophic events: will reduce to 1 backup

Other usage at T1 is for FTS: will be happy to move a new backend when available

  • FNAL is very interested to participate to the testing: FTS is the only Oracle use at FNAL

Some Oracle DB at T0: may consider reducing the number down if asked

  • Positive experience with MongoDB and CouchDB
  • PhEDEX is probaby hard to move out of Oracle but only required at CERN


Tony: discussion with Oracle ongoing, no decision yet whether this will continue to cover T1 usage or not

  • CERN is trying to ensure that T1 will have conditions at least as good as today

DPM Community and Workshop - O. Keeble

DPM workshop last week at LAL: 25 people registered

  • First day was for product presentations and site/experiment feedback
  • Second day was more hands-on and demo of new features: http/DAV interface, DMlite installatin, memcache, new backends
  • An important milestone in the process of creating a DPM community

Site feedback about main issues

  • I/O concurrency limits
  • Draining performance
  • Rebalancing storage
  • Hot file replication
  • Metadata consistency
  • Inter-VO quotas
  • Redundant dpmdaemon
    • Nameserver can already been done
  • Some of these issues already (partly) adressed by GridPP-maintained admin toolkit

Gridftp improvements planned to support redirection

  • Will allow SRM-less operations required by services like Globus Online


Configuration: Puppet use demonstrated and will be used in standalone mode as the future standard configuration tool in replacement of YAIM

DPM collaboration: first meeting to establish the collaboration

  • Representatives from CERN, France, Japan, Italy, Taiwan, UK
  • Decided to proceed with establishment of the collaboration
  • Statement: see slides

ATLAS remark: sites are asked to configure/enable xrootd and http access but sites expressed some concerns about the complexity

  • Probably more information must be released to site about what is involved exactly: additional RPMs required, config files to produce

Virtualized WNs: where are we? what next?

Final Report for HEPiX WG - T. Cass

Initial concern about root access still to be no more a concern for many sites, eg. CERN

  • Logs for traceability remains an important concern, in particular if the user has root access

Image endorsement model defined now endorsed by EGI Federated Clouds TF

Initial steps envisioned not really followed!

  • Step 1: users choose between images prepared by sites. In fact, virtualised WNs are transparent too users
  • Step 2: distribution of images between several sites. Has been demonstrated: technology is there.
    • CERNVM emerged as the main image
  • Step 4: VM images connect directly to pilot framework
    • Clearly one of the issue that may be worked together in the future
    • Need to work how to transmit VO credentials to images
    • Role of the pilot factory, impact on batch system needs
      • How to avoid queueing requests for VMs?


LHCb interested to work on direct use of cloud resources from DIRAC without going through batch systems

  • Would like to use multi-core VMs

ATLAS / BNL also interested in participating to work on step 4 proposed

Start an email discussion between interested persons and come up with more concrete plans for possible work together

  • Plan a pre-GDB on this topic in the coming month?

Software Defined Networks for Big-Data Sciences - I. Monga

Why is there a problem with big-data sciences requiring bulk data transfers

  • TCP is the underlying transfer protocol but it is a "fragile" workhorse
    • Sensitive to packet loss
    • Prevents to use efficiently the underlying infrastructure
    • 2 possible solutions: replace/change TCP or layer 2 protocols like RoCE

Whatever the solution used will require

  • Enough capacity provisionning to provide loss-free, high bandwith infrastructure
  • Fast lanes (virtual circuit) end-to-end
  • Big buffer
  • Efficient monitoring for proactive resolution of problems

OpenFlow opens new possibilities

  • Allows control of the data path (HW) from a central controller
  • Based on flow tables where entries define actions taken based on rules based on 11-field tuples
    • Main actions are forward to a port (possibly after some transformatons) or drop packet (firewall), ...
    • Can also forward to the openflow controller if no rules matches (or if a specific rule matches)

SDNs: provide a virtualized view of the global topology to applications (users)

  • Allows an application to define rules applied to its view of the network resources and have it distributed throughout the network
  • Example: ESnet can be presented to apps as a WAN virtual switch interconnecting a few sites
  • Demonstrated at last SC with a 3-port VirtualSwitch
    • Specify edge OF ports
    • Specify backbone topology and bandwith: was done using OSCAR mesh
    • Policy constraints like flowspace
    • Store the switch in a topolgy service

Future work

  • Harden architecture and implementation: move from experiment to test service
  • Verify scaling of the model, in particular size limit of a virtual switch
  • Automation and intelligent provisionning

MW Provisionning Lifecyle - M. Schulz

Feedback got shows the proposed approach is well accepted

No more central funding means all efforts become voluntarily

  • Fact that infrastructure depends on it doesn't change this
  • Hopefully, most institutions involved are part of WLCG and committed to its success

Role of EGI, future open-ended EMI consortium, ScienceSoft

Scope: main focus is EMI components with gLite heritage

  • OSG historitically managed their MW independently but OSG MW now closer to ours
  • ARC: indepedent, no WN, not clear how much coupling is needed

Centrally managed rollback idea is rejected: no more discussion on how to do it

  • Recipies will be published on how to do it for sites who may need it
  • Site admins are generally not confortable with the rollback idea

Version metapackages also rejected: will keep only unversioned metapackages

  • Just describe the dependencies, not their versions
  • Will evolve very slowly
  • Does not help with the baseline versions required for some services

OS platform at discretion of PTs as long as RH and its derivatives are covered

  • Coordination only required for OS major version changes

Configuration management: YAIM-core support ending soon and not clear what's the alternative for small sites

  • DPM will try the idea of a "YAIM wrapper" to Puppet
  • Transition period required to evaluate the appropriate solution

Pilot services: not necessary for most PTs

  • In particular storage tends to be covered product communities or developers (FTS3)
  • Main issue is probably the CE

Support: not all PTs are committed to provide level 3 support through GGUS

  • EGI will do level 1 and 2
  • GGUS must remain the place for tracking change requests: well known, no real alternative

Other questions

  • MW discussions in WLCG as pre-GDB a couple of times each year
  • Impact on other sciences: to be managed by PTs and their funding bodies
  • Communication with PTs: they are asked to connect to pre-GDB or GDB when the plans for their products are discussed
  • EPEL is strong on integration, weak on testing: materials move from epel-test to epel-stable can be time driven
    • Need to put emphasis on automatic testing
    • Need also to do some real world testing involving a few sites, as it was done for EMI2 WN

A new version of the proposal will be released soon, taken into account feeback received.

NDGF Update - O. Smirnova

NDGF T1 now part of NeIC which is an initiative of NordFosk

  • NeIC director: Gudmund Host
  • HW and SW contributed by users

NDGF T1 main roles

  • Overall coordinator: M. Waddenstein
  • CERN liaison: Oxana
  • Security officer: Leif Nixon
  • Sites participating to T1 also host T2 resources
  • Now extends outside nordic countries: Slovenia, Switzerland
  • MW from EMI
  • Huge variety of HW/SW configuration
  • LHCOPN extending to all sites with T1 storage
  • Pledges: 10% of ALICE T1 resources, 5% of ATLAS T1 resources

This topic: LCG > WebHome > WLCGGDBDocs > GDBMeetingNotes20121212
Topic revision: r1 - 2012-12-12 - MichelJouvin
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback