WLCG Technical Forum

Introduction

The WLCG Technical Forum has been set up for discussions between WLCG stakeholders about middleware etc. in view of improving the reliability and efficiency of the WLCG infrastructures.

It can and should play a role both in the resolution of short-term issues and in devising longer-term strategies for phasing out certain components in favor of better technologies. In the latter case it would be desirable to increase the commonality between the experiments, where that is practical. Industry or de-facto standards generally should be preferred over ad-hoc approaches.

Discussions are generally proposed on the mailing list (add cern.ch domain):

    wlcg-tech-forum
Only members can post, but anyone can apply for membership using the E-Groups interface. Subscribers outside CERN may need to apply first for an external account. Or send a message to (add cern.ch domain):
    Maarten.Litmaath
For any "promising" topic a (usually short-lived) working group would be formed by forum members who are interested and able to participate in creating a (short) working document with the topic's current state of affairs and recommendations. Such documents and associated discussion histories are hosted on a MediaWiki instance run by Andrew McNab at the University of Manchester:

https://wlcg-tf.hep.ac.uk/wiki/Main_Page

Membership

  • Experiments
    • Various representatives per experiment
  • Sites
    • Tier-0
    • All Tier-1 centers
    • Significant number of Tier-2 sites with different setups
  • Infrastructures
    • EGEE/EGI, OSG, ARC
  • Experts
    • Some resident, others consulted as needed

Topics proposed for discussion

At the EGEE'09 conference a presentation was given about the Technical Forum with many examples of possible topics:

http://indico.cern.ch/materialDisplay.py?contribId=201&sessionId=23&materialId=slides&confId=55893

Generic

  • Error messages
  • Logging
  • Fault tolerance
  • Service stability and failover
  • Services should protect themselves against misuse
  • Failover can allow for transparent upgrades
  • Documentation
  • Collaboration and communication between gLite, EGEE, deployment, experiments: avoid duplication and surprises
  • Sometimes it seems we are too ambitious with what we ask from the sites
  • IPv6

Packaging

  • Dodgy packaging, requiring workarounds in YAIM
  • ETICS: difficult to get package sources, ARC does this right

Configuration

  • YAIM: example site-info.def does not show new variables
  • YAIM: understanding configuration problems is difficult, much better in ARC
  • Configuration complexity, customization
  • Site upgrades require long downtimes
  • SE downtime implies CE downtime
  • Change strategy for sites: stability vs. upgrades
  • Upgrade rollbacks often not possible

Data management

  • DPM: short and long-term future
  • SRM scalability
  • Rate of I/O errors
  • dCache administration and troubleshooting
  • dCache and DPM: different ports, different security layers
  • dCache: optimization
  • dCache: DCAP service instability and client recovery
  • Local protocols: client vs. server vs. application versions
  • Temporary unavailability of T0D1 files
  • SRM vs. data access patterns outside HEP
  • SRMv2.2: some complexities/issues due to lack of a standard for synchronizing the SE and experiment catalogues

Job management

  • Pilot jobs push problems from users to sysadmins
  • WMS: Condor-G not VOMS-aware
  • WMS: VOViews should be decisive where present
  • Shared area at many sites
  • Support for multiple SubClusters per CE
  • Tight connection of CE to Torque installation
  • Benchmarks: SW must be bought, results debatable
  • Reliability of jobs, random failures, much better in ARC
  • Virtual machines: agreement on remotely generated images
  • WMS/LB: operation has required a lot of effort
  • WMS/LB: should be stateless and load-balanceable
  • Pool accounts: scalability

Information system

  • Wrong info published by sites making them attractive

Clients

  • gLite UI: difficult for users to install
  • Backward/forward compatibility of clients
  • Documentation, APIs and standardization, e.g. for Java clients

Further examples

Data management

  • Efficient, scalable data access by jobs -- main STEP'09 outcome!
    • Local vs. remote
    • Protocols
    • Throttling
    • T3 farms vs. T2 load
  • ACLs
  • Quotas
  • SRM
  • Xrootd
  • GPFS, Lustre, NFSv4, Hadoop, REDDNet, ...
    • File protocol
    • Clouds
  • Issues specific to some implementation(s)
    • BeStMan, CASTOR, dCache, DPM, StoRM

Job management

  • CREAM
  • WMS
  • ARC
  • Condor-G,-C, GT4
  • MyProxy failover
  • Pilot jobs
    • Glexec
    • Frameworks
  • Virtualization
  • Clouds
  • Shared SW area scalability
    • ALICE: BitTorrent
  • PROOF

Security

  • Vulnerabilities
  • Consistency

Information system

  • Fail-over
  • GLUE 2.0

Monitoring

  • All jobs
  • Consistency, consolidation

Accounting

  • Messaging system
  • Storage

-- MaartenLitmaath - 2009-09-14

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2010-01-12 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback