WLCG Technical Forum 2009
Introduction
The WLCG Technical Forum has been set up for discussions between WLCG
stakeholders about middleware etc. in view of improving the reliability
and efficiency of the WLCG infrastructures.
It can and should play a role both in the resolution of short-term issues
and in devising longer-term strategies for phasing out certain components
in favor of better technologies. In the latter case it would be desirable
to increase the commonality between the experiments, where that is practical.
Industry or de-facto standards generally should be preferred over ad-hoc
approaches.
Discussions are generally proposed on the mailing list (add
cern.ch
domain):
wlcg-tech-forum
Only members can post, but anyone can apply for membership using the
E-Groups
interface. Subscribers outside CERN
may need to apply first for an
external account
.
Or send a message to (add
cern.ch
domain):
Maarten.Litmaath
For any "promising" topic a (usually short-lived) working group would be formed
by forum members who are interested and able to participate in creating a (short)
working document with the topic's current state of affairs and recommendations.
Such documents and associated discussion histories are hosted on a MediaWiki
instance run by Andrew McNab at the University of Manchester:
https://wlcg-tf.hep.ac.uk/wiki/Main_Page
Membership
- Experiments
- Various representatives per experiment
- Sites
- Tier-0
- All Tier-1 centers
- Significant number of Tier-2 sites with different setups
- Infrastructures
- Experts
- Some resident, others consulted as needed
Topics proposed for discussion
At the EGEE'09 conference a presentation was given about the Technical Forum
with many examples of possible topics:
http://indico.cern.ch/materialDisplay.py?contribId=201&sessionId=23&materialId=slides&confId=55893
Generic
- Error messages
- Logging
- Fault tolerance
- Service stability and failover
- Services should protect themselves against misuse
- Failover can allow for transparent upgrades
- Documentation
- Collaboration and communication between gLite, EGEE, deployment, experiments: avoid duplication and surprises
- Sometimes it seems we are too ambitious with what we ask from the sites
- IPv6
Packaging
- Dodgy packaging, requiring workarounds in YAIM
- ETICS: difficult to get package sources, ARC does this right
Configuration
- YAIM: example site-info.def does not show new variables
- YAIM: understanding configuration problems is difficult, much better in ARC
- Configuration complexity, customization
- Site upgrades require long downtimes
- SE downtime implies CE downtime
- Change strategy for sites: stability vs. upgrades
- Upgrade rollbacks often not possible
Data management
- DPM: short and long-term future
- SRM scalability
- Rate of I/O errors
- dCache administration and troubleshooting
- dCache and DPM: different ports, different security layers
- dCache: optimization
- dCache: DCAP service instability and client recovery
- Local protocols: client vs. server vs. application versions
- Temporary unavailability of T0D1 files
- SRM vs. data access patterns outside HEP
- SRMv2.2: some complexities/issues due to lack of a standard for synchronizing the SE and experiment catalogues
Job management
- Pilot jobs push problems from users to sysadmins
- WMS: Condor-G not VOMS-aware
- WMS: VOViews should be decisive where present
- Shared area at many sites
- Support for multiple SubClusters per CE
- Tight connection of CE to Torque installation
- Benchmarks: SW must be bought, results debatable
- Reliability of jobs, random failures, much better in ARC
- Virtual machines: agreement on remotely generated images
- WMS/LB: operation has required a lot of effort
- WMS/LB: should be stateless and load-balanceable
- Pool accounts: scalability
Information system
- Wrong info published by sites making them attractive
Clients
- gLite UI: difficult for users to install
- Backward/forward compatibility of clients
- Documentation, APIs and standardization, e.g. for Java clients
Further examples
Data management
- Efficient, scalable data access by jobs -- main STEP'09 outcome!
- Local vs. remote
- Protocols
- Throttling
- T3 farms vs. T2 load
- ACLs
- Quotas
- SRM
- Xrootd
- GPFS, Lustre, NFSv4, Hadoop, REDDNet, ...
- Issues specific to some implementation(s)
- BeStMan, CASTOR, dCache, DPM, StoRM
Job management
- CREAM
- WMS
- ARC
- Condor-G,-C, GT4
- MyProxy failover
- Pilot jobs
- Virtualization
- Clouds
- Shared SW area scalability
- PROOF
Security
- Vulnerabilities
- Consistency
Information system
Monitoring
- All jobs
- Consistency, consolidation
Accounting
--
MaartenLitmaath - 2009-09-14