Summary of GDB meeting, September 11, 2013 (CERN)

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=251189

Welcome - M. Jouvin

GDBs scheduled for second Wednesday in October and December.

  • No GDB in November owing to WLCG workshop.

2014 GDB dates created in indico (usual 2nd Wednesday) - please check for clashes.

  • What to do about January GDB? Nominally due 11th January but may be difficult to arrange speakers [no views expressed about this].

Maria Girone will take over from Ian Fisk as CMS computing coordinator. The GDB thanked Maria for her dedication to WLCG and wished her good luck in her new role.

  • No info yet as to who will replace Maria in WLCG coordination role, announcement soon but maybe not before October GDB.

WLCG Monitoring Consolidation - Pablo Saiz

Pablo presented CERN's plans for monitoring consolidation. Key objectives were to reduce costs (owing to expected end of EGI/Inspire funding) by:

  • Reducing complexity, modular design
  • Simplifying operations, support and service
  • Common development and deployment - Unify, where possible, components

Main improvement/simplification directions

  • Remove unused applications: identified with experiments, most apps used!
  • Reduce scope: centralized deployment of monitoring, SAM central service used by EGI transfered outside CERN, remove OPS monitoring
  • Modular design: by example, ensure that there is only one storage service used by every components
    • Currently a lot of interdependencies and duplication
  • Deployment: use OpenStack/Agile Infrastructure
  • Storage: evaluating ElasticSearch (nosql) rather than Oracle
    • Waiting for next version of ElasticSearch (end of the year)
    • Storage service based on a SSB-like metric store
  • Merging applications
  • Common metric store: base solution already existing (SSB) and already used heavily by ATLAS and CMS

Plan is to have a new prototype monitoring framework in place by the end of 2013 and all monitoring transitioned to the new framework by summer 2014.

  • First prototype set up with a SSB instance fed with SAM results and SSB web interface
  • Working on aggregation and availability calculation

A review of experiment requirements for monitoring applications has been carried out and a number identified as no longer required. Pablo also asked if there was any ongoing requirement for OPS VO monitoring given that it no longer featured in WLCG reporting.

Discussion:

  • Where did the site naming scheme come from? Answ: Came from the experiments.
  • How would individual services be aggregated (IP addresses)? Answ: Experiments say which site components they monitor.
  • Jeff concerned that sites had not been sufficiently involved. There was particular concern that some sites depended on Nagios Site Monitoring - was this within the scope of this review.
    • Pablo reported that engagement with the sites had been carried out through the WLCG Operations Liaison meeting. Was not clear however if sufficient feedback had been obtained.

  • Can sites still use Nagios to import/view WLCG monitoring results? Answ: No - only for Nagios SAM tests.
    • Several sites surprised by this. Alessandra noted that some sites didn't use Nagios as nagios probes were not critical for availability.
    • Simone noted that ATLAS was only asked to comment from an experiment's PoV (not sites').
    • The working group needs another iteration that is focussed on sites. Michel stated that these directions regarding Nagios were a bit contradictory with July discussion and that we need to get more input from the sites (interaction occured mainly during the summer) - do we want to pursue Site Nagios?

Alice SW Evolution - Pedrag Buncic

Pedrag presented Alice's plans for how to evolve its existing 15 year old "monolithic" software framework. Main motivation is in order to converge the online and offline reconstruction in order to be able to handle the factor 100 Pb-Pb interaction rate expected in Run 3.

  • Run3 will use a continuous readout

Alice still suffering from I/O performance issues for analysis despite closely working with I/O developers.

  • The elaborate OO data model is not necessarily the most efficient.
  • Need to look again at this in order to better optimise I/O. Expect to have to cope with a factor of 10-100 in terms of file transaction rate beyond run 2.

Simulation: need to migrate from Geant3 to Geant4 but G4 2x slower for ALICE....

  • Need to work with G4 experts on performance: expect to profit from future G4 developments related to performances (multithread, GPU support...)
  • Already improved this G4 penalty by 60% since the beginning of the year

Alice moving to CVMFS to replace torrent. Currently Alice using CVMFS at two sites, 35 Alice sites have CVMFS and 17 sites still need CVMFS setting up.

  • Alice thanked sites for their quick response.

Looking for more synergies with other experiments: many commonalities in problems to solve

  • Example: file catalog

Discussion:

  • Is Alien file catalogue scalable? Answ: No - plan not to move off Alien, just implement new catalogue
  • Michel: Useful to have update in 6 months time.

ATLAS WebDav Plans - Cedric Surfon

Cedric explained how ATLAS DQ2 was evolving towards the new RUCIO system.

  • ATLAS moving to open and widely used protocols where possible.
  • WEBDAV of interest to ATLAS as it is available on over 90% of WLCG SEs. Currently using it to manage the large scale renaming of files necessary in the migration to RUCIO.
  • File renaming campaign underway - will need to rename approx. 300 million files across WLCG. transparently without disrupting user analysis or MC production.
    • Renaming rate of 10Hz achievable at distant sites (such as TRIUMF) with large RTT
    • Managing 30Hz at close sites (such as IN2P3). 5 Tier-1s already renamed and overal 52% of files in the LFC now renamed. The goal is to finish teh campaign by the end ofthe year.

Discussion:

  • Most Tier-2 sites run DPM and will need to upgrade? Answ: DPM 1.8.7 (which supports webdav) now a stable release and therefore no obstacles to deployment at Tier-2s to allow progress. ATLAS will deploy DPM 1.8.7 on a few test sites and then press sites to deploy widely.
  • What about BESTMAN? Answ: Just 10% of ATLAS sites, will have to rename manually.
  • Michel: How does upgrade of Ruccio work? Answ: Site should contact its ATLAS Cloud support.

Comment - version support for webdav in storm being tested this week.

Ops Coord Report - Andrea Sciaba

perfSONAR now part of baseline versions

dCache: 1.9.12 support extended to end of Sept., 2.2.17 (just released) must be installed for SHA-2 compatibility

BDII: last version must be deployed by all sites for site and top BDII

  • Important GLUE2 fixes

SHA-2: WLCG deadline extended to December 1st

  • Validation by experiments progressing well: mainly ALICE has still various SW to check
  • VOMSRS -> VOMS migration underway: fully automated

CVMFS

  • Security fix released end of August (2.1.14): must be installed by all sites
  • ALICE deployment progressing well
  • CMS: only ~10 sites remaining
  • New support unit in GGUS

FTS3

  • ATLAS using it for 30% of their production transfer
  • 2 production instances: CERN and RAL
  • Very impressive results so far

Tracking tools

  • Completed migration for some ALICE and LHCb trackers

MW readiness verification: extended staged rollout for WLCG

  • See July discussion
  • TF membership still to be defined: may start from the old MW deployment TF

Machine/job feature TF fully setup, just starting to work

News from EGI

perfSONAR now tracked in baseline versions table End of support for dCache 1.9.12 was extended to September 30 Due to delay in releasing SHA-2 ready version dCache 2.2.17 (just released) is SHA-2 compliant All sites should update their BDIIs to the latest version, including fixes for GLUE-2 and security

SL6 Migration Status - Alessandra Forti

Several Tier-1s completed.

  • Five Tier-1s in progress and three yet to start migration.
  • All expect to be complete by the end (or just after) of October.

About 43% of Tier-2s now migrated. Looks tight for all sites to be completely migrated but most will be done.

  • Some sites reported they were waiting for the QUATTOR templates. Michel reported that the templates had been checked at GRIF and so other sites will be able to upgrade rapidly.

HEP_OSLibs

  • RPM still evolving: ATLAS and LHCb had pb because of sites running old versions

EMI-3 WN is still not usable by all sites.

  • There have been problems with memory usage for the the EMI-3 VOMS client on nodes with large number of cores ( 48?)

SHA-2 and EMI-3 Deployment Roadmap - Peter Solagna

Peter reported on the progress towards deployment of SHA-2 capable services.

  • Compliance currently: 61% CREAM, 53% VOMS, 74% WMS, 11% STORM and no dCache (74 dCache sites).
    • Correction to slide 2 "dCache 2.2.17 not in UMD yet, released by the PT a few days ago.
  • Existing deadline for all services to be SHA-2 compliant by 1st October is not feasible. A new timeline has been proposed which will keep SHA-1 certificates the default until 1st December 2013, this will be discussed at the next EUGridPMA meeting.
    • GRID services must be SHA-2 compliant by the end of November. This is a strict deadline and services no SHA-2 capable must be put in downtime or suspended if unable to comply after this date.
  • DPM: Michel noted that all EMI-2 software versions were compliant and non compliant dpm instances should have been in downtime for months.

What is the situation in the USA? Is there any risk that American certificates will not comply with the new timeline?

  • Ans: OSG has moved to a commercial CA and probably will not release SHA-2 certificates any time soon.
  • Fermilab is considering their resources available to upgrade to SHA-2 compliance and believe the new deadline may be feasible.
    • Implies a major dCache upgrade already scheduled with a similar timeline

Maartin noted that we cannot be certain that a CA will not roll out SHA-2 certificates early but that seems unlikely.

  • It is unlikely we (WLCG) will switch unless everyone is able to switch but Fermilab remains a challange - we should monitor the situation month by month.
  • It is not likely we will switch to issuing SHA-2 certificates in December - we should wait until after the Chrismas break. We are in pretty good shape overall
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2013-10-09 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback