Summary of GDB meeting, March 12, 2014 (CNAF)

Agenda

https://indico.cern.ch/event/272619/other-view?view=standard

Introduction - M. Jouvin

WLCG dates settled: July 7-9, Barcelona

  • Finalizing registration fees: 100 to 150E
  • Web site opening: by the end of March
  • Send program suggestions to Michel and Simone
  • No GDB in July

Spring pre-GDB

  • May: data access
  • June: IPv6 management
  • Discussing possibility of an April meeting as a follow-up to volunteer computing presentation in January
  • IPv6 post-GDB in April (April 10-11)

Actions status

  • CVMFS MW client repository: waiting for feedback from experiments
    • Mostly LHCb so far
  • Update DM client developers with experiment plans to move to GFAL2/FTS3 clients

Forthcoming meetings: see slides

CVMFS Status and Roadmap - J. Bloomer

CVMFS 2.1 is a rewrite of CVMFS 2.0 to address shortcomings and implement wishes received from the early experience

  • shared local cache, NFS support, hot patching, file system snapshots, file chunking
  • Some features require the full chain (from Stratum 0 to clients) to be 2.1
  • 6 Linux distributions supported
  • 2 releases in the last year, compared to one every month in the previous year!
    • SW lifecycle improved
  • Now 100% (WLCG) grid coverage

CVMFS deployment status: 2.0 end of life end of March 2014

  • Deployed at all Stratum 1
  • Stratusm 0 (Installation boxes): several outside WLCG and new repositories at CERN
  • Migration of main repositories planned during Spring, one by one repository

Development plans for 2014

  • S3 backend for installation boxes and used as a source for Stratum 1: will remove the need for NFS
  • Improved support for super computers: shared cache between nodes, located on the SC global filesystem
  • Alternative content hashes: support for other methods than SHA-1, a rehashing campaing may be required
    • SHA-1 remains perfectly fine for CVMFS
  • Auto-configuration for easier support of small VOs
    • Automatic discovery of site proxies, following recommendation of WLCG TF
  • Improved support for completely independent Stratum 0
    • Ability to support only a subset of Stratum 1
    • Central key and configuration repository approach
  • Active repository replication, without the need for the cron job running every 15 mn

CernVM 2.x will be EOL end of Sept. 2014: moving to 3.x

  • Bundle of independent components: microCernVM bootloader + SL6 subet + CernVM extras like CVMFS
  • VM image is ~20 MB, with a contextualization method usable on all clouds: cloud-init + amiconfig
  • 3.1 released in January, based on SL6.4
  • 3.2 planned end of March: SL6.5, Google Compute Engine support, integration with cloud-scheduler and Shoal
  • Work in progress for long term data preservation support: SL4 and SL5 support in microCernVM

Discussion

  • Status of client update to baseline version: someone should check the Stratum 1 logs...

EU-T0 Initiative - G. Lamanna

APPEC forum: Astro-Particle Physics consortium

  • All the future experiences are challenging for data management and data analysis
  • A 3-year work about astro-particles and computing: a report will be published soon
  • Several issues identified: data management, computing models, new SW/MW for // computing...

Know-host of HEP community could benefit to AP community: close computing needs, same funding agencies...

  • CERN papers last year about e-Infra vision for the 21th century well in line with APPEC thoughts

EU-T0 initiative: Integrated Distributed Data Management Infrastructures for Science and Technology

  • More efficient and sustainable based on macro research communities: ELIXIR, CLARIN, LifeWatch
  • No more room for competition between those communities for funding
  • Be research and data-centric: all projects related to EU-TO are data intensive
  • These research communities operate the computing centers providing the services required by their communities
  • Federate these computing centers as a EU-TO Centre of Excellence
    • Operate common cloud services, drive evolution of current grid/cloud and big data e-Infra
    • Dedicated network development
    • Global "Virtual Research Enviroment"
    • Interfaces with HPC centers
    • Data preservation
    • SW R&D

EU-T0 as a global and coherent initiative

  • Collaboration between institutes
  • Training centre, with private+public cooperation
  • Federated stakeholder of e-Infrastructyres
  • Pilot on possible public+private cooperation

EU-T0 initiative just starting: a position statement signed by several funding agencies Feb. 11

  • CERN is one of signatory
  • Working on a detailed work program that could be submitted to funding programs like H2020

Discussion

  • Peter Solagna: unclear how this T0 initiative will interface with the existing dsitributed infrastructure?
    • Giovanni: T0 in the sense the initiative is completely different from T0 in the CERN/WLCG sense. A distributed infrastructure holding main data from various disciplines/experiments. Controlled by the scientific communities. Interacting with EGI for operational issues: no intention to replace or reinvent it. Also interacting with EGI for needs of long tail sciences.
  • Just the start of this initiative, many things to clarify/refine: could be rediscussed at WLCG workshop as part of the discussion on future directions for WLCG computing

GOC Recent Developments - J. Gordon

Recent feature added in GOCDB 5.2 allowing sites to define arbitrary valued tags

  • Called extension properties
  • If tags agreed with communities, can allow to classify their sites

Multiple key/value for the same key: allows a list of value (e.g. list of VOs)

  • Queries can define a filter based on extension properties: wilcard supported
  • Usual operators available to build expressions: AND, OR...

Extension properties are added to the XML output under EXTENSIONS tag.

Full set of examples available on line: see slides

Proposal for the future: unauthenticated access to query GOCDB for non sensitive information

  • No enthusiasm at a recent EGI OMB

Discussion

  • The VOs should agree with the sites to keep the namespace clean: some coordination also needed between the VOs
    • Suggestion by Atlas (Alessandro) to have a standard set of keys, but GOCDB does not see it as its role to implement it: creation of keys should remain under the control of sites
    • Maria/Andrea: need some clear use cases from the VOs. In particular should avoid duplication with BDII information (e.g. supported VO list doesn't seem a very good idea...)
    • Adding/updating extention properties to endpoints will remain under the control of the sites, not the VOs: cannot be used by a VO to tag a particular set of sites (e.g. T2D)
  • Michel suggests follow-up in the discussion with the VOs under the umbrella of the WLCG IS TF, led by Maria Allandes
    • Maria insists that VOs have already been asked about the foreseen use cases but that no answers were received so far
  • Feedback on the proposal of a RO "public GOCDB"
    • Ulf: Need to ensure GOCDB doesn't become a source for collecting email address and other similar information
    • WLCG VOs generally have no problem with the authenticated access as they have certificates...

Summary of pre-GDB on Batch Systems - M. Jouvin

Well attended

  • 25 person in the room
  • ~15 remote
  • Only site representatives (expected)
  • Most “well known experts” of batch systems present
  • Covering Torque/MAUI, Grid Engine, LSF, HTCondor, SLURM
    • See slides for details

No major issue with MW integration, including accounting

  • Not everything working of the shelf with EMI-3 but fixes have been produced by sites and integration into EMI/UMD is in progress

Accounting: J. Gordon would like to setup a short-live TF with main batch system experts (1 or 2 per implementation)

  • Review reported discrepancies by experiments between what is published in EGI accounting and job-based accounting in experiments: need to double check every thing done in APEL parsers is consistent, in particular between implementations
  • Check failed job processing
  • Call for volunteers
  • Would like to report at WLCG workshop

A summary table will be produced in Twiki to help sites wanted to review their batch system choice

  • Weaknesses, not only strengths/features…
  • Scale at which problems where observed
  • Contact of reference sites

A lot of work in progress, in particular for multi-core job support

  • Follow-up in Ops Coordination
  • Summary at a future GDB, probably end of Spring

Some topics not discussed due to lack of time, could be discussed at a future meeting

  • Dynamic WN handling
  • ARC CE vs. CREAM CE comparison based on recent experiences

Discussion

  • Philippe: back filling is difficult in the pilot job world. Late binding of payload, no idea when the pilot start about the job duration...
  • MAUI: clarification that MAUI being unsupported, its usage remains possible but sites using it are at risk of a major problem that would require an urgent migration to something else
    • No urgency to plan this migration currently but probably a good thing for those sites to review their choice for the future

Actions in Progress

Ops Coord Report - A. Sciaba

Dates of next meetings: see slides, a few changes due to conflicts

  • In particular 1st meeting of April (April 3d) is clashing with HEP SW Collaboration kick-off meeting: to be reviewed

Experiment news

  • ALICE: investigating with CERN differencies in failure rates and CPU efficiencies between Meyrin and Wigner
  • ATLAS
    • Migration to Rucio going well
    • Asking more sites to join FAX
    • Evaluating the use of http/DAV for federation as an alternative to xroot: sites may be asked to keep http access enabled permanently
  • CMS:
    • Working on using CVMFS for nightly builds
    • Would like to use multi-core jobs at scale at T1
  • LHCb
    • T1s must declare SRM.nearline endpoint in GOCDB to have separate downtimes

Storage

  • StoRM: release with SHA-2 support
  • dCache: 2.2 support extended as 2.6 doesn't support Enstore

glexec

  • Close to the end: only 16 sites left with open tickets
  • SAM tests close to enable glexec test as critical

SHA-2

  • VOMRS in fact has no pb with SHA-2...
  • VOMRS migration unchanged: a VOMS-Admin cluster available since beginning of Feb.

perfSONAR

  • ps 3.3.2 is now baseline
  • April 1st is the deadline for all sites to have it deployed and registered: still 50% sites missing
    • Tickets should be open against these sites, to be checked
    • Contry representatives could look at the perfSONAR dashboard to get a list of the non working sites

FTS3

  • Incident at RAL on Feb. 18
  • CERN would like to decommissiong FTS2 by August 1st
  • Discussion with experiments on how to efficiently integrate multiple servers in their framework

Machine/job features

  • Prototype ready for testing, including for clouds (based on CouchDB)
  • Progress on bi-directional communication between clouds and experiment frameworks

Multi-core deployment: see pre-GDB summary

WMS decommissioning

  • Green light from all experiments: starting April 1st
  • SAM has its own timeline, linked to Condor-G submission readiness

MW readiness

  • Experiments writing instructions to site who want to participate
  • Next meeting March 18

Oracle: CERN moving to 12c during Spring

  • T1 requested to upgrade to last 11g asap

EMI-3/UMD-3 Migration Status - P. Solagna

Starting today, alarms will be raised against site running unsupported SW (UMD-2/EMI-2 with exception of dCache)

  • Sites asked to provide a migration plan in the next 15 days
  • Beginning of May: end of security support for UMD-2/EMI-2
  • End of May: not upgraded services will have to be put in downtime
  • dCache is an exception: deadline set to July because of the lack of Enstore support

On March 7th

  • DPM: 127
  • CREAM : 220
  • Sites with WNs: 130
  • Site BDIIs (related to others probably): 150
  • A lot to do but still 2 months ahead of us...

APEL client migration is a critical step for upgrading CEs: need to upgrade all APEL clients at the same time. Mainly 2 options:

  • Ugrade APEL client at the same time as the CE
  • Upgrade APEL client on EMI-2 CREAM
  • Upgrading CREAM keeping EMI-2 APEL probably works but is not recommended at all
  • Sites with doubt should open a ticket against APEL units
    • Also look at the migration documentation, quite exhaustive
    • Sites should favour migration on a month boundary: can begin by just stopping the publication into EMI-2 and restarting it later in EMI-3 (backlog will be properly processed)
    • Reminder that a site publishing both EMI-2 and EMI-3 data will have its EMI-2 data ignored: said at last GDB, well documented

SL/CentOS discussion update - T. Ouveley

4 possible future directions formalized on hepix-users

  • See slides

No date given by RH about the end of source RPMs: will not happen soon, keep current ways of doing for SL6

SL7: most probable options are 3 and 4

  • No urgency to decide

Next steps

  • Progress update at April GDB
  • Discussion/decision at HEPiX about SL7 strategy

HEP-SPEC 14: Status and Next Step

HEP-SPEC History - M. Michelotto

SPEC: non profit organization with members from the industry and the academic world

Originally started with SPEC 2K but realized that they were too many ways to run it, leading to difficult comparison

  • Despite this, was widely used in LHC community for pledges
  • But some discrepancies seen by experiments between SI2K score and real application performance

Led to the need of a common way of running SPEC suite: HS06, based on SPEC 2006

  • Objective: match +/- 3% the application perfs
  • Reproduce the batch node environment: single core apps on a fully loaded WN
  • 2 GB/core memory footprint
  • Well defined options to use to make easy for computer vendors to run it as part of tenders
  • Fine tuning made using a few floating point suites

HS06 aging because

  • Impose 32-bit execution environment: not completely a problem right now as long as the apps continue to scale the same way as the benchmark

Group of experts from experiments has been formed

  • Many already participated in the HS06 effort: should help

Need to wait for the new SPEC suite to be available before starting the real work: expected next Fall

  • A first step that can be done already: agree with experiments on compilation options to use
  • May also start to identify the experiment application to use to assess that HS14 is representative of our apps

Summary of Discussions about HS14 - M. Alef

Several points raised on GDB list after the initial presentation in January. Below are some answers/clarifications:

Number of // executions: number of job slots (or cores it they match)

  • In // the benchmark must be run with the HT/Turbo configuration than the one used on WNs
  • Number of // execution is critical as there is not necessarily a linear increase with the number of copies/slots

Benchmark score vs. accounting

  • Benchmark score is the average socre with a typical job mix on a fully loaded system: one given app may have a significant variation and the same may happen on an idle system
    • Accounting uses the worst-case: fully loaded system

SPEC is not open-source and ideally gives result in a minute...

  • Very doubtful about such a benchmark to be representative of a mix of applications

Conversion SI2K to HS06: a rule was propose to easily convert HS06 to SI2K but not the other way...

  • Proper benchmarking is to run HS06 benchmark and do the conversion if needed

HS06 score scaling with real applications cannot be guaranted for each individual application but must look for significant discrepancies

  • Possible hardware issues, in particular disk access contention
  • In VM, some compilation flags requiring specific HW hidden in the VM may result in poor performances

Infrastructure News

OSG Update - B. Bockelmann

No organizational changes since last report

Linear increase of OSG resource usage in the last year: all VOs, not only WLCG

  • ATLAS and CMS remains the main users
  • VO osg: users accessing OSG through OSG User Support, in particular users from XSEDE
  • Pilot-based payloads acount for more than 99% of OSG
  • LHC benefiting from large use base and resulting economy of scale

OSG SW: currently supporting 2 major releases

  • Avoid too frequent major release: goal is less than once every 2 years
  • New major release only when a component update cannot fit in the current release
  • Significant investment in the last months in testing infrastructure to allow quicker issue detection
  • Adding support for UDT as an alternative to TCP for high-latency connections: working with FTS3 to get UDT support integrated into FTS
  • Koji-based build system: opened to/used by several partnership projects

HTCondor CE: CE based solely on HTCondor + BLAH

  • No more Globus components (e.g. GRAM)

Misc. issues with interoperability consequences

  • SHA-2: no known issues since Nov. 2013
  • VOMS clients: still using VOMS 2.x series and have no current plans to upgrade
    • Many applications linked with the C/C++ API
  • Upgrade to latest VOMS-Admin is under consideration
  • BDII/GLUE2: no news, no plan
  • CVMFS: large investment to allow its use by all OSG VOs
    • grid UI distributed through a CVMFS repository
    • All sites encouraged to runn HTTP proxies
  • APEL: system to transfer from Gratia recently transfered to OSG operation. No know issues.
    • GRATIA remains at FNAL
    • No plan for new features in GRATIA right now: John asking about possible developments for clouds and storage

Discussion

  • IPv6: no plan yet, probably part of next year planning
    • Main activities currently at upstream vendors
  • BDII: Atlas insists that they use the BDII as one source for AGIS and really need OSG to publish accurate data into it
    • Brian: USATLAS should be made aware of this requirement and push it in OSG
  • GLUE2: currently some extensions being discussed at OGF, in particular for cloud resources
    • Brian: no plan to use it for cloud resources
  • VOMS client: OSG aware that it says on its own with v2 clients, ready to do the minimum maintenance required for its usaget
    • In fact already started to add the ability to generate a proxy with the same encryption method as the parent certificate (e.g. SHA-2) instead of SHA-1 all the time.
  • OIM contact for discussing GOCDB extension properties: Rob Quick (OSG operations)M

NDGF Update - M. Waddenstein

ARC uptake increased significantly

  • ~80 ARC CE
  • New backends: BOINC (IHEP), HTCondor (RAL
  • Several improvement in integration with EGI
    • Better support of SAM tests
    • Direct data access from WN now possible

NDGF still hosted by NeIC and distributed among 7 sites

  • Plenty of opportunities for small share on large HPC computers: lots of opportunity for short jobs (backfilling of HPC clusters)
  • NDGF will finally remain in EGI
  • 2 developers (2 FTE): 1 for dCache and 1 for ARC

NDGF current concerns

  • Full cost of networking due to GEANT cost sharing model: GEANT cost is 3x higher than commercial links
    • Cost related to traffic volume: inappropriate for LHCONE usage
    • GEANT costs paid by NDGF
  • Could be a problem for any-to-any usage envisioned by data federations

Discussion

  • ARC maintenance for non Nordic countries: not foreseen as a problem, minimum load
  • ARC CE main development project, currently unscheduled: rewrite the interface scripts with the back-ends
    • Especially needed for HTCondor
    • Already started several times but a significant project, difficult in the current manpower context

-- MichelJouvin - 12 Mar 2014

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2014-03-12 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback