Summary of GDB meeting, September, 2016 (CERN)

(Thanks to Brian Davies and Michel Jouvin)

Agenda

https://indico.cern.ch/event/394786/

Introduction - I. Collier

Discussion points:

October GDB cancelled: conflict with CHEP

No pre-GDB yet scheduled in November and December

  • Network pre-GDB postponed to January due to unavailability of key people

Proposing to move the March GDB to Taipei, co-located with ISGC

  • This allows Asian sites to more easily participate - and to address issues relevant to them
  • Shout shortly if you have objections!

WLCG Workshop - J. Andreeva

Draft agenda: https://indico.cern.ch/event/555063/timetable/#20161008

  • Follow-up of some key areas discussed in Lisbon: data management, cloud provisionning, security
  • New topics: network, SW perfomance (including experiment frameworks)
  • Some topics not included when sufficiently discussed at GDBs (accounting, IS) or not mature enough (monitoring)
  • Goal is discussions rather than a lot of presentations

Data management: after pre-GDB will concentrate on current activities (many) rather than discussion of disruptive changes

SW performance: concentrate on experiment workflow efficiency at T0

  • Do we have the proper tools to understand it?

Discussion points:

  • T0 efficiency analysis talk should bring up issues/interest to other sites for accounting and IO rates

IPv6 Status - D. Kelsey

Plan elaborated after the June IPv6 workshop, presented at Luly MB, refined in a WG meeting last week

  • Seeking for approval at next MB, Sept. 20

Main points of the plan

  • ATLAS, CMS and LHCb are aiming to have their critical central services dual stack by April 1, 2017.
    • ALICE central services are already dual stack
  • Except ALICE, storage is not fully federated so this is manageable to use some storage resources that are IPv6-only
    • Not possible for ALICE: all existing IPv4 storage resources must be dual stack before they can use IPv6-only ones
  • ALICE can support IPv6-only WNs as soon as enough storage is dual stack

Timeline proposed

  • April 2017: sites can provide IPv6-only WNs, all T1s storage must be dual stack
  • April 2018: same availability/reliability for IPv6 storage/nodes

July MB worried that the plan is too aggressive...

IPv6 WG meeting in September showed good progress

  • Feedback from all T0T1s since July: no complaint
  • CERN, CNAF and RAL actively engaged in the WG
  • QMUL providing IPv6-only WNs
  • Production dCache IPv6 at PIC: 500 MB/s sustained during 12 hours
    • Transfers with QMUL
  • BNL now peering with IPv6 on OPN
  • CMS: IPv6-only WN SAM tests
    • Some failures but understood and easy to fix
  • LHCb declared itself fully IPv6 compatible: can use any dual stack storage

Google IPv6 statistics continue to show a strong increase in IPv6 usage compared to June

  • In particular UK changed from negligible to 15%
  • Some incoming IPv6 traffic seen at CERN recently

Adjustement proposed to initial plan is to change "T1s are required" by "T1s are requested". But would like to stay with the original plan to keep the momentum

  • In particular would like to see MB approving the April 2017 date

Discussion points:

  • Slide 6 Alice , Not only remotely read for workflow, but if users are ipv6 then all storage must be accessible by ipv6 ( slide 7)
  • Slide 18 Alice needs ~150 not ~50 per run since 4-5% of data per job is read remotely also - so for ALICE would be a requirement for t2s as well as T1.
  • Requested changes to document going to MB next week.
  • ALICE request T2 changes to IPV6 to be conveyed to the T2s explicitly.
  • CMS requirements are getting closer to ALICE requirements (partial T2 deployment leads to reduced flexibility in job placement.)
  • Discussion to plan to allow for some small amount of pledged resource to be accommodated for.
  • Acknowledging the challenge for ALICE & CMS, T1s are a good starting point as they are few in number. Operationally pursuing all T2s will be a bigger job.

NDGF Review - J. Flix

Study required by NeIC about cost efficiency of NDGF T1 and possible optimizations

  • Not restricted to monetary costs
  • Conducted by Pepe
  • Discussion with experiment PIs, NT1 coordination body, some ATLAS/ALICE users, service providers

Main resource providers located in Norway and Sweden

  • Denmark has a significant share of the tape capacity

Resource utilization

  • ALICE: level of usage consistent with the average of T1s (CPU, disk), over pledged
    • At the time of the survey, was not fully integrated into ALICE SAM: done now, very good reliability
    • NT1 storage seen as 1 SE: some impact on CPU efficiency. Job submitted directly to sites without using ARC cache.
  • ATLAS: CPU just at the level of pledges, over pledges for disk
    • Very good reliability as seen by SAM
    • ATLAS using ARC cache: generally hiding the SE distribution

NT1 budget sources: public grants, NeIc funding, in-kind contributions from sites

  • Total (direct) budget ~1.4 M/year
  • FTE: 2 for development, 11 for operations (12.5 with WAN operations
  • Significant in-kind contributions covering electricity, infrastructures...
    • Electricity cost estimated at 175 k/year
  • Network (WAN) costs (NeIC, NREN + in-kind): ~500 k/year
  • Cost of operating a distributed tape system estimated to 35 k/year: marginal in the total budget
  • Total cost estimated at 3 M/year
    • ~1/2 budget related to personnel costs
    • Denmark electricity price significantly higher but marginal impact as it hosts few resources

NT1 has a very successful history in WLCG

  • Recognized competence and expertize
  • Only distributed T1

Cost of running NT1 as a single site: budget reduction evaluated at ~23% (-0,7 k)

  • But the budget relies on a similar level of in-kind contributions that would not be affordable for a single site
  • May negatively impact the skills in the different nordic countries
  • Any reduction in the number of sites must be done carefully, as the budget impact would be marginal

Final report public.

Looking for a replacement for Gerard, dCache developer

MISP - R. Wartel

In our WLCG/HEP community (like other private organizations), not time to look into malware trends: just want to know what to look for, what to block...

  • Great amount of knowledge and expertise available: security vendors, security teams, law enforcements...
    • At CERN, contacts with parnets considered the #1 IDS for many years

"Threat intelligence" is a critical security component

  • Main challenge: get a quality and actionable thread intelligence
  • Malware Information Sharing Platform (MISP) emerged as a key platform for this
    • Share knowledge with others
    • Pull information from other sources
  • CERN started a central MISP for WLCG/HEP
    • Goal: be quicker than update of tools by vendors
    • Access through Web portal (eduGain+egroup) or API access (with API key for automatic pulling of information)
  • Exploiting MISP information is more complicated than setting up the platform
    • Correlation with local logs and data non trivial: solutions likely site-dependent
    • Increase the importance of the work on a WLCG SOC: need resources for SOC WG

Data Management pre-GDB Summary - O. Keeble

Agenda: https://indico.cern.ch/event/394833/

Resource Reporting in absence of SRM

  • Proposal elaborated and discussed with experiments over the summer
  • What? mainly report about "space quotas" (aka space tokens in SRM), free/used
    • Optionally area with restrictive quotas and ability of getting the resource report for the full namespace
  • How? via existing protocol
    • Discussion on whether a summary file should also be provided in the namespace: may have some advantages to support new storage protocols or tapes
  • Additional: storage dumps
    • Should be "on demand" and as an exception (as rarely as possible)
    • CMS strongly relies on them

S3/Swift early experience

  • @RAL and @CERN in particular
  • S3 can fit into current WLCG storage evolution: http-based
    • Currently testing integration with FTS
  • Has the potential to revolutionize some workflows: doesn't have the same limitations as grid storage
    • May allow to move from data management to data curation

Reducing the protocol zoo = reducing the cost of running storage

  • Reduce the number of services to run to operate storages
    • Key: remove obligation to run SRM
    • Provide missing SRM features through standard features of other protocols
  • Easier support of non HEP communities
  • No storage provider have problem running a SRM-less storage resource
  • Experiments: main issue is space reporting (ATLAS/LHCb), now a proposal available
    • ATLAS: want to push gridftp to its limits but don't really care about protocols, no plan to stop SRM support
    • CMS: already supporting non-SRM storage, 2 already retired their endpoints (CERN in progress and others expected by the end of the year)

WLCG Data Coordination: coordination of data management activities

  • Initial idea from July MB: Oliver asked to prepare a draft of a possible mandate
  • Proposal: a limited data steering group with one person per experiment, storage providers, 2 or 3 persons from infrastructure
    • Representatives should understand they are undertaking a WLCG responsibility
    • No limit on group lifetime
    • Oversee discussions around DM and trigger the necessary actions/TF
    • Meeting at least twice a year, preferably more
    • Organizing one pre-GDB every 6 months on DM issues to discuss DM developments
    • Reports to MB, escalate discussions requiring formal decisions
    • Ian B: should make clear that this group has the responsibility to draw a long term strategy for WLCG evolution

Fast Benchmarking Update

LHCb and Fast Benchmark - A. McNab

Presentation based on P. Charpentier's work and presentation at December 2015 GDB

JobPower (renormalized event/s in MC jobs) to HS06 correlation: event/s per HS06 varying ~50% depending on architecture

  • JobPower vs. Dirac benchmark: must better picture, really a gaussian peak, mean close to peak, very little in the +50% area

HS06 (MJF) vs. fast benchmark: arguments for both

  • Fast benchmark: picture of the current situation on the machine (that may vary during the job)
  • MJF/HS06: close to worst case that the site is promising to supply

Conclusion: LHCb very happy with Dirac benchmark, would like to see it included into any WLCG recommendations for a fast benchmark

  • Better predictor than HS06 for MC jobs
  • Would be nice to have it included into SL to be run at boot time and communicated with MJF
    • Michel: suggests to get it in EPEL rather than SL, not all sites using SL
    • Randy: CVMFS could be used to distribute the benchmark too
    • Tim: value of running it a boot time is not clear (machine not loaded)
    • Domenico: this problem can be overcome by running as many benchmarks as the number of cores available

ATLAS Fast Benchmark Results - A. Forti

ATLAS will use the CERN benchmark suite built by D. Giordano and M. Alef

  • ATLAS KV + Dirac benchmark + Whetstone
  • Good scaling with ATLAS apps
  • To be use in cloud and standard grid pilots
    • WIP for grid pilots
    • R. Sobie carrying out tests on Canadian IaaS resources (+ EC2 soon): benchmark run regularly, results in ES, users can look for the benchmark value during a specific period

ALICE Update on Fast Benchmark - C. Grigoras

ALICE considers LHCbMarks (Dirac benchmark) as the best candidate

  • Easy to run and integrate
  • Very good scaling with ALICE simulation

Would suggest registering benchmark result in a DB that allows to share value between VOs

  • All VOs should run it from the same place to ensure that the same benchmark is run
  • Will also allow to get an average of the last N measurements
  • Provide a URL to query the DB based on a CPU model and a site
  • Will allow to reduce the impact on specific conditions on a machine and ease to spot host misconfigurations

HS06: significant variations compared to ALICE MC

  • No value seen in MJF compared to the fast benchmark + DB
  • For ALICE MC, HS06 is not always a worst case scenario (see slides)

Discussion

Latchezar: we should move forward with a WLCG fast benchmark (LHCb one or the CERN suite of 3 benchmarks)

  • Should become the benchmark used for procurements

Michel: could ATLAS provides figures on the value added by KV and Whetstone to LHCb benchmark

  • Domenico: work in progress to compare the benchmark, at the 10-15% level they are all equivalent, KV which runs a bit longer (~2mn) can spot some particular effects not seen by others

Helge: important to continue to discuss this with the benchmark WG

AOB

CHEP: program being finalized

  • Several parallel sessions done
  • Plenary sessions almost finalized

-- IanCollier - 2016-09-29

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2016-09-29 - IanCollier
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback