Summary of October GDB, October 17, 2018 (CERN)

Agenda

Agenda

Introduction

Next GDBs

  • March 2019 at CERN, usual week: not moving, book early
  • April 2019: during ISGC, one week earlier than usual (April 3d)

Pre-GDB: authn/authz in Decembre

  • No confirmed pre-GDB in Novembre

List of upcoming meetings: see slides

  • LHCONE/LHCOPN Workshop, 30th, 31st October, FermiLab, US, https://indico.cern.ch/event/725706/
  • CS3, 28th, 30th January 2019, INFN Roma, Italy - http://cs3.infn.it/
  • Euromicro International Conference on Parallel, Distributed and Network-based Processing, PDP2019, February 13-15, 2019, Pavia, ITALY, https://www.pdp2019.eu/std.html
  • WLCG/HSF/OSG Workshop, 18th 22nd March 2018, JLab, VA, USA
  • HEPiX Spring 2019, 25th, 29th March 2019, San Diego
  • ISGC 2019, 31st March - 5th April 2019, Taipei, Taiwan
  • European HTCondor, 24 - 27 September 2019, at the Joint Research Centre of the European Commission in Ispra, Lombardy, Italy
  • CHEP date mentioned is still a tentative date...

Next WLCG/HSF workshop: JLab decided

  • Full week of March 18, 2019
  • Joined with OSG too
  • Work just started on the agenda matching expectations of the 3 parties: see slide for an early list of topics

CRIC Status - A. Di Girolamo

CRIC is a validated source of information for experiments that can present in a consistent way several sources of information

  • BDII, GOCDB, REBUS, OIM...
  • Not a replacement for the information sources, e.g. BDII: CRIC is an aggregator
  • Information presented as a particular experiment expects it
  • Can include every kind of resources, including those not in the BDII/grid (HPC...)

Implemented as a Web application with a REST API

  • Based on Django framework
  • Authn/authz for a flexible implementation of groups and roles
  • Possibility of editing information manually for urgent fixes
  • If information published by the sites is wrong, possibility to notify the site
    • Detailed on how it is done still to be sorted out: a ticket is probably the best way to do it

Status

  • CRIC for CMS as a siteDB replacement: the main priority presently (https://cms-cric.cern.ch)
    • User info (including authz) and topology retrieved from CRIC
    • Also working on importing GlieIN WMS config: much more complex, many problems identified, now ready to test again...
    • Importance of collaboration with experiment experts
  • Coming soon: WLCG instance for operations and LHCb/ALICE who didn't request a specific instance (https://wlcg-cric.cern.ch)
  • Then an ATLAS instance
  • Also an instance for Compass has been set up

Contact: cric-devs@cernNOSPAMPLEASE.ch

Discussion

Ian C.: how to making the situation more complex for sites if they have to support CRIC for WLCG and BDII and friends for other VOs
  • A: CRIC is not a site service, it consumes information published by sites using the usual/common services.
  • It may be interesting when we have demonstrated usefulness for WLCG VOs to propose it to other VOs, if they willing to adopt this approach and implement their instance
  • Alessandro: CRIC also makes possible for a site to use something else (e.g. JSON export) to publish their resources if they don't want to run a BDII

Tim: Need to make clear what is the place to use for updating site info like downtime. CRIC ? GOCDB ?

  • Vincent: we need to be sure that CRIC is not exposing a site that has be suspended, in particular for security reasons

DPM Upgrade TF

Storage Resource Reporting (SRR) - J. Andreeva

Prototyping phase started at Fall 2017

2 different use cases for SRR

  • Topology: description of non-overlapping storage shares, protocols
  • Accounting: query used/free space without SRM, based on JSON file published by SEs

DPM support for SRR: available in 1.10.3

  • Requires DOME to be enabled

TF Report - F. Furano

DPM 1.10.4 is the current release and support SRR

  • Based on the new directory based quota tokens: very efficient real-time processing of the information
  • Published at known path: can be retrieved by CRIC
  • DOME must be enabled: upgrading to 1.10.4 is not enough
    • Still possible to run the legacy, SRM, stack but as a totally separate stack: optional if DOME is enabled

Legacy mode in DPM (aka LCGDM) will not be supported anymore after June 1st, 2019

  • No plan to remove it from EPEL as long as it compiles: pure C-code, may last very long...

TF to minimise the DOME switch effort: start by a few sites, then ask for a wide upgrade.

  • Goal: 5 pilot sites (3 so far: BRUNEL, INFN-NAPOLI, PragueLCG2)
  • Next dev enlarged meeting (Doodle in progress) will discuss this

Authorization WG - R. Wartel

Design and promote token-based authn/authz as a possible replacement for X509

  • 2 modes: integrated with CERN/SSO and CERN HR DB or standalone (configurable authn and identity vetting sources)
  • No deployment by WLCG: use existing SW solutions

Ongoing activities

  • Pilot AAIs being enhanced for meeting requirement defined/agreed
    • 2 pilots: EGI Check-In & COManage and INDIGO AIM
    • WLCG will ultimately decide on one implementation for its use
  • Token schema to allow inter-operability between implementations
  • VO interviews to align VO workflows/frameworks requirements
    • Will influence the token schema definition

Next phase

  • Dec. pre-GDB: assess the pilots
  • Feb. 2019: provide feedback to WLCG MB

Discussion

Xavier: several other communities look at sharing their infrastructure (in particular for data) with us (DUNE, SKA...), need to be proactive with them to ensure we are deciding on something that can meet their need

  • Romain/Maarten: sure, it is important. But no worry as we are jumping on an industry-standard solution, adopted by a significan part of the world.

WLCG DOMA Report - S. Campana

Project general meeting on the 4th Wednesday of each month at 4 pm CET

3 WGs

  • Third Party Copy (TPC): alternative to gridftp
    • Co-chaired by A. Forti (ATLAS) and B. Bockelmann (CMS)
    • Short-term: investigate and commission alternative protocols. Currently https/DAV and Xrootd.
    • Medium-term: token-based auth (in coordination with authz WG)
    • Long-term: switch to an alternative protocol
    • Rucio+FTS testbed set up: many caveats being addressed. Initial focus on functionality, performance later.
  • Data access: performance, content delivery, caching
    • Chairs: S. Jezequel, I. Vukotic, F. Wuerthwein, X. Espinal, M. Schulz
    • 2 main subtopics: understanding access patterns and performances, caching strategies and technologies
    • Caching: simulation work done based on Rucio logs for MWT2 site. Main conclusions: caching has to be content and workflow aware; use a hierarchy of cache (based on XCache). See WG meeting during HEPiX: https://indico.cern.ch/event/763847/contributions/3170541/attachments/1730988/2797549/go.
  • QoS: recently started
    • Chair: P. Millar
    • Expose different class of storages (what we used to call disk and tape) and have them integrated into DDM framework. A potential source for HW saving.
    • A storage class is a different tradeoff between performance, reliability and cost
    • Will help with the integration of new storage technologies

Also some work related to network activities and R&D: focus on data transfers

  • Data Transfer Nodes, bandwith on demand, SDNs...
  • Collaboration with SKA ARENAS project and HEPiX Networking WG
  • Leverage network information from FTS as the transfer maanager

AII: follow WLCG Authz WG

WLCG DOMA WG is not about defining the Data Lake architecture but, through R&D and prototypes, enable the technologies that will make the Data Lake possible

  • Distributed storage is addresses through the 3 focuses subgroups which are different aspects of a global/distributed storage
  • After adoption of Rucio by CMS, open a big opportunity for simplifying "inter-operability" work
  • Some topics not yet discussed like tape carrousel: work also done in the archival WG

Next milestones

  • LHCC review of WLCG: no date settled yet but sometimes this year. Need to be ready to present our work.
  • HSF/WLCG workshop in JLab

Work done in collaboration with relevant work in XDC and ESCAPE projects

Discussion

Discussion on how much caching can be efficient in our context where there is little reuse of a file

  • Analysis mentioned before, based on MWT2, tends to suggest that there is a significant efficiency with a reasonably sized cache
  • In addition to pure caching, it is planned to look at 'latency hiders' where basically the required files are read ahead when the processing has already started making the latency price paid by the first events.

Cost and Performance Modelling - J. Flix

Identify metrics that best describe a workload to better predict needs and impact of changes

  • Need a collaboration of SW experts and site administrators
    • SW experts: how efficiently a HW is used
    • Site admins: how the global resources are used

A tool to collect basic metrics for an application developed by HSF: prmon * Metrics based on OS counters

Another tool developed in WLCG performance group at CERN IT: Trident

  • Collects metrics based on CPU counters

Resource estimation: goal is to define a common framework for modelling computing requirement

  • Only need to be good enough
  • Current work based on CMS framework: interest from other experiments
  • Input parameters: LHC parameters (trigger rates, duty cycle...), computing model (event size, processing times, processing campaign), storage model (number of replicas...), infrastructure (capacity evolution model, disk, tape...)
    • No network parameters for the moment

Site cost estimation: a model developed at CCIN2P3

  • Goal: maximize the capacity over cost ratio
  • Takes into account HW costs, electricity, infrastructure, manpower

Also some work specific on the impact of various storage strategies on the overall operation cost

  • Having only 15 big data sites and only (unmanaged) caches at T2: may save up to 45% of the storage manpower (60 FTEs)
    • Some discussion during the GDB about the figures used to come to this conclusion... partly based on estimates made at NDGF to run an ARC CE and its cache without running a SE
  • Removing replicas for all files that can be regenerated, e.g. AOD: worst case is probably the same price to regenerate all of the lost files based on 1 PB lost (corresponding to the observed disk failure rate in EOS) per year
    • Could be better if regenerating only things that are needed

Caches efficiency, throughput and latency studies: preliminary work done, see slides

Several different knobs to improve performances slightly but most of them can be consolidated to lead a significant improvement

HEPiX Report - H. Meinhard

HEPiX WGs

  • 2 new WGs formed recently: Tech Watch and AAI

Last workshop last week in Barcelona (PIC): 137 attendees (3/4 from Europe) from 54 affiliations (including 5 companies)

Covering the usual, diverse, service-related topics on basic and end user IT services, computing and batch services, storage, network and security, IT facilities and business continuity

  • Computing: first tests of AMD EPYC architecture that may be appealing for us in the future
  • Several HEP sites involved in Photon Science: a first workshop at BNL earlier this month, idea of a co-located meeting with HEPiX in the future
  • Site reports: a few newcomer like SurfSARA

Several BOFs: AAI at sites, DOMA access, archival storage/tapes

  • AAI WG/SIG created as a result of the BOF, led by Paolo Taledo (CERN AAI project manager). Subscription welcome: see slides.

Next workshops

  • SDSC, San Diego, March 25-29 (week after JLab workshop)
  • NIKHEF, Amsterdam, October 14-18
  • Also the next HTCondor European Workshop at Ispra (Italy), September 24-27

DUNE Computing Update - Ken Herner

DUNE far detector in South Dakota (Sanford Underground Research Facility, former mine)

  • 800 miles from Fermilab
  • Four liquid Argon detectors 19 x18x60m
  • Plan to start operation in 2026

protoDune at CERN to test technologies for DUNE (building EHN1, Prevessin)

  • 2 detector technologies: dual phase and single phase
  • x100 factor with DUNE

Single Phase computing: 2-3 GB/s at 25 Hz beam

  • 2-5 PB of raw data for the 6 weeks of running this fall
    • Only 1/3 for physics, other are tests and noise
  • DUNE : 10-30 GB/year from far detector, more from near detector
    • In fact, DUNE should run at a lower frequency than 25 Hz
  • Rates similar to CMS today
    • 2026 expectation: data rate similar to ATLAS + CMS today
  • Production current focus: data keepup (decoding + hit reconstruction, no tracking yet) and MC

Using larsoft, a CMSSW fork

  • Also evaluating POMS, developed at FNAL, as the workflow management system
    • Used to schedule keepup works
    • Also plan to evaluate others
  • Data management: still using SAM, likely to switch to Rucio
    • CERN-FNAL transfers using FTS
    • Data out of the protoDUNE detector first stored in EOS at CERN
  • Also relying on CVMFS and HTCondor GlideIn WMS

Cross-site computing monitoring developed to provide a unified view of CERN and FNAL resources

  • Covers network and transfers

Next steps

  • Move all sites to multicore
    • Tacking will be able to use multicore, based on TensorFlow: deep learning will play a large role, need for CPU/GPU combination at a large scale
  • Integrate SEs from other sites, starting with UK
  • TDR: expected in 2019

Longer term

  • Develop a computing consortium with interested parties to work on DUNE computing strategy, assemble resources and make technical decisions
    • Work with wider HEP and astrophysics where it makes sense (workflow management, data management, authz, coding practices...
    • Many lessons to be learned from WLCG
  • Evaluate technologies and make choices for the various parts
    • Need to adapt to the evolving computing landscape
  • Operations, including management of resource commitments
  • Efficient use of HPC resources: clear mandate from DOE
    • Art v3 is a step in the right direction: allow to process multiple events in separate threads
    • CORI (NERSC) was able to reproduce in a few hours the 4 week long NOvA analysis done in 2017

Discussion

Ian C.: how much effort is put in Data Analysis Preservation? We learned it is better to think about it upfront...

  • Ken: One WG formed recently about the computing model and has explicitely this topic in its scope
  • Maarten: keep in mind that it is more than bit preservation, may be worth liaising with REANA project at CERN

-- IanCollier - 2018-10-18

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2018-10-18 - IanCollier
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback