Summary of October GDB, October 17, 2018 (CERN)
Agenda
Agenda
Introduction
Next GDBs
- March 2019 at CERN, usual week: not moving, book early
- April 2019: during ISGC, one week earlier than usual (April 3d)
Pre-GDB: authn/authz in Decembre
- No confirmed pre-GDB in Novembre
List of upcoming meetings: see slides
- LHCONE/LHCOPN Workshop, 30th, 31st October, FermiLab, US, https://indico.cern.ch/event/725706/
- CS3, 28th, 30th January 2019, INFN Roma, Italy - http://cs3.infn.it/
- Euromicro International Conference on Parallel, Distributed and Network-based Processing, PDP2019, February 13-15, 2019, Pavia, ITALY, https://www.pdp2019.eu/std.html
- WLCG/HSF/OSG Workshop, 18th – 22nd March 2018, JLab, VA, USA
- HEPiX Spring 2019, 25th, 29th March 2019, San Diego
- ISGC 2019, 31st March - 5th April 2019, Taipei, Taiwan
- European HTCondor, 24 - 27 September 2019, at the Joint Research Centre of the European Commission in Ispra, Lombardy, Italy
- CHEP date mentioned is still a tentative date...
Next WLCG/HSF workshop: JLab decided
- Full week of March 18, 2019
- Joined with OSG too
- Work just started on the agenda matching expectations of the 3 parties: see slide for an early list of topics
CRIC Status - A. Di Girolamo
CRIC is a validated source of information for experiments that can present in a consistent way several sources of information
- BDII, GOCDB, REBUS, OIM...
- Not a replacement for the information sources, e.g. BDII: CRIC is an aggregator
- Information presented as a particular experiment expects it
- Can include every kind of resources, including those not in the BDII/grid (HPC...)
Implemented as a Web application with a REST API
- Based on Django framework
- Authn/authz for a flexible implementation of groups and roles
- Possibility of editing information manually for urgent fixes
- If information published by the sites is wrong, possibility to notify the site
- Detailed on how it is done still to be sorted out: a ticket is probably the best way to do it
Status
- CRIC for CMS as a siteDB replacement: the main priority presently (https://cms-cric.cern.ch
)
- User info (including authz) and topology retrieved from CRIC
- Also working on importing GlieIN WMS config: much more complex, many problems identified, now ready to test again...
- Importance of collaboration with experiment experts
- Coming soon: WLCG instance for operations and LHCb/ALICE who didn't request a specific instance (https://wlcg-cric.cern.ch
)
- Then an ATLAS instance
- Also an instance for Compass has been set up
Contact:
cric-devs@cernNOSPAMPLEASE.ch
Discussion
Ian C.: how to making the situation more complex for sites if they have to support CRIC for WLCG and BDII and friends for other VOs
- A: CRIC is not a site service, it consumes information published by sites using the usual/common services.
- It may be interesting when we have demonstrated usefulness for WLCG VOs to propose it to other VOs, if they willing to adopt this approach and implement their instance
- Alessandro: CRIC also makes possible for a site to use something else (e.g. JSON export) to publish their resources if they don't want to run a BDII
Tim: Need to make clear what is the place to use for updating site info like downtime. CRIC ? GOCDB ?
- Vincent: we need to be sure that CRIC is not exposing a site that has be suspended, in particular for security reasons
DPM Upgrade TF
Storage Resource Reporting (SRR) - J. Andreeva
Prototyping phase started at Fall 2017
2 different use cases for SRR
- Topology: description of non-overlapping storage shares, protocols
- Accounting: query used/free space without SRM, based on JSON file published by SEs
DPM support for SRR: available in 1.10.3
- Requires DOME to be enabled
TF Report - F. Furano
DPM 1.10.4 is the current release and support SRR
- Based on the new directory based quota tokens: very efficient real-time processing of the information
- Published at known path: can be retrieved by CRIC
- DOME must be enabled: upgrading to 1.10.4 is not enough
- Still possible to run the legacy, SRM, stack but as a totally separate stack: optional if DOME is enabled
Legacy mode in DPM (aka LCGDM) will not be supported anymore after June 1st, 2019
- No plan to remove it from EPEL as long as it compiles: pure C-code, may last very long...
TF to minimise the DOME switch effort: start by a few sites, then ask for a wide upgrade.
- Goal: 5 pilot sites (3 so far: BRUNEL, INFN-NAPOLI, PragueLCG2)
- Next dev enlarged meeting (Doodle in progress) will discuss this
Authorization WG - R. Wartel
Design and promote token-based authn/authz as a possible replacement for X509
- 2 modes: integrated with CERN/SSO and CERN HR DB or standalone (configurable authn and identity vetting sources)
- No deployment by WLCG: use existing SW solutions
Ongoing activities
- Pilot AAIs being enhanced for meeting requirement defined/agreed
- 2 pilots: EGI Check-In & COManage and INDIGO AIM
- WLCG will ultimately decide on one implementation for its use
- Token schema to allow inter-operability between implementations
- VO interviews to align VO workflows/frameworks requirements
- Will influence the token schema definition
Next phase
- Dec. pre-GDB: assess the pilots
- Feb. 2019: provide feedback to WLCG MB
Discussion
Xavier: several other communities look at sharing their infrastructure (in particular for data) with us (DUNE, SKA...), need to be proactive with them to ensure we are deciding on something that can meet their need
- Romain/Maarten: sure, it is important. But no worry as we are jumping on an industry-standard solution, adopted by a significan part of the world.
WLCG DOMA Report - S. Campana
Project general meeting on the 4th Wednesday of each month at 4 pm CET
3 WGs
- Third Party Copy (TPC): alternative to gridftp
- Co-chaired by A. Forti (ATLAS) and B. Bockelmann (CMS)
- Short-term: investigate and commission alternative protocols. Currently https/DAV and Xrootd.
- Medium-term: token-based auth (in coordination with authz WG)
- Long-term: switch to an alternative protocol
- Rucio+FTS testbed set up: many caveats being addressed. Initial focus on functionality, performance later.
- Data access: performance, content delivery, caching
- QoS: recently started
- Chair: P. Millar
- Expose different class of storages (what we used to call disk and tape) and have them integrated into DDM framework. A potential source for HW saving.
- A storage class is a different tradeoff between performance, reliability and cost
- Will help with the integration of new storage technologies
Also some work related to network activities and R&D: focus on data transfers
- Data Transfer Nodes, bandwith on demand, SDNs...
- Collaboration with SKA ARENAS project and HEPiX Networking WG
- Leverage network information from FTS as the transfer maanager
AII: follow WLCG Authz WG
WLCG DOMA WG is not about defining the Data Lake architecture but, through R&D and prototypes, enable the technologies that will make the Data Lake possible
- Distributed storage is addresses through the 3 focuses subgroups which are different aspects of a global/distributed storage
- After adoption of Rucio by CMS, open a big opportunity for simplifying "inter-operability" work
- Some topics not yet discussed like tape carrousel: work also done in the archival WG
Next milestones
- LHCC review of WLCG: no date settled yet but sometimes this year. Need to be ready to present our work.
- HSF/WLCG workshop in JLab
Work done in collaboration with relevant work in XDC and ESCAPE projects
Discussion
Discussion on how much caching can be efficient in our context where there is little reuse of a file
- Analysis mentioned before, based on MWT2, tends to suggest that there is a significant efficiency with a reasonably sized cache
- In addition to pure caching, it is planned to look at 'latency hiders' where basically the required files are read ahead when the processing has already started making the latency price paid by the first events.
Cost and Performance Modelling - J. Flix
Identify metrics that best describe a workload to better predict needs and impact of changes
- Need a collaboration of SW experts and site administrators
- SW experts: how efficiently a HW is used
- Site admins: how the global resources are used
A tool to collect basic metrics for an application developed by HSF: prmon
* Metrics based on OS counters
Another tool developed in WLCG performance group at CERN IT: Trident
- Collects metrics based on CPU counters
Resource estimation: goal is to define a common framework for modelling computing requirement
- Only need to be good enough
- Current work based on CMS framework: interest from other experiments
- Input parameters: LHC parameters (trigger rates, duty cycle...), computing model (event size, processing times, processing campaign), storage model (number of replicas...), infrastructure (capacity evolution model, disk, tape...)
- No network parameters for the moment
Site cost estimation: a model developed at
CCIN2P3
- Goal: maximize the capacity over cost ratio
- Takes into account HW costs, electricity, infrastructure, manpower
Also some work specific on the impact of various storage strategies on the overall operation cost
- Having only 15 big data sites and only (unmanaged) caches at T2: may save up to 45% of the storage manpower (60 FTEs)
- Some discussion during the GDB about the figures used to come to this conclusion... partly based on estimates made at NDGF to run an ARC CE and its cache without running a SE
- Removing replicas for all files that can be regenerated, e.g. AOD: worst case is probably the same price to regenerate all of the lost files based on 1 PB lost (corresponding to the observed disk failure rate in EOS) per year
- Could be better if regenerating only things that are needed
Caches efficiency, throughput and latency studies: preliminary work done, see slides
Several different knobs to improve performances slightly but most of them can be consolidated to lead a significant improvement
HEPiX Report - H. Meinhard
HEPiX WGs
- 2 new WGs formed recently: Tech Watch and AAI
Last workshop last week in Barcelona (PIC): 137 attendees (3/4 from Europe) from 54 affiliations (including 5 companies)
Covering the usual, diverse, service-related topics on basic and end user IT services, computing and batch services, storage, network and security, IT facilities and business continuity
- Computing: first tests of AMD EPYC architecture that may be appealing for us in the future
- Several HEP sites involved in Photon Science: a first workshop at BNL earlier this month, idea of a co-located meeting with HEPiX in the future
- Site reports: a few newcomer like SurfSARA
Several BOFs: AAI at sites, DOMA access, archival storage/tapes
- AAI WG/SIG created as a result of the BOF, led by Paolo Taledo (CERN AAI project manager). Subscription welcome: see slides.
Next workshops
- SDSC, San Diego, March 25-29 (week after JLab workshop)
- NIKHEF, Amsterdam, October 14-18
- Also the next HTCondor European Workshop at Ispra (Italy), September 24-27
DUNE Computing Update - Ken Herner
DUNE far detector in South Dakota (Sanford Underground Research Facility, former mine)
- 800 miles from Fermilab
- Four liquid Argon detectors 19 x18x60m
- Plan to start operation in 2026
protoDune at CERN to test technologies for DUNE (building EHN1, Prevessin)
- 2 detector technologies: dual phase and single phase
- x100 factor with DUNE
Single Phase computing: 2-3 GB/s at 25 Hz beam
- 2-5 PB of raw data for the 6 weeks of running this fall
- Only 1/3 for physics, other are tests and noise
- DUNE : 10-30 GB/year from far detector, more from near detector
- In fact, DUNE should run at a lower frequency than 25 Hz
- Rates similar to CMS today
- 2026 expectation: data rate similar to ATLAS + CMS today
- Production current focus: data keepup (decoding + hit reconstruction, no tracking yet) and MC
Using larsoft, a CMSSW fork
- Also evaluating POMS, developed at FNAL, as the workflow management system
- Used to schedule keepup works
- Also plan to evaluate others
- Data management: still using SAM, likely to switch to Rucio
- CERN-FNAL transfers using FTS
- Data out of the protoDUNE detector first stored in EOS at CERN
- Also relying on CVMFS and HTCondor GlideIn WMS
Cross-site computing monitoring developed to provide a unified view of CERN and FNAL resources
- Covers network and transfers
Next steps
- Move all sites to multicore
- Tacking will be able to use multicore, based on TensorFlow: deep learning will play a large role, need for CPU/GPU combination at a large scale
- Integrate SEs from other sites, starting with UK
- TDR: expected in 2019
Longer term
- Develop a computing consortium with interested parties to work on DUNE computing strategy, assemble resources and make technical decisions
- Work with wider HEP and astrophysics where it makes sense (workflow management, data management, authz, coding practices...
- Many lessons to be learned from WLCG
- Evaluate technologies and make choices for the various parts
- Need to adapt to the evolving computing landscape
- Operations, including management of resource commitments
- Efficient use of HPC resources: clear mandate from DOE
- Art v3 is a step in the right direction: allow to process multiple events in separate threads
- CORI (NERSC) was able to reproduce in a few hours the 4 week long NOvA analysis done in 2017
Discussion
Ian C.: how much effort is put in Data Analysis Preservation? We learned it is better to think about it upfront...
- Ken: One WG formed recently about the computing model and has explicitely this topic in its scope
- Maarten: keep in mind that it is more than bit preservation, may be worth liaising with REANA project at CERN
--
IanCollier - 2018-10-18