Summary of GDB meeting, May 11, 2016 (CERN)

Agenda

https://indico.cern.ch/event/394782/

Introduction - I. Collier

Future pre-GDBs and GDBs

  • JunePre-GDB: IPv6 Workshop
    • Would be good to have as many T1s as possible represented
  • June GDB–In depth IPv6 session following up on workshop
  • July pre-GDB: possible Security Operation Centre WG F2F
  • July GDB: still open to suggestions

Forthcoming meetings

GDB Steering Group - I. Bird

GDB Steering Group

In the past mentionned the idea of WLCG Technical Forum: after discussing with many people, the idea is that the GDB is the right place for this

  • Need to strengthen the in-depth technical discussions at GDB: already moved in the right direction in the last months
  • Proposal of a GDB Steering Group to help driving the discussion (not to have it!)
    • 1 representative per experiment + 1 or 2 reprenting sites
    • Jeff: why so few site representives compared to experiments, there are more sites than experiments!
    • Ian B.: the group doesn't need to have a well balanced representation, the group is to drive discussion not to have it or take decisions
    • Ian C.: one T1 representative + one T2 representative.

Let's decide the site representatives during the GDB or shortly after that

  • Experiments must nominate their representative asap

Security Operation Center Update - D. Crooks

After March GDB discussion, WG formed. Scope

  • Identify key stakeholders to be considered in the SOC deployment
  • Data protection/privacy issues
  • Timeframe for delivery

Mandate

  • Establish a clear set of data input and output
  • Review of relevant SOC products and projets
  • Reference design for large sites + appliance for small sites

Participation to WG: all kind of sites

  • T0, T1, T2 (different T2 sizes and type, like cache site)
  • Dedicated or shared sites
  • Candidates welcome! Both experts and interested/motivated people
    • Contact: David and Liviu
    • An egroup has been created: ask if you want to be registered
    • A CERNbox area created

Possible technical seed : Intrusion Detection System + Threat Intelligence

  • IDS: Bro
  • Threat Intelligence: MISP

Timeline

  • F2F pre-GDB in July
  • Report/discussion at WLCG workshop in October

Machine/Job Features TF - A. McNab

Goals:

  • A common API that jobs can use to discover the parameters of their environment, e.g. time limit
  • Support VMs and clouds (no batch system command to retrieve the info)
  • Reduce the NxM matrix (when N experiments need to support M batch systems) to something closer to N+M

Status

  • Agreement on key/value pairs to publish
  • Transport mechanism: $MACHINEFEATURES and $JOBFEATURES point to a "directory" containing 1 file per key (filename is the key)
    • In cloud, the "directory" is a URL populated by the VM Manager

Implementation available as source (GitHub) and RPM: see https://github.com/HEP-SF/documents/raw/master/HSF-TN/2016-02/HSF-TN-2016-02.pdf

SAM probe available: run in the ETF pre-prod service

  • Can also be run by hand
  • Same test script for all batch systems: this is the whole point of MJF!!
  • Critical to follow-up the rollout

Please volunteer!!

Discussion

Ian C.: probably too early to make a decision whether WLCG requires the general deployment

  • Need more sites to adopt it and report from them: should issue an official request for more volunteers
  • Review them in a few months

Tim Bell: what is the real benefit that can be expected by experiments?

  • Andrew: no real sustainable alternative if a job has to know about its environment. Lightweight solution.
  • Maarten: probably not a strong requirement by experiments except LHCb (due to its job masonry) but all experiments agreed that if the system was widely deployed, they could make use of it

Michel: GRIF was an early adopter of previous version and it was not a hard work to deploy it (at least the machine features part). New version supposed to be much easier according to what said Andrew. We should insist that the complexity has nothing to do with CVMFS (not to speak about glexec!).

HSF Workshop Summary - M. Jouvin

Held at LAL last week (May 2-4): https://indico.cern.ch/event/496146/timetable/

  • Good participation: 70 people, 30 affiliations
    • Not only the LHC experiments but also IF, Belle II...
  • Agenda: mix of general discussions and topical sessions

HSF objectives remain unchanged: promote collaboration around SW, avoid duplication, give visibility to new projects, help with career recognition

  • Also a potential framework for attracting support
  • A framework for interacting with other communities

HSF activities structured in WGs

  • Communication and information exchange
  • Training
  • Software packaging
  • Software projects (the main pilar of HSF)
  • Dev tools and services

Several project reports: DIANA-HEP, AIDA2020 (RD detecors and SW), future Conditions DB (common ATLAS/CMS), HEP S&C Knowledge Base, WikiToLearn

Project support: refining what HSF can bring to projects

  • Advancements last year: best practice document (soon a Technical Note), a project creation script helping implementing these best practices
  • Future work planned
    • Help with visibility of projects
    • Interoperability of projects
    • Project peer review: starting with GeantV, several projects declared interest

SW pacaking is another very active area

  • Key piece for SW interoperability/cross-integration
  • A lot of work around Spack, a tool from the HPC world

News from other communities: 3 projects presented with various aspects relevant/close to HSF

  • Bioconductor: biomed project portal
  • Netherlands eScience Center: already presented at GDB
  • depsy: a NSF-funded project to promote credit for SW in science

2 topical sessions

  • Machine Leaning: hot topic in HEP, Inter-experiment Machine Learning (IML) WG started last year
    • IML wants to have strong links with HSF: will become the HSF forum for ML
  • SW performance: contributions by ALICE, ATLAS, CMS, GeantV, ROOT, Art/LArSoft, and the Astroparticle community
    • A lot of interesting discussions, more questions than answers
    • An activity that will be made more visible in HSF, through the SW Technology Evolution forum (replacement for the SW Concurrency Forum)

Community Whitepaper: a proposal for american collegues to build a roadmap for the work addressing challenges for HL-LHC computing

  • Target: a whitepaper by summer 2017
  • Proposal: a serie of HSF-branded workshops during next year
    • Discussing a kick-off around CHEP
  • Fitting well with LHCC request for a HL-LHC computing TDR: quite complementary

Proposal of a Journal about SW&C in Data-intensive sciences

  • Refereed, indexed journal that could be a reference archive
  • Already presented at HEPiX 2 weeks ago
  • Not restricted to HEP but focused on data-intensive sciences
  • Good feedback, discussions going on on how to move forward
    • Present/discuss at a future GDB?

Conclusions :

  • HSF is alive and recognised
  • Community Whitepaper is a good initative to progress towards common solutions
  • HSF will try to get an "official blessing" from ICFA and similar bodies
  • Discussion still going on about a legal entity to support HSF
    • Need discussions with funding agencies and lawyers
    • intial goal: IPR management, like Apache SW foundation

WLCG Experiment Test Framework (ETF) - M. Babik

ETF: new version of the SAM/Nagios Test Framework

  • Keep track of sites availability/reliability
  • Run deployment campaigns
  • Overall: simplification, reduction of complexity
  • Keep up with changes in the monitoring technologies: OMD, new communication/transport libraries...
  • Publish metrics to other services like SAM3

Core framework still based on Nagios-core but integrated with check_mk and OMD

  • New web interface (check_mk)
  • Same plugins/probes
  • WN micro-framework: tests through job submission
    • Results retrieved directly from the job output and inserted into to Nagios
    • All CE technologies suppored: CREAM, ARC, HTCondor-CE, Globus
  • Two custom components: rule-based configuration (ncgx), publication of results (nagios-stream)

Changed behaviour

  • Test with RFC proxies
  • All services in VO feeds will be tested
  • https tested for data access

Check_MK Central: http://etf.cern.ch

  • Currently in beta test
  • Reachable outside CERN (need a certificate)

No change in the support channel: GGUS

Future work

  • Notifications and site-based host groups
  • Refactoring of WN framework
  • Cloud support

HEPiX Summary - H. Meinhard

HEPIX is a very lively organisation: last meeting in Berlin, April 18-22

New activities

  • CPU bechmarking working group relaunched
    • People interested in fast benchmaark and next HS06, contact Manfred
  • Monitoring: a BoF run with a good participation
    • People interested: contact Cary Whytney (LBL)

Tracks and trends

  • 15 site reports: HTCondorCE - maybe the potential to replace other CE techno
  • Security and networking - 7 contributions
  • Storage and file systems
    • Ceph usage continue to grow
    • One of the OpenAFS maintainer made a worrysome presentation about OpenAFS future (in particular in term of new kernel support)
  • Grid/clouds: lot of work around containers and container orchestrators
  • IT facilities: new greenITcube @GSI now in production. very efficient datacenter.

Next workshops

  • Berkeley: the week after CHEP
  • Spring 2017: not clearly settled yet, confident it will ne in Budapest (organised by Wigner)
  • Fall 2017: KEK, date already fixed (October 16-20, 2017)
  • Will need to re-consider the swap of European/N-American meeting, Asian labs becoming important contributors to HEPiX
  • Expressions of interest and proposals to host are always welcome

Discussion

  • Is the active participation of other scieces increase ?
    • Helge: LifeSciences has come, dried out and they come back
    • Ian C.: also come from people starting in our community and moving away
    • Helge: still it goes beyond personnal contacts

Ligthweight Sites

Introduction - M. Litmaath

Session about small sites

  • Nevertheless other sites may benefit from simplification

Build on earlier discussions and ideas

  • In particular demonstrators for new data management approaches presented at April MB

T2 vs. T3: T3 generally dedicated to one experiment, no need for the generic grid MW

  • Directly deploy the experiment framework, no need for APEL accounting/InfoSys integration
  • T2 possible simplifications: reduce the catalog of required services, replace classic/complex service by new, simpler ones, simplify deployment and maintenance
    • catalog reduction example: computing-only sites
    • CE/batch systems: reduce the number of options, HTCondor (and -CE) on the rise
    • Accounting: ARC CE can publish directly into APEL

Cloud systems: not completely easy

  • OpenStack is the most popular cloud MW but not necessarily easy to deploy/manage
  • Paradigm shift: batch slots -> VM instances, need proper accounting
  • Do no expose to experiment a wide zoo

AuthZ

  • EGI: ARGUS is the cornerstone, supported through the INDIGO Datacloud project, release 1.7 almost ready
  • OSG: GUMS

Configuration

  • Slow but steady move to Puppet
    • Shared modules: not a lot of evidence yet
  • Some sites prefer another solution but probably know what they do
  • YAIM still being used for (too) many services
  • DPM approach: standalone Puppet installation - an idea to be developed ?
  • Small sites will need help

Simplified deployment

  • VMs distributed by CVMFS
  • Containers
  • Ready-made (HW+)SW solutions à la perfSonar
    • Deployed in a DMZ, remotely operated

Better documentation: will benefit everybody!

Monitoring

  • Integration into the local fabric monitoring
  • SAM tests and experiment monitoring should be able to raise local alerts

Lightweight sites in ATLAS - A. Forti

Constraints

  • Ability to use resources provided by sites who cannot be a standard grid site
  • Standard sites with decreasing funding/manpower and more and more conflicting constraints
  • Need to reduce load of experiment operations

Storage: the main source of operation cost

  • 75% of ATLAS storage by ~30 sites: small sites (<400 TB) discouraged to do further investments
  • Bigger sites (T1 and T2) with satellites acting as cache (without tight coupling)
  • Regrouping sites into larger ones not trivial: bigger strain on larger sites, potential efficiency pbs...
    • If small sites remain integrated into these larger ones, reliability issue may remain
  • Object stores are an emerging, promising technology but no experience yet at large scale
  • Cache site: several approaches from pure internal cache to secondary files (handle the same way as normal files but candidates for deletion if space is needed)
    • Different technlogies available: ARC cache, Xrootd cache, upcoming DPM caching...
    • Does the cache must be multi-protocol?

Computing: main issue is the WN which is a very specific environment, making sharing with other sciences difficult

  • Virtualised WNs? Not necessarily well supported by some batch systems
  • Containers: probably easier and enough
  • ATLAS recommends ARC CE + HTCondor or SLURM or other BS with cgroups support
    • Also alternative to a batch system explored: Vac/Vcycle, BOINC, OpenStack/EC2/Azure. All behing a ARC CE. Currently restricted to some workloads.

Remove the dependency on BDII: work in progress (WLCG IS TF)

Event service: an important ATLAS service to make a job preemptable without loosing the work done: event-level checkpointing

Lightweight Site in UK using VMs - A. McNab

Vac/Vcycle: simple daemons to provision VMs at sites without running grid services

  • VMs already built/maintain by experiments to use clouds
    • Only constraint: VM must shutdown if no more job to do has been retrieved from central queue
  • VMs only need to have the experiment framework agent installed
  • Vac: autonomous hypervisors, Vcyle: VM provisioning through a cloud API
    • Vcyle has a backend for OpenStack, OCCI, DBCE and Azure
    • Vac: generates APEL accounting records, mechanism to implement target shares
    • Integration with Machine/Job features
  • Proposed to EGI to build "community platforms"

Vac-in-a-box (ViaB): allow to deploy Vac without relying on any site service (including DNS)

  • USB boot image to download, containing a kickstart file to install an hypervisor wirht the full stack (DHCPD, TFTP, Squid...)
  • PXE-boot the second hypervisor that will get installed as the first one and become a second installation server
  • ...
  • Security fixes distributed from the ViaB web site through an hourly yum-update cron

Next steps

  • Exploit existing Vac to ElasticSearch reporting
  • Support mixed-size VMs on the same hypervisor
  • Manage mix of VMs and containers on VAC hypervisors
  • Make easier to add VOs: Vacuum Pipes

JINR experience - M. Kutouski

3 possible approaches to site simplification

  • A few endpoint with CEs per country and sites providing only WNs integrated into these consolidated CEs
  • Cloud everywhere, national/regional federation of clouds, federation of the federated endpoints at the WLCG levels
  • Basic low-level machine configuration done by sites, grid services deployed and operated remotely

OpenStack Fuel may allow to combine the 3 approaches

The Ubiquitous Cyberinfrastructure - L. Bryant

CI Substrates: trusted/DMZ zones in charge of running the grid services

  • Sites only provide resources that will be used by the CI substrate (Edge Platform) + resources managed by these services
  • (complex) service operated remotely/centrally
  • Produce a reference specification for edge container platform: allows support team to focus on SW rather than HW troubleshooting
  • May make easier to engage with other sciences

Currently exploring several underpinning technologies, based on containers for the edge platform

  • Find a balance between platform features and ease of use by sites
  • Provide a Web interface/dashboard + REST API
  • Containers maintained centrally

Automation efforts: decouple approvals (requiring human intervention) and configuration that should be fully automated

  • Require to have all the required information at one place after the approval has been done

Benefits for WLCG

  • Easier maintenance/update of services
  • More consistent versions
  • Focus on documentation and builds

Jeff: in this approach services like CE that used to be a gatekeeper to local resources is now outside, risk of causing security issues if needed to open outside access to the batch system

  • Lyndon/Maarten: no definitive answer, using HTCondor-CE/HTCondor that is designed for this kind of configuration. May be more difficult with an other batch system

T3 in a Box - F. Wuerthweim

T3 = site with no dedicated personnel for WLCG support

  • Want to be able to use efficiently the resources provided by ~80 universities
    • ~50 of them funded by NSF in the last years, running O(10K) clusters and had their network upgraded in the last year

Minimum services required from a T3 site

  • Submit host to submit jobs to both local and global resources
  • CVMFS for experiment SW distribution
  • Xrootd cache for access to experiment data
  • Xrootd server for private data storage/access
  • Everything packaged in a 10K US$ box (40 cores, 12x4TB disks, 128 GB RAM, 2x10 GbE)
  • Can support several VOs

Cooperation between experts and local IT

  • Local IT manages only HW and user accounts
  • Box integration at site worked with recipients: based on the first experience (5 sites in California), always some differences: a real challenge!
  • Security maintenance of the box done centrally

May consider switching to Ubiquitous Edge Platform later when it will be mature

  • The current project already used and allows to understand what are the challenges in trying to get a central team of experts and local IT people work together

Discussion

Storage work and TFs

  • TFs should a well-defined mandate and clear-objectives: be sure to have enough people to work in them
    • Jeff: should take into account the effort spent in TFs in the operation cost calculation... TFs have to be very efficient if they exist
  • Build upon work done by the demonstrators

Conclusion - I. Collier

Nominations for the GDB Steering Group extended up to next MB in 2 weeks.

  • Lacking T2 representatives

-- MichelJouvin - 2016-05-11

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2016-05-12 - CatherineBiscarat1
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback