Summary of April GDB, April 12, 2017 (CERN) DRAFT

Notes compiled by C. Biscarat & M. Jouvin ; any and all mistakes are ours.

Agenda

Agenda

Introduction - I. Collier

  • WLCG Manchester workshop registration open ; starts at lunch time June 19th, ends on Wednesday end afternoon ; Thursday morning: IPV6 hands-on; afternoon SKA
  • eduGAIN login to the CERN Twiki has been enabled ; Users in the “wlcg-external” e-group now have edit rights on LCG topics.
  • “Computing and Software for Big Science” Journal has now been launched [http://www.springer.com/physics/particle+and+nuclear+physics/journal/41781 one issue available]

Argus - A. Ceccanti

  • Argus 1.7.0 current release in UMD, stable (bug fix in production + port on Centos7).
  • Argus 1.7.1 targeting the May UMD release: make Argus aware of authentication profiles
  • Argus is being integrated in INDIGO-DataCloud AAI (OpenID Connect token)
  • Funding: main funding from INDIGO (ends in Sept. 2017), if EOSC-hub is approved, funding is secured for the next 3 years

LHCONE/OPN Report - E. Martelli

LHCOPN

  • RAL: 3rd 10G link, IPv6 configured
  • CERN: provisioning 3rd 100G link to Wigner

LHCONE

  • Traffic stable in the last months (no increase): probably due to the LHC shutdown
  • 5 additional sites connected (PL, USA, TW)
  • ESnet now sees more traffic on LHCONE than on OPN

perfSonar

  • Waiting for v4 to stabilize the services and the underlying infrastructure
  • WLCG working on ETF probes to monitor the service
  • ATLAS working on integrating network metrics into analytics

New collaborations

  • Belle II: perfSonar infrastructure with MadDash setup
    • Some asymmetric performances identified
  • NOvA: traffic between two main sites (FNAL and FZU NOvA) rerouted over LHCONE in October 2016 - modest load
  • Pierre Auger Observatory: experiment in Argentina, data storage in Lyon (FR), ask to go through LHCOne
  • Xenon: 1.5TB/day produced at LNGS (Gran Sasso, IT), just started to connect to LHCOne

Developments in Asia

  • More sites connected or willing to connect to LHCONE, including China (IHEP and CCNU)
  • TransPAC: initiative to offer Asia a transpacific 100G link to LHCONE (Seattle)

Network operators confident they are ready for Run3 and started to discuss Run4 challenges

  • Idea to build a network brokering service based on SDN to improve the usage efficiency of the network infrastructure

Next meeting

  • colocated with HEPIX fall 2017

pre-GDB "Collaborating with Other Communities" Report - I. Collier

agenda: https://indico.cern.ch/event/578969/ agenda

Idea of the pre-GDB: to invite communities already using WLCG or having upcoming requirements, how we can share/coexist better. Some make good use of the grid (IceCube happy and comfortable), LIGO is struggling a bit more, SKA perhaps just at the beginning of what to do.

Organised as a panel with a few questions ; participants: Michel Jouvin, Maarten Litmaath, Anna Scaife (SKA), John White (EISCAT 3D)

Questions:

  • Something particularly useful?
  • One thing you learned?
  • One next step?

A. Scaife

  • Good to get here and see other projects and storage needs.
  • To use the grid, it is useful to have local contact like people in WLCG make faster progress
  • Looking for more collaboration: confirms that SKA would welcome WLCG participation to its next all-hands meeting in September in Manchester

M. Jouvin

  • Aware of some projects, good to have better overviews of the current changes.
  • Distributed computing is the basis of these projects within different constraints.
  • People are important to establish links: often adhoc contacts so far, example of Icecube with the role of G. Merino
  • A lot of potential for a fruitful collaboration both on the infrastructure side (WLCG) and on the SW side (HSF).
  • We could establish a list of contact to facilitate distribution of information, invite each others to relevant meetings/workshops and add SW side (HSF) to future initiatives (and current CWP initiative)

J. White (EISCAT)

  • Need to pick the best bits of WLCG.
  • One particular challenge for EISCAT is to change their culture in terms of computing: small community not used to the scale of computing they require
  • EISCAT may be interested to join some WLCG meetings to learn more from WLCG experience: currently attending mainly EGI meetings but not as focused

Maarten

  • The various projects are not worlds apart, we see good opportunities for converging on technologies, some communities use the same infrastructure or same SW.
  • We could share our experience in setting up our model: data challenges, service challenges.
  • Increase collaboration and cross-participation in meetings to move to facing the big data challenges together. Avoid duplicating a lot of work.

A. McNab

  • Impressed by the commonalities already existing in term of the kind of resources needed
  • Collaboration also important for sites as we'd like to avoid providing different set of resources to various communities with similar needs

Ian C.

  • There was a bit of discussion on network activities: even if the specific LHCOne is targeting the LHC communities, the technology is valuable for other communities.
  • Wanted to demonstrate that coming to CERN was not so difficult!
  • Next opportunity to meeting is the WLCG workshop in Manchester
  • Need to identify what are the other opportunities

Benchmarking

Benchmarking WG Update - D. Giordano

HEPiX WG web site: https://twiki.cern.ch/twiki/bin/view/HEPIX/CpuBenchmark

WG mandate

  • Investigate scaling issues in HS06 compared to real workloads
  • Next generation of long-running benchmark
  • Evaluate fast benchmarks, in particular to estimate the performance of a VM

Since February:

  • DB12 underwent deep set of analyses
  • CMS: preliminary comparison of CMSSW performance Vs DB12, KV, HS06
  • Study Tier-0 job performance with passive methods

Fast benchmark

  • DB12 (Dirac Benchmark) shows a good correlation for simulation jobs
    • Good agreement for ALICE and LHCb
    • Work in progress for ATLAS CMS
  • See slides for detailed analysis of DB12 performance profile
    • Haswell perf boost now understood: related to the improvement of branch prediction in Haswell as DB12 is dominated by branch prediction (the +45% boost appears only when running DB12 in # of slots == # physical cores and goes down when profiting of SMT enabled)
    • Alternative implementations done with Numpy and in C++: in both cases: dominated by the mathematical functions rather than branch prediction. In addition much faster to run. See https://twiki.cern.ch/twiki/bin/view/HEPIX/DB12VsPythonVersion#DB12np_py_A_python_version_based
  • Impact on Python and OS versions on DB12
    • Effect related to Python version (up to 18%) but marginal effect of the number of parallel processes launched (hyperthreading)
    • C++ version not really affected by OS/compiler versions
    • C++ and Numpy versions have a better scaling with HS06
  • Also a detailed profiling study of other fast benchmark candidates
    • ATLAS KV (mostly based on GEANT) exhibiting a problem similar to DB12 with SMT
    • Geant4: very sensitive to non-locality of the the data: exposes to cache architecture differences
    • HS06: CMS found it more greedy than its simulation code (ttbar simulation)
  • Replacement of HS06 by a fast benchmark: large divergence of opinions
    • Not a large enough mix of instruction: exposes to microarchitecture optimizations
    • Not a clear understanding of the medium/long-term consequences of such a choice

Also an analysis of T0 activity performance correlation with HS06, using reco jobs (A. Sciaba)

  • Scaling generally good but 2 exceptions
    • Not true on Opteron (6276) by a significant factor
    • Haswell tends to perform better than predicted by HS06 at a 10% level

HS06 successor: starting to draft requirements for its validation

  • Could be HS17 or a HEP suite based on experiment's workloads
  • Main target architecture will be x86_64
    • But plan to explore other architectures of interest like ARM
  • As in HS06, there will be fixed version of OS and compiler version and options
  • Perform reproducible studies, including sharing experiment codes through CVMFS or containers
    • Includes building a testbed with representative HW and apps/benchmarks

ATLAS and CMS are instrumenting pilots to run, collect and study DB12 scores Vs production jobs (including multi- thread jobs)

Discussion:

  • Original DB12 shows some problems, NumPy and DB12 scales and avoid the Haswell problem, are you considering to move to NumPy ot C++ for fast benchmark ?
  • One could provide a suite of several tools and weight the results ; one should not compare to HS06 but to the experiments payload. In this respect the proposed testbed is important and will give the opportunity to collect the results.

CPU Units Proposal - A. McNab

Need to prepare the ability to change/update the benchmark we use for accounting and pledges ; it could also help when there is a new HW generation delivering more CPU units to apps than reflected by the benchmark

  • Example of Haswell new branch prediction change
  • Impact on site procurements: either they deliver more than what they are credited for or they may buy older HW that looks less expensive but are more expensive relatively to the perf they deliver

Proposal to introduce a new unit called "CPU Units (CU)" in parallel with HS06 in the accounting system (APEL, portal...) with 1.0 CU = 1.0 HS06 when introduced

  • Then WLCG has the freedom to change what a CU is without changing the accounting/pledging infrastructure
    • Can respond to evolving experiment code
  • CU definition should be based on empirical evidence about experiment software performance across relevant hardware
  • A site should pledge in the current value of the CU, even if using old HW
    • But a revision should scale up the numbers in such a way that a site will never publish less than before
  • On newer hardware if the new definition is sensitive to improvements in technology, then new CU value may go up
    • Encourage sites to buy HW that deliver the most to experiments
  • Benchmark could be based on a suite of different microbenchmarks or apps
    • Should be open-source, easy to run and distribute
  • Lower the cost of updating the benchmark: convert benchmarking from a commissioning activity into an operational activity

Discussion

  • A lot of the people present not really convinced that it solves the problem, just shifting it
    • Will allow more optimized changes with a more difficult comparison over time
  • Ale: thinks that if we convince APEL to support more units, we should do it in a way that allows more units, not just one more

This proposal has been presented in the Benchmarking WG and in the accounting WG already. This topic should be followed up at the Manchester workshop.

Containers

Session about what we can do with containers regarding sites.

Introduction - J. Blomer

VM vs. container virtualizations

  • Less overhead but less isolation: larger surface exposed to attacks
  • No privileged operations possible as the container user is treated as a normal user (e.g. no mount)
  • Not one feature but a set of features that make Linux containers: every component moving at its own pace... adding to the complexity

Container engines

  • Main products : Docker, Singularity
  • Dominated by Docker, introduced the push-pull model (from/to registry)
  • Singularity: new engine from HPC world, very lightweight, removing the unnecessary parts from Docker in our context
  • Other engines not as popular in our community: Linux containers (lxc), Rocket (rkt), systemd

Ability to build container clusters with orchestrators

  • Mesos (good for long-running service), Kubernetes (availability to build small cluster), ...

Containers and CVMFS

  • Bind mounts: mount in the container filesystems mounted in the host machine
    • Can be used to access CVMFS: one shared cache for all containers running on a host (and other processes on the host)
  • Docker volume driver
    • Integrates with Kubernetes
    • Can be used also to access CVMFS from a container
  • From inside the container if started as a privileged container: not really used in practice
  • Coming soon: Docker graph driver that will allow to get container images from an outside source, e.g. CVMFS
    • Container images not stored as a single file in CVMFS

Possible dangers with containers: containers foster an attitude of "capturing the mess"

  • More moving parts (and moving targets) in your system
  • Automation required: containers must be disposable items
    • In particular not carriers for data, databases...

Discussion

  • If containers are instantiated as a cluster, what's the benefit in term of performances
    • The container cluster (VMs) can be instantiated in advance and shared by many containers that will start almost immediately

RAL Experience - A. Lahiff

Since RAL moved to HTCondor, a lot of improvements in the ability to contain user processes with OS features (cgroups...) but still sharing the same root file system

HTCondor introduced the Docker universe in 2015 to run payload in Docker containers

  • Successfully tested by RAL to run SL6 WN (in containers) on SL7 machines
  • Nebraska T2 migrated fully to Docker universe last summer

RAL current configuration

  • HTCondor 8.6.1
  • CVMFS bind mounted into containers
  • 40% of the batch farm now moved to SL7 machines with payload from many VOs (4 LHC + others) run into containers

Container cluster managers

  • Using Mesos to manage many different computing activities/resources
  • Start using Kubernetes to implement a single API across on-premise resources and multiple commercial clouds
    • Successfully demonstrated for CMS, LHCb and ATLAS

Plans

  • Add an xrootd gateway to worker nodes (requires to use SL7 machines)
  • Provide access to RHEL7 via CEs
    • Easy for ATLAS and CMS
    • Still need to figure out how to do it with DIRAC and ALICE
  • Give access to Singularity
    • CMS interested to migrate from glexec to Singularity ; useful for other experiments, e.g. ATLAS
  • Get rid of pool accounts

Containers in ATLAS - A. Filipcic

Motivations and benefits

  • Similar to VMs but more flexible and no performance loss
  • Independence of execution environment from the OS
    • Isolate ATLAS from site choices/upgrades
    • Isolate sites from ATLAS constraints
  • Easy to make test environments
    • Several different environments can be used at the same time on the same site
  • Common approach for execution, software distribution for all sites (including HPC)

Currently concentrating on Singularity

  • Docker only for specific use cases (more difficult to deploy)
  • Singularity easy to deploy by site: one RPM
    • No specific UID required: current UID preserved when the container is started
  • Already decided (last ATLAS S&C Workshop) for large scale singularity deployment, starting with all modern OS sites
  • Already some good experience at several sites: encouraging all sites to deploy it on recent OS versions
    • Some specific steps needed on EL6
  • Bind mounts: some default ones added, sites must use their local one (scratch space...)
  • AGIS is container-ready using the 'catchall' parameter
    • May consider adding new parameters if needed
  • Pilots will be improved to allow a per-payload selection of the container to use, based on AGIS settings for the site
    • A few weeks ahead...

Long-term plans

  • All ATLAS jobs will use containers
  • No more than a basic OS will be requested from sites (CoreOS will be enough)
    • Libs, grid MW... will be added in the container - easier for sites and centralised SW deployment
    • CVMFS will be used as the main distribution point for container images

Open questions

  • Image management
    • How to manage them? Enable private images?
    • Images with a common core and a VO specific part
  • Security
    • Tracing the container activity: instructions for sysadmins
    • Handling/fixing of security vulnerabilities
  • Deployment model:
    • Minimal host OS - not compatible with WLCG site requirement ; need to agree in WLCG and if possible others (e.g. Belle II)

Time for a task force?

Jakob:

  • ATLAS wants to start with img files in cvmfs of ~2GB, this is big, why not going to flat files ?
  • Andrej: ATLAS prefers to distribute one single file.

Jeff : concerns about the impact of such a move on site responsibility in case of job errors as the site will have less information about the problems

  • Brian: part of the answer may be a central syslog
  • Jeff: cannot log every information about every possible errors to syslog... or syslog will become the problem!

Singularity in CMS - B. P. Bockelman

Nebraska (Brian's site) runs Singularity into Docker

  • All the WNs run as Docker containers

Main objective: simple isolation

  • Isolate pilot from payload and vice versa
    • Processes that can be interacted with, files/filesystems access
  • Replace glexec, the current and problematic solution to isolation
  • Make user OS environment as minimal and identical as possible

Singularity

  • Provides the isolation needed by CMS, does not do resource management (the batch system does)
  • No daemons, no UID switching
  • Easy to install: default configuration appropriate, no need to edit config files
  • User gains no privilege being inside the container
    • E.g. all setuid binaries disabled in the container

glexec replacement: Singularity meets CMS needs in term of isolation

  • In fact adopted by the Isolation and Traceability TF as the glexec replacement
  • Ironing out the last details to allow sites to adopt it
    • Currently a few sites running with this configuration

Will allow to decouple OS installed by the OS (and used by the pilot) from the one used to execute the payload.

  • The pilot is in charge of instantiating the appropriate container: can use a different container for each payload it schedules
  • Sites can run EL7 WNs as soon as they provide Singularity
    • Otherwise, CMS may be unable to utilize the site.

Singularity images

  • CMS decided to use Docker images rather than native ones
  • Singularity can use directories (unpacked images) rather than single image file
    • Image can be pretty big (a few GB)
    • CMS used to distribute the directory: benefit from the resulting caching

CMS image main characteristics

  • EL6 image with default passwd, group, shadow for a sane environment
  • payload run as the pilot user
  • Mounting user working directory as /srv in the image
  • Need to figure out the most appropriate way for a site to pass CMS the information about the local file systems that must be mounted in the container

Singularity and SAM tests

  • Singularity disables all setuid binaries, including glexec
  • but glexec is a mandatory/critical SAM test for CMS
    • Have a SAM probe for Singularity: need to figure out to OR it with the glexec test
  • Not all CMS tests are EL7 ready anyway

Traceability: glexec provides isolation and traceability but Singularity provides only isolation

  • Solution 1: sites rely on the VOs to do the appropriate logging and contact them in case of problems
    • In fact already happens and some sites comfortable with it
  • Solution 2: VO asserts to the site what the user will run
    • Basically with glexec when setting GLEXEC_CLIENT_CERT
    • Work in progress to do this with HTCondor and HTCondor-CE: should be ready end of Spring. No reason to be CMS specific.

Conclusions

  • Sites may be able to decommission glexec as soon as they deploy singularity. Nebraska will hopefully do this in April!
  • Looking for interested sites to participate. It’s an exciting-but-young effort: there will be some speed bumps, but will benefit from your help!

LHCb Perspective - A. McNab

Currently 2 sites running with containers: RAL and Skygrid at Yandex

  • Both use containers derived from DIRAC VMs
  • From this experience, developing a generic LHCb container definition
    • Uses Docker
    • CERNVM root image (via CVMFS)
    • CVMFS and init scripts to run in the container provided as Docker volumes
    • Format supported by Vac and Vcycle

Singularity as a glexec replacement

  • Need to add a Singularity-based wrapper to replace the glexec-based one in DIRAC: no major difficulty foreseen
  • Plan to test the approach with LHCb DIRAC VMs replacing the sudo wrapper approach currently used
  • Singularity is not a requirement to support EL6 environment on EL7 hosts: Docker or VM are other possible approaches
  • Singularity may also be used to allow users to package their jobs as Docker images
    • May help to make analysis more reproducible

Discussion:

  • idea of users packaging their jobs in containers images very interesting and could be extended to other VO ; question to be clarified: shall we let any user container to go on the grid ?

Security Strengths and Issues - V. Brillault

Containers decouple provisioning and VOs

  • OS/library independent from VOs
  • No VOs libraries leaking to provisioning

Containers provide a better isolation than UID switch (glexec)

  • WN processes and files invisible/not accessible
  • cgroups to manage resources used

Potential issues

  • Young technology: new classes of bugs in the kernel, missing support and the ecosystem changing fast
  • Most kernel bugs can still be exploited with containers: still need the ability to update quickly (emergency updates)
  • Singularity is still suid: could disappear in 7.4 but a sysctl configuration might be needed
  • Singularity is an attractive technology to replace glexec but would rely on kernel security updates
    • No central callout/service required: simpler configuration means less failures but at the price of no traceability to the end-user (see Brian/CMS talk)
    • Potential impact on the way central banning is done nowadays: move from site-based central banning to VO-based central banning?

Conclusion on containers (Ian Collier + Maarten Litmaath)

There are things to explore still. A WG should be proposed to iron out the subject.

OSG All Hands Meeting - E. Fajardo

See slides.

LHC is still the most CPU consuming community.

Xenon1t (experiment at Gran Sasso) leveraged different technologies from LHC

Year of retirement for many components!

  • GRAM: replaced by HTCondor-CE
  • glexec: Singularity as the replacement
  • GIP/BDII: replaced by OSG Collector integrated into HTCondor
  • Gratia: replaced by a decentralized ElasticSearch
  • Bestman2: replaced by load-balanced GridFtp

New big comers:

  • singularity

Opportunistic storage replaced by StashCache

  • Used by LIGO

Future work:

  • Monitoring SAM is too CMS specific, move to component self-testing
  • Simplify the VO zoo since all workflows are pilot-based
    • Long-term goal: throw away GUMS, the authorization service which is adding a lot of complexity

Current OSG project ends June 2018: discussions in progress for another 5-year extension

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2017-04-23 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback