Summary of GDB meeting, May 11, 2016 (CERN)
Agenda
https://indico.cern.ch/event/394782/
Introduction - I. Collier
Future pre-GDBs and GDBs
- JunePre-GDB: IPv6 Workshop
- Would be good to have as many T1s as possible represented
- June GDB–In depth IPv6 session following up on workshop
- July pre-GDB: possible Security Operation Centre WG F2F
- July GDB: still open to suggestions
Forthcoming meetings
- CernVM[-FS] Workshop, RAL, 6th-8thJune
- LHCONE/LHCOPN, Helsinki, 19-20thSeptember
- Digital Infrastructures 4 Research (DI4R)–28-30thSeptember, Krakow, Poland
- http://www.digitalinfrastructures.eu/
- Joint user forum involving EGI, EUDAT, GEANT, OpenAIRE, RDA-Europe with the support of PRACE.
- WLCG Workshop, October 8-9th, San Francisco
- CHEP, San Francisco, October 10-14th
- HEPiX, Berkeley, October 17-21s
GDB Steering Group - I. Bird
GDB Steering Group
In the past mentionned the idea of WLCG Technical Forum: after discussing with many people, the idea is that the GDB is the right place for this
- Need to strengthen the in-depth technical discussions at GDB: already moved in the right direction in the last months
- Proposal of a GDB Steering Group to help driving the discussion (not to have it!)
- 1 representative per experiment + 1 or 2 reprenting sites
- Jeff: why so few site representives compared to experiments, there are more sites than experiments!
- Ian B.: the group doesn't need to have a well balanced representation, the group is to drive discussion not to have it or take decisions
- Ian C.: one T1 representative + one T2 representative.
Let's decide the site representatives during the GDB or shortly after that
- Experiments must nominate their representative asap
Security Operation Center Update - D. Crooks
After March GDB discussion, WG formed. Scope
- Identify key stakeholders to be considered in the SOC deployment
- Data protection/privacy issues
- Timeframe for delivery
Mandate
- Establish a clear set of data input and output
- Review of relevant SOC products and projets
- Reference design for large sites + appliance for small sites
Participation to WG: all kind of sites
- T0, T1, T2 (different T2 sizes and type, like cache site)
- Dedicated or shared sites
- Candidates welcome! Both experts and interested/motivated people
- Contact: David and Liviu
- An egroup has been created: ask if you want to be registered
- A CERNbox area created
Possible technical seed : Intrusion Detection System + Threat Intelligence
- IDS: Bro
- Threat Intelligence: MISP
Timeline
- F2F pre-GDB in July
- Report/discussion at WLCG workshop in October
Machine/Job Features TF - A. McNab
Goals:
- A common API that jobs can use to discover the parameters of their environment, e.g. time limit
- Support VMs and clouds (no batch system command to retrieve the info)
- Reduce the NxM matrix (when N experiments need to support M batch systems) to something closer to N+M
Status
- Agreement on key/value pairs to publish
- Transport mechanism: $MACHINEFEATURES and $JOBFEATURES point to a "directory" containing 1 file per key (filename is the key)
- In cloud, the "directory" is a URL populated by the VM Manager
Implementation available as source (
GitHub) and RPM: see
https://github.com/HEP-SF/documents/raw/master/HSF-TN/2016-02/HSF-TN-2016-02.pdf
- Vac/Vcycle
- PBS/Torque
- HTCondor
- Need more volunteers!
- Installation intended to be easy: install RPM, change configuration file (/etc/sysconfig) if necessary.
- By default, the script knows enough about batch systems to do whatever is sensible
SAM probe available: run in the ETF pre-prod service
- Can also be run by hand
- Same test script for all batch systems: this is the whole point of MJF!!
- Critical to follow-up the rollout
Please volunteer!!
Discussion
Ian C.: probably too early to make a decision whether WLCG requires the general deployment
- Need more sites to adopt it and report from them: should issue an official request for more volunteers
- Review them in a few months
Tim Bell: what is the real benefit that can be expected by experiments?
- Andrew: no real sustainable alternative if a job has to know about its environment. Lightweight solution.
- Maarten: probably not a strong requirement by experiments except LHCb (due to its job masonry) but all experiments agreed that if the system was widely deployed, they could make use of it
Michel: GRIF was an early adopter of previous version and it was not a hard work to deploy it (at least the machine features part). New version supposed to be much easier according to what said Andrew. We should insist that the complexity has nothing to do with CVMFS (not to speak about glexec!).
HSF Workshop Summary - M. Jouvin
Held at LAL last week (May 2-4):
https://indico.cern.ch/event/496146/timetable/
- Good participation: 70 people, 30 affiliations
- Not only the LHC experiments but also IF, Belle II...
- Agenda: mix of general discussions and topical sessions
HSF objectives remain unchanged: promote collaboration around SW, avoid duplication, give visibility to new projects, help with career recognition
- Also a potential framework for attracting support
- A framework for interacting with other communities
HSF activities structured in WGs
- Communication and information exchange
- Training
- Software packaging
- Software projects (the main pilar of HSF)
- Dev tools and services
Several project reports: DIANA-HEP, AIDA2020 (RD detecors and SW), future Conditions DB (common ATLAS/CMS), HEP S&C Knowledge Base,
WikiToLearn
Project support: refining what HSF can bring to projects
- Advancements last year: best practice document (soon a Technical Note), a project creation script helping implementing these best practices
- Future work planned
- Help with visibility of projects
- Interoperability of projects
- Project peer review: starting with GeantV, several projects declared interest
SW pacaking is another very active area
- Key piece for SW interoperability/cross-integration
- A lot of work around Spack, a tool from the HPC world
News from other communities: 3 projects presented with various aspects relevant/close to HSF
- Bioconductor: biomed project portal
- Netherlands eScience Center: already presented at GDB
- depsy: a NSF-funded project to promote credit for SW in science
2 topical sessions
- Machine Leaning: hot topic in HEP, Inter-experiment Machine Learning (IML) WG started last year
- IML wants to have strong links with HSF: will become the HSF forum for ML
- SW performance: contributions by ALICE, ATLAS, CMS, GeantV, ROOT, Art/LArSoft, and the Astroparticle community
- A lot of interesting discussions, more questions than answers
- An activity that will be made more visible in HSF, through the SW Technology Evolution forum (replacement for the SW Concurrency Forum)
Community Whitepaper: a proposal for american collegues to build a roadmap for the work addressing challenges for HL-LHC computing
- Target: a whitepaper by summer 2017
- Proposal: a serie of HSF-branded workshops during next year
- Discussing a kick-off around CHEP
- Fitting well with LHCC request for a HL-LHC computing TDR: quite complementary
Proposal of a Journal about SW&C in Data-intensive sciences
- Refereed, indexed journal that could be a reference archive
- Already presented at HEPiX 2 weeks ago
- Not restricted to HEP but focused on data-intensive sciences
- Good feedback, discussions going on on how to move forward
- Present/discuss at a future GDB?
Conclusions :
- HSF is alive and recognised
- Community Whitepaper is a good initative to progress towards common solutions
- HSF will try to get an "official blessing" from ICFA and similar bodies
- Discussion still going on about a legal entity to support HSF
- Need discussions with funding agencies and lawyers
- intial goal: IPR management, like Apache SW foundation
WLCG Experiment Test Framework (ETF) - M. Babik
ETF: new version of the SAM/Nagios Test Framework
- Keep track of sites availability/reliability
- Run deployment campaigns
- Overall: simplification, reduction of complexity
- Keep up with changes in the monitoring technologies: OMD, new communication/transport libraries...
- Publish metrics to other services like SAM3
Core framework still based on Nagios-core but integrated with check_mk and OMD
- New web interface (check_mk)
- Same plugins/probes
- WN micro-framework: tests through job submission
- Results retrieved directly from the job output and inserted into to Nagios
- All CE technologies suppored: CREAM, ARC, HTCondor-CE, Globus
- Two custom components: rule-based configuration (ncgx), publication of results (nagios-stream)
Changed behaviour
- Test with RFC proxies
- All services in VO feeds will be tested
- https tested for data access
Check_MK Central:
http://etf.cern.ch
- Currently in beta test
- Reachable outside CERN (need a certificate)
No change in the support channel: GGUS
Future work
- Notifications and site-based host groups
- Refactoring of WN framework
- Cloud support
HEPiX Summary - H. Meinhard
HEPIX is a very lively organisation: last meeting in Berlin, April 18-22
New activities
- CPU bechmarking working group relaunched
- People interested in fast benchmaark and next HS06, contact Manfred
- Monitoring: a BoF run with a good participation
- People interested: contact Cary Whytney (LBL)
Tracks and trends
- 15 site reports: HTCondorCE - maybe the potential to replace other CE techno
- Security and networking - 7 contributions
- Storage and file systems
- Ceph usage continue to grow
- One of the OpenAFS maintainer made a worrysome presentation about OpenAFS future (in particular in term of new kernel support)
- Grid/clouds: lot of work around containers and container orchestrators
- IT facilities: new greenITcube @GSI
now in production. very efficient datacenter.
Next workshops
- Berkeley: the week after CHEP
- Spring 2017: not clearly settled yet, confident it will ne in Budapest (organised by Wigner)
- Fall 2017: KEK, date already fixed (October 16-20, 2017)
- Will need to re-consider the swap of European/N-American meeting, Asian labs becoming important contributors to HEPiX
- Expressions of interest and proposals to host are always welcome
Discussion
- Is the active participation of other scieces increase ?
- Helge: LifeSciences has come, dried out and they come back
- Ian C.: also come from people starting in our community and moving away
- Helge: still it goes beyond personnal contacts
Ligthweight Sites
Introduction - M. Litmaath
Session about small sites
- Nevertheless other sites may benefit from simplification
Build on earlier discussions and ideas
- In particular demonstrators for new data management approaches presented at April MB
T2 vs. T3: T3 generally dedicated to one experiment, no need for the generic grid MW
- Directly deploy the experiment framework, no need for APEL accounting/InfoSys integration
- T2 possible simplifications: reduce the catalog of required services, replace classic/complex service by new, simpler ones, simplify deployment and maintenance
- catalog reduction example: computing-only sites
- CE/batch systems: reduce the number of options, HTCondor (and -CE) on the rise
- Accounting: ARC CE can publish directly into APEL
Cloud systems: not completely easy
- OpenStack is the most popular cloud MW but not necessarily easy to deploy/manage
- Paradigm shift: batch slots -> VM instances, need proper accounting
- Do no expose to experiment a wide zoo
AuthZ
- EGI: ARGUS is the cornerstone, supported through the INDIGO Datacloud project, release 1.7 almost ready
- OSG: GUMS
Configuration
- Slow but steady move to Puppet
- Shared modules: not a lot of evidence yet
- Some sites prefer another solution but probably know what they do
- YAIM still being used for (too) many services
- DPM approach: standalone Puppet installation - an idea to be developed ?
- Small sites will need help
Simplified deployment
- VMs distributed by CVMFS
- Containers
- Ready-made (HW+)SW solutions à la perfSonar
- Deployed in a DMZ, remotely operated
Better documentation: will benefit everybody!
Monitoring
- Integration into the local fabric monitoring
- SAM tests and experiment monitoring should be able to raise local alerts
Lightweight sites in ATLAS - A. Forti
Constraints
- Ability to use resources provided by sites who cannot be a standard grid site
- Standard sites with decreasing funding/manpower and more and more conflicting constraints
- Need to reduce load of experiment operations
Storage: the main source of operation cost
- 75% of ATLAS storage by ~30 sites: small sites (<400 TB) discouraged to do further investments
- Bigger sites (T1 and T2) with satellites acting as cache (without tight coupling)
- Regrouping sites into larger ones not trivial: bigger strain on larger sites, potential efficiency pbs...
- If small sites remain integrated into these larger ones, reliability issue may remain
- Object stores are an emerging, promising technology but no experience yet at large scale
- Cache site: several approaches from pure internal cache to secondary files (handle the same way as normal files but candidates for deletion if space is needed)
- Different technlogies available: ARC cache, Xrootd cache, upcoming DPM caching...
- Does the cache must be multi-protocol?
Computing: main issue is the WN which is a very specific environment, making sharing with other sciences difficult
- Virtualised WNs? Not necessarily well supported by some batch systems
- Containers: probably easier and enough
- ATLAS recommends ARC CE + HTCondor or SLURM or other BS with cgroups support
- Also alternative to a batch system explored: Vac/Vcycle, BOINC, OpenStack/EC2/Azure. All behing a ARC CE. Currently restricted to some workloads.
Remove the dependency on BDII: work in progress (WLCG IS TF)
Event service: an important ATLAS service to make a job preemptable without loosing the work done: event-level checkpointing
Lightweight Site in UK using VMs - A. McNab
Vac/Vcycle: simple daemons to provision VMs at sites without running grid services
- VMs already built/maintain by experiments to use clouds
- Only constraint: VM must shutdown if no more job to do has been retrieved from central queue
- VMs only need to have the experiment framework agent installed
- Vac: autonomous hypervisors, Vcyle: VM provisioning through a cloud API
- Vcyle has a backend for OpenStack, OCCI, DBCE and Azure
- Vac: generates APEL accounting records, mechanism to implement target shares
- Integration with Machine/Job features
- Proposed to EGI to build "community platforms"
Vac-in-a-box (
ViaB): allow to deploy Vac without relying on any site service (including DNS)
- USB boot image to download, containing a kickstart file to install an hypervisor wirht the full stack (DHCPD, TFTP, Squid...)
- PXE-boot the second hypervisor that will get installed as the first one and become a second installation server
- ...
- Security fixes distributed from the ViaB web site through an hourly yum-update cron
Next steps
- Exploit existing Vac to ElasticSearch reporting
- Support mixed-size VMs on the same hypervisor
- Manage mix of VMs and containers on VAC hypervisors
- Make easier to add VOs: Vacuum Pipes
JINR experience - M. Kutouski
3 possible approaches to site simplification
- A few endpoint with CEs per country and sites providing only WNs integrated into these consolidated CEs
- Cloud everywhere, national/regional federation of clouds, federation of the federated endpoints at the WLCG levels
- Basic low-level machine configuration done by sites, grid services deployed and operated remotely
OpenStack Fuel may allow to combine the 3 approaches
The Ubiquitous Cyberinfrastructure - L. Bryant
CI Substrates: trusted/DMZ zones in charge of running the grid services
- Sites only provide resources that will be used by the CI substrate (Edge Platform) + resources managed by these services
- (complex) service operated remotely/centrally
- Produce a reference specification for edge container platform: allows support team to focus on SW rather than HW troubleshooting
- May make easier to engage with other sciences
Currently exploring several underpinning technologies, based on containers for the edge platform
- Find a balance between platform features and ease of use by sites
- Provide a Web interface/dashboard + REST API
- Containers maintained centrally
Automation efforts: decouple approvals (requiring human intervention) and configuration that should be fully automated
- Require to have all the required information at one place after the approval has been done
Benefits for WLCG
- Easier maintenance/update of services
- More consistent versions
- Focus on documentation and builds
Jeff: in this approach services like CE that used to be a gatekeeper to local resources is now outside, risk of causing security issues if needed to open outside access to the batch system
- Lyndon/Maarten: no definitive answer, using HTCondor-CE/HTCondor that is designed for this kind of configuration. May be more difficult with an other batch system
T3 in a Box - F. Wuerthweim
T3 = site with no dedicated personnel for WLCG support
- Want to be able to use efficiently the resources provided by ~80 universities
- ~50 of them funded by NSF in the last years, running O(10K) clusters and had their network upgraded in the last year
Minimum services required from a T3 site
- Submit host to submit jobs to both local and global resources
- CVMFS for experiment SW distribution
- Xrootd cache for access to experiment data
- Xrootd server for private data storage/access
- Everything packaged in a 10K US$ box (40 cores, 12x4TB disks, 128 GB RAM, 2x10 GbE)
- Can support several VOs
Cooperation between experts and local IT
- Local IT manages only HW and user accounts
- Box integration at site worked with recipients: based on the first experience (5 sites in California), always some differences: a real challenge!
- Security maintenance of the box done centrally
May consider switching to Ubiquitous Edge Platform later when it will be mature
- The current project already used and allows to understand what are the challenges in trying to get a central team of experts and local IT people work together
Discussion
Storage work and TFs
- TFs should a well-defined mandate and clear-objectives: be sure to have enough people to work in them
- Jeff: should take into account the effort spent in TFs in the operation cost calculation... TFs have to be very efficient if they exist
- Build upon work done by the demonstrators
Conclusion - I. Collier
Nominations for the GDB Steering Group extended up to next MB in 2 weeks.
- Lacking T2 representatives
--
MichelJouvin - 2016-05-11