Summary of GDB meeting, January 14, 2015 (CERN)
Agenda
https://indico.cern.ch/event/319743/
Introduction - M. Jouvin
Thanks to C. Biscarat for taking notes today.
Future GDB exceptions
- The March meeting will be in Amsterdam and co-hosted by EGI and NIHKEF
- The meeting in April is cancelled due to the workshop in Okinawa the following WE
- October GDB during HEPiX at BNL: an occasion for a meeting in the US?
- Please send feedback to Michel
Pre-GDBs planned at each GDB until next summer
- February: F2F meeting of Cloud traceability TF
- March (Amsterdam): cloud issues?
- Fill https://doodle.com/49ghux9gw3uzgiug
even if you don't come (use '(Yes)' for remote participation), decision by the end of the week
- Also a proposal by Jeff to have a meeting with Netherlands e-Science Center that announced they wanted to do more for Physics: may be an occasion to discuss with other communities on common issues
- May split the day in 2 parts, based on Doodle and contacts with NeSC
- Spring: batch systems, volunteer computing, cloud accounting/resource reporting
- As always, let Michel know about topics of interest to consider for future meetings.
WLCG Workshop: Okinawa, April 11-12 (WE before CHEP)
- Important to have a good participation
- We are a worldwide collaboration: not every workshop can be in Europe…
- Just before restart of data taking
- Agenda: https://indico.cern.ch/event/345619/
: please send your feedback/suggestions
- Organize your trip asap
- Not many flight options below 1000€ from Europe
- Registration on CHEP site
Status of actions in progress
- Reminder : sites must reinstall perfSonar
- H2020 VRE projets: 2 submitted, news expected by the summer
- OK-Science: linked to analysis reproducibility and knowledge capture
- Medical VRE
Forthcoming meetings: see slides
Discussion
- Jeff: was there a project submitted by EU-T0
- No, as far as we know. But EU-T0 involved in 2 projects submitted in September (DataCloud and Zephyr)
- News about these projects?
- Not yet, at least official ones. But some people may have informal ones. Definitive anwser expected by February 2
Cloud Resource Reporting - L. Field
Cloud Status document released as v1
- Only minor changed to the initial draft
WLCG rely on resource reporting to match resources delivered by sites and used by experiments with pledges and requirements
- Ability to do it with cloud resources is a requirement to be able to use cloud resources as part of the official WLCG resources
- Site will not give free capacity...
- Not exactly the same thing as accounting
- Managed by REBUS: sort of an electronic version of the MoU
- Experiment requirements and site pledges
- Resource reporting: for T1 taken directly from accounting portal, for T2 a spreadsheet is build from the accounting portal and circulated to sites
Specific challenges for cloud resources with resource reporting
- Need HS06 hours and wall time per WLCG federation
- Benchmarking VMs is a critical point to do time normalization
Proposing a Cloud Reporting TF
- Objective: gap analysis, definition of an action plan, to ensure that cloud resources can be reported
- Key participants: APEL, accounting portal, REBUS/WLCG Office, Benchmark specialists
- Weekly meeting proposed every Wednesday 3 pm CET
Discussion
- Helge: despite not disagreeing with the proposed TF, never heard about it, surprised it is announced as created
- Michel: I'm responsible for that! In fact the issue of cloud benchmarking/resource reporting has been identified as one of the major issue in the cloud-related pre-GDB for 2 years, Laurence proposed me last September to convene a TF to make progress with it and I welcome the proposal. The goal is to put together all the activities around this topic that is already happening. As every group, its main responsibility is to work! and come back with proposal that will be discussed: no power to take decision on its own. The goal is to make progress "quicly" thus the rather high frequency proposed.
- Helge: sites are missing in the proposed participants but they should be key participants
- Laurence: no problem with site participation, if interested they are welcome to register. The initial list of experts proposed is the minimum list to look at the issue. Sites participation would help to ensure that proposed solutions are feasible. One critical issue is related to benchmarking and time normalization.
- Jeff: should take into account that, if WLCG like to resources belonging to WLCG, this is quite orthogonal to cloud elasticity and may be a source of difficulty
- Michel: was already the case with grid to some extents but agreed that this is part of the specific challenge with clouds and this is why we needed this task force
- When it starts?
VO Box Services and Security - J. Templon
From 2006 VO definition, 2 classes of services
- Class 1: services accessed through API exposed to external world (possibly through firewall)
- No major issue: as long as VOBOX as only Class 1 services, it can leave on a separate network and seen as an external resource for the site: less risks, can be wiped and reinstall in case of problems
- Class 2: private access through some restricted access interface (e.g. qsub to a local cluster)
Recent vulnerability identified with a VO box running only Class 1 services: fortunately was moved to a specific subnet
- In fact affecting all VO boxes
- Port scan revealed a vulnerable service (due to an old SSL version) but limited exposure due to firewalling
- Once vulnerability (openssl) is fixed: wipe box and return to VO
- Fix not yet available as it proved to be not completely trivial to update openssl to a newer version but should be ready soon
2 VOs left with class 2 services on VOBOX
- ALICE: due to the computing model
- CMS: access to SE namespace because srm-ls is not working as expected (to be confirmed)
ATLAS also runs a class 2 service on SE: the
N2N service
- Potential problem with class 2 services not specific to VOBOX, apply to every VO service
Need some formal process to check and do security assessment of VO-developed SW that needs trust beyond "VO boundaries"
Also some policy should be defined on how this should be handled with CVMFS: some VOs never replace what is delivered by CVMFS, meaning that the vulnerable SW can still be used
- Some VOs like ALICE already replaced files in CVMFS: technically feasible without problem
- Nothing really specific to CVMFS, except lack of site control to implement a workaround (remove file or package)
- ATLAS N2N distributed as a RPM via YUM (WLCG repo)
Reinforce the importance of not relying only on VO for traceability: VO can be affected by vulnerability if distributing vulnerable SW, does the VO has enough expertise in security?
- Markus: not clear, commercial cloud providers are clear that the responsibility/liability is entirely on the user (the one who paid the resources) and not on the provider... Need clarification of responsibility delegation. The same issue of using vulnerable SW happens with the basic OS from RH: is RH liable? is sysadmin who installed it liable?
- Helge: main difference between our SW and OS/EPEL: if there is a security vulnerability with OS/EPEL, we know that some people will work immediately on it and that we'll get a fix quickly
- Developing site level auditing for such SW could be a recommended good practice...
- S. Lueders: need also to check that developers are using appropriate methodology/tools (nightly builds, build tests...): automatic search for vulnerabilities is pretty easy to add if you have a testing infrastructure
Discussion conclusions
- Establish a list of Class 2 services running on VO boxes or elsewhere to clearly identify the sensitive SW
- We could then decide whether some of these pieces may be worth a best effort review, as we did for glexec
- Find new volunteers for such code reviews
- Jeff: NIKHEF reularly in contact with students that are looking for 6-week internships around this kind of work: could ask them if they may be interested
ATLAS Experience with cgroups - G. Qin
Work done on using cgroups from Condor
Condor and cgroups
- Each job put in its own cgroup for each selected subsystems controled by cgroups
- 3 policies for memory: none, soft, hard
2 sites tested cgroups with Condor
- Glasgow: since last April, fully enabled on Condor cluster, no problem seen so far
- RAL: enabled on 10% of the farm, no problem so far
- Historical analysis of job behaviours every day: job profiles examined (lifetime, RSS, correlation)
See slides for analysis of ATLAS job profiles
- Help to understand the normal behaviour of different kinds of jobs and to detect the suspicious/misbehaving jobs
- For example a typical reco job will take 2h where a misbehaving job will take up to its limit (e.g. 48 hours): means a lot of resources wasted, in particular with multicore jobs
Future work includes adding to Condor information, information from
PanDA
- Will required cgmemd to parse job logs files
Discussion
- Michel: very comprehensive, it would be good if we got similar reports from other VOs at a next GDB if there is some work going on with cgroups
- S. Roiser (SR): in LHCb there is a similar study to kill jobs based on a RSS threshold. LHCb very interested by cgroups but is worrying about the lack of cgroup support in Torque. Is the Torque support really needed.
- Michel: the issue is that, to apply cgroup limits, the process must be put in a cgroup. This is what Condor does. There is no concept of default cgroup AFAIK.
- Michel: is a GE site could say if GE has cgroup support
- Andreas: yes last UNIVA version has support for it. Will check if KIT started to play with it and if there is enough experience to report about it.
Actions in Progress
Ops Coord Report - A. Forti
WLCG survey now closed
- 95 answers
- Analysis starting
Next Ops Coord meeting: 22/1
Kernel vulnerabilities identified just before Xmas
- Update out since 19/12
- Vulnerability tagged as important by RH but as critical by EGI
- Sites which didn't upgrade are suspended since today...
Middleware
- Vulnerabilities found in FTS3 and gfal2: new versions in UMD by mid-January
- New baselines: FTS3 3.2.20, GFAL2 2.7.8
- First time that FTS3 will be part of UMD
- dCache 2.11.4 considered the new baseline
- dCache 2.6 not supported after beginning of Run 2
- xrootd 4.1.1 in EPEL5/6 testing: will required a new dpm-xrootd for DPM sites, in preparation to allow sites to upgrade
- dCache xrootd plugins should be put into WLCG repo: followed up by OpsCoord
T0
- Final decommissioning of AFS UI scheduled Feb. 2
- VOMS-Admin testing progressing: looking for experts for each experiments
- Experiments involved in testing, GGUS 110227 to track issues: several identified but no showstopper
- Would like to have one contact per experiment for the migration: only LHCb supplied one so far
Experiments
- ALICE: high activity, very good site performance
- Continuing ARC testing: troubleshooting a proble memory leak
- ATLAS: no major issue, moderate workload (70-80% of resources), tuning/fixing new frameworks
- CMS: 50% of T1 capacity must be multicore enabled by end of January, merging CRAB and central production into a single Condor pool
- T1: long lifetime for pilots preferred
- Also disk rather full at some T1s: cleanup in progress
- LHCb: almost completed the Run 1 legacy stripping campaign, work continuing on http/DAV
- http/DAV: still 5 sites missing
glexec
- PanDA testing progressing: 43 sites (+11)
- Issues found at a few sites, under investigation
Machine/Job feature: sites encouraged to test MJF
- No volunteer sites so far
MW Readiness
- Preparing to participate to ARGUS stress testing
- Slow-down in site participation recently: need active site participation
New Benchmark Status - M. Alef
Still waiting the SPEC release.
Compiler flags: still waiting new recommandation from Architects Forum
- Critical for HS scalability with real applications
Scalability difficult to achieve when an application uses a compiler other than gcc
Preparing next benchmark: need to identify apps representatives of experiment applications
Open source benchmark: still 2 candidates, GEANT4 and LHCb python script
- GEANT4: investigation in progress
- Still difficulties to get replies from GEANT4 expert about observed behaviour
- CMS provided a docker image to run the test easily
- CVMFS dependency can make difficult to run the test everywhere
- LHCb script: attractive as very quick (~1mn)
- But tests at GridKa showed a big dispersion compared to HS06 scores: seems very dependent on system load
- A python script: means scaling with real apps will be difficult as they depend on compilation flags used
Helge: no sign that HS06 doesn't scale reasonnably with recent HW
- All the issues reported so far have been trackdown to misusage of benchmark (inappropriate load on the box tested, wrong compilation flags....)
Discussion
- D. Giordano: Atlas developed a benchmark for Helix Nebula, based on the standard simulation chain.
- Goal: compare with in-house resources
- Already collected a good amount of data
- Michele: benchmarking WG very interested by the tool, Domenico invited to join the mailing list
- Jeff: HS06 was designed to scale the same way applications scale but as experiments plan to improve their SW performance, is it still meaningful and why to design a new benchmark to scale better than what we have today (HS06)?
- Michel/Alessandro/Helge: scaling is evaluated/defined with one specific software version. This is what allow to compare relative power of machines, to evaluate pledges against requests and to do provisionning
- Michel: do you have an idea when SPEC CPUv6 will be available?
- Manfred: probably by the end of 2015 or early 2016: I can compile it but there are issues that may take several months to be fixed
- Helge: this is not a major problem presently. The situation is not the same as with SI2K when benchmarks could fit in processor caches and were giving non representative results. We currently have no sign of any HS06 crisis: the only reported problems have been tracked down to misusage of HS06 benchmark (wrong compiliation flags, wrong number of benchmarks per machine). We are still in the 10% scaling.
- Helge: a comprehensive talk on this topic planned at next HEPiX
Data Preservation: The LHC Experiments - F. Berghaus
Summary of yesterday's pre-GDB.
Objectives
- Preserve data, SW and know how in the collaborations
- Analysis reproducibility
- Requires ability to reuse (rebuild?) old tagged version of the SW
- Prototype portal: https://data-demo.cern.ch
- Share data and associated SW with a larger scientific community
- Should have no impact on the production infrastructure
- Makes access to data easier
- Importance of documentation
- Prototype portal: http://opendata.cern.ch
: CMS already published 2010 data, ATLAS planning to do the same thing, LHCb and ALICE committed to do it later
- ATLAS experimented another way to give open access to data: Kaggle challenge (HiggsML)
- Education and outreach, open access to general public
- 1st target: CERN Master Class program
- Importance of meaningful examples
- Variant of the previous objectives: increased importance of ease of access/use
All objectives require bit preservation but bit preservation is not enough in itself
- Part of MoU for T0 and T1s
- Must include periodic reading of the data
- Do we need to include running some physics code on the data?
Agreed to report about ~6 months to GDB
- Another pre-GDB or topical meeting if needed
Discussion
- Wahid: it would be interesting to allow comparison of ATLAS and CMS data
- Franck: this is not trivial because the format is different and, BTW, the experiments are preparing for Run 2 and thus have a limited manpower available for these activities
- Why ALICE will release only 10 TB of data
- Pedrag : it is not a technical problem but the feeling that we are giving away something that is ours: it took 2 years in ALICE to agree to releases these 10TB. Also data without MC means nothing and some people are afraid that someone will pick up the data and present something completely wrong.
Data Protocol Zoo pre-GDB - W. Bhimji
Very active discussions!
Progress on many issues since the last meeting in Annecy in 2012
- FTS3, gfal2, federations...
Protocol simplication would benefit to both sites and experiments
- Reduce number of access protocols, SRM-less disk SE
Experiment views
- CMS: probably ready to use SRM-less sites but no experience yet
- Deletion may require user/site intervention until gfal2 proved to work for this need (not tested yet)
- LHCb: currently relying on SRM to get URLs but could live without
- Want to have one xrootd stable endpoint (local redirector) at each site
- ATLAS: would like to move to xrootd/file only for data access, exploring DAV as an alternative to SRM for metadata operations
Sites: interest in new storage technologies like Ceph
- Ceph is not completely ready for WLCG usage because of non overlapping protocols: CephFS may help in the future
Storage systems
- All implementations focusing on DAV protocol support and have xrootd available
- No plan for xrootd 3d copy in dCache
- StoRM investigating a non SRM BringOnLine
- Phasing out SRM support: consequences different according to implemntations
- EOS would like to get rid of SRM support
- CASTOR: would require investigation and planning
- Xrootd: 4.2 will bring a Ceph backend plugin
- 4.1 already brought cross-protocol redirections (protocol change during redirection)
- Davix: becoming mature, similar DAV/Xrootd perfs
- FTS3: gridftp bulk transfer coming
- GFAL2: checksum native on gridft, http and xrootd
- SRM deletion remains faster than any other method, thanks to bulk deletion
Possible evolutions
- 3d party transfer: gridftp is the only possible short term solution, xrootd/httpd are possible options for the future but still a lot of work needed on performance
- Download: moving from SRM to xrootd/httpd feasible today
- Reporting space without SRM: possible via adhoc solution or RFC
- Deletion: need to investigate non SRM bulk deletion
- Local access: xrootd/file seems feasible but some xrootd issues found at ATLAS sites that may prevent moving from dcap in the short term
- Moving away from RFIO seems feasible today
Conclusions
- No crisis but need to push for rationalisation
- Current approach for discussion/reporting seems ok
- Mainly pre-GDB and GDB discussions
- Some items may move to WLCG Ops Coord: rfio/dcap decommissionning
- Will make a table available with storage solutions/supported protocols and experiments/used-usable protocols
Discussion
- Gerd: different motivations expressed yesterday from getting rid of rfio/dcap to getting rid of SRM. Not the same thing and require different time scales.
- Michel: if we wish to get rid of SRM, we have to show we can have SRM-less sites. It's true that SRM has been battle-tested and that it's too early to decide whether we really can/want to get rid of it but some sites want to run without SRM and this is a good opportunity to demonstrate wheter it is feasible or not and to tackle issues with non SRM disk instance.
- Philippe Charpentier: about deprecation of rfio and dcap, if the experiment says, from now we don't use rfio, then sites can do what they wish. SRM is still useful for a few things, if we are able to do the rest without SRM, this removes a lot of pressure from SRM.
- Whahid: even if we use SRM only for deletion, sites are still obliged to keep it but probably less problems as there will be no more competition for resources between deletion and transfer/access
Evolution of UMD Repositories - C. Aiftimiei
Move management of EMI repositories under UMD as UMD-preview
- Will retain current EMI repo characteristics, in particular packages as released by product teams
- EMI repositories will continue to be available but contents frozen
- No EPEL packages
- Source packages and 3d party packages as repos in AppDb
- Management with the current UMD SW provisionning workflow: a team instead of a person
- New web pages replacing the EMI-3 pages from EMI projects: http://repository.egi.eu
Migration transparent to sites if using RPM emi-releases
Final plan to be announced ~mid-February
ARGUS Workshop Summary - M. Litmaath
December 12: good attendance and representation
- ARGUS PT
- WLCG Security and Operations
- OSG
- EGI Operations
- Sites
Main question: what problems ARGUS can/does/should ARGUS address?
- Will it become irrelevant in a few years or might it take on new functionality?
- Possible alternatives
- Impact of federated identities
- ARGUS has been designed to be generic, not bound to X509
Performance and scalability: few instabilities occured, always under high load
- Need to better document HA setups
- Load testing setup is very desirable: a small setup may be enough
- Last guess for problems seen: server get blocked by something like OSCP checks (OSCP checks can be disabled in last fix provided by SWITCH, not really needed in our context) but OSCP is not necessarily the only cause: even with OSCP checks disabled, a few server blockages have been observed
- Problem may be in the underlying Java libraries used, some of them in a "non-typical" way
- Apart from these obscure instabilities under high load, general feedback is that the service typically runs by itself
ARGUS use cases in EGI go beyond what GUMS ca do today: not an alternative
- OSG has no plan to extend GUMS (and don't really think it would be reasonnable to do it)
- ARGUS provides hierarchical composition of policies
- EGI requires continued identification of users, rather than just relying on the VO
- Evoluiton toward federated identities: ARGUS may help with authz in this new world
Agreement to start a community to keep ARGUS alive
- INFN already supporting PAP components, may take over PDP and PEPd if they get additionnal funding through DataCloud H2020 project for example
- NIKHEF supports and will continue to do so the C clients, used by glexec
- EGI agrees to do the release managements/staged rollout, 1st and 2nd level support, scale testing with partner sites and WLCG MW Readiness validation activity
- New potential partners that will be contacted: CSNET (testing), UNICORE (using CANL library that they maintain)
- Also ARC needs to fix its client: choose a different approach than EMI but not completely correct
- Others welcome: can be recognized for it!
Recent evolution (last year) and short term plans
- Code hosted in GitHub: new issue tracker will be hosted there
- CNAF will take care of creating RPMs
- Documentation: move to GH too
- Google groups mailing lists for developers, users, support
- EGI will prompt site to take ARGUS up again and open GGUS tickets if encountering issues
- Next meeting early Feb (Vidyo)
Conclusions
- Community is forming: more participants would be welcome
- Stronger support statement is desirable when community has matured
- Progress on fixing the few issues observed in high load conditions
--
MichelJouvin - 2015-01-19