November 2016 GDB notes
DRAFT
Agenda
http://indico.cern.ch/event/394788/
Introduction (Ian Collier)
https://indico.cern.ch/event/394788/contributions/2357312/attachments/1368521/2074360/GDB-Introduction-20161109.pdf
Maarten: ask why the Rome ASTERICS/OBELICS workshop was included?
Ian: Clearly it's not directly related to WLCG, but many sites are working one ways to engage, particularly with computing requirements for communities. May be of interest, not clear where the intersection is, but we should be paying some attention to it. Also true that if we put everything with intersection list would be impossibly long!
INDIGO-CMS Update (Daniele Spiga)
https://indico.cern.ch/event/394788/contributions/2357313/attachments/1368563/2074634/GDB-spiga.pdf
Romain: If you plan to do anything production related, I highly recommend you go through security audit, in particular to check against logging and traceability policies, and the fact that you use a non IGTF-approved CA matches up with grid pilots policy, look through existing policies. You should at least document process.
Ian: Part of the indigo project is not complete audit, but STFC is going to be leading some security challenges.
Romain: Excellent, very good.
Frank: Hopefully two simple questions. Pilot jobs?
A: Not running pilot jobs. Two ways of joining. Pilots or start process over startd. If call that pilot then yes, otherwise just condor starts daemon.
F: What gets submitted to the condor queue?
A: End user job. Match resources and create the claim. Standard approach used by CMS to submit, use this. Once schedd knows about user request, carry out matches. Once claim is possible set of jobs starts running.
Oliver: In the end CMS analysis job runs with regular proxy signed by regular CA running with standard grid storage.
A: 2 proxies. Indigo solution is to have an available proxy equivalent to pilot, allow to fetch jobs inc user proxy. Then also have user proxy.
Oliver: Extend to use data provided by INDIGO. So CMS will import data into Onedata?
A: Not an answer I can give for CMS. What we will do in order to demonstrate, establish a zone within Onedata, testbeds, demonstrate workflows.Transparent for CMS. What will happen next is matter of discussion with CMS. Security, etc... 2nd phase, first 6 months of 2017 will address data management. One or more zone with a provider with CMS data.
Domenico: CVMFS inside container?
A: After Mesos cluster, load balancer, instantiate a squid proxy. TOSCA then triggers CVMFS in docker host, start deploying everything that is needed, both CVMFS and squid.
Domenico:
IdM - how much INDIGO approach connected with eduGAIN?
A: token obtained may be got INDIGO federated with eduGAIN.
Andrea Ceccanti: The INDIGO instance for this integration test is registered with eduGAIN, authenticate with home id provider if provides enough attributes. Note that membership in this instance goes through an id vetting step. For example like CERN VOMS/VO, cannot simply register in INDIGO IAM instance, need approval of admin. You could have one instance INDIGO IAM for CMS, where there are VO admin who decides which members are allowed to be inside. Users can authenticate with home institution, or google, etc.
Romain: Auth, really important INDIGO looks at SIRTFI framework wrt incident response. Important to make sure that authorization remains in line with current VO registration guidelines.
Andrea: To check what are the supported
IdPs this is something we can look into, we don't have a set of required attributes, but we can certainly look at this. As soon as this goes into production we can see that same level of integration with VOMS, or closer integration with CERN SSO system. More related to deployment in production, this is more of a proof of concept demonstration. Token translation can be carried out. Needs to go into production take care to make sure that policies are met.
Q: Use case where we need authentication layer. Token translation layer. CLI?
A: Both, but using REST.
ECFA Workshop (Simone Campana)
https://indico.cern.ch/event/394788/contributions/2357313/attachments/1368563/2074634/GDB-spiga.pdf
Q: What if worst case scenario play out, IBM/Oracle drop tapes, eg? Would that affect future accelerator construction?
A: Impact kind of physics you do. Linking LHC programme with software and computing. Priority to define a strategy for experiments for HL-LHC. Computing needs to adapt to this strategy. Mentioned at last RRB when the obvious question, for 2017, take money from Upgrade? The answer is no.
Ian: How can we use the space in the GDB to complement the other activities? Not necessarily answer this now but for future GDBs...
Simone: Mix of three areas. Software, computing, facilities. Definitely GDB, maybe not necessarily for software but surely for computing and facilities.
Domenico: Computing model for content delivery network. When you think about moving storage, mainly AOD, small files, what we feed as data?
Simone: Have to be prepared to do all of it. Different experiments, benefit might be in different areas. Looking at ATLAS and CMS today, CMS spend fair amount of resources for analysis, ATLAS numbers are almost negligible. ATLAS spends enormous resources for GEANT4 simulation, CMS doesn't. Different in such a way you have different model, share pros and cons. Whatever infrastructure needs to be flexible enough to be able to optimise, reco and end user analysis. Cannot tune infrastructure to one user case, makes job difficult but the future is uncertain enough that you have to be flexible in what you support the best.
HNSciCloud Tender Evaluation (Domenico Giordano)
https://indico.cern.ch/event/394788/contributions/2357321/attachments/1368556/2074418/HNSciCloud_Update_GDB_9-11-16.pdf
Q: 2 questions. First, companies in different partners, is this not strange?
A: This is allowed, mostly confidentiality between two activity that's of concern. Different partners have different roles in different collaborations. It is a grey area. Our goal is to arrive at the commercialisation. The need to justify investment over 6 months rather than 2 years of a European project. Possibility to commercialise before end of project.
Q: Relationship
HNSciCloud /INDIGO?
A: Some of those commercial partners are also part of R&D of INDIGO.
Q: Difference between projects?
A: Different areas. We're hybrid cloud using different services. Authentication/data management. Different layers in INDIGO, here we are talking about the commercialisation which could have been developed in INDIGO or other solution.
A: This is specifically a procurement project with R&D element.
A: Don't think commercialisation component is in INDIGO.
Q: Commercialisation: WLCG buying or selling?
A: We buy capacity, but also services for hybrid cloud. They should be able to commercialise these so we don't finish this project and end, but look for the next 10 years. Need solution that we can buy using models that are different to those we have now.
Q: Long perspective - timeline ends in 2018, what next, one winner, and then?
A: The winner(s) are the one(s) that when reach the end, demonstrate commercialisation. Goal is to move then to commercialisation. Winner, means (a) satisfies requirements can commercialise a solution, we can then buy after 2018 (b) If they are not able/willing to commercialise working solution we earn back the Intellectual Property (if you want) in 2018. Even the ones that do not arrive at the end, they could even build their own solution and commercialise it. We will buy, not sell.
Q: Foresee to see on pre-emptive resources? Another factor of work.
A: One of the investigation(s). Which workloads can fit within those resources. Not trivial, quite advanced with respect to other institutions who do not have the experience of WLCG. Genome sequencing algorithm, need 200 VMs for >24hrs, maybe could not adopt this.
Q: Expt need to prepare for this.
A: Yes, part of evaluation.
Phasing out of Legacy Proxies (Erik Mattias Wadenstein)
https://indico.cern.ch/event/394788/contributions/2357322/attachments/1367917/2073109/20161109-LegacyProxies.pdf
Maarten: RFC proxy should be the default per the Task Force this year, almost finished. Hinges on EGI UMD, default has already been released about a month ago. Don't have any concerns, opposite has already happened since a few years with some expt software. CMS said, we're not going to debug why legacy proxies don't work, force RFC proxies instead. Also ALICE had to switch for similar concerns, software could no longer be made to work with legacy proxies despite huge effort. For once we were on time with this. SAM area was the first to see how well the infrastructure worked for this, zero problems found. Here mention problem recently discovered, JGlobus issue with RFC proxies for certain CAs. That is the only outstanding issue to my knowledge. Sad thing, JGlobus not supported by anyone. Vast majority of dCache instances depend on it, dCache devs looking into private builds. 2.14, 2.13 still considered OK. Solution is going to happen there that build may also be used by EOS, BeStMan. Matter was not discovered through operational errors, folks in Canada ran into this doing particular tests. Operations work, vast majority of certs are not used to access storage. It is better to be fixed in private build of the code, should not be that difficult (one line change). Aside from this, legacy proxies should better disappear as soon as possible.
Ian: When are we going to be ready for a closing report?
Maarten: TF set up a few years ago to make the infrastructure ready for SHA2, RFC proxies came later. Basically the infrastructure is ready with this one JGlobus proviso. Then RFC proxies be the default as well as supported. In few days UMD4 release with change to default. In December Ops coord meeting I will document that this has happened, infrastructure is ready, latest versions of code compliant, then it depends what the expts do with that, will have to check where are proxies created, try to weed out where legacy proxies are still being used for no good reason, could take a few months.
Ian: Look forward in the next few months to closing report.
Maarten: Possible that early next year legacy proxies are not used for anything.
Mattias: Concern on ARC lists about what this would mean for WLCG sites - I can comfort them.
Update on WLCG Accounting Reports (Julia Andreeva)
https://indico.cern.ch/event/394788/contributions/2367230/attachments/1368590/2074483/WLCGPortalChangesGDB.pdf
Peter: This set of requirements, is this final, or will be after MB this month?
A: MB should bless us, absent objections this is what we'll do
Peter: Should be submitted to developers as official set of requirements, a good part of them could be implemented across whole portal not just subset. Once they are official I will discuss with developers.
A: Agree, main developer, Ivan, is part of task force and so is following very closely, we're not pushing to have things included until ratified by MB, hopefully in place by next Tuesday.
Ian: This might be useful for the rest of the community, not just us so useful to capture.
WLCG Workshop report (Ian Collier)
http://indico.cern.ch/event/394788/contributions/2357323/attachments/1368648/2074831/GDB-WLCG-Workshop-20161109.pdf
no comments - but
please note the request that volunteers to host the next WLCG Workshop contact Ian Bird or Ian Collier
Facilitating campus and grid security teams working on the same threats
(Liviu Valsan, Romain Wartel)
http://indico.cern.ch/event/394788/contributions/2357324/attachments/1368633/2074577/20161109_GDB_v0.2.pdf
- Q: example of an incident propagated to the grid via a stolen credential?
- A: none. Instead, we need to worry about common attacks to gain root access at sites.
- Q: should users be trained e.g. not to send passwords around?
- A: while training can be worthwhile, passwords usually are found through other means.
- Attackers may deploy malicious payloads through compromised admin accounts
- Q: how useful is MISP today?
- A: while data selection criteria still are a work in progress, new threat info
may already be available shortly before a corresponding attack actually happens
- the SOC WG is about integrating MISP at sites, making use of Bro,
depending on the capabilities at each site
- CERN: distinction between campus and grid security has been removed
- that is the goal also for other sites
- the path to enlightenment is through informing and educating campus security teams
- buy-in can be obtained when usefulness is delivered early on
- Q: an institute may not want to share its own attack info?
- A: sharing is desirable, but not required; it can still pull info from the WLCG MISP
- historically grid resources were treated differently, e.g. put into a DMZ,
while these days they can be merged with the rest of the campus infrastructure
- we ask sites to do something and we give them help also for the rest of their campus!
- Q: w.r.t. the recent kernel vulnerability, different sites reacted differently:
can the WG advise in such matters?
- A: how to deal with vulnerabilities is more an operational matter;
sites may have different practices, cultures, maturity levels
- Even T1 sites may have separate security teams for grid vs. campus!
- Q: send a questionnaire to find out how the landscape looks?
- A: possibly. We can start with sites that are ready for our initiatives.
- Opportunistic resources: we have little control there
- HPC: work with PRACE
- The active threats come from the campus!
- Clouds etc. are a different matter
- Let's start building a trust relation with campus security teams now,
so that the benefits can be reaped in the future
Performance at WLCG workshop (Andrea Valassi)
http://indico.cern.ch/event/394788/contributions/2357345/attachments/1368787/2074869/20161109_GDB_AV.pdf
no comments
Compiler based optimization - Testing AutoFDO for Geant4 (Nathalie Rauschmayr)
http://indico.cern.ch/event/394788/contributions/2357347/attachments/1368686/2074705/slides.pdf
- data needed to train the compiler is largely independent of the physics
- the code will mostly be occupied in tracking particles through detectors
- or reconstructing their tracks
- forward vs. barrel events illuminate different detectors
- but should not be a big effect overall
- training the compiler per executable could be part of typical build and validation clusters
- newly built libraries and programs may already be validated with reference data sets
- FDO typically reaches ~90% of what could be achieved with PGO
- ATLAS are interested in FDO: a speed-up of ~10% is worth spending some time for
- the Google gcc enhancements are supposed to be merged with the standard gcc
- but the necessary flag
-frecord-compilation-info-in-elf
is not yet available
- executables recompiled after training are unlikely to be slower for other scenarios
- larger training samples might lead to a few % further gains
Performance discussion (Andrea Valassi)
http://indico.cern.ch/event/394788/contributions/2357348/attachments/1368794/2074881/20161109_GDB_MS.pdf
- performance discussions could be held in a pre-GDB
- though the GDB context is more about workflows involving sites etc.
- the HSF ought to have a WG on performance
- commercial cloud activities: feedback on which workflows are compatible with clouds
- i.e. can have a good performance there
- a reference workflow would be good to check performance questions
- a synthetic benchmark can represent such a workflow
- HammerCloud could be used to test resources with such workflows
- it may be difficult to package such a workflow so that an admin can run it
Regional Federations Demonstrator (Fabrizio Furano)
http://indico.cern.ch/event/394788/contributions/2357349/attachments/1368821/2074933/0-GDBNov2016FedWLCGDemonst.pdf
http://indico.cern.ch/event/394788/contributions/2357349/attachments/1368821/2074935/1-GDBNov2016FedWLCGDemonst-ATLAS-IT.pdf
http://indico.cern.ch/event/394788/contributions/2357349/attachments/1368821/2074931/2-GDB-BelleII.pdf
http://indico.cern.ch/event/394788/contributions/2357349/attachments/1368821/2074932/3-UVIC-Federations2.pdf
- Q: how is the authN/authZ done for cloud object storage?
- A: Dynafed handles the user authN/authZ and deals with the storage transparently
- it holds the keys to the storage, clients do not need to know
- a cloud storage could also have e.g. a DPM in front of it
- the data bridge concept could be used in some cases
- it could also be mounted as block storage
- Q: how does Dynafed avoid SEs that are overloaded?
- A: each SE receives a ping test every ~30 sec and gets temporarily excluded when unresponsive
Wrap up (Ian Collier)