Summary of GDB meeting, June 14, 2014 (CERN)
Agenda
https://indico.cern.ch/event/272777/
Introduction - M. Jouvin
As usual, looking for volunteers to take notes
Next GDBs
- Next ones at CERN on the second Wednesday of the month
- Next one outside CERN in March 2015. Please, let Michel know if you are interested in organising it
Summary of WLCG Workshop
Actions in Progress:
- Migration to GFAL2/FTS3 clients in October 1st
- Volunteering sites to provide dual stack IPv6 endpoints
- Batch accounting in progress
- SL/CentOS: no news
O. Smirnova asks about the status of Batch accounting and J. Gordon explains that there are still on going discussions on what the accounting information should be.
O. Smirnova asks whether virtual WNs should offer IPv6 endpoints. M. Jouvin suggests to contact the IPv6 WG.
T. Bell gives an update on SL/CentOS: currently studying how to use
CentOS 7 in the virtual infrastructure.
EGI Role in the Evolving DCI Landscape - Y. Legre
- Currently there is fragmentation at several levels: RIs and e-Infrastructures
- EGI would like to develop on the concept of e-infrastructure common:
- Common backbone of federated services
- Joint capacity planning
- Integrated resource provissioning/access, including commercial providers
- Not only NGIs can be members of EGI, but also other type of partners
M. Jouvin expresses his concerns on the fact that EGI, like also EU-T0, rely on the control of resources they do not actually own. Y. Legre explains that there has to be a common strategy and work together to be able to consolidate e-infrastructures in Europ, or otherwise there is a risk that funding agencies won't invest on different projects. There has to be a national program for e-infrastructures common to all the countries, a similar model like GEANT.
On a comment by O. Smirnova, Y. Legre explains that other communities apart from HEP are growing and have different needs and it has to make sure these needs are also covered.
Question about more details on the e-infrastructure commons proposal. Y. Legre explains some proposals have been already made and for some others there will be consultation with people like WLCG, etc.
T. Wildish asks why Y. Legre believes there is a lack of trust, as mentioned in his presentation. Y. Legre thinks people are not talking enough together. Moreover EGI was originally set up for HEP and then started to look to other communities, which may have lead to some frustration because it didn't have the resources to take care of all of them.
M. Jouvin explains that in WLCG there are some fears that EGI won't be able to continue providing the services it used to provide if there is lack of funding. Otherwise he believes WLCG and EGI has a very good collaboration in the areas they have in common.
Y. Legre explains that WLCG could decide to be member of the council in EGI.eu and he would be very pleased if this happens.
T. Bell adds that we are trying to avoid the development of in-house solutions (or project funded solutions that once the project comes to an end needs to be maintained by us).
Identity Federation - R. Wartel
- Enable people to use their home credentials to do things (submit jobs without the need of end user certificates, access web portals, etc)
- Not build own federation: build on existing ones (eduGAIN where many NREN participate)
- A lot of technical aspects to be members of eduGAIN (many completed, still a few to be done)
- Web access now working (CERN SSO)
- A pilot project for WLCG on going (no progress since last GDB meeting)
- Trust and Policy issues
- Experiments verify people's identity and this is CRITICAL for them and with eduGAIN there is no trust/guaranty about the identities used
- eduGAIN has still a lot of work to do for operational security (it opens the gate to the world)
- Two main Trust and Policy issues
- Operational Security
- Privacy and protection of personal data
- WLCG AUP and data protection policy must be reviewed
- user consent no longer sufficient to use personal data
- Coordinated effort among different projects who are linked (SCI, Sir-T-Fi, FIM4R...)
- GEANT has a Data Protection Code of Conduct that has been endorsed by several projects. WLCG could endorse this too!
- World is changing and we need to adapt our policies and services
T. Bell explains that twitter loging is currently used at CERN and this is also another example of not very secure identity trust. The fact that you have succesfully login with twitter doesn't mean you are authorised to use certain service.
M. Dimou makes a comment on authentication of experiments users as it is done today with VOMRS that would require substantial changes to move to something like eduGAIN. She also comments US and EU legislation are very different. M. Litmaath adds that this may lead to some incompatibilities that could have a very big impact in WLCG.
M. Alandes asks about CERN Data Protection Policy and WLCG. S. Lueders explains it affects all data stored at CERN. An assessment will be done. In principle it should be aligned with all the other EU policies and codes of conduct mentioned during the presentation.
O. Keeble asks about IOTA CA, do we have one and how we can convince sites to trust it? R. Wartel explains we currently have one in CERN IT for testing. Not recognised by EGI and WLCG, but should happen in the next year. It is produced by IGTF. The plan is to have this accredited by IGTF and endorsed by EGI and WLCG.
O. Smirnova asks how eduGAIN will be integrated with computing services. She believes this will be more complicated for sys admins. R. Wartel explains this will be much easier for the users and the technical aspects will have to be of course understood. S. Lueders adds that this makes service management easier too since there will be a generic way to define who is authorised to use the service. M. Litmaath adds that the migration is not going to be easy but we should start moving towards this approach.
Actions in Progess - M. Alandes
Ops Coordination Report
- Lack of ARGUS support in a critical state now with zero effort from SWITCH and issues being reported
- T0 news
- Decommission of lxplus5 and lxbatch5. Feedback to be sent by 14th September.
- Discussions to decommission AFS UI
- CVMFS server and clients updates
- FTS 2 decommission in August without problems
- Experiments: smooth operation over the summer
- New VOMS server certificates: sites will start failing SAM tests on 15th September if new VOMS servers not properly configured
- New condor-g based SAM will be deployed on 1st October
- Multicore TF extended to cover passing of parameters to the batch system
- Network and Transfer metrics WG kick off meeting with new actions defined. More details will be presented at the next GDB.
Information System Status
- 3 releases in the last months. Minor fixes. Focus on glue-validator improvements. New version of ginfo with functionality similar to lcg-info/lcg-infosites.
- New GLUE 2.1 extension including Clouds and GPUs.
- BDIIs up to date in terms of versions. Number of endpoints stable over the past months.
- SW tags cleaning has reduced considerably GLUE size in the BDII.
- GLUE 2 validation now automated and steered by EGI. Some issues to be followed up for i.e. 444444 WaitingJobs problem.
- Storage info providers improvements. Storage information validated for ATLAS: SRM and BDII numbers are the same.
M. Jouvin asks whether there are any news on GLUE 2 to be adopted by OSG. M. Litmaath answers there are no plans for the time being. A. Girolamo mentions that experiments do not have a motivation to move out from GLUE 1. Some people give arguments in favour of GLUE 2, as a better schema developed to overcome the constraints of GLUE 1. Moreover Cloud and GPU will only be available in GLUE 2. M. Alandes clarifies that the validation activities are done for GLUE 2 but make also GLUE 1 to be better as the underlying SW is the same.
M. Jouvin asks about the status of GSR/AGIS prototypes. M. Alandes explains there was a prototype for CMS but CMS management didn't show any interest on this for the time being. T. Wildish explains that for PheDEx developers this looked very interesting and that he hopes that CMS management reconsiders this, as for the time being this is not taken into account in their priorities.
Cloud pre-GDB Summary - M. Jouvin
- WG explores possibility of using clouds as a replacement of CEs. Some progress done in different areas:
- Machine/Job features TF
- Accounting done by EGI Cloud in APEL
- vcycle and Openstack fair share scheduler initiatives
- Cloud technology seen as more pervarsice technlogy, no MW development required by the community
- Idea is to foster on going work to study how clouds could replace CEs: realistic milestones to achive this in shared clouds
- Achieve something similar to fair share in batch systems: deploy vcycle (already in a few UK sites) and openstack fair share in some sites
- Benchmark of VMs work presented.
- Some features missing in APEL for cloud accounting
- Security issues too be understood
- Availability reports
- T0-T1 Summary
- T1 History report
- All sites reports
- VOs reports
- MB monthly reports
- Different changes introduced: experiments define topology, site reports...
- Changes in experiments reports are presented (used algorithm for availability calculations, etc)
- Changes in T1 History: sites instead of federations
- T2 federation reports for the VO reports: values are for the whole site instead of the whole VO! Information per VO is not correct in the BDII. Sys admins should check this!
- From October start using SAM 3 reports as primary
- Change MB report to T0-T1 summary
There are some clarifications on the ATLAS algorithm and how availability is calculated.
There is a question on whether
HammerCloud results will be used. P. Saiz says there are no plans to use this.
I. Bird explains that it's not necessary to validate the number of cores and hepspec published, it's enough to look at the pledges and accounting info. Number of cores is not needed. Hepspec is more important.
Data Management Discussion - O. Keeble, F. Furano and W. Bhimji
- xrootd4 (ipv6, major improvements): do experiments need it deployed soon? how organise deployment? Is the integration with ROOT available? diff between old and new client lib?
- T. Wildish ipv6 xrootd support required by CMS by the end of next year
- xrootd4 client talking to xrootd3 server will work as long as the new features are not used. xrootd3 client will be able to talk to xrootd4 server, since xrootd4 is a super set of xrootd3.
- xrootd4 and EPEL: plans?
- M. Ellert will provide this asap.
- O. Keeble asks whether this will be a new xrootd4 or an upgrade of xrootd3 in EPEL. M. Ellert should be contacted. xrootd team prefers to have it also as xrootd4.
- Move away from SRM: move to other protocols? metadata requirements clear?
- Clarification of "move to other protocols": F. Furano explains it means using something else. We currently see high usage of SRM.
- T. Wildish explains that deletion campaigns in CMS is done locally by the sites. SRM is not used. This is done using PheDEx agents. End of Run 2 may be a feasible timeline to move to other protocols, but this is not clear yet.
- F. Charpentier says that LHCb is trying to move using xrootd. SRM still used for massive deletions.
- A. Girolamo explains that for ATLAS the new Data Management system is based on plugins, whatever the site offers, ATLAS is able to use it. Most ATLAS sites support xrootd and webdav. ATLAS would like to start using webdav for massive deletion since it's more performant. But we can still use SRM if webdav not available. ATLAS has requested sites in many places (GDB, WLCG Ops) to deploy xrootd and webdav, this was the communication channel. Right now endpoints taken from GOCDB.
- Interface rationalisation:
- FTS team has been asked to support deletion. It could use gridftp but not only.
- Data WG needed to coordinate all this?
Multicore: Dynamic Partitioning with Condor at UVic - F. Berghaus
- Cloud Scheduler manages VMs on clouds
- User submits HTCondor jobs
- 17 clouds for ATLAS are deployed
- Uses CVMFS for OS and project SW
- Cloud-init and puppet contextualise images on boot
- Dynamic batch slots
- Uses condor groups to prioritise job types
- Shoal toal: Dynamic Squid Discovery
- EMI Dynamic Federation run by Univ Victoria and with SEs from Canada and Australia
- Dynamically allocating resources for single and multi core job requirements and planning to test high memory jobs
Some questions from the audience with explanations from F. Berghaus:
Data federation is different from the cloud sites. The nearest SE is used when data is requested.
This is implemented for production jobs but could be readapted for analysis jobs by reusing the same VM is several analysis have the same requirements.
Data Federation uses http because they want to support other non HEP community.
This could be in fact reused at other sites. Some criticism on the fact that Condor is used. But Condor is indeed very good at dynamic partition.
It is possible to attach plain condor nodes to the cloud.
--
MariaALANDESPRADILLO - 10 Sep 20