TWiki> LCG Web>WLCGGDBDocs>GDBMeetingNotes20140910 (revision 3)EditAttachPDF

Summary of GDB meeting, September 14, 2014 (CERN)

Agenda

https://indico.cern.ch/event/272777/

Introduction - M. Jouvin

As usual, looking for volunteers to take notes

Next GDBs

  • Next ones at CERN on the second Wednesday of the month
  • Next one outside CERN in March 2015. Please, let Michel know if you are interested in organising it

Summary of WLCG Workshop

Actions in Progress:

  • Migration to GFAL2/FTS3 clients in October 1st
  • Volunteering sites to provide dual stack IPv6 endpoints
  • Batch accounting assesment in progress
  • SL/CentOS: no news

O. Smirnova asks about the status of Batch accounting assessment regarding handling of failed jobs

  • J. Gordon explains that there are still on going discussions on what should be considered as a failed job: currently not treated the same way by all batch systems with respect to accounting

O. Smirnova asks whether virtual WNs should offer IPv6 endpoints

  • M. Jouvin suggests to contact the IPv6 WG for details but is not aware of such a requirement
    • First goal is to assess that enabling IPv6 on a service endpoint doesn't impact IPv4

EGI Role in the Evolving DCI Landscape - Y. Legre

Evolution is marked by fragmentation

  • Fragmentation of infrastructures: technologies, provisionning, policies, access mechanisms
  • Community/RI fragmentation
  • Lack of collaboration and trusts between RIs and e-Infra

EU Vision for 2020 as drafted by the e-IRG in 2012

  • Single point of contact to provision all ICT services required (e-Infra commons)
  • Ability for users to connect to the best expert consultancy (services)
  • Ability to reuse science output (open data)

Strategy for e-Infra commons

  • federated services
  • joint capacity planning
  • Open and integrated resource provisioning and access, e.g. OpenAIRE

EGI mission in this context: be a key providers of e-Infra commons and knowledge/expertise

  • Challenges due fragmentation
  • Address needs both of individual users and communities
  • Joint development of solutions with communities
  • Primary focus remains data

EGI Strategic actions

  • New governance model including stakeholders, not only EGI
  • New sustainability and business model: who funds the effort
  • User involvment
  • e4ERA proposal involving all major DCI actors, ESFRI, in Europe

Discussion

M. Jouvin expresses his concerns on the fact that EGI relies on the control of resources they do not actually own and that funding agencies are in a process of having a greater control on them with EU-T0

  • Yannick: there has to be a common strategy and we need to work together to be able to consolidate e-infrastructures in Europe, or otherwise there is a risk that funding agencies won't invest on different projects.
  • Yannick: there has to be a national programs for e-infrastructures coordinated at the European level, a model similar to GEANT.
  • Michel: there is a risk that EGI has less influence on the way the resources are accessed, the type of service deployed, if this is more directly controlled by communities

On a comment by O. Smirnova, Y. Legre explains that other communities apart from HEP are growing and have different needs and it has to make sure these needs are also covered.

Question about more details on the e-infrastructure commons proposal.

  • Yannick explains some proposals have been already made and for some others there will be consultation with people like WLCG, etc.

T. Wildish asks why Y. Legre believes there is a lack of trust, as mentioned in his presentation

  • Yannick thinks people are not talking enough together.
  • Moreover EGI was originally set up for HEP and then started to look to other communities, which may have lead to some frustration because it didn't have the resources to take care of all of them.

M. Jouvin: has the feeling that WLCG and EGI had a good collaboration so far, focused mainly on operations. At risk that with reduced EGI funding for operations, this will make it weaker...

Yannick repots that it is difficult to identify who can be a WLCG official representative

  • WLCG could decide to be member of the EGI.eu council as CERN is not a WLCG representative in this body
    • But the council is made of legal entities...

T. Bell about EGI plan developments: we are trying to avoid the development of in-house solutions (or project funded solutions)

  • Once the project comes to an end, has to be maintained by us

Identity Federation - R. Wartel

Goals reminder

  • Enable people to user their home organization credentials: grid jobs, web portals, other X509 services
  • Allow the use of home credentials for other services: Vidyo, Indico...

Build on existing federations and infrastructures: eduGAIN

  • CERN involved via SWITCHaai
  • CERN eduGAIN integration into CERN SSO (web apps) now done
    • Included a lot of legal work
  • Next steps: enable eduGAIN on core service like Indico, OpenStack..., discuss with experiments to enable federated identity access to several experiment core services

Grid service integration through a portal, using EMI STS to translate federated identity to X509 (WLCG pilot)

  • No concrete progress yet

Policy and trust issues: there is not trust/guaranty about the identities issue

  • Our current security models relies on Authorization: not identity vetting during Authentication
    • Identity vetting done during registration: registration by experiments is CRITICAL
  • eduGAIN is currently an operational security wild west...!
    • No consistency between IdPs on operational rules
    • In fact inter-IdPs eduGAIN authentication has not been used widely in production so far... but a moving target
  • How can we open access to our services without trusting identities and a trust framework? Still an open question...
  • 2 main issues
    • Operational security: incident handling, coordinated incident response, exchange of confidential information between actors
    • Privacy and traceability: must comply with EU policy on privacy and personal data protection, "user consent" no longer sufficient, must find the right balance between logging all (current practice) or nothing (liability issues of collecting data), WLCG AUP and data protection policy must be reviewed to comply with new EU policies

Several ongoing efforts

  • SCI: an EUGRIDMPMA initialtive, many participants from different e-Infra
  • Sir-T-Fi: largely based on SCI work, SCI members + Internet2 + InCommon + eduGAIN + REFEDS
  • FIM4R
  • AARC: H2020 proposal on AAI
  • All these projects try to address the 2 main issues
    • Operational Security: propose an incident response procedure, a trust framework and a minimum set of (verified) requirements on IdPs and SPs
    • Privacy/traceability: propose a trust framework and guidelines, compliant with EU/US (conflicting!) policies and allowing safe operations, ensure convergence/compatibility between participants

GEANT's Data Protection Code of Conduct

  • Encourage IdPs to release attributes
  • Push eduGAIN SPs to enforce basic data protection practices
  • Proposal that WLCG should endorse this CoC
    • Already endorsed by several communities

Conclusions

  • Lots of trust/policy issues: not only technical
  • Web use case almost sorted out: required a lot of work
  • Need to adapt to the new trust and policy landscape

Maria: VOMS future? Probably difficult to adapt to federated identities...

  • VOMS is the source of attributes today. This is the very critical piece of our AAI current infrastructure
  • In the future, we have to see if a similar service more integrated with IdP infrastructure will emerge...

Maarten: EU/US legislation incompatibility is major threat for federated identity, WLCG should take care of this topic

Oliver: IOTA-profile compliant CA availability? How to convince site to trust it?

  • One prototype set up at CERN
  • This prototype IOTA CA is not yet IGTF accredited
    • IOTA profile also has to be endorsed by WLCG (and EGI): planned next year

M. Alandes asks about CERN Data Protection Policy and WLCG

  • S. Lueders explains it affects all data stored at CERN.
  • An assessment will be done: in principle it should be aligned with all the other EU policies and codes of conduct mentioned during the presentation.

O. Smirnova asks how eduGAIN will be integrated with computing services. She believes this will be more complicated for sys admins.

  • R. Wartel explains this will be much easier for the users and the technical aspects will have to be of course understood.
  • S. Lueders adds that this makes service management easier too since there will be a generic way to define who is authorised to use the service.

Transition issue: how to manage users with several accounts

  • Probably no generic solution but per service

Actions in Progess

Dates for next meetings defined until the end of the year: see slides or Indico

ARGUS support: critical situation

  • No clear commitment from different partners to take over from SWITCH
    • SWITCH has now ZERO effort on ARGUS
  • Mid august proposal to create an ARGUS collaboration sent to INFN, NIKHEF, CESNET and Helsinki Univ: positive reaction but no commitment
  • Urgency to find a solution: problem in the last (unofficial) ARGUS rpm, used at CERN, intended to fix problems in the last official one...
  • Follow-up at next MB

T0

  • SL5 decommissionning on Sept. 30th
    • Deadline for feedback: Sept. 14
  • Plan to decommission AFS UI: work in progress to understand use cases
  • FTS2 decommissionned
  • CVMFS stratum 0 upgraded to 2.1 for all VOs

Experiments

  • Smooth operations during the summer for all VOs
  • CMS planning AAA tests at a larger scale soon: sites may see increased activity
  • Atlas rucio commissionning ongoing: may also create additional load

Concluded TFs

  • Tracking Tools Evolution
  • FTS3 integration: now in production for all VOs

SHA2: hard deadline end of November

  • Failing sites ticketed (44) but some already OK
  • SAM tests will start using the new VOMS servers on Sept. 15
  • Experiment workflow assessment to be done

WMS decommissionning

  • SAM will start to use Condor submission on Oct. 1st

Multicore deployment: extended to address passing of parameters to batch systems

Network and Transfer metrics: kickoff last Monday

  • Still 9 sites with incorrect perfSONAR version
  • New US funded project: PuNDIT
    • Coordinated by Shawn
  • More detailed report planned at next GDB

Information System Update - M. Allandes

BDII: 3 releases since April (last update)

  • Minor fixes, mostly glue-validator fixes

New beta version of ginfo, lcg-infosites replacement for GLUE2

GLUE2.1: GLUE2 extension being discussed

  • New objects and attributes to better describe cloud resources and GPU resources
  • Intended to be backward compatible with GLUE 2.0
  • Will require of a new version of glue-shcema package: not before end of the year
    • WLCG not really interested by these extensions

BDII overload incident in August: tuned LDAP configuration to better handle it

  • To be released soon for deployment at all top BDIIs

BDII deployment status stable

  • All EGI + OSG sites
  • Around 400 BDII instances: 323 site BDII + 79 top BDIIs

SW tags clean up campaigns: from 195000 one year ago to 66000 now!

  • Reduced by More than 20% the size of BDII data

glue2-validator runs as a Nagios probes: automatic ticketing in case of error

  • Now steered by EGI
  • More than 300 ticketed sites, 290 fixed
  • Now: mostly info message, error/warning at a very low level
    • Mostly default value problems (e.g. 444444 waiting jobs): default changed from undefined value
  • Some sites having problems during a very long period

SRM vs. BDII consistency

  • ATLAS: similar check to what was done for LHCb, integrated in dashboard
  • Numbers well aligned, main inconsistencies related to known issues
    • Means that BDII info can be used by VOs as an alternative to SRM

GLUE2 Storage information providers fixes

  • CASTOR/EOS: since August
  • DPM: issues idenfied, fix expected in October
  • StoRM: issues fixed
  • dCache: no known issue

OSG and GLUE2: no change, no plan until experiments require them

  • Alessandro: experiments need an incentive for GLUE2 as GLUE1 has been mastered and limitations work around to provide what is needed. New resource types (cloud, storage http endpoints...) present in the BDII could be one as this is a lot of work to add them to experiment specific information system solutions.
  • New CE may be an opportunity to make progress as there is currently no information provider for it: Alessandro in contact with OSG collegues to push them to write a GLUE2 provider
    • Would help a lot as storage is mainly dCache and dCache already has a GLUE2 provider

Alessandro (ATLAS): BDII remains a critical source of information

  • Importance of the validation work done: experiments can now be confident that they can rely on the BDII information
    • SRM-BDII consistency assessment is a good example: experiments can now be confident in using BDII rather than SRM
  • Maria: validation work done only for GLUE2 but also benefits to GLUE1 as the providers are generally the same
    • Nevertheless GLUE2 validation may be one incentive to migrate

AGIS as a generic experiment IS

  • Maria: no real activity since a prototype was delivered to CMS
  • T. Wildish: evaluated positively by CMS (several CMS service development groups, including PHEDEX) but higher priorities to address in the short term to be ready for Run 2
    • Still a plan to use it

SL/CentOS Discussions

(coordination problem: see slides attached to the agenda but not presented at the GDB)

CERN started to deploy for evaluation some CERN CentOS7 machines

  • Currently only the infrastructure team
  • Using the CentOS rpms
  • Good collaboration with CentOS people: CERN contributed experience with Koji to help setting up the build system

Cloud pre-GDB Summary - M. Jouvin

  • WG explores possibility of using clouds as a replacement of CEs. Some progress done in different areas:
    • Machine/Job features TF
    • Accounting done by EGI Cloud in APEL
    • vcycle and Openstack fair share scheduler initiatives
  • Cloud technology seen as more pervarsice technlogy, no MW development required by the community
  • Idea is to foster on going work to study how clouds could replace CEs: realistic milestones to achive this in shared clouds
    • Achieve something similar to fair share in batch systems: deploy vcycle (already in a few UK sites) and openstack fair share in some sites
  • Benchmark of VMs work presented.
  • Some features missing in APEL for cloud accounting
  • Security issues too be understood
  • Availability reports
    • T0-T1 Summary
    • T1 History report
    • All sites reports
    • VOs reports
    • MB monthly reports
  • Different changes introduced: experiments define topology, site reports...
  • Changes in experiments reports are presented (used algorithm for availability calculations, etc)
  • Changes in T1 History: sites instead of federations
  • T2 federation reports for the VO reports: values are for the whole site instead of the whole VO! Information per VO is not correct in the BDII. Sys admins should check this!
  • From October start using SAM 3 reports as primary
  • Change MB report to T0-T1 summary

There are some clarifications on the ATLAS algorithm and how availability is calculated.

There is a question on whether HammerCloud results will be used. P. Saiz says there are no plans to use this.

I. Bird explains that it's not necessary to validate the number of cores and hepspec published, it's enough to look at the pledges and accounting info. Number of cores is not needed. Hepspec is more important.

Data Management Discussion - O. Keeble, F. Furano and W. Bhimji

  • xrootd4 (ipv6, major improvements): do experiments need it deployed soon? how organise deployment? Is the integration with ROOT available? diff between old and new client lib?
    • T. Wildish ipv6 xrootd support required by CMS by the end of next year
    • xrootd4 client talking to xrootd3 server will work as long as the new features are not used. xrootd3 client will be able to talk to xrootd4 server, since xrootd4 is a super set of xrootd3.
  • xrootd4 and EPEL: plans?
    • M. Ellert will provide this asap.
    • O. Keeble asks whether this will be a new xrootd4 or an upgrade of xrootd3 in EPEL. M. Ellert should be contacted. xrootd team prefers to have it also as xrootd4.
  • Move away from SRM: move to other protocols? metadata requirements clear?
    • Clarification of "move to other protocols": F. Furano explains it means using something else. We currently see high usage of SRM.
    • T. Wildish explains that deletion campaigns in CMS is done locally by the sites. SRM is not used. This is done using PheDEx agents. End of Run 2 may be a feasible timeline to move to other protocols, but this is not clear yet.
    • F. Charpentier says that LHCb is trying to move using xrootd. SRM still used for massive deletions.
    • A. Girolamo explains that for ATLAS the new Data Management system is based on plugins, whatever the site offers, ATLAS is able to use it. Most ATLAS sites support xrootd and webdav. ATLAS would like to start using webdav for massive deletion since it's more performant. But we can still use SRM if webdav not available. ATLAS has requested sites in many places (GDB, WLCG Ops) to deploy xrootd and webdav, this was the communication channel. Right now endpoints taken from GOCDB.
  • Interface rationalisation:
    • FTS team has been asked to support deletion. It could use gridftp but not only.
  • Data WG needed to coordinate all this?

Multicore: Dynamic Partitioning with Condor at UVic - F. Berghaus

  • Cloud Scheduler manages VMs on clouds
  • User submits HTCondor jobs
  • 17 clouds for ATLAS are deployed
  • Uses CVMFS for OS and project SW
  • Cloud-init and puppet contextualise images on boot
  • Dynamic batch slots
  • Uses condor groups to prioritise job types
  • Shoal toal: Dynamic Squid Discovery
  • EMI Dynamic Federation run by Univ Victoria and with SEs from Canada and Australia
  • Dynamically allocating resources for single and multi core job requirements and planning to test high memory jobs

Some questions from the audience with explanations from F. Berghaus:

Data federation is different from the cloud sites. The nearest SE is used when data is requested.

This is implemented for production jobs but could be readapted for analysis jobs by reusing the same VM is several analysis have the same requirements.

Data Federation uses http because they want to support other non HEP community.

This could be in fact reused at other sites. Some criticism on the fact that Condor is used. But Condor is indeed very good at dynamic partition.

It is possible to attach plain condor nodes to the cloud.

-- MariaALANDESPRADILLO - 10 Sep 20

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2014-09-10 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback