Summary of GDB meeting, April 9, 2014 (CERN)

Agenda

https://indico.cern.ch/event/272620/

Introduction - M. Jouvin

See slides.

Discussion

OpenSSL vulnerability

  • not only certificates, but also passwords or other private information could have been obtained through vulnerable services
  • CERN SSO was not affected
  • recommendations regarding end user certificates will follow
  • some CAs may have been affected through their web portals
  • use tools to identify affected versions
    • OpenSSL project released fix in 1.0.1g, but RHEL/SL have the fix backported into 1.0.1e-16.el6_5.7 !
  • security contacts of all sites should have received broadcasts with the necessary details


Identity Federation Update - R. Wartel

See previous discussions: decision taken to concentrate on web applications as a first stage

A pilot started at CERN

  • CERN SSO allows users to authenticate against their home institute/federation and attributes passed to application as environment variables
  • Applications authorize users using e-groups based on the environment variables (typically email)
  • Work well in progress

Phase 2 is to build a WLCG portal based on this prototype

  • Portal will receive SAML assertions from the federation
  • STS will be used by the portal to generate X509 certs on the fly using IOTA CA and generate appropriate VOMS extensions
  • Generated X509 credentials will be used to connect to the X509-based service
  • Many open questions: how to choose VOMS instance ? how to register new DNs in VOMS?
  • Still at early stage of implementation

IGTF IOTA profile is central to this new approach: requires vetting user identity in VOMS (already done anyway) rather than at CA level

A common policy framework is critical but current eduGAIN policies are insufficient

  • Need to preserve what we built during 10 years in terms of incident response
  • Bilateral agreements cannot scale
  • How to integrate eduGAIN community into the work done on Security of Collaborating Infrastructures
  • Takes time to identify the appropriate contacts and how to set up an "official" discussion
  • Currently difficult to make more than demonstrations with eduGAIN

Operational considerations are also important, in particular security incident handling

  • So far, service providers do all the security incident handling, CAs have no operational roles
  • In the federated identity world, IdPs will need to have an operational role (user suspension)

Command line interface: to be done after we are done with the web portal

  • Let time for ECP to be wider adopted
  • With an approach like CILogon, the web portal could help with the command line tools (modulo some sort of copy/paste or direct access to the browser cache)

Discussion

  • command line solution prospects?
    • web interface first, command line later
    • ECP may take a few years to become widespread, perspectives look dim so far
  • totally invisible to grid services?
    • yes
  • IOTA identities less reliable?
    • identities need to be vetted elsewhere, e.g. already done in VOMS via CERN HR DB
    • a candidate user has rights according to VOMS
  • VOMS proxy available in browser?
    • possibly
    • forward it to command line tool via browser cache?
    • that is the CILogon approach, but ideally a browser should not be needed at all
  • stage 2 is like SARoNGS?
    • possible resemblance to other projects indeed
    • we are not trying to reinvent the wheel
  • phase 1: work has started on SSO, eduGAIN, STS, IOTA CA at CERN, portal
  • phase 2 candidates: monitoring service in ATLAS, FTS web portal (see talk in this GDB)
  • most IdPs already assist with incident response, but we need to assert that requirement
    • eduGAIN does not impose requirements
    • will be discussed in forums
    • need to raise the bar across participating projects


HEP SW Collaboration Kickoff Meeting - P. Mato

Major SW re-engineering needed for HEP: paradigm shift with new architectures, new types of resources

  • Need to attract people with the required skills and experience
  • Opportunity to share more SW with other scientific communities
  • data-intensive application challenge

Goal is to develop interoperable SW components: facilitate both package development and package usage/adoption

  • Different (layered) domains but every user/experiment should be able to pick what they need
  • Learnt from history: common packages were generally not designed to be common... started to solve a specific need, not as a management decision
    • e.g.: ROOT which was started as an experiment tool
    • AIDA WP2 effort also "generalized" existing packages for geometry and reconstruction
  • Several levels of interoperability from common data formats to component models
  • Modularity has a price: dependency management, potentially complex. Not a reason to avoid it: code duplication is worse...

Collaboration must be based on project independence

  • Should not enforce any particular SW process
  • Ownership should remain with developers
  • Developers should commit to a certain level of support
  • Collaboration must provide benefits to developers: coherency, integration builds/tests, distribution repositories

Collaboration will help to secure R&D funding for different sources (direct or partnership) for the collaborating projects

Conclusions

  • Development model: what will be the role of the LHC experiments?
  • Build the collaboration bottom-up: need to show the added value for developers, have a lightweight management/coordination structure
  • Software foundation was the preferred term: should learn from existing foundations
  • Invitations to groups and individuals to submit white papers (5p max) in the next 4 weeks: participant mailing list as the initial forum

Discussion

  • difficult to summarize the meeting
    • lots of discussion
    • goals not clear yet
    • experiments were worried about imposing structures and bureaucracy without benefits
    • the meeting lost track
    • some time is needed to digest the outcomes
  • there was goodwill to try adding value


Actions in Progress

EMI3 Update Status - C. Aiftimiei

Deadline: end of May

  • Beginning of the potential suspension of sites running unsupported services
  • dCache 2.2.x supported until July

Significant progress in 1 month. April status (instances to be upgraded):

  • site BDII: 68
  • CREAM: 87
  • DPM: 61
  • sites with WN: 90

Discussion

  • next GDB will have an update


Ops Coord Report - M. Alandes Pradillo

Simone is now the new Atlas ADC coordinator: handed over to Andrea Sciaba and Maria Alandes Pradillo

CVMFS 2.0 end of life: no more possibility to access servers

GGUS: new release with a replacement for former Savannah-GGUS bridge (CMS)

Experiment news

  • ALICE: problems of job efficiencies at KIT
    • Problem accessing data locally leading to failing over to remote replica and thus overloading the firewall
    • Other VOs affected: side effect or common cause?
      • network path between WN and local SEs suspected
    • ALICE share limited to 1.5k jobs yesterday but not improving situation for ATLAS: to be followed offline
  • ATLAS: Rucio migration ongoing
CMS
FTS2 decommissioning in progress, continuing multicore tests

glexec: still 16 sites with tickets open, will be followed up in MB

  • TF completion to be discussed

perfSONAR: 9 sites out of 111 missing

  • 2 sites not responding
  • 2 without appropriate HW
  • Still configuration issues at several sites
  • TF completion to be discussed

Multicore deployment

  • Currently only ATLAS running MC jobs, CMS to start soon
  • ATLAS is implementing a way to pass parameters to batch systems: current mechanism (through BLAH) using out-of-the-box only with SGE
    • Scripts developed at NIKHEF for Maui: to be tested at other sites
    • Needs further discussion

WMS decommissioning proceeding as planned: no issue

MW readiness: a list of volunteer sites being prepared for next MB (April 15)


WLCG IS Status - M. Alandes Pradillo

Very good shape

  • No known problem with BDII: no scheduled new release
    • BDII IPv6-ready
  • Still fixes and improvements to glue-validator

SW tags cleanup in progress: still sites publishing obsolete tags

Automatic GGUS ticket submission from glue-validator in case of errors (not for warning or info)

  • GLUE2 obsolete entries: created by an old version of BDII and not cleaned up in EMI2. Fixed by upgrade to EMI3
  • 444444 jobs now triggering a ticket
  • Promote some warning as errors to increase the chance to see them fixed?
  • Max CPU time checking campaign done for LHCb: most problems solved

Current action: cross-checking of CPU resources found in the accounting and published in the BDII for T0 and T1s

Meeting with storage developers April 15 to re-discuss storage information providers and align how things are implemented

  • Also discussing publishing xrootd and WebDAV into the BDII and doing some checks based on this
    • Alessandro: need also registration in GOCDB, both for xrootd and WebDAV: currently very few sites...
      • they currently are only in AGIS as a stopgap solution

Cloud resources now published into BDII: a request from EGI

  • Maintain backwards compatibility
  • Discussion at OGF about extending GLUE2 for cloud resources: GLUE 2.1?

ginfo: new version under preparation that is able to combine entries of different GLUE2 objects, similarly to lcg-info/infosites

GSR: currently on hold in favor of AGIS

  • CMS has been asked to evaluate the use of AGIS as a common information system interface, feedback expected soon

Discussion

  • notifications for warnings and infos?
    • no, only on errors a GGUS ticket will be opened by the EGI ROD
  • ginfo used little today, mostly for debugging info system content issues


SL/CentOS Discussions Update

Jarek

  • CentOS has made progress with making their build system public soon
  • Main option #4 explored: an additional SLC repository on top of CentOS core distribution

Connie

  • Survey on the mailing list done: good feedback from the community, preference for keeping SL identity
  • No clear choice yet
  • Not necessarily the same choice as CERN

A new meeting with RH/CentOS next week involving CERN and FNAL: decision to be made at HEPiX

  • will mainly affect SL7
  • SL5 + 6 probably will stay as they are

HS06 and 64-bit - J. Templon

  • discussion at last MB did not converge
  • summary of the issue:
    • HS06 is 32-bit
    • running applications as 64-bit typically gives 15% performance increase
    • experiments should use 64-bit applications and they already do so
    • accounting and therefore benchmarks should be done accordingly
  • main points of the subsequent discussion:
    • HS06 procedures and values need to be consistent across all sites
    • HS06 was designed to scale with HEP applications behavior
      • relationship has diverged due to changing HW and usage of 64-bit mode
      • to be restored in HS14 (or HS15)
    • experiments got 15% extra performance out of current pledges for various reasons
      • SL5 --> SL6 kernel
      • better compiler
      • 32 --> 64-bit mode: used by most experiments already since a few years!
    • relationship between 32- and 64-bit benchmarks need not be linear
    • HS06 is just a measurement unit to match the needs of experiments with what is provided by sites
      • a change of its definition would lead to a huge confusion
      • a benchmark is an approximation of reality in any case
      • as the mismatch still is not large, we just live with it for another year
      • an interim benchmark is not worth the effort
    • plan:
      • wait for the upcoming SPEC suite expected this year or early next year
      • create HS14 or HS15 accordingly
      • HEPiX WG has already started preparations


Machine/Job Features - S. Roiser

Goal: mechanism to communicate information from the resource provider to the VO (job) using the resource

  • Can be used for different types of information: benchmark, cores, wall/cpu limit...
  • Working for both batch systems and cloud infrastructures
    • Including commercial clouds through a service running in the cloud and maintaining the information instead of the resource provider producing the information
  • Original idea extended by Igor to a bi-directional mechanism between VOs and resource providers
  • A common thin client to consume the information produced by the resource provider

Current status: information for all batch systems except SLURM

  • A SLURM volunteer would be very welcome...
  • SAM probe in pre-production to check availability of MJF: no real sanity check (or very coarse) of the values published right now

Sites welcome to join: SW available on GitHub

Clouds: original idea was to run it on the OpenStack metadata service but found too risky, as potential overload of the metadata service must be avoided

  • Replaced by a key/value store: deployment at CERN waiting, will port to other cloud MW when done and validated

MJF client (Python) available in WLCG repository

Bi-directional communication: last details being discussed, plan is to extend mjf during the summer to handle it

Open questions

  • What to publish: currently the agreed information available in Twiki
  • Validation of correctness for information published
  • Agreement on grace periods...

Discussion

  • bidirectional communication: what does the site expect from the job?
    • info on the importance of individual jobs/VMs to discover which VM should be terminated first when resources need to be reclaimed
    • more important for clouds than batch systems
  • the WG is trying to establish how site-specific the implementations for the various batch systems are
    • try a second site for each
    • LSF: some CERN-specific scripts serve as examples, cannot be used out of the box
    • Torque/Maui: expected to work also at other such sites
    • SGE: 1 site-specific script to provide HS value for a given WN
    • SLURM: contributions by Nordic sites look feasible, will be followed up


FTS Web Portal - A. Abad Rodriguez

WebFTS: a user-oriented web interface to FTS3

  • Funded by EUDAT
  • Multi-protocol

Includes a dashboard of submitted transfers

  • Give access to the transfer log in case of errors

Future improvements planned

  • Authentication based on identity federations
  • Handling of storage cloud providers (dropbox, owncloud, gdrive...)

Beta service available at CERN: feedback is welcome!

Discussion

  • the beta instance is accessible from outside CERN
  • users need to have a valid certificate and association with a VO
  • this service would be interesting in particular for users outside WLCG
  • support for other types of credentials?
    • federated identities (see talk in this GDB)
    • open to suggestions
  • deployment plan for EUDAT?
    • the service is currently under evaluation
  • does the service expose its own API?
    • no, but the code could be incorporated into bigger applications
    • frameworks that want to use an API should rather invoke the FTS3 directly


T0 Update - M. Barroso Lopez

Agile infrastructure in production both in Meyrin and Wigner

  • Weekly meetings with users
  • OpenStack block volume service based on Ceph started
  • Configuration management with Puppet: aggressive migration from Quattor, deadline end of October
  • Monitoring/alarming: moving from LAS to GNI

Wigner: currently 10% of CERN capacity, continuous ramp-up

  • Good experience with Wigner, including everyday operations

Grid services

  • Currently shutting down experiment WMS
    • SAM WMS: end of June
  • FTS2: shutdown plan during the summer (1st of August)
  • VOMS: a lot of activity related to the move from VOMRS to VOMS-Admin
    • A few issues found, waiting for next VOMS-Admin release
    • VOMRS is able to deal with SHA-2 certs

Batch services

  • Looking at LSF replacement: large scale testing with HTCondor
  • Checking CPU accounting
  • Job efficiency/throughput at Wigner reported by experiments in the last months: no indication of a genuine Wigner problem
    • Report from a team of IT experts and experiments expected next week: looks a multi-dimensional issue
  • SLC6 migration: currently 65% of the resources, combined with migration from physical HW to VMs
    • Need to go through the whole provisioning chain
    • 5% more next week
    • When 80% is reached, will discuss what to do with the remaining SLC5

Network

  • Issue on 100G link to Budapest fixed
  • dhcpv6 rolled out in Meyrin and Wigner

Databases

  • Oracle in progress at CERN and T1s: mostly 11.2.0.4, a few using 12c already

Discussion

  • the cause of the 100G link issue lay with a component in an Alcatel device at Wigner
    • it was a general problem with the series of that device
    • fixed ~10 days ago
  • SLC5 is still used by all 4 experiments
    • since Jan LHCb no longer build SL5 binaries
  • SLC6 capacity is also taken from OpenStack now
    • SLC6 usage is higher than suggested by batch system numbers
    • majority of 1k ATLAS cores is expected to be at Wigner
  • Puppet modules that are not CERN-specific are uploaded into GitHub
  • configuration evolution will be presented in next GDB and at HEPiX
  • job efficiency report may also be interesting for other sites!
    • a lot of effort is being spent to understand all the relevant factors
    • a current summary will be presented at the MB
  • easy way for a job to find out if the WN is in Meyrin or Wigner?
    • let's wait until after the presentation


EGI Future Directions - T. Ferrari

Federated cloud: build a network of trusted cloud providers

  • Discovery of cloud services
  • Easy deployment of services independent of existing clouds: AAI, accounting, VM management
  • Address needs of big data analysis: data movement, community-specific services (PaaS, SaaS)

Two federation profiles

  • Basic: security compliance, AAI, accounting
  • Advanced: adoption of cloud standards, VM management

Production to start in May: launch at CF

Future steps

  • Complete the federation integration
    • Federated ID management interoperable between providers
    • Evolve operations tools: grid and cloud platform, open-source projects for these tools (e.g.: SAM -> ARGO)
  • User co-development of PaaS and SaaS: starting Proof of Concepts
    • Partners: international user communities, developers, service providers
    • Call for proposals April-May, Selection: May-June, inclusion in the EGI-Engage proposal
  • From service catalog to EGI marketplace
    • Application/VM image repository
    • commercial and academic resource providers

Helix/Nebula marketplace: EGI FedCloud as a resource provider

Business models for EGI FedClouds: several options leading to different EGI.eu structures and different pricing policies

e-Infrastructure Commons: promote international cooperation between e-infrastructure, offering a common service catalog to users

EGI-Engage: the follow-up project proposal for EGI-Inspire

  • Coordination of operation, user-driven operation, development of grid and cloud platforms
  • Out-of-scope: EGI central operations (funded by EGI.eu fees) and NGI operations (was 35% of EGI-Inspire funding)
  • Editorial board defined: T. Ferrari coordinating

Discussion

  • EGI central and NGI operations: will NGIs commit to what is expected?
    • no specific issues were raised
    • minimal contributions from NGIs are defined in the service catalog as a subset of the usual tasks of today
      • some tasks are optional
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2014-04-10 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback