Summary of GDB meeting, October 8, 2014 (CERN)

Agenda

https://indico.cern.ch/event/272778/

Introduction - M. Jouvin

Dates and rooms booked for GDB until end of 2015

  • Location outside CERN needed for March 2015 GDB: send proposals to Michel

Pre-GDB in the two coming months

  • November 11: Volunteer computing
  • December 8: Data Management
    • 1/2 day on data preservation impact on data management
    • 1/2 day on reducing the protocol zoo

Argus is still in bad shape since it's unsupported

  • After discussion with a few security developers/experts, proposal to hold a meeting at CERN in December or January to assess interest in creating a community behind ARGUS and its development.
    • Not necessarily restricted to WLCG people, e.g. EGI (non WLCG) institutes
    • Include OSG in the discussions
    • Date being discussed: will be announced as soon it is fixed.

New availability/reliability reports discussed at last GDB: switching next month (November).

  • Active checks by experiments: no major problem found, minor issues fixed

Data Preservation Update - J. Shiers

Research Data Alliance (RDA) co-funded by US, EU and ANDS has become important for everything related to data, in particular data sharing

  • DP is a form of data sharing
    • DP is also about preserving workflows
  • Can be used as both a source of knowledge and a source of funding...
    • We are experts in many areas but not all of them
  • 2 plenary meetings per year
  • WGs: short-lived (12-18m) groups focused on one topic
  • IGs: long-lived (no fixed term) groups, on larger topics
    • E.g. domain repositories, reproducibility, data fabric....

New DoE policy since October 1st: every project submitted must have a Data Management Plan (DPM) that describes data sharing and data preservation

  • IF no DPM is submitted, this should be explicitly motivated

RDA4: last RDA meeting in Amsterdam end of September

  • DP in every talk: 5% cost repeatedly discussed
    • A bit higher than HEP estimates
  • Co-located events with major infrastructure providers (EUDAT, EGI...): opportunity for networking
  • Several FAs present and positive about the DP challenge

CTRUST H2020 proposal: certification of sites, 60+ sites envisioned

HEP should invest in the Reproducility IG and try to steer it in the right direction

Doing a multi-disciplinary LTDP seems to be feasible... and possibly funded

  • Need to identify the generic common services that will allow to address the key Use Cases from several communities
  • A first proposal could be submitted as a proof-of-concept base on limited number of Use Cases at EINFRA-9-2015 (January 14, 2015) but schedule is tight...
    • Currently only CMS in HEP/WLCG expressed a clear interest
    • Need to decide in the coming month
    • We can probably build a credible proposal but we also need to ensure that we'll have the manpower to do it: the same people tend to be engaged in all the projects...
  • A second possible project may be submitted in 2-3 years with a wider scope in term of communities

December pre-GDB on services/resources required for the 2 key use cases idendified for WLCG (open access for outreach and reproducibility)

  • Impact of HepData, RIVET, RECAST on sites
  • T2s have a role to play, in particular for the outreach use case
  • Should clarify role of sites, in particular T1: MoU speaks about data curation but the knowledge to do the curation is in the experiments
    • Should probably rephrase it as data bit preservation

Technlogy evolution makes preserving the data forever not a cost issue if already preserved a few decades

  • The real issue is to be able to reuse it

Discussion

  • Jeff: no EU T0 mentioned in the presentation. Why?
    • Jamie: discussion tends to be with infrastructure providers and it is not clear if EU-T0 has this role. DP is on their agenda but concrete engagement is not clear...
  • Decision about a project in EINFRA-9-2015: urgently need to ping concerned/interested people in experiments. Difficulty is that this project will be an additional effort for experiments without an immediate return on investment.
    • ALICE (Pedrag): interested in principle, but no forces available for this
    • LHCb (Stefan): cannot answer on this topic, to be followed offline
    • ATLAS (Alessandro): will check with relevant people
    • CMS may be enough to start a proof-of-concept project if others are unable to join

GGUS Update - G. Grein

2 latest major enhancements

  • Ability to do bulk submission to many sites
  • Integration of CMS in replacement for Savanah

Bulk submit

  • Extension of alarm tickets: currently require alarm permission
    • To be discussed in the future: a separate permission?
  • By default priority is "top priority": change the default?
    • Notification of multiple sites does not mean a top priority issue
  • Site selection: ticking in list
    • Currently no VO filtering
    • Request for ability to select T1 sites, T2 sites... would require getting the info from REBUS. To be discussed.

Customized submit form: done for CMS

  • Customized site names (from VO feed): only at ticket creation, not possible to modify the assigned site using the CMS site name
    • CMS site name is translated into the actual GOCDB or OIM name
  • Specific support units
  • Specific type of problems
  • This customization is a new GGUS feature, not an adhoc solution: open to request by other VOs

GGUS team welcome suggestions and request for improvements: use GGUS (or JIRA) to submit and discuss them

Discussion

  • Oxana: can the support unit/vo be prefilled in the submit form?
    • Guenter: it is possible to select them and to bookmark the form: next use of the bookmarked URL will result in the field being prefilled

Support of new VOMS Servers : Postmortem Analysis

Goal of the discussion: understand lessons we can learn from this particular issue about efficiency of communication with sites and possible way to reduce the operational cost/effort of the infrastructure.

Simple and extensively documented changes, announced well in advance

  • First announcement: March 17
  • First deadline postponed due to a bouncycastle-mail making SHA2 certificates (used by new VOMS servers) inusable
  • Still 88 sites ticketed Sept. 1st
    • All sites starting to fail SAM tests on Sept. 1 (but downtimes not taken into account)
  • A lot of assistance required by many sites: at least to find the url of the documentation...

Discussion (several countries collected input from their sites following Michel's email when announcing the GDB)

  • Broadcasts are mostly ignored: not completely true but broadcast are clearly suboptimal as sites tend to receive many of them. GGUS tickets are much more efficient and were probably used too late in this particular case.
  • GGUS ticket contents was not self sufficient: a few comments that the GGUS ticket was mainly a reference to the broadcast where it would have helped to get all the information in the ticket
    • The consequence was probably particularly bad in this case where the original information was sent months ago: several sites spent times (or sent emails to Maarten!) to check if some new information was missed during the summer
  • Unability to have a test focussing this issue only resulted in a number of false positives, with sites having other problems (sometimes with downtimes) being ticketed
  • In several cases, in well managed sites, some issues in configuration tools were resetting the support added previously: probably points out the weakness of a site-specific configuration tools or configuration description for tools like Puppet, cfengine or Quattor.
    • Importance of a community effort behind the tools with shared, well tested, tools and configuration modules/descriptions
    • Importance of using off-the-shelf solutions, like the configuration RPM, in this case when relying on home-grown solutions or when having a limited manpower

Sites encouraged to report feedback about handling of such events and communication with WLCG to the WLCG Ops Coord

  • A site survey to better understand how to improve operations efficiency and cost is being prepared by OpsCoord as an input for further discussions
    • Requested by last MB

Actions in Progress

OpsCoord Report - J. Flix

Dates of OpsCoord meeting until the end of the year: see slides or Indico

  • Sites encouraged to participate

Main news

  • Bash ShellShock : perfSonar must be reinstalled, if not patched before Sept. 26
  • EGI proposes to replace max time in VO cards by max HEPSPEC*time
  • M. Dimou temporarily replaces M. Alandes as co-chair of the OpsCoord

T0

  • AFS UI: CVFS UI agreed to be a viable replacement
  • lxplus5 decommissionning: Oct. 13
  • WMS decommissionned

Experiments

  • ALICE using HLT farm as a site
  • Atlas DC14 ran succesfully
  • CMS AAA exercise on-going with some overlap with HC tests
  • LHCb: new stripping campaign soon

New VOMS servers rollout in production

  • Used by SAM prod since Sept. 16
  • Node firewalls are open selectively to assess proper functionning of VO workflows
    • No usage outside CERN

Multicore Jobs: accounting now working properly but requires a specific setting in APEL parser config for CREAM CE

  • Sites must reparse and republish accounting if they already published MC jobs before the config modification
  • CPU time is properly scaled but not the elapse time: efficiency is meaningless for MC jobs
  • ATLAS published clear requirements for Phys memory, RSS and VMEM for MC jobs/queues

Site survey is being prepared to understand operational efforts/costs and how to optimize them

  • Survey will be sent to GDB first for early feedback

SL/CentOS Update - T. Oulevey

CERN CentOS7

  • CentOS upstream RPMs: no CERN specific customization
  • Additional SW via CERN and CERNONLY repo
  • Staging CentOS7 updates and configuring machines with yum-autoupdates

Build infrastructure being updated to support el7 and allow users to early test/provide their package

Relationship with CentOS community

  • Very welcoming
  • Open discussion on the mailing list and IRC
  • Contributed CERN Koji experience
  • Access to Q/A code
  • Future collaboration discussed: OpenAFS, SW collections

New Benchmark Update - M. Michelotto

Fast benchmark: request from machine/job TF to get a benchmark fast to run to estimate the perfs of provided job slots

  • Open source, small, a few minutes max
  • A few requirements unclear
    • Reproducibility
    • single core or multicore
    • Run everytime or run to sample resources available?

An example based on Geant4 evaluated (provided by Geant4 people)

  • Intel x86 or ARM
  • CPU bound
  • Simple to run: 5 to 10mn, including compilation
  • Realist use case 1/3 to 1/4 of real experiments
  • Perf stabilizes after 1K events in single thread but requires a long time in multicore context with cores >= 16
  • Difficult to troubleshoot problems without the Geant4 people expertise...
  • Need to assess the scaling with HS06: not yet done

A new fast benchmark provided by LHCb but got it only last week and too early to report on it: quite promising

  • Very small and quick benchmark, scaling pretty well with HS06 in the order of +/-20%

SPEC14 : beta available and still expected by the end of this year

  • KIT is a member of the consortium and had a first look to it

Discussion

  • Fast benchmark: agreement that 5 mn of running time is a max limit if we want to be able to run it as part of jobs
  • Geant4 benchmark: part of the time required is for getting and compiling the benchmark (1-2 mn)
    • Investigate if the benchmark could be distributed through CVMFS to avoid this
  • The new benchmark will not necessarily be based on SPEC if an alternative proves to be as representative, scaling well with SPEC and easier to run/maintain
  • Not clear that we'll endup with only one benchmark: may need one for procurements/pledges and one for job slot performance assessment (fast benchmark)
    • Bottom line: if we have two benchmarks, they must scale the same way and it should be possible to define a "calibration ratio"

More discussion at HEPiX next week

IPv6 F2F Meeting Summary - D. Kelsey

News from experiments

  • ALICE not able to make the workshop: unfortunate as they had strong activities
  • ATLAS: successfully tested dual stack pilot factory to dual stack CE
    • No time yet for submission to IPv6 only WN
    • Also would like to see FTS3 http transfers
    • Testing dual stack Squid
  • CMS: relying on GlideinWMS
    • Good news: Condor should allow mixed IPv4 and IPv6 pools soon (Wisconsin running out of IPv4 address...)
    • Would like to see a significant fraction of AAA accessible through IPv6 by end of 2015
  • LHCb
    • CVMFS access through IPv6 through dual stacked CVMFS
    • DIRAC: only binds to IPv4 currently but only a small modification is required to enable the use of IPv6 too

News from sites

  • DESY: network ready, waiting for users...
  • KIT: started work with FTS3 and dCache 2.10
  • PIC and Imperial: also testing dCache 2.10
  • QMUL: StoRM + WebDav + xrootd v4
  • Many UK sits awaiting readiness of the university network...
  • Recent test during several hours between QMUL StoRM and Imperial dCache at 750 MB/s

Security and IPv6

  • Useful discussion with CERN Security Team about lessons learnt, vulnerabilities identified, security tools
  • Will a create a best-practice document with a top-10 list of potential issues
    • Block all IPv6 protocol except ICMP and open on request
    • Ensure that firewall rules are identical for IPv4 and IPv6

OSG has started to work actively on the topic: aim is to provide dual-stack services

  • Currently testing IPv4-only and dual-stack clients
  • More collaboration is required

ATLAS made an explicit statement that they would like to see their storage accessible with IPv6 to allow opportunistic use of IPv6-only WN

  • Request all T1s to engage with the IPv6 TF: only 1/2 participating currently
  • perfSonar is a good candidate as a first IPv6 (dual stack) service: 3.4 should be entirely IPv6 compliant/ready
    • Also true for T2s

Next steps (next 6 months)

  • Move GridFTP testbed to PhEdEX and FTS3
  • Perform enough tests with dual stack SE to gain confidence and be able to propose a wide deployment plan after next meeting
  • Some central services will also need to be dual-stack

Next F2F meeting: 21-22/1/2015

  • To be confirmed

CERNVM and CVMFS Update - R. Meusel & G. Ganis

Growing number of CVMFS repositories at CERN Stratum0

  • Apps and condition DB
  • Not only LHC experiments: AMS, Boss, Belle...

Migration of Stratum0 to 2.1 completed beginning of September

  • Required to have 2.1.19 client deployed everywhere first
  • Smooth transition with only minor issues, generally caused by machines still running CVMFS 2.0

New features brought by the 2.1 migration (or coming soon)

  • Plug-in backend architecture allowing support of S3, Ceph and others: test instance of S3 backend at LHCB (lhcbdev.cern.ch)
    • Both for Stratum0 and Stratum1
    • Used for nightly builds
  • New version of cvmfs-keys package including public keys for egi.eu and opensciencegrid.org
    • In the future will break this monolithic package into several packages
  • Support for central CVMFS client configuration (2.1.20)
  • Garbage collection for Stratum0: remove versions that are useless
    • Especially useful with nightly builds where history is of no long-term interest
  • Fix support of multiple repositories with Parrot: currently unstable
  • Web API on Stratum1 servers: allow a client to ask for the geographically ordered list of Stratum1 servers it knows

Ulf: Stratum1 is not IPv6-ready, only works with a dual-stack squid

  • René: was not aware, will follow-up

Helge: a serious problem has been encountered at Wigner, any idea on the reason and the timeline for the resolution

  • René : being worked on, reason not yet clear, may be related to some misconfiguration somewhere... working on it to identify and fix the issue asap

CernVM

  • v2 (SL5) is end of life since Sept. 30
  • v3: microCernVM + OS (SL6by default) from CVMFS
    • Drastic reduction in size: 12 MB image + 100 MB cache
    • Contextualization through CloudInit and amiconfig
    • Several extras available: HTCondor, Ganglia, Puppet, Squid, Xroot...
    • cvm2ova to produce OVF files
  • Web portal: CernVM online
  • Images used for desktop, cloud and volunteer computing
    • Works with every hypervisors (except a restriction with Parallel)
  • Looking at using CernVM for data preservation: use the snapshoting features of CVMFS
    • 2 use cases demonstrated: ALEPH (SL4) and CMS OpenData (SL5)
  • Future: Evaluation of SW containers (Docker) integration
    • Also SL7 support/SL baxed version

Network and Transfer Metrics WG - M. Babic

Mandate: ensure all the relevant network and transfer metrics are collected to allow sites and experiments to better understand their network usage and issues and improve efficiency of experiment worflows.

  • This includes the usage of network-aware tools, in particular perfSonar (operations, deployment and configuration)

WG: already a good participaton but more volunteers are welcome.

Metrics

  • Network metrics provided by perfSonar: network path, bandwith, latency
  • Transfer metrics: rates, link status, errors

perfSonar DataStore, based on esmond

  • Tested and deployed at OSG
  • Give access to all perfSonar data with a common API

perfSonar-related plans

  • GGUS support unit
  • Complete rewrite of the documentation
  • Infrastructure monitoring: SSB operation dashboard planned

perfSonar 3.4: brings interesting new features and fixes but sites asked to work for clear upgrade instructions

  • Configuration adjustements needed
  • Just released: proposal to deploy it as part of the reinstallation campaign following Bash ShellShock vulnerability
  • Broadcast to sites through WLCG and EGI soon...

Discussion

  • How many perfSonar instances are currently down? ~60%
    • Which impact? Not too much as this is not a service production depends on. But need to restore it asap: combine reinstallation with 3.4 upgrade, plan is to start the upgrade campaign next week
  • Markus: is perfSonar really needed? FTS3 and experiment frameworks collect a log of information related to network operations too
    • Shawn: reminder that to interpret FTS3 or experiment framework logs, need to be able to disentangle network problems from applications problems. Need a pure network monitoring (and related metrics) for this, this was the reason for deploying perfSonar. WG has the responsibility that all the required metrics are collected and that there is no unnecessary duplication between what we collect with perfSonar and what is collected at application level.

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2014-11-12 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback