Sep 2017 GDB notes

Agenda

https://indico.cern.ch/event/578990/

Introduction (Ian Collier)

presentation

Next meetings

  • No pre-GDB in october
  • November: Authz pre-GDB
  • December pre-GDB: SOC Hackaton
  • March at ISGC

Many meetings in the coming months: check slides

WLCG Workshop probably March 26-29

CWP Status (Michel Jouvin)

presentation

IPv6 F2F report (Dave Kelsey)

presentation

Monthly Vidyo meetings, next F2F January 12-12

  • More participation from sites is welcome

No change in the deployment timeline decided by WLCG with a major milestone April 1, 2018

  • Transition has begun: ~9% of the services registered in BDII are dual-stacked

T1 status

  • All T1s with an active IPv6 peering to IPv6, except KISTI
  • Dual-stack storage slowly deployed: 6 T1S still not have clear plans including FNAL and BNL

T2 status & experiments

  • Decision that it was important to have experiments involved in chasing the T2 sites (in coordination with the WG)
  • ALICE: 8 over 71 SEs reachable by IPv6
    • All SEs/CEs IPv6 readiness monitored in alimonitor.cern.ch
  • CMS: adding a new XRootd IPv6 test test to production ETF instance, will be usable by other experiments
    • Also monitoring site IPv6 readiness: 12 sites with an healthy IPv6, 23 sites with known issues
  • LHCb: will use the ALICE approach (with the same table)
    • 6 over 21 SEs IPv6 capable
  • ATLAS: no update at this meeting (the person in charge could not make it)

EOS and IPv6

  • Citrine branch, based on XRootd v4, is IPv6 ready
  • Currently 59% of EOS traffic is IPv6 but a lot of internal traffic: 9% of the users coming from IPv6
  • EOSLHCB already IPv6 since May, EOSALICE next week

CVMFS@CERN: Stratum 0 and Stratum 1 are IPv6 ready

CERN S3: internal traffic is IPv6 only (no support for dual-stack for internal traffic), 1% of the users are IPv6

FTS: 2-3% of transfers are using IPv6

perfSonar dual-stack mesh: a lot of orange, probably something wrong in PS dashboard

ETF IPv6 instance provide dual-stack testing support

  • Ready to start one instance per experiment

IPv6 known issues

  • Still the majority of the CAs have only IPv4 CRLs
  • CERN Agile Infrastructure: plan to turn on IPv6 by default, delayed because of a router bug
  • Docker: open issue with network namespaces
    • Singularity: no precise status but it is believed it is not affected by this issue (not using network namespaces)

Storage pre GDB report (Oliver Keeble)

presentation

Object Stores

  • RAL is now the main actor in attempting to use the Ceph object store and integrate it in our production infrastructure
    • Raised some operational issues, like management of secret keys that may require an additional service (Dynafed?)
  • Dynafed: several features, in particular a potential "cloud storage integrator"
  • Exploratory Ceph-based S3 service at CERN

Tapes: follow-up of discussions in the FTS Steering Group

  • How to make the most out of tapes: optimizations, key metrics, impact on experiment workflows
    • Not all sites have the same view on request optimisation (number of requests the experiment should send)... How to let FTS know the site prefernces?
  • EOS+CTA: a drop-in replacement for CASTOR
    • Same data, only metadata changes
  • dCache: abstraction of archiving technologies, can rely on several, selection based on QoS policies per pool

Resource reporting: part of the discution about the post-SRM era...

  • Already several iterations, seems to converge
  • Is it good enough to request prototypes from the storage providers
  • Aim to finish this by next GDB (October)

Security Policy Update (Dave Kelsey)

presentation

The last policies to be updated as part of the EGI project

  • Joint effort between EGI Security Group and AARC2

Now based on Communities: VO is just one possible form

  • Group of users organized as an entity that can act as an intermediary between users and infrastructure

2 new policies

  • Community Operation Security Policy: aimed at governing relationship between Community and Infrastructures
  • Community Membership Management Policy
  • No major comments received so far (deadline expired), in particular none from WLCG: next Security Policy Group in October, still time for feedback
    • No major change in policy, just trying to address new use cases linked to smaller communities and long tail of science

Scalable Negociator for a Community Trust Framework in Federated Infrastructure (SNCTFI)

  • Inspiration from SCI and Sirtfi
  • Owned by IGTF

Archive storage & discussion (German Cancio Melia)

presentation

Tape: market dominated by LTO consortium (3 drive manufacturers)

  • Niche market of enterprise tapes: IBM, Oracle
    • IBM has a roadmap with regular improvements
    • Oracle: no roadmap, basically stepping out from the market. Used to produce heads for HP but the TMR investment was too important: only IBM will have the capability to produce them (HP will use them).
  • Moving from GMR to TMR technology for heads: LTO8 will use the new TMR technology (expected end of 2017)
  • No technology problem but tape market is decreasing
    • LTO decreasing since 2007: competition of disk solutions
    • IBM left has the last major actor: worying, how long they will continue

Disks

  • spinning disks: future not very clear
    • HAMR still not there, reliability+cost issues.
    • If HAMR is available, we could get 100 TB disks by 2025
  • SSD: expected to stay an order of magnitude more expensive that spinning disks

Disk servers for archival @CERN: requires to optimized the disk server cost, currently dominated by the (enterprise) disk prices (75%)

  • Currently disk-based storage is 3x compared to tape
  • Use desktop disks and compensate more failures by more redundancy or by letting the capacity decay over time
    • Being tested with ALICE: successful so far
    • Also some "archival disks" appearing on the market: to be evaluated in the future
  • Monster disk servers: 192 disks behind one server
    • Achieve a volume/throughput perf comparable to tape

Optical disks: promising technology by Sony/Panasonic based on an evolution of Blu-Ray...

  • Max current capacity : 300 GB
    • Roadmap to 1TB but not beyond and availability of 1 Tb always moving, now 2020
  • Behind magnetic storage for capacity, perf and reliability
  • No clear info about media pricing
  • Libraries announced but not yet available

Holographic Storage: record over media volume rather than surface

  • Some promising demo in the passed years but no real progress and density achieved closed to LTO7
  • Startup companies behind this technology brankrupted, major players like IBM have stepped down from this techno: may
reappear at some time but still very far from a product, will not play a role in the next 10 years

New Big Data Solutions and Opportunities for Database Workloads (Luca Canali)

presentation

Young but evolved ecosystem/technology

  • Main change is the use of declarative interface where a user says what it wants to achieve and this is converted in to a DAG and executed on a
distributed, fault-tolerant on the infrastructure
  • SQL still very strong... even if not with traditional DBs

Several solutions available. Competition between:

  • Data-analytics, Hadoop being the most popular: cost/perf and scalability very good
  • RBMS and in-memory database

Hadoop ecosystem based on YARN and Hadoop

  • Many additional tools widely used like Spark, HBase... Kudu the next generation?

Hadoop@CERN: 3 production clusters + 1 new one being deployed for accelerator logging

  • Accelerator logging: critical system for running LHC, 700 TB currently, +200 TB/year
    • Data ingestion with Kafka + Gobblin

New projects

  • Next generation archiver for WinCC (ex PVSS): currently using Oracle, planning to move to an Hadoop-like cluster
    • Evaluating using Kudu, a potential Hadoop replacement with db (HBase) capabilities
  • CMS Data Reduction Facility, based on Spark: challenge is to process PBs of data
    • Already there: ability to read ROOT files in Spark and to read from EOS
    • Testing at scale is another challenge

Jupyter Notebooks as physics analysis interface

  • SWAN: ROOT and other librairies available
  • In front of an Hadoop/Spark cluster

Spark became a major component in this ecosystem

  • Workload engine
  • Machine learning: efficient support requires an ecosystem of many other components, including data storage, workload management...
    • Spark is currently a key component
  • Spark SQL provides DB access, now mature
  • Several presentations at last ACAT

Training effort at CERN-IT

  • Introduction and overview of Hadoop and Spark last Spring
    • Registered, archive available
  • New similar tutorial in November: see Indico

Growing community, both inside WLCG and outside: many opportunities to share experience and get new ideas

Database on Demand (Ignacio Coterillo Coz)

presentation

See slides.

Currently hosting 380 MySQL, 110 Postgres, 60 InfluxDB

Ready to share the SW developed to provide the service

  • Already some institutes interested (EPFL, ESO)
  • Built on Apiato

DBOD monitoring: Telegraf + InfluxDB + Grafana

Q: why a DBOD service with OpenStack having a similar service

  • The OpenStack service still has limited features, similar to DBOD 5 years ago

Workload Management directions (Erik Mattias Wadenstein)

presentation

Follow-up for the June discussion at GDB

Some motivations

  • Unfavourable support situation for SW like CREAM and Torque/PBS/MAUI
  • Unnecessary diversity is increasing the support cost for experiments

Target of possible recommandations: new sites and sites who are considering changing their solution

  • Not a requirement
  • Understanding that a site may have to take into account requirements from other communities than WLCG

Support of recommended solutions is not done by WLCG but by SW team behind products

  • WLCG will participate to the deployment documentation, tests, troubleshooting/pb analysis

Batch systems: 2 candidates

  • HTCondor: the typical recommandation for WLCG sites (HTC load)
    • Main target: single/multi core jobs up to a full node
  • SLURM: mainly for sites with a significant HPC load
    • Especially when running multi-node MPI jobs

Recommended CEs

  • HTCondor-CE
    • Can be appealing for sites using HTCondor, does not really make sense for others (even if technically feasible)
  • ARC-CE
  • Criteria of choices not very well defined, importance of local preferences (and associated support)
    • Trend to see HTCondor-CE preferred in OSG, ARC-CE in Europe
  • Experiments may express preferences and they could be added to the recommendation, if they are supported by facts and rationale (not just a statement)

CREAM CE no longer recommended for a new site

  • Working well but support very limited, in particular for the support of the recommended batch systems

Editorial board for the recommendation: Mattias in charge of making propositions at next GDB

Globus Toolkit Status (Andrea Manzi)

presentation

U. of Chicago announced EOL January 2018

  • Security fixes for some components until the end of 2018
  • Package repositories will be shut down end of 2018
    • WLCG relies on EPEL: no problem
  • Source code will remain on GitHub but U. of Chicago will no longer contribute to it

Impact mainly on DM servers and clients but not only

  • No major open bugs
  • Concern about possible future security issue and support of new OS/new openssl
  • Based on the last years experience, not too much concern about the work needed: in fact for the most important issue, was mainly already fixing things
and integrating changes directly in the EPEL version without going upstream (as they were not very responsive)

OSG already announced the full support of gsiftp and GSI

  • During last EGI Data TCB MW providers and infra representatives (including EUDAT, PRACE) agreed to coordinate their support effort in the coming years
  • CERN ready to contribute

Long term: look for/make possible alternatives

  • E.g. https for gridftp, OAuth2 for GSI
  • Mattias: we should also evaluate the cost of making the part of Globus that we rely one maintainable and compare it with moving to something else

May consider starting new package repositories

2018 WLCG Workshop (Ian Collier, Guido Russo, Michel Jouvin)

introduction

Napoli proposal

LAL proposal

RRB scheduled on the initial week foreseen (April 23-27): March 26-29 now seems the best (and probably only) possible date

  • Friday March 30 (Good Friday) is holidays on several countries: 4 days for the WLCG and HSF workshops

Proposal: 2 1/2 day for WCLG and 2 1/2 day for HSF, with Tuesday afternoon and Wednesday morning as a common part

2 hosting proposals:

  • NAPLES (Italy): large INFN site with strong connection with universities, WLCG T2, involved in many HEP experiments
    • Venue: conference center of the university. Main conference room can accomodate 150 people. A second room for 46. Can book a 3d one. And several smaller rooms possible.
    • Hotels: a lot of option around the conference center (pedestrian area)
    • Social dinner also planned in the same area to avoid buses
    • Getting there: direct flight from the main European cities, 65 mn train from Roma
    • Registration fees < 200 for each workshop
    • Organizational work/registration by a professional travel agency
  • LAL (France, near Paris)
    • See slides

Discussion

  • Graeme: because of the "Good Friday", people may want to move back home on Thursday still attending the afternoon sessions... Paris may be a
little bit easier.
  • Naples has already proposed to host the last workshop... Inclined to choose it!
    • HSF meeting tomorrow: will discuss proposals and give feedback

Wrap up (Ian Collier)

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2017-09-28 - IanCollier
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback