Summary of pre-GDB meeting on Data Access, May 13, 2014 (CERN)



Focus of today's meeting is data access

  • Not covering data transfer and management

SRM-less access: some progresses...

  • Dav/http supported/used by Rucio
  • Performant gridftp redirection (DPM)
  • LHCb moved to using xrootd SURL: no longer any need SRM at disk-only storage by Run2
  • ATLAS missing checksums with xrootd to avoid using SRM
  • Still need a non SRM method to report storage usage per directory to deprecate space tokens

Some questions still needing some work

  • Understanding our data access partens
  • Data federation in productions but do we have everything working at scale
  • Is our data access methods compatible with other communities? Is it a requirement?
  • How to reduce our protocol zoology

Federated Storage Workshop - A. Hanuchevsky

Experiment review

  • ALICE: long history with fed storage, catalog driven, heavy production usage, large penalty with remote small reads (3X compared to local), waiting for xrootd v4
  • ATLAS: rocky start becuase of LFC bottleneck, now very stable with Rucio, PanDA now enabled for FAX failover with significant improvement in job efficiency
  • CMS: lot of work done to improve WNA I/I, now 80 of sites federated, federated transfer at the same I/O level as bulk file transfer
    • AAA used for failover and small-scale opportunistic scheduling
  • COEPP (Australia): initially read-only federation, WAN transfer rates problematic when TTree cache is off

Monitoring: a very active area, analysis done to predict impact of federated access on EOS

  • Understand relationship between I/O and CPU
  • Proposal to use event/s as a metric rather than wall clock: more significant to users
  • Minimal impact on EOS (1%) when used for failover
  • Developing a cost matrix seen as essential for a scheduler not to overcommit a site
  • Monitoring data growing rapidly: scalability issues to solve but a lot of potential usages (data popularity...)

WAN access requiring retooling of apps for effective access

  • Large latency from 30 to 300 ms
  • Caching is a must but TTree cache is not alway sufficient: training time when the first 20 objects are read, can lead to significant perf degradation
    • Bulk transfer before training seems to offer a huge improvement

Access patterns: crucial to understand but difficult to find a dominating pattern in user access and has to be revisited regularly

  • Very dependent on data format

Scale testing

  • File open rate: tested by CMS at the 200 Hz level when 100 Hz is enough but dependent of SE types
    • Investigation to understand why the discrepancies between implementations
  • Read: US sites performing very well despite variations between sites, even with the same SE type

Workload management

  • Main work by PanDA: failover, migrating jobs queued at an over-subscripted site to a site with free slots

Recent developments

  • http plugin for Xrootd
  • Caching Xrootd Proxy server
  • http federations

v4 in RC2

  • IPv6 support
  • http support
  • Caching proxy server
  • Vector read passed thru proxy servers for better performances

Next meeting: January 2015, UK or Rome

Xrootd Monitoring Report - A. Beche


  • Understand data flows
  • Give information for efficient operations
  • PRovide information for data placement

Information collected from storage systems through the GLED collector that aggregates all the informatin for a particular file before sending to the monitoring infrastructure

  • Up to 150 Hz (FAX EOS) out of a GLED collector

Monitoring data scalability: now 2 GB/day inserted

  • Tables partitioned per day with no global index: used for dashboard
  • Global aggregation done asynchronously for pre-defined set of views
  • Many views available: ability to zoom/drill down

Several operations allowed by monitoring for operations

  • Identify the main users, the kind of operations they do, the files they access
  • Identify dormant files for archiving

Some issues

  • xrootd server should report their site name: topology difficult to guess without this information
  • GLED collector improvements for better reliability: restart can take a long time
  • Multi-VO sites: idea is to start a new GLED connector (locally for example) that will receive all data and feed them to the appropriate GLED collector based on the VO
    • Discussion about the possibility to filter the VO reported at the SE level: Andry agrees this is doable and the most suitable, will add to the request list...
  • Access to raw data for building other applications on these data
  • Need to understand why the number of distinct users per day is significantly lower with xrootd federations than with FTS

http monitoring

  • xrootd/httpd: no problem, xrootd v4 will report the protocol used
  • pure httpd/dav: not clear yet whether to report directly to ActiveMQ or to GLED collector
    • Preliminary work started by DPM to publish to GLED collector

Data Access Analysis @CERN

Infrastructure view - C. Nieke

Several data sources (EOS, LSF, Experiment frameworks) but missing unique identifiers to cross match

Metrics: events/sec look more meaningful to experiments but difficult to compare different workloads

  • Mainly useful to compare sites for the same workload
  • Still need CPU/memory for understanding problems and predict them...
  • Event rate will be useful only if applications fill the app_info field in xrootd to be able to compare similar applications

Proposal to have a new xrootd plugin that will collect CPU/memory used by a payload for easier correlation with I/O metrics

Experiment view - N. Magini

Comparison of job behaviours between Meyrin and Wigner

Many different data collected: impact of remote access between each site is small

  • Comparable with other effects like OS version

CMS Plans and Site Expectations - K. Bloom

Status of AAA federation

  • 6/8 T1s in, PIC coming soon, TW as an "opportunistic T1"
  • 41/52 T2s, representing 96% of the unique datasets resident at T2s
  • Main application is the fallback mechanism: a few easy configuration changes needed at sites
    • 47/52 T2s with fallback enabled
    • RAL still having problem due to tricky firewall configuration issues

Production with remote data

  • Test done with reprocessing of 2012 data: input at T1s, T2s accessing the data through AAA allowing opportunistic use of their resources
    • Like to continue during Run 2 as it was very successful
    • Sites should be prepared to serve data at a scale similar to that at which they process data: robust storage needed, storage no longer only for local batch slots
  • Optimizing resource use by moving jobs at over-subscripted sites to sites with free slots: data will be download by AAA rather than pre-placed
  • Sites without data (T3): analysis only, done through AAA
    • 99% success rate observed; 2-3 Gb/s sustained
  • Opportunistic resource: SW deployed with Parrot+CVMFS (500 MB), data read through AAA
    • clouds, HPC, HLT farm...

Scale tests:

  • Baseline : 20 Hz, technology goal : 100 Hz, testing up to 200 Hz
    • Results varying by site but DPM and dCache seem to have problem to sustain 100 Hz
    • Need to discuss with developers: in particular for DPM, there is no reason for such a low limit (not matching tests done in other contexts)
  • File serving: up to 800 concurrent jobs running remotely from 1 site at 0.25 MB/s
    • Tests at Nebraska and UCSD show good scalability but different read time per block with the same storage implementation (Hadoop)...
  • Client side: concurrent jobs at a site all reading from AAA, limit in the ability to sink data
  • Next step is chaos test by centrally forcing job to run everywhere and ignore data location
    • Will be done as part of CSA14 exercise this summer
    • Will mesure success rate, I/O rate, network load...
    • Require sites to configure xrootd monitoring before then else will be blind (dCache: requires 2.6)

ALICE Plans - C. Grigoras

Data access model

  • Central catalog of LFNs and replicas
  • Jobs go where the data is
    • Other files (configuration, calibration) accessed remotely (not downloaded)
    • Job locality constraint relaxed for high priority tasks
    • Every job received a list of replica for doing the failover if needed

Data management

  • xrootd exclusively for data files
    • Other protocols can be used for other files
  • xrd3cp for managed transfers

Analysis train is the main recommended way to submit jobs: optimize access to storage

  • Last week of organized analysis train showed 0,4% of remote access but significant penalty when reading remotely (2x)
    • Long delay observed at the initialization of the job
  • Problems can comme from network or storage: some sites with better remote performances than local performances...
    • Congested firewall is a well know problem

Monitoring data collected from Xrootd using standard reporting and AppMon

  • Many views available in MonAlisa
  • Producing topology information with "network distance" used to order replica
    • Weighted factors based on available free space and reliability

SE functional tests run every 2h against know redirector + test suite run against discovered disk servers

Future plans

  • xrootd v4 upgrade: impact on Alien (command line) and ROOT (library)
  • Implement IPv6 topology
  • Newly deployed storage: EOS recommended, using the RAIN functionality between well-connected sites
    • CERN would like to validate RAIN between Meyrin and Wigner first...
    • CERN will do its best to support EOS sites outside CERN but would prefer to avoid small sites...

Current I/O requirement : 2 MB/s/core

  • Observed over long-period analysis train: average throughput weighted by CPU efficiency
    • I/O assumed to be the main contributor to job inefficiency
  • Will have to increase with new generation of CPUs
  • May require advanced HW configuration: not necessarily compatible with flat budget model...

LHCb (absent)

1 summary slide

  • Production job (simulation): data downloaded
  • working group/analysis: data access through xrootd
    • List of replicas given to the job (processed by Gaudi)
    • Remote access only for failover

Atlas Plans and Expectations - R. Gardner

Motivation similar to CMS

Current FAX status

  • 48 sites, stable
  • Using Rucio + gLFN (global name)
  • Hourly functional tests + support by cloud squads
  • Sites able to throttle use of remote access by parameters in AGIS but may have an impact on job efficiency at remote site


  • Failover: 180K jobs failed over remote locations in the last 3 months
    • Very smooth, modest network impact
  • Overflow: now in testing in US cloud: 100% success in BNL -> MWT2 (700 jobs)
    • Reading done from original site

Test at scale planned during DC14: up to 10% of remote access

  • T1s and T2s encouraged to join FAX by DC14 preparation step

Also planning tests with http/DAV access/federation

German Sites Experience - G. Duckeck

Main experience in ATLAS DE cloud

  • All sites using dCache
  • most sites interconnected with 10G links (except PL)
  • Using direct I/O with dcap since many years
  • A couple of sites with efficient but small storage (LUSTRE/GPFS) not enough for a real SE but that could provide a good/efficient caching

2 possible setups for federated Xrootd with dCache

  • dCache plugin: invasive setup, very dependent of dCache versions but ability to do pool redirection for better performance
  • Xrood proxy server: no impact on dCache but all remote traffic going through this node
    • The setup used at every site except GridKA

HC based stress tests: remote I/O has no impact on the success rate but impact on performance partly related to node distance (latency)

Also some tests done with the 3 CMS sites (GridKA, DESY, RWTH)

  • All joined AAA: very smooth
  • Main issue is xrootd monitoring requiring a plugin installed in each pool

Some preliminary tests with http/Dav federation


Dynamic Federation and http plugin for xrootd - F. Furano

http for Xrootd: a plugin adding http/Dav connectivity to an existing Xrootd installation

  • Well integrated, no new daemon...

http/Dav and grid storage motivations

  • Use browser to access grid storage to lower the barrier
  • Use the many standard clients existing
  • Build customized web portal/clients
  • Write specific clients with very advanced features
  • Important to have xrootd in this world rather than 2 separate islands: thus the important of http plugin for xrootd
    • Both for client access to xrootd and for integration of xrootd in federations

http redirections allow several mix of authenticated/unauthenticated access

  • redir and disk server authenticated (https)
  • redir authenticated (https) and disk server unauthenticated (http)
  • the opposite

Dynamic federation: dynamic discovery of metadata, no persistency

  • Parallelization of metadata discovery for efficiency
  • A few early adopters: NEP101, FNAL
  • Next plans: move federator to EPEL, move doc to Drupal to get it indexed
  • Looking for new adopters, including non WLCG
  • PRobably main use cases are outside WLCG as LHC experiments have a list of all file replicas in their central catalogs usable through metalink
    • metalink provides additional features like // download of files from different locations

http/Dav preliminary tests in Atlas - C. Serfon

See slides

Rucio knows about the site status: will exclude sites in downtime from the replica list returned

Rucio also provides a redirector service

Developing a Nagios probe for http/Dav.

ROOT Multiple Exec Streams - P. Canal

Current area of work

  • Thread safety (but not lock free...)
  • TTreeCache configurable to be on by default
  • Reform of checksum

Next milestones

  • First ROOT6 production release in the next weeks
  • ROOT I/O workshop at CERN June 25: want to refocus on ROOT I/O after the ROOT6 release

Likely focus for future developments includes

  • Perf improvement, in particular tweaking of file format to be little endian (compatible with Intel)
  • Interface simplifications

Current use of multiple stream of execution

  • Parallelism through PROOF: multi-core to many-node
  • Parallel file merge
  • Multiple threads in a few places, like read-ahead TTreeCache
  • Vectorization in math libraries

Concurrent I/O support: allow every thread to have its own TFile/TTree to access the same file

  • Same TFile/TTree by multiple threads require locking
  • Leverage on C++ new features (atomics, thread_local)
  • Move away CINT is a requirement: no separation between execution and database access
  • Preliminary tests promising but still some work to troubleshoot all corner cases
    • 5% sequential execution when reading CMS Event TTree, 11% for cond DB

Future work to benefit from multiple threads

  • Seamless migration from single core to multiple core to multiple threads: remove TSelector restrictions in PROOF
  • Multiple threads reading from the same TTree: parallelization of unzipping
  • Parrallelization of histograms
  • Remove need for locks and atomics

Challenge of introducing new, cleaner, simpler interface still keeping backward compatibility

  • Main solution seems interface versionning

-- MichelJouvin - 14 May 2014

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2014-05-14 - MichelJouvin
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback