Summary of pre-GDB meeting on Data Access, May 13, 2014 (CERN)
Agenda
https://indico.cern.ch/event/272787/
Introduction
Focus of today's meeting is data access
- Not covering data transfer and management
SRM-less access: some progresses...
- Dav/http supported/used by Rucio
- Performant gridftp redirection (DPM)
- LHCb moved to using xrootd SURL: no longer any need SRM at disk-only storage by Run2
- ATLAS missing checksums with xrootd to avoid using SRM
- Still need a non SRM method to report storage usage per directory to deprecate space tokens
Some questions still needing some work
- Understanding our data access partens
- Data federation in productions but do we have everything working at scale
- Is our data access methods compatible with other communities? Is it a requirement?
- How to reduce our protocol zoology
Federated Storage Workshop - A. Hanuchevsky
Experiment review
- ALICE: long history with fed storage, catalog driven, heavy production usage, large penalty with remote small reads (3X compared to local), waiting for xrootd v4
- ATLAS: rocky start becuase of LFC bottleneck, now very stable with Rucio, PanDA now enabled for FAX failover with significant improvement in job efficiency
- CMS: lot of work done to improve WNA I/I, now 80 of sites federated, federated transfer at the same I/O level as bulk file transfer
- AAA used for failover and small-scale opportunistic scheduling
- COEPP (Australia): initially read-only federation, WAN transfer rates problematic when TTree cache is off
Monitoring: a very active area, analysis done to predict impact of federated access on EOS
- Understand relationship between I/O and CPU
- Proposal to use event/s as a metric rather than wall clock: more significant to users
- Minimal impact on EOS (1%) when used for failover
- Developing a cost matrix seen as essential for a scheduler not to overcommit a site
- Monitoring data growing rapidly: scalability issues to solve but a lot of potential usages (data popularity...)
WAN access requiring retooling of apps for effective access
- Large latency from 30 to 300 ms
- Caching is a must but TTree cache is not alway sufficient: training time when the first 20 objects are read, can lead to significant perf degradation
- Bulk transfer before training seems to offer a huge improvement
Access patterns: crucial to understand but difficult to find a dominating pattern in user access and has to be revisited regularly
- Very dependent on data format
Scale testing
- File open rate: tested by CMS at the 200 Hz level when 100 Hz is enough but dependent of SE types
- Investigation to understand why the discrepancies between implementations
- Read: US sites performing very well despite variations between sites, even with the same SE type
Workload management
- Main work by PanDA: failover, migrating jobs queued at an over-subscripted site to a site with free slots
Recent developments
- http plugin for Xrootd
- Caching Xrootd Proxy server
- http federations
v4 in RC2
- IPv6 support
- http support
- Caching proxy server
- Vector read passed thru proxy servers for better performances
Next meeting: January 2015, UK or Rome
Xrootd Monitoring Report - A. Beche
Goals
- Understand data flows
- Give information for efficient operations
- PRovide information for data placement
Information collected from storage systems through the GLED collector that aggregates all the informatin for a particular file before sending to the monitoring infrastructure
- Up to 150 Hz (FAX EOS) out of a GLED collector
Monitoring data scalability: now 2 GB/day inserted
- Tables partitioned per day with no global index: used for dashboard
- Global aggregation done asynchronously for pre-defined set of views
- Many views available: ability to zoom/drill down
Several operations allowed by monitoring for operations
- Identify the main users, the kind of operations they do, the files they access
- Identify dormant files for archiving
Some issues
- xrootd server should report their site name: topology difficult to guess without this information
- GLED collector improvements for better reliability: restart can take a long time
- Multi-VO sites: idea is to start a new GLED connector (locally for example) that will receive all data and feed them to the appropriate GLED collector based on the VO
- Discussion about the possibility to filter the VO reported at the SE level: Andry agrees this is doable and the most suitable, will add to the request list...
- Access to raw data for building other applications on these data
- Need to understand why the number of distinct users per day is significantly lower with xrootd federations than with FTS
http monitoring
- xrootd/httpd: no problem, xrootd v4 will report the protocol used
- pure httpd/dav: not clear yet whether to report directly to ActiveMQ or to GLED collector
- Preliminary work started by DPM to publish to GLED collector
Data Access Analysis @CERN
Infrastructure view - C. Nieke
Several data sources (EOS,
LSF, Experiment frameworks) but missing unique identifiers to cross match
Metrics: events/sec look more meaningful to experiments but difficult to compare different workloads
- Mainly useful to compare sites for the same workload
- Still need CPU/memory for understanding problems and predict them...
- Event rate will be useful only if applications fill the app_info field in xrootd to be able to compare similar applications
Proposal to have a new xrootd plugin that will collect CPU/memory used by a payload for easier correlation with I/O metrics
Experiment view - N. Magini
Comparison of job behaviours between Meyrin and Wigner
Many different data collected: impact of remote access between each site is small
- Comparable with other effects like OS version
CMS Plans and Site Expectations - K. Bloom
Status of AAA federation
- 6/8 T1s in, PIC coming soon, TW as an "opportunistic T1"
- 41/52 T2s, representing 96% of the unique datasets resident at T2s
- Main application is the fallback mechanism: a few easy configuration changes needed at sites
- 47/52 T2s with fallback enabled
- RAL still having problem due to tricky firewall configuration issues
Production with remote data
- Test done with reprocessing of 2012 data: input at T1s, T2s accessing the data through AAA allowing opportunistic use of their resources
- Like to continue during Run 2 as it was very successful
- Sites should be prepared to serve data at a scale similar to that at which they process data: robust storage needed, storage no longer only for local batch slots
- Optimizing resource use by moving jobs at over-subscripted sites to sites with free slots: data will be download by AAA rather than pre-placed
- Sites without data (T3): analysis only, done through AAA
- 99% success rate observed; 2-3 Gb/s sustained
- Opportunistic resource: SW deployed with Parrot+CVMFS (500 MB), data read through AAA
Scale tests:
- Baseline : 20 Hz, technology goal : 100 Hz, testing up to 200 Hz
- Results varying by site but DPM and dCache seem to have problem to sustain 100 Hz
- Need to discuss with developers: in particular for DPM, there is no reason for such a low limit (not matching tests done in other contexts)
- File serving: up to 800 concurrent jobs running remotely from 1 site at 0.25 MB/s
- Tests at Nebraska and UCSD show good scalability but different read time per block with the same storage implementation (Hadoop)...
- Client side: concurrent jobs at a site all reading from AAA, limit in the ability to sink data
- Next step is chaos test by centrally forcing job to run everywhere and ignore data location
- Will be done as part of CSA14 exercise this summer
- Will mesure success rate, I/O rate, network load...
- Require sites to configure xrootd monitoring before then else will be blind (dCache: requires 2.6)
ALICE Plans - C. Grigoras
Data access model
- Central catalog of LFNs and replicas
- Jobs go where the data is
- Other files (configuration, calibration) accessed remotely (not downloaded)
- Job locality constraint relaxed for high priority tasks
- Every job received a list of replica for doing the failover if needed
Data management
- xrootd exclusively for data files
- Other protocols can be used for other files
- xrd3cp for managed transfers
Analysis train is the main recommended way to submit jobs: optimize access to storage
- Last week of organized analysis train showed 0,4% of remote access but significant penalty when reading remotely (2x)
- Long delay observed at the initialization of the job
- Problems can comme from network or storage: some sites with better remote performances than local performances...
- Congested firewall is a well know problem
Monitoring data collected from Xrootd using standard reporting and
AppMon
- Many views available in MonAlisa
- Producing topology information with "network distance" used to order replica
- Weighted factors based on available free space and reliability
SE functional tests run every 2h against know redirector + test suite run against discovered disk servers
Future plans
- xrootd v4 upgrade: impact on Alien (command line) and ROOT (library)
- Implement IPv6 topology
- Newly deployed storage: EOS recommended, using the RAIN functionality between well-connected sites
- CERN would like to validate RAIN between Meyrin and Wigner first...
- CERN will do its best to support EOS sites outside CERN but would prefer to avoid small sites...
Current I/O requirement : 2 MB/s/core
- Observed over long-period analysis train: average throughput weighted by CPU efficiency
- I/O assumed to be the main contributor to job inefficiency
- Will have to increase with new generation of CPUs
- May require advanced HW configuration: not necessarily compatible with flat budget model...
LHCb (absent)
1 summary slide
- Production job (simulation): data downloaded
- working group/analysis: data access through xrootd
- List of replicas given to the job (processed by Gaudi)
- Remote access only for failover
Atlas Plans and Expectations - R. Gardner
Motivation similar to CMS
Current FAX status
- 48 sites, stable
- Using Rucio + gLFN (global name)
- Hourly functional tests + support by cloud squads
- Sites able to throttle use of remote access by parameters in AGIS but may have an impact on job efficiency at remote site
Statistics
- Failover: 180K jobs failed over remote locations in the last 3 months
- Very smooth, modest network impact
- Overflow: now in testing in US cloud: 100% success in BNL -> MWT2 (700 jobs)
- Reading done from original site
Test at scale planned during DC14: up to 10% of remote access
- T1s and T2s encouraged to join FAX by DC14 preparation step
Also planning tests with http/DAV access/federation
German Sites Experience - G. Duckeck
Main experience in ATLAS DE cloud
- All sites using dCache
- most sites interconnected with 10G links (except PL)
- Using direct I/O with dcap since many years
- A couple of sites with efficient but small storage (LUSTRE/GPFS) not enough for a real SE but that could provide a good/efficient caching
2 possible setups for federated Xrootd with dCache
- dCache plugin: invasive setup, very dependent of dCache versions but ability to do pool redirection for better performance
- Xrood proxy server: no impact on dCache but all remote traffic going through this node
- The setup used at every site except GridKA
HC based stress tests: remote I/O has no impact on the success rate but impact on performance partly related to node distance (latency)
Also some tests done with the 3 CMS sites (
GridKA, DESY, RWTH)
- All joined AAA: very smooth
- Main issue is xrootd monitoring requiring a plugin installed in each pool
Some preliminary tests with http/Dav federation
http/Dav
Dynamic Federation and http plugin for xrootd - F. Furano
http for Xrootd: a plugin adding http/Dav connectivity to an existing Xrootd installation
- Well integrated, no new daemon...
http/Dav and grid storage motivations
- Use browser to access grid storage to lower the barrier
- Use the many standard clients existing
- Build customized web portal/clients
- Write specific clients with very advanced features
- Important to have xrootd in this world rather than 2 separate islands: thus the important of http plugin for xrootd
- Both for client access to xrootd and for integration of xrootd in federations
http redirections allow several mix of authenticated/unauthenticated access
- redir and disk server authenticated (https)
- redir authenticated (https) and disk server unauthenticated (http)
- the opposite
Dynamic federation: dynamic discovery of metadata, no persistency
- Parallelization of metadata discovery for efficiency
- A few early adopters: NEP101, FNAL
- Next plans: move federator to EPEL, move doc to Drupal to get it indexed
- Looking for new adopters, including non WLCG
- PRobably main use cases are outside WLCG as LHC experiments have a list of all file replicas in their central catalogs usable through metalink
- metalink provides additional features like // download of files from different locations
http/Dav preliminary tests in Atlas - C. Serfon
See slides
Rucio knows about the site status: will exclude sites in downtime from the replica list returned
Rucio also provides a redirector service
Developing a Nagios probe for http/Dav.
ROOT Multiple Exec Streams - P. Canal
Current area of work
- Thread safety (but not lock free...)
- TTreeCache configurable to be on by default
- Reform of checksum
Next milestones
- First ROOT6 production release in the next weeks
- ROOT I/O workshop at CERN June 25: want to refocus on ROOT I/O after the ROOT6 release
Likely focus for future developments includes
- Perf improvement, in particular tweaking of file format to be little endian (compatible with Intel)
- Interface simplifications
Current use of multiple stream of execution
- Parallelism through PROOF: multi-core to many-node
- Parallel file merge
- Multiple threads in a few places, like read-ahead TTreeCache
- Vectorization in math libraries
Concurrent I/O support: allow every thread to have its own TFile/TTree to access the same file
- Same TFile/TTree by multiple threads require locking
- Leverage on C++ new features (atomics, thread_local)
- Move away CINT is a requirement: no separation between execution and database access
- Preliminary tests promising but still some work to troubleshoot all corner cases
- 5% sequential execution when reading CMS Event TTree, 11% for cond DB
Future work to benefit from multiple threads
- Seamless migration from single core to multiple core to multiple threads: remove TSelector restrictions in PROOF
- Multiple threads reading from the same TTree: parallelization of unzipping
- Parrallelization of histograms
- Remove need for locks and atomics
Challenge of introducing new, cleaner, simpler interface still keeping backward compatibility
- Main solution seems interface versionning
--
MichelJouvin - 14 May 2014