Site Storage Survey

Introduction

In this document we summarise the results of the "Storage Chapter" of the WLCG Site Survey. The questions were compiled by the QoS WG whose goal is to explore mechanisms for providing WLCG storage at lower cost by introducing a new consideration - Quality of Service. The WG used the survey to assess the current situation and identify the directions of the WLCG sites.

In the first part of this document we summarise the results of the survey. Subsequently we discuss the conclusions to be drawn, and propose followup actions to be taken by the community.

Results

The raw survey results are available on the WG twiki. Here we present summaries of the answers to each question.

By Question

Q1 - underlying media

In T1s, the majority of sites who declared a media grade were using enterprise grade disks, most commonly SAS. There are nevertheless sites relying on commodity hardware.

In T2, enterprise grade hardware is also prevalent (~half). SATA is slightly more common than SAS. One T2 reported tape storage available.

There is a pattern of SSD storage being available, mostly for service use (hot-swap, journaling, caching).

Q2- media combinations

The most common deployment is RAID6 with 12-16 disks. Over 2/3 of respondents mentioned this, not counting a few sites on equivalent ZFS configurations.

The other installations are almost all JBOD with Ceph, EOS, HDFS and GPFS all mentioned as being used to implement redundancy.

Sites are typically using a variety of solutions which are targeted at different use cases such grid storage, POSIX shared fs, VM block devices, fast SSD and S3 services.

Q3 - storage system

The grid-facing systems reported are in line with expectations. T1s are predominantly dCache with DPM, EOS and StoRM and Echo also present. GridFTP and Xrootd are supported by all sites with some offering HTTPS or WebDav as well.

T2s are oriented towards DPM, dCache, StoRM and Xrootd. The most common protocols are Xrootd and GridFTP, followed by HTTP/WebDav and SRM.

Behind the grid layer, more diversity can be seen.

The grid layer can be connected to the data stores in a number of ways

  • Direct mounts of local fs
    • Potentially with RAID
  • Through a mediating service
    • Mounting a shared fs which could provide redundancy
      • CephFS, GPFS
    • Block devices from a service which could provide redundancy
    • Deeper integration with an intermediate systems
      • HDFS + xrootd
      • Ceph + xrootd (Echo)

A POSIX layer exists for a number of uses, sometimes combined, and typically provided by GPFS, Lustre or CephFS

  • direct user access
    • some sites report providing this via a fuse mount, often with HDFS.
  • input buffer (ARC style)
  • mounted by grid-style disk servers
  • datastore for StoRM
    • used with both Lustre and GPFS.

Further investigation would be needed to quantify the distribution and popularity of all the possible deployment scenarios.

Q4 - effort

Disclaimer: FTE reported data is rather incomplete, with only 10 out of 15 Tier 1 sites and 38 out of 63 Tier 2 sites having reported a number.

For Tier 1 sites, we have an average reported number of 2.63 FTEs. Important to note is that the reported number did not make the distinction between tape effort vs disk effort.

For Tier 2 sites, we have an average reported number of 0.64 FTEs. A scatter plot has been computed, which shows the distribution of FTE vs Capacity [TB], grouped by country.
From the plot, we can see an initial cost of maintaining the system around the 0.5FTE mark, which scales nicely with volume of storage.

Over all, due to the diverse nature of the storage sites and the incomplete data, no definitive conclusion can be drawn.

T2 scatter plot of FTE / capacity

T2_FTE_chart.png

Q5 - "storageless sites"

The status of T0 and the T1s is easily summarised: they will not become storageless as they expect being the origin storage for larger regions. The main worries of the T1s were that (a) it is unclear if the current storage and network throughput would suffice to support a larger number of storageless sites, and (b) that the ones who are already experimenting with such setups are observing relatively low hit rates on the intermediate caches between origin and destination. So while a majority think that smaller sites would benefit, the main suggestions from the T1s were that workflows/processing needs to accommodate in several aspects: (1) brokering needs to take these new site setups into account, for example, to properly size the intermediate caches; (2) the job IO profiles need to be reflected in the scheduling to reduce the influence of network effects, for example, focusing on low-IO event generation or simulation on storageless sites; and (3) that performance profiles of job throughput and walltime need to be much more scrutinised.

The situation for T2+T3 is more diverse: In contrary to T1s suggestions, the vast majority of T2s are neither planning nor wanting to move to storageless setups. A handful of sites responded that they are interested or actively looking into it, however without concrete plans to switch. The main reasons for non-adoption mentioned were (a) that local storage is required for a variety of different communities, (b) that the main effort at the site does not come from storage administration, and (c) that they are not sufficiently capable from an infrastructure point of view, for example, network, to support this. One stand-out reply was that going storageless would cut the funding of the data centre.

Q6 - non WLCG communities

Practically all T1s are already sharing their resources across WLCG and other communities with little problems. Noteworthy, the LHC experiments are driving the technology decisions, which is starting to become problematic on a few T1s which are supporting a wide variety of other communities, as this means potentially driving two separate installations in the same data centre. In terms of capacity at least one T1 mentioned that WLCG will not be the largest consumer in the future.

The situation for T2s and T3s is almost equally split: approximately half of the sites are already shared sites, the other half are not sharing. Out of the shared sites, practically all of them run without problems and do not expect to change. As with the T1s, a handful of sites mentioned that WLCG drives the technology decisions in their setup, with some sites worrying that WLCG decisions are prohibitive with respect to other communities, and they would rather see more widespread and widely-accepted solutions. The need for an efficient POSIX infrastructure at the sites was mentioned explicitly several times.

Q7 - future directions

By far the most common keyword here is "Ceph", which is cited in numerous contexts:

  • Provision of redundancy (current RAID systems could convert to JBOD).
    • Via CephFS : Existing systems (dCache, DPM) would continue to work on top
    • Via libRados : Some existing systems ready, others need to be adapted - Q, which systems do and which don't?
  • Provision of S3, not usable in this form directly by experiments though.
  • Provision of posix as a stage-in cache (Arc style)
  • Migration from Lustre

Some examination of newer FS : OpenIO, BeeGFS.

Some inconclusive experimentation with media (Shingled, Low endurance SSD, consumer) was mentioned, with no one media type receiving much attention.

Two sites reported pursuing densification in an effort to cut costs. This involves minimising the frontend and network infrastructure overheads per disk, for example with 192 12TB disks per server.

Many sites have no exploratory activities and some explicitly say they have no effort. Others are worried about reliability and breaching MoU conditions.

Q8 - experiment workflows

Several comments were made about the granularity of requirements. Such considerations appear in different scenarios:

  • Better expression of what's needed at procurement time, MoU stuff.
  • Identification ("tagging") at the job level, better brokering of different workflows
    • This implies similar tagging of resources to allow matching - QoS classes
  • Note - a third alternative, allowing experiments to modify QoS within a single system, is not mentioned.

A number of concerns were expressed about the cost and the disruptive nature of inbound WAN data access, plus some observations about this being an explicit part of many future "data lake" models. This is an "access to storage" cost.

There were some concerns about system overloads from concurrent connections. Can throttling help? Are the systems up to date and properly tuned?

The responses included a couple of calls for better tools, provided by experiments to sites, to allow sites to adapt workflows. For example, pausing inbound transfers if storage is under load from compute.

It was noted that ess divergence between LHC experiments would help sites.

There were some tape related comments regarding recall patterns, similar to what has already been collected by the Archival WG.

Integrated Picture

Conclusions

Current situation

There were few surprises in the questions pertaining to the current situation. Two main groupings of sites emerge - those using RAID6 (or close equivalent) with a single-replica storage system, and those using JBOD with redundancy introduced at higher layers. In the latter group, this redundancy can come from systems such as Ceph or HDFS, or it can be natively supported in the storage system (e.g. EOS). While these configurations offer more flexibility and reduced recovery times, there was little indication of sites moving from their RAID solutions as this is could necessitate disruptive adoption of a new technology. It remains to be seen if increasing disk sizes (and therefore longer RAID rebuild times) will force a re-evaluation. What is clear is that such a transition will not bring major cost savings as the current RAID ratios (e.g. RAID6 12+2) would liberate at most around 15% if redundancy was eliminated.

The conclusion above can alternatively be stated as "the only way to reduce the redundancy overhead is to abandon redundancy and use straight JBOD". The viability of such a move depends significantly on automation of data loss reporting (by sites) and recovery (by experiments).

Most sites offer some kind of POSIX service which may be entirely independent of the main grid storage (e.g. the Arc input buffer). Allowing convergence of multiple systems at a site offers the perspective of reducing operational overheads.

Answers indicate that storage operation is not the major cost in storage provision. Thus initiatives aimed at reducing the cost of storage operation (e.g. multi-site operation or caching/volatile pools) should reflect on what impact they can realistically achieve.

Community directions

Data lakes

Despite a lot of discussion about new infrastructure architectures which involve storageless sites or the conversion of smaller installations to caches, such directions do not figure significantly in sites' current plans. The vast majority of T2s are neither planning nor wanting to move to storageless setups.

The survey indicated concern about the cost of WAN data access to custodial sites. This is a cost of storage access which is not well accounted. Issues include necessary investment in networking infrastructure, the chaotic and unpredictable nature of the accesses, and the difficulty in judging how to dimension a service for such a role (how many simultaneous clients at what interaction rate are to be expected?). A caching layer is a proposed remediation but cost savings here have yet to be demonstrated and the survey itself indicates that storage operation (expected to be lighter for a cache) is not the main cost driver.

Storage

One community direction is evident - the interest in Ceph. This interest is not concentrated on a single deployment scenario however. Sites are interested in using it as a redundancy layer over JBOD with a grid storage system on top, as well as for provision of an S3 service and POSIX via CephFS.

No strong indication of directions in adoption of media variations (SMR, consumer, etc) is evident. There are a few isolated experiments with this and also server densification.

Experiment Workflows and QoS

Discussion of experiment workflows indicated an acknowledgement that there are inefficiencies introduced by the granularity of our classifications ("disk", "tape"). There is a large variety of workflows which could be better matched to particular resources, allowing such resources to be delivered more cheaply where lower performance or reliability is acceptable. This is exactly what the discussion on "QoS classes" is about.

The survey also indicated the potential of tools provided by the experiments to to improve their ability to give a stable service.

Actions

On the basis of the results discussed, the WG proposes a small number of themes which can be followed up in the community.

Proposed classification:

  • Site level
    • Procurement, densification and media
      • Purchasing strategy, server density and overheads, SMR, SSDs
      • Networking & implications of the data lake model on origin storage
      • This is trying to provide the current QoS, unchanged, at lower price
    • Software defined storage ("SDS")
      • Configurable storage characteristics
        • Replication, erasure coding, media transitions
        • Future of RAID-6 and higher capacity disks
        • Pure JBOD operation
      • Stack consolidation - serving as many different use cases as possible with the same system
      • Introducing "SDS" into the stack
        • Use of Ceph, HDFS etc behind existing grid storage systems
        • Direct use of cluster FS tech by experiments (e.g. CephFS).
      • Identification of which configurations can be mapped onto which WLCG workflows
        • -> QoS classes
  • Grid level
    • WLCG QoS classes
      • Definition of the most useful set of QoS classes allowing WLCG to progress beyond "disk" and "tape"
      • Tagging and brokering
      • MoU
    • Client-driven QoS
      • Interfaces, clients, orchestration
        • inc bring-online
      • Data Lifecycle

Followup actions should be adapted to the communities involved.

"Site level"

These lower level issues should be followed up as close as possible to the community of site operators. We should discuss in DOMA what the best mechanisms are for triggering a community process. An association with HEPiX is a natural starting point, also possibly the Cost Modelling WG. We should identify the right forum to allow sites to report their experiences and findings in the hope of stimulating exploration of the possibilities available.

  • Invite sites to begin a classification of their current offerings
    • Do this after the first set of recommended classes has been produced
    • Sites could also add any they provide (or are interested in providing) but do not figure in the list
  • Invite sites to report on current relevant directions
    • Media diversity
    • Redundancy layer
      • In particular, introducing this into existing systems
  • Attempt the "JBOD experiment" in conjunction with an experiment.
    • Remove all redundancy and work on handling data loss gracefully
    • As a cache? Is this being done already in the Access WG?

"Grid level"

The "grid level" issues can be pursued in different forums to the "site level" ones.

  • The WG will organise a dedicated consultation meeting with each of Alice, CMS and LHCb. A standard set of questions could be used to aid comparison and aggregation.
    • Identification of QoS classes useful to the experiment
    • Some reference use cases for these classes
    • Plan for how a new QoS class can be introduced, integrated, tested and exploited
      • Start with one (potentially already available)
  • The WG will digest the results of this consultation, along with the survey conclusions, and publish its white paper. This will summarise commonalities and identify where community decisions need to be made.
  • Sites should be solicited to provide new experimental storage areas with novel QoS features in order to trigger exploration of what adaptations are needed over the stack in order to exploit them. Experiment/Site combinations could be encouraged to report.
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2019-10-07 - OliverKeeble
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback