Storage Layer

The storage system should be structured into semi-independent layers to decouple low from high latency and random from sequential access patterns:

  1. Front-End Layer (Cache?)
  2. Archive/CDR Layer
  3. Tape/Backup Layer


The Front-End layer is optimized for low-latency and/or random access [ typical HEP application ROOT and ROOT files ] to experiment data and user files. Experiment data files peak at GB file sizes while user files peak at MB size. On the long term this front-end might combine CPU and storage nodes. Processing would be optimized for locality - cross-node access has to be possible. The data distribution on dataset level should be as flat as possible to reach optimal performance. Data access has to be authenticated and authorized according to the client user identity. This layer requires volume management on user, group or project level. Need of High-Availability depends on archive latency and user requirements. Assigned hardware pool might have high dynamic.


The Archive Layer is optimized for organized streaming bulk access. It can take advantage of the dataset definition for file placement to achieve high-availability and redundancy. The archive layer should be scalable to 10^12 files. Write-once operation can be assumed. Access to the data can be considered as an admin user access on behalf of a certain user identity. Access needs scheduling. This layer requires volume management on user, group or project level. High-Availability of data has high importance. Assigned hardware pool has pre-defined dynamic.




Central Meta Data Store


Example → Castor Nameserver

Filesystem based

Example →Lustre

Distributed Meta Data (on Storage Nodes) with CDN

Haystack Model
"Traditional file systems are governed by the POSIX standard governing metadata and access methods for each file. These file systems are designed for access control and accountability within a shared system. An Internet storage system written once and never deleted, with access granted to the world, has little need for such overhead."

Semi-distributed Meta Data

Master node + Chunkserver

Example →googleFS

In-Memory Cache with Snapshot/Changelog and CDN or Directory Service

Example →Xtreemfs, Hadoop HDFS, Flickr(HDFS)

Directory Service & Volumes

Example → AFS (new AFS with Object Store)

File Store Structure

Logical Files

Example → NFS3/4.0

Physical Files (Object Store with OIDs, File Store with FIDs)

Example →Lustre, Castor, Grid SE

Files with fixed size chunk layout

Example →googleFS, HDFS (CloudStore)

Haystack/Needle Filelayout

Example →Facebook Photostore

Objects with variable layout

Example → PNFS (NFS 4.1), Lustre Striping

Meeting Minutes


Present: Andreas, Dennis, Fabrizio, Lukasz, Miguel

  • Review of meta data implementations in large scale projects (see above)
    • There was an agreement not to build meta data on top of a database
  • First design consensus:
    • meta data is stored together with stored objects(files)
    • container catalogue with authorization meta data on container level
    • location cache: storage index (file→node)
    • reverse location cache: storage inventory (node→filelist)
  • Supported Archive Operations
    • file IO
      • add file (write-once)
        • Andreas: do we allow: open(O_EXCL), ||write(offset,len) :||, close ?
      • remove file
      • get file
        • Andreas: do we allow: open(O_RONLY),||read(offset,len) :||,close ?
      • stat(fstat) (comment: can only be partial POSIX compliant)
      • Andreas: additionally for discussion:
        • add replica(s) ?
        • remove replica(s) ?
    • object meta data OP
      • getattr
      • setattr
      • listattr
    • container OP
      • open (read or add)
      • read member
      • add member
      • checkpoint commit (flush)
      • close
      • query container
      • browse ('ls')
        • container
        • container catalogue


  • Do we allow to change user meta data after a file has been added?
  • Do we allow arbitrary length of user meta data?
  • Can an object be part of several containers? How can we handle that?

Milestones/Timeline Draft

Phase 1

  • Running production pool in summer 2010 (August) with TSM backup
    • Tier-3 sized pool (20-100 server?)
    • Redundancy via replica
    • Performance/Scalability
      • O(10^7) files
      • 1000 Read Open/s + 250 concurrent Creations/s
    • Access via xroot protocol
    • basic POSIX compliance (directory views)
      • no write-once simplification
      • permission on directory level possible
    • Strong authentication (krb5 + X509) - only physical UID/GIDs
    • Admin tools for
      • fsck
      • server drain
      • server restore
    • Soft Quota
    • Policy driven Clustersegmentation
    • - by identity (user,group,project), by container, by meta data match
    • Monitoring
      • xroot

Phase 2

  • Scalable Front-end, Archive & Tape Layer



A two-dimensional grouping of files(member) with unique authorization meta data for each dataset member.

File Layout

A description of any algorithm to assemble a complete file.

Meta Data Store

A persisant storage of file/object meta data.

Storage Cluster

A group of storage pools with a unique logical namespace.

Storage Pool

A group of storage nodes.

Storage Node

A node providing access to single or multiple storage volume groups.

Storage Volume Group

Disk space with static size and a unique name across a storage cluster

Storage Volume

Part of a volume group assigned to a user or group -- AndreasPeters - 16-Mar-2010

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2010-03-18 - AndreasPeters
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DataManagement All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback