Concepts
Requirements
Storage Layer
The storage system should be structured into semi-independent layers to decouple low from high latency and random from sequential access patterns:
- Front-End Layer (Cache?)
- Archive/CDR Layer
- Tape/Backup Layer
Front-End-Layer
The Front-End layer is optimized for low-latency and/or random access [ typical HEP application ROOT and ROOT files ] to experiment data and user files. Experiment data files peak at GB file sizes while user files peak at MB size. On the long term this front-end might combine CPU and storage nodes. Processing would be optimized for locality - cross-node access has to be possible. The data distribution on dataset level should be as flat as possible to reach optimal performance. Data access has to be authenticated and authorized according to the client user identity. This layer requires volume management on user, group or project level. Need of High-Availability depends on archive latency and user requirements. Assigned hardware pool might have high dynamic.
Archive/CDR-Layer
The Archive Layer is optimized for organized streaming bulk access. It can take advantage of the dataset definition for file placement to achieve high-availability and redundancy. The archive layer should be scalable to 10^12 files. Write-once operation can be assumed. Access to the data can be considered as an admin user access on behalf of a certain user identity. Access needs scheduling. This layer requires volume management on user, group or project level. High-Availability of data has high importance. Assigned hardware pool has pre-defined dynamic.
Tape/Backup-Layer
Undefined
Technologies
Central Meta Data Store
Database
Example → Castor Nameserver
Filesystem based
Example →Lustre
Distributed Meta Data (on Storage Nodes) with CDN
Haystack Model
http://www.niallkennedy.com/blog/2009/04/facebook-haystack.html
"Traditional file systems are governed by the POSIX standard governing metadata and access methods for each file. These file systems are designed for access control and accountability within a shared system. An Internet storage system written once and never deleted, with access granted to the world, has little need for such overhead."
http://www.facebook.com/note.php?note_id=76191543919
Semi-distributed Meta Data
Master node + Chunkserver
Example →googleFS
http://en.wikipedia.org/wiki/Google_File_System
In-Memory Cache with Snapshot/Changelog and CDN or Directory Service
Example →Xtreemfs, Hadoop HDFS, Flickr(HDFS)
http://www.xtreemfs.org
http://hadoop.apache.org/hdfs/
http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop
Directory Service & Volumes
Example → AFS (new AFS with Object Store)
File Store Structure
Logical Files
Example → NFS3/4.0
Physical Files (Object Store with OIDs, File Store with FIDs)
Example →Lustre, Castor, Grid SE
Files with fixed size chunk layout
Example →googleFS, HDFS (
CloudStore)
Haystack/Needle Filelayout
Example →Facebook Photostore
Objects with variable layout
Example → PNFS (NFS 4.1), Lustre Striping
http://www.pnfs.com/
http://www.lustre.org/
Meeting Minutes
16.3.2010
Present: Andreas, Dennis, Fabrizio, Lukasz, Miguel
- Review of meta data implementations in large scale projects (see above)
- There was an agreement not to build meta data on top of a database
- First design consensus:
- meta data is stored together with stored objects(files)
- container catalogue with authorization meta data on container level
- location cache: storage index (file→node)
- reverse location cache: storage inventory (node→filelist)
- Supported Archive Operations
- file IO
- add file (write-once)
- Andreas: do we allow: open(O_EXCL), ||write(offset,len) :||, close ?
- remove file
- get file
- Andreas: do we allow: open(O_RONLY),||read(offset,len) :||,close ?
- stat(fstat) (comment: can only be partial POSIX compliant)
- Andreas: additionally for discussion:
- add replica(s) ?
- remove replica(s) ?
- object meta data OP
- container OP
- open (read or add)
- read member
- add member
- checkpoint commit (flush)
- close
- query container
- browse ('ls')
- container
- container catalogue
Discussion
- Do we allow to change user meta data after a file has been added?
- Do we allow arbitrary length of user meta data?
- Can an object be part of several containers? How can we handle that?
Milestones/Timeline Draft
Phase 1
- Running production pool in summer 2010 (August) with TSM backup
- Tier-3 sized pool (20-100 server?)
- Redundancy via replica
- Performance/Scalability
- O(10^7) files
- 1000 Read Open/s + 250 concurrent Creations/s
- Access via xroot protocol
- basic POSIX compliance (directory views)
- no write-once simplification
- permission on directory level possible
- Strong authentication (krb5 + X509) - only physical UID/GIDs
- Admin tools for
- fsck
- server drain
- server restore
- Soft Quota
- Policy driven Clustersegmentation
- - by identity (user,group,project), by container, by meta data match
- Monitoring
Phase 2
- Scalable Front-end, Archive & Tape Layer
-
Glossary
Dataset
A two-dimensional grouping of files(member) with unique authorization meta data for each dataset member.
File Layout
A description of any algorithm to assemble a complete file.
Meta Data Store
A persisant storage of file/object meta data.
Storage Cluster
A group of storage pools with a unique logical namespace.
Storage Pool
A group of storage nodes.
Storage Node
A node providing access to single or multiple storage volume groups.
Storage Volume Group
Disk space with static size and a unique name across a storage cluster
Storage Volume
Part of a volume group assigned to a user or group --
AndreasPeters - 16-Mar-2010