DRAFT - CASTOR Tape Subsystem Twiki

tape.jpg

Introduction

The Tape Subsystem comprises all functionality directly dealing with storage on, and management of magnetic tapes (volumes), tape servers and drives and tape libraries. This ranges from low-level tape positioning and access over disk-to-tape copying up to the management of tape drives and queues and the volume database.

Team and Contributors

The members of the development team (hosted in IT/DM/SGT) work in close collaboration with the tape operations team (IT/FIO/TSI) who are the main source for operational requirements. Also, coordination is ensured with the DM architecture team.

  • Team members:
    • Steve Murray IT/DM/SGT (investigations; overall architecture coordination; rtcpd; VDQM2). Giulia Taurelli IT/DM/SGT (Repack2; rtcpclientd; Recall/Migration policy framework). German Cancio IT/DM/SGT (investigations; planning). Nicola Bessone IT/DM/SGT (starting Sept.08)
  • (Main) Contributors:
    • Vlado Bahyl IT/FIO/TSI; Tim Bell IT/FIO/TSI; Charles Curran IT/FIO/TSI; Arne Wiebalk IT/FIO/TSI; Olof Barring IT/FIO/FS

Work Areas

Tape and remote copy daemons (taped/rtcpclientd/rtcpd)

  • Goals:
    1. Acquire knowledge on tape technology: Software, Hardware, Media
    2. Understand current performance limitations in both tape write and read operations
    3. Address operational recoverability and reliability requirements and their impact on the tape data structure and format
    4. Re-engineer existing old/legacy code base using standard CASTOR development principles and foundation libraries, taking into account the above, following a low-risk migration path

Acquire tape technology knowledge

The current CASTOR tape software is outdated and not well understood by the current development team. The technology used (tape drives, SCSI devices, etc) requires in-depth technical understanding which needs to be built up. Operational knowledge is available in IT/FIO/TSI; some development knowledge is available via A. Wiebalk which needs to be transferred to the developers in IT/DM.

Performance limitations with tape reads and writes

The current tape data formats used in CASTOR were set in place in the 1990's (ANSI AUL, NL - see link below). The NL format is not used for production data as it lacks metadata critical for media recovery (see below). There are a number of problems in terms of performance:

  • Reads: The number of files read per tape mount is very low (around 1.5 files in average; the overhead for a tape mount ranges from 1min to 3min). This due to:
    1. related files are not written together but rather spread over available tape servers, in order to increase write performance
    2. read operations for data on the same tape are not grouped (in order to avoid stale jobs on the batch farms)
  • Writes: The usage of migration policies enables building up file streams to be sent to tape. However, the efficiency of writing small files is low as the AUL format requires writing 3 tape marks per data file (two label files and the actual data file); each tape mark takes 5-7 seconds to write. Encouragements to experiments to increase their data file size has been not fully succesful so far.

Operational recoverability and reliability requirements

  • Format: Despite its performance disadvantages, the AUL format provides metadata information which are critical in terms of reliability and media recoverability (e.g. file name server ID, size, drive ID, etc). In order to avoid write positioning errors leading to data corruption, AUL trailer label information of the preceding file is checked when a new file is appended to the end of a tape and compared with the information kept in the database. The AUL header/trailer information also allows the identification of affected files in case of media problems (e.g. scratched/folded tape parts). Any alternative format must provide metadata information allowing for facilitating media recovery and safe positioning.

  • Stability:
    • The software components, in particular rtcpclientd, are not robust against failures (e.g. daemon crashes). rtcpclientd is stateful and cannot be replicated (for load balancing or redundancy). After a crash, a manual database cleanup operation is needed.
    • Some of the software components, like rtcpd, are extremely delicate and critical as bugs or instabilities may (silently) corrupt production data stored into CASTOR. Also, backwards compatibility to CASTOR1 is required until CASTOR1 is completely stopped or confined. A development/deployment model is therefore required in which modules get updated/replaced incrementally, instead of a big-bang approach.

Re-engineering

In order to address the above mentioned issues, a gradual re-engineering of the tape layer components is proposed. A design document will need to be prepared with the details. The main items are shortly described here:

  • Tape format: A new tape format will be defined in addition to the existing AUL and NL. While the AUL format structure will be kept for this format at least initially, the payload inside each AUL data file will consist of an aggregation of multiple CASTOR files, in order to reduce the number of tape marks. The aggregation will be generated by the tape layer itself. The incoming stream (list of files) to be migrated will be aggregated to a configurable maximum total size (e.g. 10GB) and/or configurable maximum number of files (e.g. 1000 files). If a file exceeds the maximum total size it will be written in a separate aggregation consisting of that single file. On hardware with efficient tape mark handling, the number of files per aggregation can be decreased. Aggregations are not a static concept but are dynamically (re-)created on every migration; files (more precisely: segments) may be rearranged into different aggregations e.g. after repacking of a tape.

  • Aggregation format: The aggregation format is internal to the tape layer and not visible from higher levels which still continue to see individual files (or "clusters" as defined by the stager, see below). Possible choices for the aggregation format include IL (see link below), GNU tar, CPIO, ZIP. The currently preferred choices are IL and TAR (see here for further discussion.)

  • tape gateway: rtcpclientd will be replaced by another module ("tape gateway") which communicates with VDQM and the tape servers, using a new tape database.

  • tape aggregator: A new module ("tape aggregator") will be developed which will run on the tape server and which interacts with the tape gateway on one hand (via the new tape database), and rtcpd/taped running on the tape server on the other hand.
    • For write operations, the tape aggregator receives the list of files to be migrated from the tape gateway and performs the aggregation (using rfio and a memory buffer) using the selected aggregation format. Each aggregation is then (asynchronously) sent over to rtcpd/taped as a single file, using the existing protocols. In case of errors during write, all files of the aggregation are flagged as failed.
    • For read operations, the complete aggregation is read out (via rtcpd) - potentially into a memory buffer - and then (asynchronously) sent over to the disk servers using rfio. Using a memory buffer to disk will be investigated, to better overcome slow disk server connections.
    • Data communication between the tape aggregator and rtcpd will be done using named pipes (over rfio) as they run on the same node. Data communication between the disk servers and the tape aggregator will use RFIO but can be replaced later by xrootd as needed.

  • rtcpd/taped: Initially, rtcpd/taped will remain unchanged in order to preserve backwards-compatibility and to enable a low-risk migration path.
    • In the short term, small changes may be required to accomodate the existence of a new tape format.
    • In the future, the tape aggregator will be extended to cover their functionality (after stoppage of CASTOR1). This will also allow for enhancing functionality such as direct file positioning within an aggregation, for avoiding always having to read complete aggregations.

  • name server: The name server needs to be enhanced to allow storing/retrieving the mapping between segments and the aggregation they belong to (and the position within a given aggregation).

Discussion points

  • "Aggregations" vs. "Clusters": Tape level aggregations as proposed here are an independent, lower-level concept than "clustering" as proposed by the stager working group, but both concepts are compatible. The following should be done on the stager side to further increase efficiency (but there is no functional dependency from the new tape subsystem):
    • Migration policies should group files belonging to the same cluster to be written to tape within the same migration stream(s). This will ensure that files belonging to the same cluster are stored together in aggregations.
    • The recall policies should group as many files from the same cluster and/or aggregation, possibly by extending user requests for individual files to cover other files in the aggregation ("speculative read ahead", [[FIOgroup.SpeculativeRecalls]link]). The mapping of file to aggregation can be resolved via the name server.
    • The potential benefits of having FTS sending aggregations also has been discussed. Here, the network is less the bottleneck but rather the achievable disk throughput; therefore FTS uses parallel connections for multiple files and fills up the available bandwidth, which fits naturally with the file spreading done by CASTOR. From a throughput perspective, serialization of files would therefore not be an advantage (nor a disadvantage, in case the bandwidth is filled up anyway).

  • Sequencing within aggregations: There is no sequencing worth to be respected by the tape layer as the sequence of user operations (e.g. file creation) is 'lost' by the stager (due to scheduling/load balancing across disk servers, application of policies). The sequence of files within an aggregation is therefore meaningless and cannot be exploited.

  • Dedicated disk cache layer: Efficient tape recalls/migrations require high bandwith availability between the tape server and the disk server sending/receiving the data. With the speed of modern drives (which nominally can reach 160MB/s) the 1Gb network link (~110MB/s) of a disk server will get saturated. This implies that during migration/recalls, other data accesses (e.g. end users) should be avoided on a particular disk server which means that all other data on that disk server becomes unavailable during the time tape data transfer is happening. With increased disk server sizes (10-15 TB) this becomes a significant limitation. A dedicated disk layer would allow to decouple user I/O access from the I/O load generated by tape drives running at maximum speed. The following options are envisageable:
    1. Define a standard Castor disk pool (no user access) where files get replicated on creation (via "replicate on close"). Migrations/recalls will be run from that disk pool. The configuration of that disk pool requires investigation (many small batch-node like servers allowing for high and fast throughput from/to tape? Large disk servers with higher retention and longer streams?)
    2. Use the local tape server disk for data collection and caching. This would require local disk scheduling on the tape servers; also the attached tape drive is idle during the time data collection / caching is running.
    3. Complementary to 1. or 2.: create the aggregation outside the tape layer. Tests have demonstrated that there is little overhead resulting from the concatenation of small files scattered across multiple disk servers using rtcpd, so there is no advantage in performing this step outside the tape layer.

  • Hardware support and evolution:
    • Tier-1 sites. What is the lowest common denominator? Will block positioning support (needed for positioning within aggregations) work on tape drives used by Tier-1's?
    • Impact of FC fabrics

Roadmap

A coarse-grained roadmap is listed below, which will need refinement and adding dependencies. This roadmap shall feed into the general CASTOR release planning.

  • End Q3 / beginning Q4 2008:
    • Feasibility study and architectural design draft for new tape aggregator, tape gateway, and tape DB.
    • Proposal for new tape and aggregation format
  • End Q1 2009:
    • First functional prototype for new tape aggregator, gateway, tape DB
  • End Q2 2009:
    • First limited production (write) usage for repacking tapes only (with extended verification of repacked data).
    • Full production read support.
  • Q3 2009:
    • Full production usage for Repack; first limited/controlled production (write) usage for non-repack migrations
    • First integrated prototype (replacing rtcpd)
  • Q4 2009 / Q1 2010:
    • Integrated version in production, drop of rtcpd

Repack2

  • To Be Completed

VDQM2

  • To Be Completed

Recall/migration/stream policy framework

  • To Be Completed

VMGR

  • To Be Completed

Meetings and Presentations

  • Meeting with FIO/TSI, 1/7/08 link
  • Presentation at DM arch meeting, 16/7/08 link
  • Meeting with FIO/TSI, 25/7/08 link

Links

-- GermanCancio - 15 Aug 2008


This topic: DataManagement > CastorTapeSubsystem
Topic revision: r9 - 2008-08-18 - GermanCancio
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback