DRAFT - CASTOR Tape Subsystem Twiki


The Tape Subsystem comprises all functionality directly dealing with storage on, and management of magnetic tapes (volumes), tape servers and drives and tape solos (libraries). This ranges from low-level tape positioning and access over disk-to-tape copying up to the management of tape drives and queues and the volume database.

Team and Contributors

The members of the development team (hosted in IT/DM/SGT) work in close collaboration with the tape operations team (IT/FIO/TSI) who are the main source for operational requirements.

  • Team members:
    • Steve Murray IT/DM/SGT (investigations; overall architecture coordination; rtcpd and taped; VDQM2;
    • Giulia Taurelli IT/DM/SGT (Repack2; rtcpclientd; Recall/Migration policy framework)
    • German Cancio IT/DM/SGT (investigations; planning)
  • Contributors:
    • Vlado Bahyl IT/FIO/TSI
    • Tim Bell IT/FIO/TSI
    • Charles Curran IT/FIO/TSI
    • Arne Wiebalk IT/FIO/TSI

Work Areas

Tape and remote copy daemons (taped, rtcpclientd/rtcpd)

  • Goals:
    1. Acquire knowledge on tape technology: Software, Hardware, Media
    2. Understand current performance limitations in both tape write and read operations
    3. Address operational recoverability and reliability requirements and their impact on the tape data structure and format
    4. Re-engineer existing old/legacy code base using standard CASTOR development principles and foundation libraries, taking into account the above, following a low-risk migration path

Acquire knowledge on tape technology

The current CASTOR tape software is outdated and not well understood by the current development team. The technology used (tape drives, SCSI devices, etc) requires in-depth technical understanding which needs to be built up. Operational knowledge is available in IT/FIO/TSI but knowledge transfer is at risk when C. Curran will retire during 2008. Some development knowledge is available via A. Wiebalk which needs to be transferred to the developers in IT/DM.

Performance limitations in tape read and writes

The current tape data formats used in CASTOR were set in place in the 1990's (ANSI AUL, NL - see link below). The NL format is not used for production data as it lacks metadata critical for media recovery (see below). There are a number of problems in terms of performance:

  • Reads: The number of files read per tape mount is very low (around 1.5 files in average; the overhead for a tape mount ranges from 1min to 3min). This due to:
    1. related files are not written together but rather spread over available tape servers, in order to increase write performance
    2. read operations for data the same tape are not grouped (in order to avoid stale jobs on the batch farms)
  • Writes: The usage of migration policies enables building up file streams to be sent to tape. However, the efficiency of writing small files is low as the AUL format requires writing 3 tape marks per data file (two label files and the actual data file); each tape mark takes 5-7 seconds to write. Encouragements to experiments to increase their data file size has been not fully succesful so far.

Operational recoverability and reliability requirements

  • Format: Despite its performance disadvantages, the AUL format provides metadata information which are critical in terms of reliability and media recoverability (e.g. file name server ID, size, drive ID, etc). In order to avoid write positioning errors leading to data corruption, AUL trailer label information of the preceding file is checked when a new file is appended to the end of a tape and compared with the information kept in the database. The AUL header/trailer information also allows to identify affected files in case of media problems (e.g. scrached/folded tape parts). Any alternative format must provide metadata information allowing for safe positioning and ease media recovery.

  • Stability:
    • The software components, in particular rtcpclientd, are not robust against failures (e.g. daemon crashes). rtcpclientd is stateful and cannot be replicated (for load balancing or redundancy). After a crash, a manual database cleanup operation is needed.
    • Some of the software components, like rtcpd, are extremely delicate and critical as bugs or instabilities may (silently) corrupt production data stored into CASTOR. Also, backwards compatibility to CASTOR1 is required until CASTOR1 is completely stopped or confined. A development/deployment model is therefore required in which modules get updated/replaced incrementally, instead of a big-bang approach.


In order to address the above mentioned issues, a gradual re-engineering of the tape layer components is proposed (rtcpclientd, rtcpd, tape daemon). A design document will need to be prepared with the details. The main items are shortly described here:

  • Tape format: A new tape format will be defined in addition to the existing AUL and NL. While the AUL format structure will be kept for this format at least initially, the payload inside each AUL data file will consist of an aggregation of multiple CASTOR files, in order to reduce the number of tape marks. The aggregation will be generated by the tape layer itself. The incoming stream (list of files) to be migrated will be aggregated to a configurable maximum total size (e.g. 10GB) and/or configurable maximum number of files (e.g. 1000 files). If a file exceeds the maximum total size it will be written in a separate aggregation consisting of that single file. On hardware with efficient tape mark handling, the number of files per aggregation can be decreased. Aggregations are not a static concept but are dynamically (re-)created on every migration; files (more precisely: segments) may be rearranged into different aggregations e.g. after a tape repacking.

  • Aggregation format: The aggregation format is internal to the tape layer and not visible from higher levels which still continue to see individual files (or "clusters" as defined by the stager, see below). Possible choices for the aggregation format include IL (see link below), GNU tar, CPIO, ZIP. The currently preferred choices are IL and TAR (see here for further discussion.)

  • tape gateway: rtcpclientd will be replaced by another module ("tape gateway") which communicates with VDQM and the tape servers using a new tape database.

  • tape archiver: A new module ("tape archiver") will be developed which will run on the tape server and which interacts with the tape gateway on one hand (via the new tape database), and rtcpd/taped running on the tape server on the other hand.
    • For write operations, the tape archiver receives the list of files to be migrated from the tape gateway and performs the aggregation (using rfio and a memory buffer) using the selected aggregation format. Each aggregation is then (asynchronously) send over to rtcpd/taped as a single file, using the existing protocols. In case of errors during write, all files of the aggregation are flagged as failed.
    • For read operations, the complete aggregation is read out (via rtcpd) into a memory buffer and then (asynchronously) sent over to the disk servers using rfio. The asynchronous writing from memory buffer to disk allows to better overcome slow disk server connections.
    • Data communication between the tape archiver and rtcpd will be done using named pipes (over rfio) as they run on the same node. Data communication between the disk servers and the tape archiver will use RFIO but can be replaced later by xrootd as needed.

  • rtcpd/taped: Initially, rtcpd/taped will remain unchanged in order to preserve backwards-compatibility and to enable a low-risk migration path.
    • In the short term, small changes may be required to accomodate the existence of a new tape format.
    • In the future, the tape archiver will be extended to cover their functionality (after stoppage of CASTOR1). This will also allow for enhancing functionality such as direct file positioning within an aggregation, for avoiding always having to read complete aggregations.

  • name server: The name server needs to be enhanced to allow storing/retrieving the mapping between segments and the aggregation they belong to (and the position within a given aggregation).

Discussion points

  • "Aggregations" vs. "Clusters": Tape level aggregations as proposed here are an independent, lower-level concept than "clustering" as proposed by the stager work group, but both concepts are compatible. The following should be done on the stager side to further increase efficiency (but there is no functional dependency from the new tape subsystem):
    • Migration policies should group files belonging to the same cluster to be written to tape within the same migration stream(s). This will ensure that files belonging to the same cluster are stored together in aggregations.
    • The recall policies should group as many files from the same cluster and/or aggregation, possibly by extending user requests for individual files to cover other files in the aggregation ("read ahead"). The mapping of file to aggregation can be resolved via the name server.

  • Sequencing within aggregations: There is no sequencing worth to be respected by the tape layer as the sequence of user operations (e.g. file creation) is 'lost' by the stager (due to scheduling/load balancing across disk servers, application of policies). The sequence of files within an aggregation is therefore meaningless and cannot be exploited.

  • Dedicated disk cache layer: Efficient tape recalls/migrations require high bandwith availability between the tape server and the disk server sending/receiving the data. With the speed of modern drives (which nominally can reach 160MB/s) the 1Gb network link (~110MB/s) of a disk server can get saturated. This implies that during migration/recalls, other data accesses (e.g. end users) should be avoided on a particular disk server which means that all other data on that disk server becomes unavailable during the time tape data transfer is happening. With increased disk server sizes (10-15 TB) this becomes a significant limitation. A dedicated disk layer would allow to decouple user I/O access from the I/O load generated by tape drives running at maximum speed. The following options are envisageable:
    1. Define a standard Castor disk pool (no user access) where files get replicated on creation (via "replicate on close"). Migrations/recalls will be run from that disk pool. The configuration of that disk pool requires investigation (many small batch-node like servers allowing for high and fast troughput from/to tape? Large disk servers with higher retention and longer streams?)
    2. Use the local tape server disk for data collection and caching. This would require running scheduling on the tape servers, also the attached tape drive is idle during the time data collection / caching is running.
    3. Complementary to 1. or 2.: create the aggregation outside the tape layer. Tests have demonstrated that there is little overhead resulting out of the concatenation of small files scattered across multiple disk servers using rtcpd, so there is no advantage in performing this step outside the tape layer.

  • Hardware support and evolution:
    • Tier-1 sites. What is the lowest common denominator? Will block positioning support (needed for positioning within aggregations) work on tape drives used by Tier-1's?
    • Impact of FC fabrics


  • Tape formats currently used by CASTOR (AUL, NL)
  • A possible description of the IL format


  • To Be Completed


  • To Be Completed

Recall/migration/stream policy framework

  • To Be Completed


  • To Be Completed



Work Breakdown Structure (WBS)


  • (with FIO/TSI) (agenda / notes)


-- GermanCancio - 13 Aug 2008

This topic: DataManagement > CastorTapeSubsystem
Topic revision: r5 - 2008-08-13 - GermanCancio
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback