DRAFT - WORK IN PROGRESS CASTOR Tape Subsystem Twiki

Introduction

The Tape Subsystem comprises all functionality directly dealing with storage on, and management of magnetic tapes (volumes), tape servers and drives and tape solos (libraries). This ranges from low-level tape positioning and access over disk-to-tape copying up to the management of tape drives and queues and the volume database.

Team and Contributors

The members of the development team (hosted in IT/DM/SGT) work in close collaboration with the tape operations team (IT/FIO/TSI) who are the main source for operational requirements.

  • Team members:
    • Steve Murray IT/DM/SGT (investigations; overall architecture coordination; rtcpd and taped; VDQM2;
    • Giulia Taurelli IT/DM/SGT (Repack2; rtcpclientd; Recall/Migration policy framework)
    • German Cancio IT/DM/SGT (investigations; planning)
  • Contributors:
    • Vlado Bahyl IT/FIO/TSI
    • Tim Bell IT/FIO/TSI
    • Charles Curran IT/FIO/TSI
    • Arne Wiebalk IT/FIO/TSI

Work Areas

Tape and remote copy daemons (taped, rtcpclientd/rtcpd)

  • Goals:
    1. Acquire knowledge on tape technology: Software, Hardware, Media
    2. Understand current performance limitations in both tape write and read operations
    3. Address operational recoverability and reliability requirements and their impact on the tape data structure and format
    4. Re-engineer existing old/legacy code base using standard CASTOR development principles and foundation libraries, taking into account the above, following a low-risk migration path

Acquire knowledge on tape technology

The current CASTOR tape software is outdated and not well understood by the current development team. The technology used (tape drives, SCSI devices, etc) requires in-depth technical understanding which needs to be built up. Operational knowledge is available in IT/FIO/TSI but knowledge transfer is at risk when C. Curran will retire during 2008. Some development knowledge is available via A. Wiebalk which needs to be transferred to the developers in IT/DM.

Performance limitations in tape read and writes

The current tape data formats used in CASTOR were set in place in the 1990's (ANSI AUL, NL - see link below). The NL format is not used for production data as it lacks metadata critical for media recovery (see below). There are a number of problems in terms of performance:

  • Reads: The number of files read per tape mount is very low (around 1.5 files in average; the overhead for a tape mount ranges from 1min to 3min). This due to:
    1. related files are not written together but rather spread over available tape servers, in order to increase write performance
    2. read operations for data the same tape are not grouped (in order to avoid stale jobs on the batch farms)
  • Writes: The usage of migration policies enables building up file streams to be sent to tape. However, the efficiency of writing small files is low as the AUL format requires writing 3 tape marks per data file (two label files and the actual data file); each tape mark takes 5-7 seconds to write. Encouragements to experiments to increase their data file size has been not fully succesful so far.

Operational recoverability and reliability requirements

  • Format: Despite its performance disadvantages, the AUL format provides metadata information which are critical in terms of reliability and media recoverability (e.g. file name server ID, size, drive ID, etc). In order to avoid write positioning errors leading to data corruption, AUL trailer label information of the preceding file is checked when a new file is appended to the end of a tape and compared with the information kept in the database. The AUL header/trailer information also allows to identify affected files in case of media problems (e.g. scrached/folded tape parts). Any alternative format must provide metadata information allowing for safe positioning and ease media recovery.

  • Stability:
    • The software components, in particular rtcpclientd, are not robust against failures (e.g. daemon crashes). rtcpclientd is stateful and cannot be replicated (for load balancing or redundancy). After a crash, a manual database cleanup operation is needed.
    • Some of the software components, like rtcpd, are extremely delicate and critical as bugs or instabilities may (silently) corrupt production data stored into CASTOR. Also, backwards compatibility to CASTOR1 is required until CASTOR1 is completely stopped or confined. A development/deployment model is therefore required in which modules get updated/replaced incrementally, instead of a big-bang approach.

Re-engineering

In order to address the above mentioned issues, a gradual re-engineering of the tape layer components is proposed (rtcpclientd, rtcpd, tape daemon).

  • Tape format: A new tape format will be defined in addition to the existing AUL and NL. While the AUL format structure will be kept for this format at least initially, the payload inside each AUL data file will consist of an aggregation of multiple CASTOR files, in order to reduce the number of tape marks. The aggregation will be generated by the tape layer itself. The incoming stream (list of files) to be migrated will be aggregated to a configurable maximum total size (e.g. 10GB) and/or configurable maximum number of files (e.g. 1000 files). If a file exceeds the maximum total size it will be written in a separate aggregation consisting of that single file. On hardware with efficient tape mark handling, the number of files per aggregation can be decreased.

  • Aggregation format: The aggregation format is internal to the tape layer and not visible from higher levels. Choices include IL (see link below), GNU tar, CPIO, ZIP. The currently preferred choices are IL and tar (see here for a discussion.)

  • rtcpclientd will be replaced by another module ("tape gateway") which communicates with VDQM and the tape servers using a new tape database.

  • A new module ("tape archiver") running on the tape server receives the files to be recalled and performs the aggregation. The aggregated file is send over to rtcpd/taped which remain unchanged (backwards-compatibility, low-risk migration path).

  • Would help: read ahead and clustering - not to confuse with aggregation even if complementary (driven by stager)

  • TODO: design proposal document (Action: Steve)

Issues / Open Questions to address

  • Labels: AUL vs NL
  • Payload: IL vs TAR vs others e.g. zip, cpio
  • positioning support in HW (Tier0, Tier1+LTO's)
  • read cache (cf mail 7/8/08)
  • role of disk cache (tape server / CASTOR disk pool -> small vs big file servers; cf discussion with Miguel)

  • FTS: No advantage (cf mail)

Links

  • Tape formats currently used by CASTOR (AUL, NL)
  • A possible description of the IL format

Repack2

  • Goals:

VDQM2

Recall/migration/stream policy framework

(link to) requirements

VMGR

Presentations

Plans

Work Breakdown Structure (WBS)

Meetings

  • (with FIO/TSI) (agenda / notes)

Links

-- GermanCancio - 12 Aug 2008


This topic: DataManagement > CastorTapeSubsystem
Topic revision: r3 - 2008-08-13 - GermanCancio
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback