Physics Analysis Pool related work areas

Draft test plan for an internal test

Metrics to be collected during the test

  • Time to 1st byte
  • Request summary message: type{open, write, read, delete, file stat, dir stat}, user, server(head nodes(s) that handle it), time to process, etc
  • Quotas: max quota and utilization per user
  • Disk server stats: space per fs{total, free}, number of clients{read, write, etc}
  • Instantaneous speed per transfer
  • Starvation detection
  • Ability to at disk server level break down what is internal IO and client IO
  • Request queue (?)
  • Errors
  • File distribution per disk servers: number of files, size age
  • File history: writes, accesses, deletes, stats, renames
  • Time to second copy: at creation time, when number of copies drops below what it should be
  • Single copy detection

Tests to be executed

  • Drop 30% of files when the pool is almost full and the clients are still writing/reading
  • A few 10GB file being read/seek by a large number of clients (2K5?)
  • ‘Hardware/OS problems’:
  • Drop files from the file systems
  • Drop file systems
  • Mix mount points
  • Unmount file systems
  • Raid rebuild
  • "Random user command" test as done for CASTOR

Setup description

Here is the list of details and the current state for information.

If there is something you would change/propose its much faster to discuss this just in the office and/or during the next meeting.

I have started a TWiki page with the intended setup and the documentation of every step of the setup.

Hardware- & Lustre- & Xrootd-Setup

Disk Server (OST)

We will hopefulley have 2 x 12 disk server with 24 disks each. 2 disks are used for the root filesystem. The recommended RAID setup according to Bernd is to use RAID5 over 3 disks. We need to seperate one Raid-1 file system to keep the filesystem journal. So we can have 6 RAID-5 with 2 spares per disk server. In the terms of Lustre this means we will configure 6 OSTs per disk server or 72 OSTs per filesystem. We should also test RAID-0 with 6 filesystems to gain disk space.

MDS/MDT (Lustre Head Nodes)

I got two small disk servers. I have configured for the Lustre Namespace one 1-TB-Raid-10 array (4 disks) with 2 spares with LVM. I have split 4 volumes for lustre namespaces MDT1-4 (200 GB), one for the MDS (10 G) and 4 drbd meta data volumes (4GB). The 4 Lustre filesystem are formatted for 800 Mio. files in total.

Failover

Currently we have no fibre channel disks, so I have setup DRBD to synchronize the 5 LVM partitions. I run DRBD in mode C e.g. every write is acknowleged as done on the master if the slave has acknowledged the block.

With the help of Andras I managed to use IPMI with a new user/password which allows head-node B to power-cycle head-node A to take over the role of A. The configuration of heartbeat is actually much more simple than I expected and there is a very good integration of DRBD + Filesystem mounts. It is easy to setup and 1st tests show that it works out of the box.

Backup

The MDT/MDS partitions are created via LVM. Every night we can schedule a backup via an LVM snapshot of the MDS + MDT volumes and transfer it elsewhere .... to restore a lustre fs one has just to untar a backup into the filesystem and restore ext. attributes with setfattr .....

xrootd headnode/gateways

I got two 8 core CPU server with 16 GB to run the xrootd head nodes. I asked for an DNS alias 'xlustre.cern.ch' loadbalancing the two head nodes. The final setup should export only port 1094 open and the machines should do an internal round-robin forwarding to 8 instances of xrootd running on port 1101-1108. This will need a kernel patch to iptables on SLC4.

For easier debugging and separation of tasks I will setup the following configuration for startup:

xrootd #1 1094 - handling requests coming from applications (xrootd Client library)

xrootd #2 1095 - handling xrdcp requests

xrootd #3 1096 - handling xl commands ( for the shell busy box, registration, mapping installation, quota requests etc.)

xrootd $4 1097 - for FUSE mounts

xrootd $5 1098 - configured additionally with ALICE authorization plugin

xrootd $6 1099 - forcing krb5 authentication xrootd $7 1100 - forcing GSI authentication

Each xrootd instance can handle ~150 file open requests/s (= 1 core busy).

User-Registration & Authentication

REGISTRATION

I propose a 'registration' command, which creates automatically a user home directory and assignes by default 100 GB of quota (?). More only on request ?!?!?! Each request should be forwarded to 'some' email addresses/lists for information.

I would stick to the '/castor/' / '/afs' structure e.g. '/user/<1st initial>// + 4 experiment directories '/alice' '/atlas' '/cms' '/lhcb/' I will not export any prefix like '/castor' or '/lustre' etc. for xrootd protocol. This will allow later to easily glue the namespace between all storage systems supporting xrootd protocol like dpm/castor/dcache

F.Y.I. to change prefixes in the directory structure is trivial and can be done with one configuration parameter if needed.

  • the registration command is implemented via a 'xrdcp' command

  • Users can immedeatly authenticate and use the system with kerberos
tokens and GRID certificates issued by CERN

  • I have written a GSI/VOMS authentication for xrootd based on pure SSL
with session reuse (~10ms to re-establish an existing session). If you have an established session you don't notice any difference to the non-authenticated mode with a normal copy command out of a shell. To use the GSI/VOMS authentication from experiment frameworks, users have to recompile the plugin against the ROOT version inside their framework or use krb5! I will try to get the plugin into the xrootd/ROOT CVS asap.

USER MAPPING

I suggest the following:
  • Users with external Certificates have to execute a mapping procedure
-- this will first authenticate a user via krb5 and return a mapping secret
  • the mapping secret has then to be sent within few seconds (one shot)
via a GSI authenticated connection and will finally install a mapping from the GSI DN to the krb5 principal. This saves a lot of manual work to put entries into a mapping file. If people don't like that (Akos?), I fear the only option is to do it manually at the moment or see if someone has a better idea.

  • the procedure will be implemented via the 'xrdcp' command - it is not
done yet ....

NON-CERN GROUPS/VOMS-GROUPS

After discussions with people working on the LDAP database for the passwd/group file it does not seem feasable to have in the near future VOMS group/roles in the CRA database. Therefore the easiest is that people ask explicitly for the setup of VOMS groups/roles. This groups have to be created only on the xrootd headnode(s) of a Lustre FS, not on each disk server. One could also create them on the fly like DPM does with virtual IDs but it has to be in-sync between the two xrootd headnodes.

Quota-Setup & Garbage Collection

Garbage collection in Lustre is trivial since the accounting is already done by the quota system and there is no need to run regular full scans for the garbage collection. I suggest that for every quota rule the owner (user or group) can execute a SETGC command which defines the high and low water mark. When the high watermark is reached a GC script will keep all files with the least recent access time which doesnt exceed the size of the low watermark. The high watermark could be checked every 1 or 5 minutes ... it is not a heavy operation for Lustre, just displaying some counters. If desired we could first send a mail with the files to delete to the user and delete them after some grace period which can be user-defined. One could also offer time-based garbage collection. If the GC user file is not present I would not run any GC by default.

Anyway after the recent discussions there seems not to be a real need for that.

Namespace-Setup & Filesytem-IDs

I would set the local root in xrootd and the mount point on the disk servers and head nodes to /lustre/cern.ch/<pool-name>_1/ /lustre/cern.ch/<pool-name>_2/ for mirrored FS /lustre/cern.ch/<pool-name>/ for not mirrored FS

where pool-name is a 6/8 digit identifier. '<pool-name>_1',<pool-name>_2, <pool-name> are used by Lustre mounts to identify the filesystem. E.g. to access a file visible on a disk server under /lustre/cern.ch/<pool-name>.1/user/a/apeters/file1 a user can specify as URL root://xlustre-<pool-name>.cern.ch//user/a/apeters/file1 or root://xlustre.cern.ch//user/a/apeters/file1 [if we setup a global redirector, which could also redirect to the castor namespace + xrootd@castor...]

SRM & GRID-FTP Access

If there is a need we can install an SRM + gridFTP access to the Lustre FS. The SRM which should be usable out of the box is BeStMan:

http://datagrid.lbl.gov/bestman/bestman-manual.html

The gridFTP+SRM server could be installed on a (not yet existing) gateway machin - only requirement is a propper grid-map file. But currently it seems not to be necessary ....

Deployment

The disk server + head node RPMS considering Lustre can be deployed via Quattor. The RPMs are uploaded - the template is still missing. Will do it with German on Monday.

I installed the xrootd code on our project space on afs, also RPMS which I have built like heartbeat + drbd for SLC4.

Monitoring

We can use Lemon to see some basic machine parameters but for a good understanding lemon diagrams are too poor. I suggest to run MonaLisa. There is already an xrootd interface + rpm for MonaLisa which sends information like files/s, disk activity, read/write etc .... xrootd itself sends already information like open files etc. via UDP packets which can be displayed by MonaLisa.

Next week I will connect one ost to the filesystems and try to exercise the failover procedure, producing some fake power cuts etc. and install the nightly backup.

CVS repository

-- DirkDuellmann - 09 Jun 2008

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2008-06-09 - DirkDuellmann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DataManagement All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback