Physics Analysis Pool related work areas
Draft test plan for an internal test
Metrics to be collected during the test
- Time to 1st byte
- Request summary message: type{open, write, read, delete, file stat, dir stat}, user, server(head nodes(s) that handle it), time to process, etc
- Quotas: max quota and utilization per user
- Disk server stats: space per fs{total, free}, number of clients{read, write, etc}
- Instantaneous speed per transfer
- Starvation detection
- Ability to at disk server level break down what is internal IO and client IO
- Request queue (?)
- Errors
- File distribution per disk servers: number of files, size age
- File history: writes, accesses, deletes, stats, renames
- Time to second copy: at creation time, when number of copies drops below what it should be
- Single copy detection
Tests to be executed
- Drop 30% of files when the pool is almost full and the clients are still writing/reading
- A few 10GB file being read/seek by a large number of clients (2K5?)
- ‘Hardware/OS problems’:
- Drop files from the file systems
- Drop file systems
- Mix mount points
- Unmount file systems
- Raid rebuild
- "Random user command" test as done for CASTOR
Setup description
Here is the list of details and the current state for information.
If there is something you would change/propose its much faster to discuss this just in
the office and/or during the next meeting.
I have started a
TWiki page with the intended setup and the documentation
of every step of the setup.
Hardware- & Lustre- & Xrootd-Setup
Disk Server (OST)
We will hopefulley have 2 x 12 disk server with 24 disks each. 2 disks
are used for the
root filesystem. The recommended RAID setup according to Bernd is to use
RAID5 over 3 disks. We need to seperate one Raid-1 file system to keep the filesystem
journal. So we can have 6 RAID-5 with 2 spares per disk server. In the
terms of Lustre this means we will configure 6 OSTs per disk server or
72 OSTs per filesystem. We should also test RAID-0 with 6 filesystems to
gain disk space.
MDS/MDT (Lustre Head Nodes)
I got two small disk servers. I have configured for the Lustre Namespace
one 1-TB-Raid-10 array (4 disks) with 2 spares with LVM.
I have split 4 volumes for lustre namespaces MDT1-4 (200 GB), one for
the MDS (10 G) and 4 drbd meta data volumes (4GB).
The 4 Lustre filesystem are formatted for 800 Mio. files in total.
Failover
Currently we have no fibre channel disks, so I have setup DRBD to
synchronize the 5 LVM partitions. I run DRBD in mode C e.g. every write
is acknowleged as done on the master if the slave has acknowledged the
block.
With the help of Andras I managed to use IPMI with a new user/password
which allows head-node B to power-cycle head-node A to take over the role
of A. The configuration of heartbeat is actually much more simple than
I expected and there is a very good integration of DRBD + Filesystem
mounts. It is easy to setup and 1st tests show that it works out of the
box.
Backup
The MDT/MDS partitions are created via LVM. Every night we can schedule
a backup via an LVM snapshot of the MDS + MDT volumes and transfer it
elsewhere .... to restore a lustre fs one has just to untar a backup
into the filesystem and restore ext. attributes with setfattr .....
xrootd headnode/gateways
I got two 8 core CPU server with 16 GB to run the xrootd head nodes.
I asked for an DNS alias 'xlustre.cern.ch' loadbalancing the two
head nodes.
The final setup should export only port 1094 open and the machines
should do an internal round-robin forwarding to 8 instances of xrootd running
on port 1101-1108. This will need a kernel patch to iptables on SLC4.
For easier debugging and separation of tasks I will setup the following
configuration for startup:
xrootd #1 1094 - handling requests coming from
applications (xrootd Client library)
xrootd #2 1095 - handling xrdcp requests
xrootd #3 1096 - handling xl commands ( for the
shell busy box, registration, mapping installation, quota requests etc.)
xrootd $4 1097 - for FUSE mounts
xrootd $5 1098 - configured additionally with ALICE authorization plugin
xrootd $6 1099 - forcing krb5 authentication
xrootd $7 1100 - forcing GSI authentication
Each xrootd instance can handle ~150 file open requests/s (= 1 core busy).
User-Registration & Authentication
REGISTRATION
I propose a 'registration' command, which creates automatically a user
home directory and assignes by default 100 GB of quota (?). More only on
request ?!?!?!
Each request should be forwarded to 'some' email addresses/lists for
information.
I would stick to the '/castor/' / '/afs' structure e.g.
'/user/<1st initial>/
/
+ 4 experiment directories '/alice' '/atlas' '/cms' '/lhcb/'
I will not export any prefix like '/castor' or '/lustre' etc. for xrootd
protocol. This will allow later to easily glue the namespace between
all storage systems supporting xrootd protocol like dpm/castor/dcache
F.Y.I. to change prefixes in the directory structure is trivial and can
be done with one configuration parameter if needed.
- the registration command is implemented via a 'xrdcp' command
- Users can immedeatly authenticate and use the system with kerberos
tokens and GRID certificates issued by CERN
- I have written a GSI/VOMS authentication for xrootd based on pure SSL
with session reuse (~10ms to re-establish an existing session). If you
have an established session you don't notice any difference to the
non-authenticated mode with a normal copy command out of a shell.
To use the GSI/VOMS authentication from experiment frameworks, users
have to recompile the plugin against the ROOT version inside their
framework or use krb5! I will try to get the plugin into the xrootd/ROOT
CVS asap.
USER MAPPING
I suggest the following:
- Users with external Certificates have to execute a mapping procedure
-- this will first authenticate a user via krb5 and return a mapping
secret
- the mapping secret has then to be sent within few seconds (one shot)
via a GSI authenticated connection and will finally install a mapping from the GSI DN to the krb5
principal. This saves a lot of manual work to put entries into a mapping
file. If people don't like that (Akos?), I fear the only option is to do
it manually at the moment or see if someone has a better idea.
- the procedure will be implemented via the 'xrdcp' command - it is not
done yet ....
NON-CERN GROUPS/VOMS-GROUPS
After discussions with people working on the LDAP database for the
passwd/group file it does not seem feasable to have in the near
future VOMS group/roles in the CRA database. Therefore the easiest is that
people ask explicitly for the setup of VOMS groups/roles. This groups have
to be created only on the xrootd headnode(s) of a Lustre FS, not on each
disk server. One could also create them on the fly like DPM does with virtual
IDs but it has to be in-sync between the two xrootd headnodes.
Quota-Setup & Garbage Collection
Garbage collection in Lustre is trivial since the accounting is already
done by the quota system and there is no need to run regular full scans
for the garbage collection. I suggest that for every quota rule the owner
(user or group) can execute a SETGC command which defines
the high and low water mark.
When the high watermark is reached a GC script will keep all files with
the least recent access time which doesnt exceed the size of the low
watermark. The high watermark could be checked every 1 or 5 minutes ... it
is not a heavy operation for Lustre, just displaying some counters.
If desired we could first send a mail with the files to delete to the user
and delete them after some grace period which can be user-defined. One
could also offer time-based garbage collection.
If the GC user file is not present I would not run any GC by default.
Anyway after the recent discussions there seems not to be a real need for
that.
Namespace-Setup & Filesytem-IDs
I would set the local root in xrootd and the mount point on the disk
servers and head nodes to
/lustre/cern.ch/<pool-name>_1/
/lustre/cern.ch/<pool-name>_2/ for mirrored FS
/lustre/cern.ch/<pool-name>/ for not mirrored FS
where pool-name is a 6/8 digit identifier. '<pool-name>_1',<pool-name>_2,
<pool-name> are used by Lustre mounts to identify the filesystem.
E.g. to access a file visible on a disk server under
/lustre/cern.ch/<pool-name>.1/user/a/apeters/file1 a user can specify as
URL
root://xlustre-<pool-name>.cern.ch//user/a/apeters/file1
or
root://xlustre.cern.ch//user/a/apeters/file1
[if we setup a global redirector, which could also redirect to the
castor namespace + xrootd@castor...]
SRM & GRID-FTP Access
If there is a need we can install an SRM + gridFTP access to the Lustre
FS. The SRM which should be usable out of the box is BeStMan:
http://datagrid.lbl.gov/bestman/bestman-manual.html
The gridFTP+SRM server could be installed on a (not yet existing) gateway
machin - only requirement is a propper grid-map file. But currently it
seems not to be necessary ....
Deployment
The disk server + head node RPMS considering Lustre can be deployed via
Quattor. The RPMs are uploaded - the template is still missing. Will do it
with German on Monday.
I installed the xrootd code on our project space on afs, also RPMS which I
have built like heartbeat + drbd for SLC4.
Monitoring
We can use Lemon to see some basic machine parameters but for a good
understanding lemon diagrams are too poor. I suggest to run
MonaLisa. There is already an xrootd interface + rpm for MonaLisa which
sends information like files/s, disk activity, read/write etc .... xrootd
itself sends already information like open files etc. via UDP packets
which can be displayed by MonaLisa.
Next week I will connect one ost to the filesystems and
try to exercise the failover procedure, producing some
fake power cuts etc. and install the nightly backup.
CVS repository
-- DirkDuellmann - 09 Jun 2008