Notes on dCache

Concepts

Storage class

It is an attribute that can be set per directory which defines in which set of pools a file should be stored, or (in the case of a tape system) in which set of tapes. See:

Cache class

Same thing as the storage class, but has no effect on the tape system.

Pool Manager

The heart of a dCache system is the pool manager. When a user performs an action on a file - reading or writing - a transfer request is sent to the dCache system. The pool manager then decides how to handle this request.

If a file the user wishes to read resides on one of the storage pools within the dCache system, it will be transferred from that pool to the user. If it resides on several pools, the file will be retrieved from the pool which is least busy. If all pools the file is stored on are busy, a new copy of the file on an idle pool will be created and this pool will answer the request.

A new copy can either be created by a pool to pool transfer (p2p) or by fetching it from a connected tertiary storage system (sometimes called HSM - hierarchical storage manager). Fetching a file from a tertiary storage system is called staging. It is also performed if the file is not present on any of the pools in the dCache system. The pool manager also has to decide on which pool the new copy will be created, i.e. staged or p2p-copied.

The pool manager is a unique cell in the domain “dCacheDomain” and consists of several sub-modules: The important ones are the pool selection unit (PSU) and the cost manager (CM).

The PSU is responsible for finding the pools which the Pool Manager is allowed to use for a specific transfer request. From those the CM selects the optimal one. By telling the PSU which pools are permitted for which type of transfer request, the administrator of the dCache system can adjust the system to any kind of scenario: Separate organizations served by separate pools, special pools for writing the data to a tertiary storage system, pools in a DMZ which serves only a certain kind of data (e.g. for the Grid), etc.

Links

A link links a set of transfer requests to a group of pools. In practice, a link is a rule and a set of pools. A transfer request satisfying that rule is allowed to use the pools associated to the rule. See: The relevant properties of a transfer request are:
  • the file location, that is the directory of the file in PNFS
  • the IP address of the requesting host
  • the type of transfer: read, write or cache. A request for a file on tape translates into cache+read
  • the storage class of the requested file (set per directory)
  • the cache class of the requested file (set per directory)
Each link contains one or more conditions, defined by AND and OR combinations of elementary conditions if three possible types: elementary network (-net), storage class (-store), and cache class conditions (-dcache).

Three link attributes, -readpref, -writepref, and -cachepref, can be set to a value equal or larger than zero. If all the conditions in the link are satisfied, the corresponding preference is assigned to each pool the link points to.

If more than one preference value different from zero is used, the PSU will not generate a single list but a set of lists, each containing pools with the same preference. The Cost Manager will use the list of pools with highest preference and select the one with the lowest cost for the transfer. Only if all pools with the highest preference are unavailable, the next list will be considered by the Cost Manager. This can be used to configure a set of fall-back pools which are used if none of the other pools are available.

Pool Group

A pool group is just a collection of pools which allows to treat them in the same way. The default pool group contains all the pools.

The Cost Module

It is used to determine the "best" pool for reading or writing a file. It calculates a cost for each pool and selects the one with the least cost. For details: There is a performance cost, which depends on how many active and waiting transfers are there compared to a maximum number of allowed transfers, and a space cost which is higher if the free space is lower and - when old files need to be deleted - if the least used file is newer. The total cost is a linear combination of the two.

Note: p2p transfers give an additional contribution to the cost and they are not queued, but start immediately.

Space Hopping

Pool-to-pool copies may be triggered under certain conditions. For example:
  • If a file is requested by a client but the file resides on a pool from which this client, by configuration, is not allowed to read data, the dataset is transferred to an “allowed” pool first.
  • If a pool encounters a steady high load, the system may, if configured, decide to replicate files to other pools to achieve an equal load distribution.
  • HSM restore operations may be split into two steps. The first one reads data from tertiary storage to an “HSM connected” pool and the second step takes care that the file is replicated to a general read pool. Under some conditions this separation of HSM and non-HSM pools might become necessary for performance reasons.
  • If a dataset has been written into dCache it might become necessary to have this file replicated instantly. The reasons can be, to either have a second, safe copy, or to make sure that clients don't access the file for reading on the write pools.

Section

A section is a set of Pool Manager configuration parameters that can be associated to a link. It provides a way to have different configuration parameters for different types of requests.

GPlazma

gPlazma is the cell that takes care of the authentication. It supports several mechanisms:
  • kpwd: the legacy mechanism, maps a user's DN to a local username, and then the username to the uid, gid, and rootpath;
  • the grid-mapfile to map the DN to a username and the storage-authzdb file to map the username to the uid, gid, and rootpath;
  • gplazmalite-vorole-mapping to map a DN + FQAN to a username;
  • saml-vo-mapping: as above but with a callout to GUMS;
  • xacml-vo-mapping: as above but with a callout to the latest version of GUMS or to SCAS or to the future authorization server.

xroot

The number of parallel xrootd file transfers per pool node is limited by the portrange defined in the dCache configuration, as each transfer occupies one (not firewalled) port for its own. This is an example of an URL:
root://<door_hostname>//pnfs/<site.de>/data/xrd_test
xroot access is totally unauthenticated, therefore by default it is read-only. Write access can be granted on a per-directory basis (including subdirectories). The token-based authentication developed by Peters an Fleichtinger is supported, but authentication support is generic.

SRM

Implementation

The dCache SRM is implemented as a web service running under the Apache Tomcat application server and the Axis Web Services engine. This service starts a dCache SRM domain with a main SRM cell and a number of other cells SRM service relies on. These are SrmSpaceManager, PinManager, RemoteGsiftpCopyManager, etc. Of these services only SRM and SrmSpaceManager require a special configuration.

Spaces

LinkGroups

A LinkGroup is a collection of links, each of which contains a list of pools, as described above. Each LinkGroup knows about its available size, which is a sum of all available sizes in all the included pools. In addition, a LinkGroup has 5 boolean properties called replicaAllowed, outputAllowed, custodialAllowed, onlineAllowed and nearlineAllowed, the values of these properties (true or false) can be configured in PoolManager.conf.

Space reservation

In dCache space reservation is supported. If successful, it allocates a certain amount of space identified by a space token. A space has these properties:
  • retention policy: REPLICA, OUTPUT, CUSTODIAL. OUTPUT is not used in WLCG, but it could be used for files managed by the Resilient Manager.
  • access latency: NEARLINE, ONLINE. In case of dCache, ONLINE means that there will always be a copy of the file on disk, while NEARLINE does not provide such guarantee.
The retention policy and the access latency of a file are inherited from the space where the file is written.

In dCache only disk space can be reserved and only for write operations. Once files are migrated to tape, and if no copy is required on disk, the space occupied by these files is returned back to the SRM space. When files are read back from tape and cached on disk, they are not counted as part of any space. An SRM space can be assigned a non-unique description, which can be used to discover all SRM spaces with a given description.

Each SRM space reservation is made against the total available disk space of a particular LinkGroup. The total space in dCache that can be reserved is the sum of the available sizes of all LinkGroups. If dCache is configured correctly, each byte of disk space, that can be reserved, belongs to one and only one Link Group. Therefore it is important to make sure that no pool belongs to more than one pool group, no pool group belongs to more than one link and no link belongs to more than one LinkGroup.

Explicit and Implicit Space Reservations for Data Storage in dCache

In dCache, if a space token is specified when writing a file, the file will be stored in it (assuming the user has permission to do so in the namespace). If the space token is not specified, and implicit space reservation is enabled, then a space reservation will be performed implicitly for each SRM v1.1 and SRM 2.2 srmPrepareToPut or srmCopy in pull mode. The access latency and the retention policy for the space reservation will be taken (in that order) from:

  • the values specified by the user
  • the corresponding tags for the PNFS directory containing the file
  • some default values in the SRM dCache configuration.
If no implicit space reservation can be made, the transfer will fail. (Note: some clients also have default values, which are used when not explicitly specified by the user. In this case server side defaults will have no effect. )

If the implicit space reservation is disabled, the pools belonging to LinkGroups will be excluded and only the remaining pools will be considered as candidates for storing the file.

Space Manager access control

When a space reservation request is executed, its parameters, such as the space size, lifetime, access latency, retention policy and the user Virtual Organization (VO) membership information are forwarded to the SRM SpaceManager.

The SpaceManager uses a special file to list all the VOs and FQANs that are allowed to make reservations a given LinkGroup. If the space reservation is authorized on a LinkGroup and the total available space and the replicaAllowed, outputAllowed, custodialAllowed, onlineAllowed and nearlineAllowed properties of the group are compatible with the user request, the space reservation is successful. The access control is used only for space reservation, while after a space has been created any user can store files in it (if he has the right to write in the PNFS directory).

Configuration

srmCopyReqThreadPoolSize and remoteGsiftpMaxTransfers

Make sure that both srmCopyReqThreadPoolSize and remoteGsiftpMaxTransfers are set to the same values and the common value should be the roughly equal to the maximum number of the SRM-to-SRM copies your system can sustain. So if you think about 3 gridftp transfer per pool and you have 30 pools than the number should be 3x30=90. They are relevant only for srmCopy requests.

Space Management

Space management can be enabled or disabled using srmSpaceManagerEnabled and for transfers with no space token specified using srmImplicitSpaceManagerEnabled.

srmGetLifeTime, srmPutLifeTime and srmCopyLifeTime

They specify the lifetimes of the srmPrepareToGet ( srmBringOnline), srmPrepareToPut and srmCopy requests in milliseconds. If the system is unable to fulfill a request before its lifetime expires, the request is automatically garbage collected. The default value is “14400000” (4 hours).

Maximum number of TURLs

The parameters srmGetReqMaxReadyRequests and srmPutReqMaxReadyRequests define the maximum number of TURLs handed out by the SRM. It is safe to set it to a very large number, like 50000 or 100000. If it is too small, srmPrepareToGet(Put) requests can fail. The rest of the files that are ready to be transferred are put on the “Ready” queues, the maximum length of these queues are controlled by the srmGetReqReadyQueueSize and srmPutReqReadyQueueSize parameters. These parameters should be set according to the capacity of the system, and are usually greater than the maximum number of the gridftp transfers that the dCache instance gridftp doors can sustain.

srmGetReqMaxNumOfRunningBySameOwner

The srmGetReqMaxNumOfRunningBySameOwner parameter affects the job to thread assignment inside the SRM. Setting it to 100 with a thread pool of 1000 basically means that any given user can at most use 100 of those threads. It cannot be used to limit the number of TURLs on a per user basis.

Checksum

dCache can calculate a file checksum both during transfer (transfer checksum) and after the file has been written in a pool (server file checksum). Both checksums can be compared with each other and with a checksum provided by the client (client checksum). For example the dCap client sends the checksum to the server on the close operation.

In other words, all provided or calculated checksums must match, and the checksum is written to pnfs.

--++ Copy Manager The Copy Manager is a module which allows to copy the content of a pool (one to one) to another pool. The mode of the files (precious or cached) are retained unchanged.

--++ File Pinning A file can be globally pinned indefinitely with an appropriate command.

PostgreSQL

It is used for the following databases:
  • pnfs
  • SRM
  • Pin manager
  • Space manager
  • Replica manager (not activated by default)
  • pnfs companion (not activated by default)
  • billing cell (will write to a file by default)

-- AndreaSciaba - 12 Mar 2009

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2009-07-14 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback