How to setup a single dCache instance and multiple SRM servers?
A single dCache instance means: 1 pool manager, 1 PNFS, 1 pin manager, 1 space manager. And in front of this, multiple SRM servers. The advantage of this configuration is to have one separate SRM server per VO that would avoid interference between VOs activities and would reduce the load on SRM, which proved to be the bottleneck in some high load situations.
Some considerations:
- Is it possible to set it up with a separate space manager for each VO? yes, but it's important to say that sharing space manager is the easy setup, using multiple space managers requires care.. (and what would be the advantage?)
- This setup is possible with the condition that all SRM servers refers to the same 'link group' (not clear..)
- And every SRM server has a dedicated database
Further question: how to publish the SRM endpoints in the information system?
Clients like FTS discover the endpoints through the information system and they assume one single SRM endpoint per dCache instance (this is not necessary, it's just a limitation of clients).
For VOs users, the problem is not there, as they usually hard code the endpoint. But client do use the information system and if a site wishes to deploy multiple SRM nodes for their dCache instance then this should be published as a single GlueSE object and multiple GlueService objects (one for each SRM end-point). That is: 1 object-many services.
The problem is mapping the GlueService back to the SE. This is currently done with a naming convention that enforces a 1:1 mapping between the SRM
GlueService object and the corresponding GlueSE object. So the schema '1 object-many services' is not foreseen.
In fact generic clients (lcg-utils, FTS) are "broken" and require that the GlueSE UniqueID is the SRM end-point FQDN (the naming convention I've
mentioned). If you have multiple SRM endpoints (with differing FQDNs) which one do you use as the GlueSEUniqueID? Which ever one you choose, all FTS and lcg-util users will hit that end-point and ignore the other(s).
A possibility is publishing multiple GlueSE objects, so each of them is univocally identified by the endpoint FQDN. This means down time announcement have to be done for each endpoint, but are there any other problems?
Flavia points out that a problem could be correctly reporting the space for each of the Storage Areas. It is difficult with the current schema to express the fact that 2 SEs share spaces.
--
ElisaLanciotti - 30 Apr 2009