Index | Issue/Requirement | Res | Estimated Cost | Comments | Status | ||
---|---|---|---|---|---|---|---|
100 | Security, authorization, authentication | - | - | - | - | ||
101 (was 101, 308, 501, 502, 503, 513, 572) | VOMS groups and roles used by all middleware; Support for priorities based on VOMs groups/roles; All user level commands must extract VO information from a VOMS proxy; Services must provide access control based on VOMS groups/roles. Critical is fine grained control to: files, queues and metadata; Ability to control access to data based on the schema with a granularity of entries and fields |
JRA1,SA3 tracked in: 2926 ![]() ![]() ![]() ![]() |
Short term solution for job priorities is to use VOViews. Available on the LCG-RB. Code frozen for gLite WMS (wait for the first post-glite 3.0 release of WMS) See: Jeff T. document ![]() Longer term use GPBOX to define and propagate VO policies to sites. Prototype targetted to the gLite 3.0 infrastructure is available but not integrated and tested. |
O(10) groups O(3) roles. ATLAS: O(25) groups, O(3) roles LHCb: Not clear item, should be specified for each service separately; ATLAS remarked that this should work without a central service Sites: VOMS should be used by users too; non-VOMS proxy means no special roles or privileges at site. | - | ||
103 | Automatic handling of proxy renewal | Sec. | - | users should not need to know which server to use to register their proxies for a specific service LHCb: item should be split by services | - | ||
103a | proxy renewal within the service | JRA1 | - | - | - | ||
103b | establishment of trust between the service and myproxy | Sec tracked in 2929 ![]() |
|||||
103c | find the right myproxy server | SA3 (configuration) | |||||
104 | Automatic renewal of Kerberos credentials via the GRID | ||||||
105 | Framework and recommendations for developing secure experiment specific services | JRA1 | including delegation and renewal LHCb: this should include certification of already developed by experiments security frameworks; this is also a requirement from the VO boxes group Sites: agree this is a requirement. |
||||
110 | Information System | - | - | - | |||
111 (was 111, 130) | Stable access to static information; No direct access to the information system for any operation a) caching of endpoint information in the clients and b} not to need to go to the information system if the information is already available elsewhere (e.g. through parameters) |
tracked in 3069 ![]() |
service end points, service characteristics stressed by LHCb LHCb: 124,128,130 to be merged Sites: should be addressed by split of IS into static and dynamic parts, currently discussed within GD |
- | |||
120 | Storage Management | - | - | - | - | ||
122 | Same semantic for all SEs | SRM group | - | Sites: isn't this agreed part of Witzig proposal? | |||
125 | Disk quota management | storage solution group | - | at group and user level | |||
126 | Verification of file integrity after replication | JRA1 tracked in 2932 ![]() |
checksum (on demand), file size | ||||
200 | Data Management | - | - | - | |||
210 | File Transfer Service | - | - | - | |||
213 | Real-time monitoring of errors | has to "parser friendly" and indicate common conditions (destination down..) | |||||
217b | SRM interface integrated to allow specification of lifetime, pinning, etc. | LHCb: Different types of storages should have different SE's, pinning is important here | storage type done in 217; this lists the remaining work | ||||
218 | Priorities, including reshuffling | ||||||
240 | Grid File Catalouge | - | - | - | |||
243 | Support for the concept of a Master Copy | master copies can't be deleted LHCb: can not be deleted the same way as other replicas | |||||
244 | Pool interface to LFC | pool | access to file specific meta data, maybe via a RDD like service LHCb: Pool to use gfal which will be interfaced to LFC | ||||
250 | Performance: | - | - | - | |||
260 | Grid Data Management Tools | - | - | - | |||
262a | POSIX file access based on LFN | tracked in 2936![]() |
|||||
262b | including "best replica" selection based on location, prioritization, current load on networks | research problem | |||||
263 | file access libs. have to access multiple LFC instances for load balancing and high availability | ||||||
264 | reliable registration service | supporting ACL propagation and bulk operations | |||||
265 | reliable file deletion service that verifies that actions have taken place and is performant | Sites: ATLAS is asking us for this. | |||||
266 | Staging service for sets of files | LHCb: item not clear | |||||
300 | Workload Management | - | - | - | |||
302 | Single RB end point with automatic load balancing | JRA1/SA3 | Design required. Major issue. No estimate | ||||
303 | No loss of jobs or results due to temporary unavailability of an RB instance | Standard linux HA available now (Two machines, one dies, one takes up) Multiple RB plus network file system (N RB�s using NFS/AFS shared disk, hot swap RB�s to replace failed ones with same name, IP, certs, status, jobs within minutes): 1 FTE*month (JRA1/WM) + N (~3) months test |
|||||
307 | Latency for job execution and status reporting has to be proportional to the expected job duration | JRA1 | Support for SDJ at the level of middleware is in the first post gLite 3.0 release of WMS. | ||||
310 | Interactive access to running jobs | JRA1 | job file perusal in gLite 3.0. For basic functionalities like top, ls need ~ 1 FTE month. More design needed for full functionalities. |
and commands on the WN like: top, ls, and cat on individual files | |||
311 (was 311, 405) | All CE services have to be accessible directly by user code; Computing element open to VO specific services on all sites | tracked in 3072![]() |
CEMon already in gLite 3.0. CREAM prototype targetted to the gLite 3.0 infrastructure is available but not integrated and tested. | Sites: what does this mean? | |||
312 | Direct access to status of the computing resource (number of jobs/VO ...) | JRA1 | Using CEMon (in gLite 3.0) | Sites: don't we already have this in the info system? sites likely to reject any proposal to query computing element directly, yet another way to poll the system to death. this is why we have an information system. users will have to accept that the information may be O(1 min) old. | |||
313 (was 313, 314) | Allow agents on worker nodes to steer other jobs; Mechanism for changing the identity (owner) of a running job on a worker node | tracked in 3073 ![]() |
glexec available as a library in gLite 3.0 O(1 FTE month) to have it as a service usable on the WNs, but it is a security issue to be decided by sites |
Sites: what does "agents on WN steer other jobs" mean? | |||
320 | Monitoring Tools | - | 0 | - | - | - | |
321 | Transfer traffic monitor | 0 | |||||
322 | SE monitoring, statistics for file opening and I/O by file/dataset, abstract load figures | 0 | |||||
324 | Publish subscribe to logging and bookkeeping, and local batch system events for all jobs of a given VO | 0 | |||||
330 | Accounting | - | 0 | - | - | - | |
331 | By site, user, and group based on proxy information | all applications tracked in 2941 ![]() |
Suitable log files from LRMS on LCG and gLite CE in first post-gLite 3.0 release. DGAS provides the needed provacy and granularity. APEL provides an easy collection and representation mechanism for aggregate information. | DGAS; all applications should check whether the currently available information is enough | |||
333 | Storage Element accounting | 0 | |||||
400 | Other Issues | 0 | have been grouped under Deployment Issues and partially deals with services provided at certain sites | ||||
402 | Different SE classes, MSS, Disk with access for production managers, public disks storage | 0 | |||||
500 | From Cal's list | 0 | - | - | - | ||
510 | Short Deadline Jobs | SDJ WG | 0 | - | - | - | |
511 | The release should support SDJ at the level of the batch systems | 0 | required for glite 3.0 | ||||
512 | The resource broker has to be able to identify resources that support SDJs | In first post-gLite 3.0 release of WMS as far as 511/406 are satisfied | required for glite 3.0. BUG:31278![]() |
||||
514 | Modify system to ensure shortest possible latency for SDJs | design needed | longer term | ||||
520 | MPI | MPI WG | Use cases: Running large scale parallel applications on the grid effectively | ||||
521b | Publication of the maximum number of CPUs that can be used by a single job | NA4 tracked in 2938 ![]() |
0 | required for glite 3.0 | |||
530 | Disk Space Specification | Handled with information pass-through via BLAH. Available as a prototype in the first post-gLite 3.0 release. Would need at least 1 FTE month for each supported batch system to use it. | Usecases: Jobs need scratch space, shared between nodes (MPI) or local and will fail if this resource is not available | ||||
531 | Specification of required shared disk space | As in 530 | required for glite 3.0. Needs deployment of CREAM CE + plug-ins | ||||
532 | Specification of required local scratch disk space | As in 530 | required for glite 3.0 | ||||
540 | Publication of software availability and location | 0 | Usecases" applications use certain software packages frequently. Not all have standard locations or versions. | ||||
541 (was 541,542) | Publication of the Java and Python version; Mechanism to find the required versions of those packages | 0 | required for glite 3.0; discussion not conclusive yet Sites: note this is an old HEPCAL requirement | ||||
550 | Priorities for jobs | Job Priorities WG | 0 | ||||
551 | Users should be able to specify the relative priority of their jobs | 0 | required for glite 3.0 | ||||
552 | A VO should be able to specify the relative priority of jobs | 0 | required for glite 3.0. Groups can have different priorities, but VO control is not available | ||||
553 | VO and user priorities must be combined sensibly by the system to define an execution order for queued jobs | 0 | required for glite 3.0 | ||||
580 | Encryption Key Server | 0 | Usecases: data can be highly sensitive and must be encrypted to control access | ||||
581 | Ability to retrieve an encryption key based on a file id | 0 | |||||
582 | Ability to do an M/N split of keys between servers to ensure that no single server provides sufficient information to decrypt files | 0 | |||||
583 | Access to these keys must be controlled by ACLs | 0 | |||||
590 | Software License Management | 0 | |||||
591 | Ability to obtain licenses for a given package form a given server | 0 | |||||
592 | Access to the server must be controlled via ACLs based on grid certificated | 0 | |||||
592 | The system should know about the availability of licenses and start jobs only when a license is available | 0 | |||||
600 | Database Access | DB access WG | 0 | Usecases : Application data resides in relational and XML DBs. Applications need access to this data based on grid credentials Sites: a very oft requested feature by non-HEP users! | |||
601 | Basic access control based on grid credentials | NA4 tracked in 2937 ![]() |
0 | NA4 to evaluate ogsa-dai | |||
602 | Fine-grained control at table, row and column level | 0 | |||||
603 | Replication mechanism for data bases | 0 | |||||
604 | Mechanism to federate distributed servers (each server contains a subset of the complete data | 0 | Sites: a very oft requested feature by non-HEP users (esp. biobanking) | ||||
701 | OutputData support in JDL | From Savannah bug #22564![]() |