of issues that the experiments have supported and Cal's list of requirements. There are fields in the table where the users can assign priorities and the developers can express estimated costs. Each VO should distribute a total of 100 "priority points" between the different issues.
The units for the estimated costs are in person weeks.
Under the heading "Origin" you can find wether this is item came from Flavia's list or from Cal's (FD/NA4).
You might notice that there no security or deployment issues in this list. These will be prioritized by different means.
A few of the NA4 requirements have a considerable overlap with the issues summarized in Flavia's list. The difference is mostly that the NA4 requirements are more general.
Index |
Issue/Requirement |
Origin |
ALICE |
ATLAS |
CMS |
LHCb |
Biomed |
NA4 |
Estimated Cost |
Comments |
100 |
Security, authorization, authentication |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
101 |
VOMS groups and roles used by all middleware |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
O(10) groups O(3) roles |
102 |
VOMS supporting user metadata |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
see list for details |
103 |
Automatic handling of proxy renewal |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
users should not need to know which server to use to register their proxies for a specific service |
104 |
Automatic renewal of Kerberos credentials via the GRID |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
105 |
Framework and recommendations for developing secure experiment specific services |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
including delegation and renewal |
110 |
Information System |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
111 |
Stable access to static information |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
service end points, service characteristics |
112 |
Identical GLUE schema for gLite and LCG |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
120 |
Storage Management |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
121 |
SRM used by all Storage Elements |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
SRM as specified in the Baseline Services Working Group Report |
122 |
Same semantic for all SEs |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
123 |
Smooth migration from SRM v1 to v2, gfal and FTS should hide differences |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
124 |
Direct access to SRM interfaces |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
SRM client libs. |
125 |
Disk quota management |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
not before 3Q 2006 ( CASTOR, dCache, DPM) |
at group and user level |
126 |
Verification of file integrity after replication |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
checksum (on demand), file size |
127 |
Verification that operations have had the desired effect at fabric level |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
128 |
Highly optimized SRM client tools |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
129 |
Python binding for SRM client tools |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
130 |
No direct access to the information system for any operation |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
stressed by LHCb |
200 |
Data Management |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
210 |
File Transfer Service |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
211 |
FTS clients on all WNs and VOBOXes |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
212 |
Retry until explicit stopped |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
213 |
Real-time monitoring of errors |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
has to "parser friendly" and indicate common conditions (destination down..) |
214 |
Automatic file transfers between any two sites on the Grid |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
not linked to a catalogue, file specified via SURL |
215 |
Central entry point for all transfers |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
216 |
FTS should handle proxy renewal |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
217 |
SRM interface integrated to allow specification of storage type, lifetime, pinning, etc. |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
218 |
Priorities, including reshuffling |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
219 |
Support for VO specific plug-ins |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
230 |
File Placement Service |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
231 |
FPS plug-ins for VO specific agents |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
232 |
FPS should handle routing |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
233 |
FPS should handle replication |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
choosing the sources automatically |
234 |
FPS should handle transfers to multiple destinations |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
240 |
Grid File Catalouge |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
241 |
LFC as global and local catalogue with a peak access rate of 100Hz |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
242 |
Support for replica attributes: tape, pinned, disk, etc. |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
243 |
Support for the concept of a Master Copy |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
master copies can't be deleted |
244 |
Pool interface to LFC |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
access to file specific meta data, maybe via a RDD like service |
250 |
Performance: |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
251 |
Emphasis on read access |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
252 |
Unauthenticated read-only instances |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
253 |
Bulk operations |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
260 |
Grid Data Management Tools |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
261 |
lcg-utils available in production |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
262 |
POSIX file access based on LFN |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
including "best replica" selection based on location, prioritization, current load on networks |
263 |
file access libs. have to access multiple LFC instances for load balancing and high availability |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
264 |
reliable registration service |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
supporting ACL propagation and bulk operations |
265 |
reliable file deletion service that verifies that actions have taken place and is performant |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
266 |
Staging service for sets of files |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
300 |
Workload Management |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
301 |
Configuration that defines a set of primary RB's to be used by the VO for load balancing and allows defining alternative sets to be used in case the primary set is not available |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
302 |
Single RB end point with automatic load balancing |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
303 |
No loss of jobs or results due to temporary unavailability of an RB instance |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
304 |
Handling of 10**6 jobs/day |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
305 |
Using the information system in the match making to sent jobs to sites hosting the input files AND providing sufficient resources |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
306 |
Better input sandbox management (caching of sandboxes) |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
307 |
Latency for job execution and status reporting has to be proportional to the expected job duration |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
308 |
Support for priorities based on VOMs groups/roles |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
ATLAS remarked that this should work without a central service |
309 |
RB should reschedule jobs in the internal task queue |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
310 |
Interactive access to running jobs |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
and commands on the WN like: top, ls, and cat on individual files |
311 |
All CE services have to be accessible directly by user code |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
312 |
Direct access to status of the computing resource (number of jobs/VO ...) |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
313 |
Allow agents on worker nodes to steer other jobs |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
314 |
Mechanism for changing the identity (owner) of a running job on a worker node |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
320 |
Monitoring Tools |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
321 |
Transfer traffic monitor |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
322 |
SE monitoring, statistics for file opening and I/O by file/dataset, abstract load figures |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
323 |
Scalable tool for VO specific information (job status/errors/.. |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
324 |
Publish subscribe to logging and bookkeeping, and local batch system events for all jobs of a given VO |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
330 |
Accounting |
FD |
- |
- |
- |
- |
- |
- |
- |
- |
331 |
By site, user, and group based on proxy information |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
DGAS |
323 |
Accounting by VO specified tag that identifies certain activities. These could be MC, Reconstruction, etc. |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
330 |
Storage Element accounting |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
400 |
Other Issues |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
have been grouped under Deployment Issues and partially deals with services provided at certain sites |
401 |
Read-only mirrors of LFC service at several T1 centers updated every 30-60 minutes |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
402 |
Different SE classes, MSS, Disk with access for production managers, public disks storage |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
403 |
XROOTD at all sites |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
404 |
VOBOX at all sites |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
requested by Alice, ATLAS, CMS. LHCb requested T1s and some T2s |
405 |
Computing element open to VO specific services on all sites |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
including direct access to information bypassing the information system ( CREAM and CMon) |
406 |
dedicated queues for short jobs |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
407 |
Standardized CPU time limits |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
408 |
Tool to manage VO specific site dependent environments |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
409 |
Rearranging priorities of jobs in the local queue |
FD |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
ATLAS: Requirement for a priority system including local queues at the sites, able to rearrange the priority of jobs already queued at each single site in order to take care of new high priority jobs being submitted. Such system requires some deployment effort, but essentially no development since such a feature is already provided by most of the batch systems, and is a local implementation, not a Grid one. |
500 |
From Cal's list |
NA4 |
- |
- |
- |
- |
- |
- |
- |
|
501 |
All user level commands must extract VO information from a VOMS proxy |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
502 |
Membership with multiple organizations must work correctly |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
503 |
Services must provide access control based on VOMS groups/roles. Critical is fine grained control to: files, queues and metadata |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required after glite 3.0 |
510 |
Short Deadline Jobs |
NA4 |
- |
- |
- |
- |
- |
- |
- |
|
511 |
The release should support SDJ at the level of the batch systems |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
512 |
The resource broker has to be able to identify resources that support SDJs |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
513 |
SDJ resource access should be controlled via ACLs |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
after glite 3.0 |
514 |
Modify system to ensure shortest possible latency for SDJs |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
longer term |
520 |
MPI |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Use cases: Running large scale parallel applications on the grid effectively |
521 |
Use a batch system that can handle the "CPU count problem" |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0, his problem arises because of a scheduling mismatch in the versions of maui/torque used by default. The end result is that typically an MPI job can only use half of the CPUs available on a site, yet the broker will happily schedule jobs which require more on the site. These jobs will never run. |
521 |
Publication of the maximum number of CPUs that can be used by a single job |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
522 |
Publication of wether the home directories are shared (alternatively transparently move sandboxes to all allocated nodes |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
523 |
Ability to run code before/after the job wrapper invokes "mpirun" |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required after glite 3.0. This will allow compilation and setup of the job by the user |
530 |
Disk Space Specification |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Usecases: Jobs need scratch space, shared between nodes (MPI) or local and will fail if this resource is not available |
531 |
Specification of required shared disk space |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
532 |
Specification of required local scratch disk space |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
540 |
Publication of software availability and location |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Usecases" applications use certain software packages frequently. Not all have standard locations or versions. |
541 |
Publication of the Java and Python version |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
542 |
Mechanism to find the required versions of those packages |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
550 |
Priorities for jobs |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
551 |
Users should be able to specify the relative priority of their jobs |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
552 |
A VO should be able to specify the relative priority of jobs |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
553 |
VO and user priorities must be combined sensibly by the system to define an execution order for queued jobs |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required for glite 3.0 |
560 |
Job Dependencies |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Usecases: Applications often require workflows with dependencies |
561 |
Ability to specify arbitrary (non circular) dependencies between jobs inside a set of jobs |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required after glite 3.0 |
562 |
Ability to query the state and control such jobs as a unit |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required after glite 3.0 |
563 |
Ability to query and control the sub jobs |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
required after glite 3.0 |
570 |
Metadata Catalogue |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Usecases: identify dataset based on metadata information |
571 |
Ability to add metadata according to user defined schema |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
572 |
Ability to control access to data based on the schema with a granularity of entries and fields |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
573 |
Ability to distribute metadata over a set of servers |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
580 |
Encryption Key Server |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Usecases: data can be highly sensitive and must be encrypted to control access |
581 |
Ability to retrieve an encryption key based on a file id |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
582 |
Ability to do an M/N split of keys between servers to ensure that no single server provides sufficient information to decrypt files |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
583 |
Access to these keys must be controlled by ACLs |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
590 |
Software License Management |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
591 |
Ability to obtain licenses for a given package form a given server |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
592 |
Access to the server must be controlled via ACLs based on grid certificated |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
592 |
The system should know about the availability of licenses and start jobs only when a license is available |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
600 |
Database Access |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Usecases : Application data resides in relational and XML DBs. Applications need access to this data based on grid credentials |
601 |
Basic access control based on grid credentials |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
602 |
Fine-grained control at table, row and column level |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
603 |
Replication mechanism for data bases |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
604 |
Mechanism to federate distributed servers (each server contains a subset of the complete data |
NA4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
-- Main.markusw - 06 Jan 2006 filled in the HEP and NA4 issues and requirements. There are still duplicates, especially in the area of short jobs