DRAFT DRAFT DRAFT

This is a merger between Flavia's list of issues that the experiments have supported and Cal's list of requirements. There are fields in the table where the users can assign priorities and the developers can express estimated costs. Each VO should distribute a total of 100 "priority points" between the different issues. The units for the estimated costs are in person weeks.

Under the heading "Origin" you can find wether this is item came from Flavia's list or from Cal's (FD/NA4). The list has also been extended with requirements from the VO Box discussions (marked VOB). You might notice that there no security or deployment issues in this list. These will be prioritized by different means.

A few of the NA4 requirements have a considerable overlap with the issues summarized in Flavia's list. The difference is mostly that the NA4 requirements are more general.

Index Issue/Requirement Origin ALICE ATLAS CMS LHCb Biomed NA4 Sum Res Estimated Cost Comments Status
100 Security, authorization, authentication FD - - - - - - 0 - - -
101 (was 101, 308, 501, 502, 503, 513, 572) VOMS groups and roles used by all middleware;
Support for priorities based on VOMs groups/roles;
All user level commands must extract VO information from a VOMS proxy;
Services must provide access control based on VOMS groups/roles. Critical is fine grained control to: files, queues and metadata;
Ability to control access to data based on the schema with a granularity of entries and fields
FD, NA4 5 12 12 0 18 14 61 JRA1,SA3
tracked in:
2926, 2927, 2928, 2935
Short term solution for job priorities is to use VOViews.
Available on the LCG-RB. Code frozen for gLite WMS (wait for the first post-glite 3.0 release of WMS)
See: Jeff T. document
Longer term use GPBOX to define and propagate VO policies to sites. Prototype targetted to the gLite 3.0 infrastructure is available but not integrated and tested.
O(10) groups O(3) roles. ATLAS: O(25) groups, O(3) roles LHCb: Not clear item, should be specified for each service separately; ATLAS remarked that this should work without a central service Sites: VOMS should be used by users too; non-VOMS proxy means no special roles or privileges at site. CMS: Ongoing
103 Automatic handling of proxy renewal FD 5 2 1 3 0 5 16 Sec. 0 users should not need to know which server to use to register their proxies for a specific service LHCb: item should be split by services CMS: done
TCG:discussed
103a proxy renewal within the service               16 JRA1 done   TCG:discussed
103b establishment of trust between the service and myproxy               16 Sec
tracked in 2929
    TCG:discussed
103c find the right myproxy server               16 SA3 (configuration)     TCG:discussed
104 Automatic renewal of Kerberos credentials via the GRID FD 1 0 0 0 0 0 1   0    
105 Framework and recommendations for developing secure experiment specific services FD, VOB 0 3 1 4 0 0 0 8   including delegation and renewal LHCb: this should include certification of already developed by experiments security frameworks;
this is also a requirement from the VO boxes group Sites: agree this is a requirement.
CMS: ongoing
TCG:discussed
110 Information System FD - - - - - - 0 - - -  
111 (was 111, 130) Stable access to static information;
No direct access to the information system for any operation
a) caching of endpoint information in the clients and
b} not to need to go to the information system if the information is already available elsewhere (e.g. through parameters)
FD 0 6 0 6 0 0 12
tracked in 3069
0 service end points, service characteristics
stressed by LHCb LHCb: 124,128,130 to be merged Sites: should be addressed by split of IS into static and dynamic parts, currently discussed within GD
CMS: ongoing (new info system ?)
120 Storage Management FD - - - - - - 0 - - -
122 Same semantic for all SEs FD 5 5 5 4 0 0 19 SRM group 0   Sites: isn't this agreed part of Witzig proposal?
125 Disk quota management FD 1 2 5 1 0   8 0 not before 3Q 2006 (CASTOR, dCache, DPM) at group and user level to be discussed in the storage solution group
126 Verification of file integrity after replication FD 1 5 3 2 0 0 11 JRA1
tracked in 2932
0 checksum (on demand), file size CMS: done
TCG:discussed
200 Data Management FD - - - - - - 0 - - -
210 File Transfer Service FD - - - - - - 0 - - -
213 Real-time monitoring of errors FD 1 5 2 0 0 0 8   0 has to "parser friendly" and indicate common conditions (destination down..) CMS: ongoing
TMB discussed
217b SRM interface integrated to allow specification of lifetime, pinning, etc. FD 2 2 2 1 0 0 7   0 LHCb: Different types of storages should have different SE's, pinning is important here TCG:discussed
storage type done in 217; this lists the remaining work
218 Priorities, including reshuffling FD 0 2 2 0 0 0 4   0   CMS: not done
TMB discussed
240 Grid File Catalouge FD - - - - - - 0 - - -  
243 Support for the concept of a Master Copy FD 0 0 2 2 0 0 4   0 master copies can't be deleted LHCb: can not be deleted the same way as other replicas TMB discussed
244 Pool interface to LFC FD 0 5 5 0 0 0 10 pool 0 access to file specific meta data, maybe via a RDD like service LHCb: Pool to use gfal which will be interfaced to LFC TMB discussed
250 Performance: FD - - - - - - 0 - - - ?
260 Grid Data Management Tools FD - - - - - - 0 - - -
262a POSIX file access based on LFN FD 1 0 0 4 0 5 10 NA4
tracked in 2936
0   ?
262b including "best replica" selection based on location, prioritization, current load on networks               0 research problem     CMS: open
263 file access libs. have to access multiple LFC instances for load balancing and high availability FD 1 0 0 2 0 0 3   0   ?
264 reliable registration service FD 0 1 0 0 0 0 1   0 supporting ACL propagation and bulk operations ?
265 reliable file deletion service that verifies that actions have taken place and is performant FD 1 5 0 2 0 0 8   0 Sites: ATLAS is asking us for this. ?
266 Staging service for sets of files FD 0 2 0 0 0 0 2   0 LHCb: item not clear ?
300 Workload Management FD - - - - - - 0 - - -
302 Single RB end point with automatic load balancing FD 0 0 0 2 0 0 2   Design required. Major issue. No estimate   CMS: Open
303 No loss of jobs or results due to temporary unavailability of an RB instance FD 0 5 2 1 0 0 8   Standard linux HA available now (Two machines, one dies, one takes up)
Multiple RB plus network file system (N RB�s using NFS/AFS shared disk, hot swap RB�s to replace failed ones with same name, IP, certs, status, jobs within minutes): 1 FTE*month (JRA1/WM) + N (~3) months test
  CMS: Open/Ongoing
307 Latency for job execution and status reporting has to be proportional to the expected job duration FD 0 0 2 0 0 0 2   Support for SDJ at the level of middleware is in the first post gLite 3.0 release of WMS.   ?
310 Interactive access to running jobs FD 0 2 2 0 0 0 4   job file perusal in gLite 3.0.
For basic functionalities like top, ls need ~ 1 FTE month. More design needed for full functionalities.
and commands on the WN like: top, ls, and cat on individual files CMS: open
311 (was 311, 405) All CE services have to be accessible directly by user code; Computing element open to VO specific services on all sites FD 0 0 0 7 0 0 7 tracked in 3072 CEMon already in gLite 3.0. CREAM prototype targetted to the gLite 3.0 infrastructure is available but not integrated and tested. Sites: what does this mean? CMS: Ongoing
312 Direct access to status of the computing resource (number of jobs/VO ...) FD 0 0 0 4 0 0 4   Using CEMon (in gLite 3.0) Sites: don't we already have this in the info system? sites likely to reject any proposal to query computing element directly, yet another way to poll the system to death. this is why we have an information system. users will have to accept that the information may be O(1 min) old. CMS: Ongoing
313 (was 313, 314) Allow agents on worker nodes to steer other jobs; Mechanism for changing the identity (owner) of a running job on a worker node FD 5 0 1 8 3 0 17
tracked in 3073
glexec available as a library in gLite 3.0
O(1 FTE month) to have it as a service usable on the WNs, but it is a security issue to be decided by sites
Sites: what does "agents on WN steer other jobs" mean? CMS: ongoing
320 Monitoring Tools FD - - - - - - 0 - - -
321 Transfer traffic monitor FD 1 1 1 1 0 0 4   0   CMS: open
TMB discussed
322 SE monitoring, statistics for file opening and I/O by file/dataset, abstract load figures FD 0 0 0 2 0 0 2   0   CMS: open
TMB discussed
324 Publish subscribe to logging and bookkeeping, and local batch system events for all jobs of a given VO FD 0 0 2 0 0 0 2   0   CMS: ongoing
TMB discussed
330 Accounting FD - - - - - - 0 - - -
331 By site, user, and group based on proxy information FD 3 5 1 0 0 5 14 all applications
tracked in 2941
Suitable log files from LRMS on LCG and gLite CE in first post-gLite 3.0 release. DGAS provides the needed provacy and granularity. APEL provides an easy collection and representation mechanism for aggregate information. DGAS; all applications should check whether the currently available information is enough CMS: open
TMB discussed
333 Storage Element accounting FD 0 2 1 0 0 2 5   0   CMS: open <
400 Other Issues FD 0 0 0 0 0 0 0   0 have been grouped under Deployment Issues and partially deals with services provided at certain sites
402 Different SE classes, MSS, Disk with access for production managers, public disks storage FD 0 2 1 4 0 0 7   0   CMS: Done
TCG:discussed
500 From Cal's list NA4 - - - - - - 0 - - -
510 Short Deadline Jobs NA4 - - - - - - 0 - - -
511 The release should support SDJ at the level of the batch systems NA4 0 0 0 0 4 0 4   0 required for glite 3.0
512 The resource broker has to be able to identify resources that support SDJs NA4 0 0 0 0 4 0 4   In first post-gLite 3.0 release of WMS as far as 511/406 are satisfied required for glite 3.0. BUG:31278
514 Modify system to ensure shortest possible latency for SDJs NA4 0 0 1 0 5 0 6   design needed longer term
520 MPI NA4 0 0 0 0 0 0 0   0 Use cases: Running large scale parallel applications on the grid effectively
521b Publication of the maximum number of CPUs that can be used by a single job NA4 0 0 0 0 2 5 7 NA4
tracked in 2938
0 required for glite 3.0
530 Disk Space Specification NA4 0 0 0 0 0 0 0   Handled with information pass-through via BLAH. Available as a prototype in the first post-gLite 3.0 release. Would need at least 1 FTE month for each supported batch system to use it. Usecases: Jobs need scratch space, shared between nodes (MPI) or local and will fail if this resource is not available
531 Specification of required shared disk space NA4 0 0 0 0 0 5 5   As in 530 required for glite 3.0. Needs deployment of CREAM CE + plug-ins
532 Specification of required local scratch disk space NA4 0 0 0 0 1 5 6   As in 530 required for glite 3.0
540 Publication of software availability and location NA4 0 0 0 0 0 0 0   0 Usecases" applications use certain software packages frequently. Not all have standard locations or versions.
541 (was 541,542) Publication of the Java and Python version; Mechanism to find the required versions of those packages NA4 0 0 0 0 4 4 8   0 required for glite 3.0; discussion not conclusive yet Sites: note this is an old HEPCAL requirement
550 Priorities for jobs NA4 0 0 0 0 0 0 0 Job Priorities WG 0  
551 Users should be able to specify the relative priority of their jobs NA4 0 0 1 0 3 2 6   0 required for glite 3.0
552 A VO should be able to specify the relative priority of jobs NA4 0 5 1 0 0 2 8   0 required for glite 3.0. Groups can have different priorities, but VO control is not available
553 VO and user priorities must be combined sensibly by the system to define an execution order for queued jobs NA4 0 0 0 0 2 2 4   0 required for glite 3.0
580 Encryption Key Server NA4 0 0 0 0 0 4 4   0 Usecases: data can be highly sensitive and must be encrypted to control access
581 Ability to retrieve an encryption key based on a file id NA4 0 0 0 0 6 0 6   0  
582 Ability to do an M/N split of keys between servers to ensure that no single server provides sufficient information to decrypt files NA4 0 0 0 0 3 0 3   0  
583 Access to these keys must be controlled by ACLs NA4 0 0 0 0 6 0 6   0  
590 Software License Management NA4 0 0 0 0 0 0 0   0  
591 Ability to obtain licenses for a given package form a given server NA4 0 0 0 0 2 3 5   0  
592 Access to the server must be controlled via ACLs based on grid certificated NA4 0 0 0 0 2 3 5   0  
592 The system should know about the availability of licenses and start jobs only when a license is available NA4 0 0 0 0 1 2 3   0  
600 Database Access NA4 0 0 0 0 0 0 0   0 Usecases : Application data resides in relational and XML DBs. Applications need access to this data based on grid credentials Sites: a very oft requested feature by non-HEP users!
601 Basic access control based on grid credentials NA4 0 5 0 0 3 5 13 NA4
tracked in 2937
0 NA4 to evaluate ogsa-dai
602 Fine-grained control at table, row and column level NA4 0 0 0 0 3 5 8   0  
603 Replication mechanism for data bases NA4 0 0 0 0 2 0 2   0  
604 Mechanism to federate distributed servers (each server contains a subset of the complete data NA4 0 0 0 0 2 0 2   0 Sites: a very oft requested feature by non-HEP users (esp. biobanking)
701 OutputData support in JDL Savannah 0 0 0 0 0 0 0   From Savannah bug #22564  

Done or obsolete items

Index Issue/Requirement Origin ALICE ATLAS CMS LHCb Biomed NA4 Sum Res Estimated Cost Comments Status
102 VOMS supporting user metadata FD 3 0 0 2 0 0 5   0 see list for details Done
112 Identical GLUE schema for gLite and LCG FD 0 5 0 0 0 0 5   In the first post gLite 3.0 release of WMS and gLite CE   Done
121 SRM used by all Storage Elements FD 5 5 5 4 0 0 19 SA1 (being put in place) 0 SRM as specified in the Baseline Services Working Group Report Done
123 Smooth migration from SRM v1 to v2, gfal and FTS should hide differences FD 1 5 5 4 0 0 15 JRA1
tracked in 2930
0 LHCb: 121,122,123 should be merged in one, 4 points are counted for this set Done
124 (was 124, 128) Direct access to SRM interfaces;
Highly optimized SRM client tools
FD 3 7 5 8 0 0 23 SA3
tracked in 2931
0 SRM client libs. Done
127 Verification that operations have had the desired effect at fabric level FD 0 0 0 2 0 0 2   0 LHCb: 126,127 to be merged Sites: what does this mean?? Obsolete
129 Python binding for SRM client tools FD 0 5 0 1 0 0 6   0   Obsolete
211 FTS clients on all WNs and VOBOXes FD 5 5 0 4 0 0 14 SA3 0 For Alice and LHCb only in VO-BOXES Done
212 Retry until explicit stopped FD 3 5 3 0 0 0 9 JRA1 0 will see gradual improvements on error handling, no specific action Obsolete, issue resolved
214 Automatic file transfers between any two sites on the Grid FD 5 5 0 4 0 0 14 JRA1 0 not linked to a catalogue, file specified via SURL LHCb: This should be handled by FTS Done
215 (was 215, 232) Central entry point for all transfers; FPS should handle routing FD 1 5 0 8 0 0 14 JRA1
tracked in 2933
0 LHCb: 214,215 to be merged Obsolete
216 FTS should handle proxy renewal FD 5 1 3 3 0 0 12 JRA1
tracked in 2933
0   Done
217 SRM interface integrated to allow specification of storage type, lifetime, pinning, etc. FD 2 2 2 1 0 0 7   0 LHCb: Different types of storages should have different SE's, pinning is important here Done
219 Support for VO specific plug-ins FD 0 1 0 0 0 0 1   0   Done
230 File Placement Service FD - - - - - - 0 - - ATLAS comment: is it not included now in the FTS specs? FPS is covered by FTS plus plug-ins; obsolete
231 FPS plug-ins for VO specific agents FD 0 0 0 1 0 0 1   0   obsolete
233 FPS should handle replication FD 0 0 0 1 0 0 1   0 choosing the sources automatically obsolete
234 FPS should handle transfers to multiple destinations FD 0 0 0 1 0 0 1   0 LHCb: 233,244 to be merged obsolete
241 LFC as global and local catalogue with a peak access rate of 100Hz FD 5 5 5 4 0 0 19 testing 0 Sites: would be good to clarify role of global and local catalog, they seem to get out of sync. Atlas usecase: Done in their framwork
242 Support for replica attributes: tape, pinned, disk, etc. FD 1 1 3 0 0 0 5   0   Obsolete
251 Emphasis on read access FD 3 0 0 0 0 0 3   0   Done
252 Unauthenticated read-only instances FD 3 2 0 4 0 0 9   0 Sites: will reject because it opens a DoS vector. If users insist, they get zero guarantee about downtime. obsolete
253 Bulk operations FD 3 2 0 4 0 0 9   0 Sites: optional async bulk deletes from SRM server? done
261 lcg-utils available in production FD 5 0 0 4 0 0 9   0   Done
301 Configuration that defines a set of primary RB's to be used by the VO for load balancing and allows defining alternative sets to be used in case the primary set is not available FD 5 0 0 2 0 0 7   UIs may be configured to use a number of RBs (choosen randomly). To do it in a smarter way will take ~1FTE month, but need specifications   obsolete
304 Handling of 10**6 jobs/day FD 3 5 10 0 0 0 18 testing bulk match-making: 6 FTE*months
Upgrade to a new version of Condor (first post-gLite 3.0 release)
use CREAM CE (Prototype targetted to the gLite 3.0 infrastructure is available but not integrated and tested.)
LHCb: this is metrics, not task obsolete
305 Using the information system in the match making to sent jobs to sites hosting the input files AND providing sufficient resources FD 0 2 0 0 0 0 2   done   Done
306 Better input sandbox management (caching of sandboxes) FD 0 1 2 0 0 0 3   Sandox as URL (gsiftp) in gLite 3.0
Download from http server is already possible. Some work needed for uploads: 2 FTE*months
  Done
309 RB should reschedule jobs in the internal task queue FD 0 0 2 0 0 0 2   Should be merged with 101   obsolete
323 Scalable tool for VO specific information (job status/errors/..) FD 0 0 2 0 0 0 2   0 Sites: is this not the dashboard? CMS: open
moved to done thanks to the dashboards
332 Accounting by VO specified tag that identifies certain activities. These could be MC, Reconstruction, etc. FD 0 0 1 0 0 0 1   0  
401 Read-only mirrors of LFC service at several T1 centers updated every 30-60 minutes FD 0 0 0 3 0 0 3   0 Sites: already provided by 3D project. Done
403 XROOTD at all sites FD 5 0 0 0 0 0 5 closed done by LCG Sites: not an EGEE requirement ?
404 VOBOX at all sites FD 5 2 0 2 0 0 9 VOBox WS 0 requested by Alice, ATLAS, CMS. LHCb requested T1s and some T2s. ATLAS comment: awaiting conclusions of NIKHEF workshop. Sites: not an EGEE requirement done
406 dedicated queues for short jobs FD 1 5 1 0 0 0 7 SDJ WG should be merged with 511   working SDJs have been demonstrated, sites who want to support this are able to
407 Standardized CPU time limits FD 3 5 0 3 0 0 11 JRA1
tracked in 2942
0 Sites: more important is standards for publishing the information, for example if one publishes cpu time limit, this should be actual time and not some scaled time. obsolete
408 Tool to manage VO specific site dependent environments FD 0 0 1 0 0 0 1   0 Sites: this could mean almost anything. What does it actually mean? CMS: obsolete
409 Rearranging priorities of jobs in the local queue FD 0 2 0 0 0 0 2   Should be merged with 101 ATLAS: Requirement for a priority system including local queues at the sites, able to rearrange the priority of jobs already queued at each single site in order to take care of new high priority jobs being submitted. Such system requires some deployment effort, but essentially no development since such a feature is already provided by most of the batch systems, and is a local implementation, not a Grid one. CMS: obsolete Sites: decided to move to long term (Witzig recommendations)
moved to 'done/obsolete' due to widespread use of pilot jobs
410 Package management VOB - - - - - - 0 - - according to requrements document (need to link!); simple, bare-bones implementation initally Sites: also an old GAG requirement obsolete
521a Use a batch system that can handle the "CPU count problem" NA4 0 0 0 0 3 5 8   0 required for glite 3.0, his problem arises because of a scheduling mismatch in the versions of maui/torque used by default. The end result is that typically an MPI job can only use half of the CPUs available on a site, yet the broker will happily schedule jobs which require more on the site. These jobs will never run. done
522 Publication of wether the home directories are shared (alternatively transparently move sandboxes to all allocated nodes NA4 0 0 0 0 3 0 3 JRA1
tracked in 2939
0 required for glite 3.0 done
523 Ability to run code before/after the job wrapper invokes "mpirun" NA4 0 0 0 0 2 5 7 JRA1
tracked in 2940
Job prologue executed before the job will be available in the first post-gLite 3.0 release of WMS. An epilogue is also possible but developers need to know the required semantics required after glite 3.0. This will allow compilation and setup of the job by the user done
560 Job Dependencies NA4 0 0 0 0 0 0 0   0 Usecases: Applications often require workflows with dependencies done via DAGs
561 Ability to specify arbitrary (non circular) dependencies between jobs inside a set of jobs NA4 0 0 0 0 2 2 4   Done via DAGs required after glite 3.0
562 Ability to query the state and control such jobs as a unit NA4 0 0 0 0 2 0 2   0 required after glite 3.0 done via DAGs
563 Ability to query and control the sub jobs NA4 0 0 0 0 1 0 1   0 required after glite 3.0 done via DAGs
570 Metadata Catalogue NA4 0 0 0 0 0 0 0   0 Usecases: identify dataset based on metadata information done via AMGA
571 Ability to add metadata according to user defined schema NA4 0 0 0 0 6 5 11 SA3 0   done via AMGA
573 Ability to distribute metadata over a set of servers NA4 0 0 0 0 3 0 3   0   done via AMGA

-- Main.markusw - 21 Dec 2005

-- Main.markusw - 06 Jan 2006 filled in the HEP and NA4 issues and requirements. There are still duplicates, especially in the area of short jobs

-- Main.markusw - 09 Jan 2006

-- Main.markusw - 10 Jan 2006

-- ErwinLaure - 03 Feb 2006 modified table according to discussion of Jan 18.

-- ErwinLaure - 13 Mar 2006 added savannah tasks and updates according to discussion of Feb. 22.

-- ClaudioGrandi - 20 Dec 2006 added request from Savannah

-- ErwinLaure - 17 Mar 2008 - moved done/obsolete items to a separate list at the end

Edit | Attach | Watch | Print version | History: r29 < r28 < r27 < r26 < r25 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r29 - 2008-06-11 - ErwinLaure
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback