TOP ROC issues (to be raised at the TCG)

This is the list of issues gathered from the sites through the ROCs to be raised to the TCG on a regular basis.

Links to the reports on the discussions at the TCG are here:

Process:

  • Every ROC can submit up to 3 issues
  • ROCs may be requested to add details and requested action to each of their points.
  • The ROCs will be asked to vote on the list (indicating their top 5 issues) on a regular basis.
  • The list will be reviewed and presented to the TCG on a regular basis.

Num. ADDED BY DESCRIPTION DATE ADDED STATUS
1 Italy, France,
DECH
Fine grained list of RPMs per service component (and not only per node/service. Those lists should include all dependencies both internal and external, and should be available in a official and stable url).

Does the information at this URL: http://glite.web.cern.ch/glite/packages/R3.0/deployment/ satisfy this requirement?
- Update from ROC-IT:
The lists are OK for ROC-IT
- Update from ROC France 10/07/07:
1) About 50% of french sites are now using Quattor to deploy the middleware. Unfortunately, people working with Quattor are still experiencing some problems of missing RPMs. To get more details about the encoutered problems contact Michel Jouvin, who is a representative of the Quattor Working Group and sumitted some GGUS tickets (ex.: #18358, #23255). As a conclusion, we are afraid of some remaining problems with the current building process used for the RPMs list at new release time.
2) The list of RPMs are provided by node, not by M/W component as initially requested. So, even if this is not considered as high priority, we remain very interested in the result of SA3 investigations regarding the possiblities of the ETICS facilities as proposed in http://egee-docs.web.cern.ch/egee-docs/operational_tools/Issues_raised_by_ROCs_to_TCG/ISSUES_DISCUSSED_AT_TCG.pdf.

- Update from ROC DECH 23/07/07:
The lists appear to be very helpfull. However, they seem to be for SL3 (I checked the WN part of 3.0.22-2, which is our most important issue). As the experiments pressure for SL4 and we would be more than happy to go to SL4 as soon as we get the WNs to install, we won't look further into this, e.g. by doing test installations. A list for SL 4 WN installation would be more than welcome.
-Update from TCG (11Jul07):
List of rpms given node by node => The reason is that a lot of services cannot be hosted together, so the rpm list is given for frequently co-hosted combinations (the certified ones). We are aware that more combinations work. ex: DPM + MONBOX. If needed, ROCs should bring important other combinations to the SA3 attention.

-
2 Italy External dependencies: No restrictions on versions of dependencies unless proven. All dependencies should be “version >= x.y.z”. Require mw components to provide all detailed info about dependencies (e.g. at packaging level). - CLOSED
3 DECH Reduce dependency of OS: e.g. better support for the UI/WN tarball (it should be certified and released at the same time as the WN/UI RPMs are certified and released.)

- Update from TCG(SA3): ongoing work on porting, ETICS will help
- Update from TCG(SA3): WN/UI tarball problems are now fixed

-
4 all Documentation: release notes should contain all information on: required configuration changes; detailed deployment instructions (for example, must service be re-started after upgrade); bugs fixed by the update; new functionality introduced by the update; etc.

- Update from TCG(SA3): Trying to improve documentation, now nodes to be reconfigured and restarted are indicated.

-
5 UKI New services should come with clear and complete error messages. This should be part of the SA3 checklist (if it isn't already); new components should not be accepted if they don't bring this.

-Update from TCG(11Jul07): New services are supposed to come with better error handling but it is difficult to verify as it is subjective and in any case difficult to handle during the certification phase.The 3.1 UI already has better handling of errors. Work on improving error messages is an always on going work for all services. But TCG states that new services should not be deployed if error messages are not satisfactory or if a user reported error can't be correlated on the service side by the service administrator.

-Update from TCG 09 April 08: should drop reference to 'new' services, this should refer to all services. Action #6711 to discuss standards for error messages in a future meeting. Action #6712 reflects TCG decision to view 'bugs on error messages' as valid and appreciated bugs. Action #6713 is for SA1 to disseminate encouragement to submit such bugs. Temporarily assiged to okeeble as Maite is not yet a registered TCG person (action #6714)!

- In Progress
6 France SE: Several storage areas by VO per SE. (this is being reworked at the glue working group. The TCG should give it higher priority when they receive the glue report). Currently the glue schema is certainly not fully used by lcg-utils, for example it doesn't take care of published MinFileSize).

- Is this still an issue? No discussion at tcg since 0 votes
- Response from France?

-
7 France SE: Publish network cost information to prevent job submission to sites at which network access to SEs is already producing a bottleneck.

- No discussion at TCG since 0 votes

-
8 AP SE: Provide a utility that can locate the physical location of file in a DPM disk cluster.
(Removed at the request of AP ROC.)
-
9 SWE SE, high priority: Implement "access control" into the Storage Elements. The current Glue1.2 allows to define GlueSA's (Storage Areas) for a given SE. For each of these GlueSA's, one can publish an AccessControlBaseRule, where in principle one could put a VOMS FQAN that identifies a group of users, allowed to access that GlueSA. The middleware is not making use of this AccessControlBaseRule at all, so I think it is somewhat urgent that it does. In a way, this is also to "make the SE VOMS aware".

- Is this still an issue?
- Response from SWE?

- Update tcg 09 April 08: JT should submit (#6716) the associated bugs. note this is related to item #6

-
10 UKI Implement VO quotas on shared disk pools provided by DPM or dcache. Publish the quota and also available space, per VO, based on this quota.

- Update from TCG: Need to wait for SRMv2
- Update from RM: Maite to ask Flavia if this is tested in PPS where srmv2 is deployed
-Update from TCG(11Jul07): DPM: Jean-Philippe Baud is working on a first implementation of VO quotas. - Tests and feedback from users/experiments is expected then. - Based on this, dCache will implement VO quotas.
- Update 23/07/07: From Lana Abadie: VO quotas have been shortly discussed in the GSSD meeting in May. For the DPM, a prototype is being implemented. For Castor and dCache, nothing has been done and don't think it is their highest priority for now. For the moment with srmv2.2,you can reserve space for a VO statically because of Castor (also update it if necessary).

-Update tcg 09 April 08: status unknown: JRA1 to check (#6717) on the status.

- In Progress
11 SWE SE: Publish type of SE (tape, disk etc.) in the infosystem to avoid storing lots of small files on tape storage

- Isn't the GlueSEArchitecture glueschema value an answer?
This answer is considered sufficient by SWE.

-
12 France, UKI Provide better and clearer log system, with a standard logging format. This should be part of the SA3 checklist (if it isn't already); new components should not be accepted if they don't bring this. Also, the logging system must be able to store the logs on a remote host (as the "syslog" system does).

- Update from TCG 18th april 2007 (JRA1): We are addressing it, now in the JRA1 workplan
- Update from TCG 11Jul07 (JRA1):
the document on common logging is being finalized. It will soon be communicated to security group and ROCs.

- Update from TCG 17 OCT.2007: Document was finalized by JRA1 and :

 *This should be added as a TCG priority* *Every component should be following the recommendations, together with their other work.* *John will bring it up at the EMT JRA1 steering meeting*. 
- *Update TCG 09 April 08 : Doc exists, need to reopen issue with new wording that the JRA1 logging document needs to be implemented by the software*
-
13 France Provide diagnostic scripts for each service so after installation admins can verify if there are any configuration or funtionality problems. These scripts will also help remote diagnostic by providing output that can be sent back for analysis.

- Update from TCG 18th april 2007 (JRA1): Started working with the monitoring group in order to what is already in place and which are the most important things to monitor
- Update from TCG (11Jul07): JRA1: started to work on this with Monitoring Group. Monitoring Group tried to encapsulate what is critical for a service to run, and put them in "probes". The probes will be given back to the developers, and should be maintained by them when service changes.

-Update TCG 09 April 08: open issue, monitoring group is working well but there is no visible contact between them and JRA1

- In Progress
14A France, UK Provide administration tools such as add/remove/suspend a VO, a user, add/remove/close/drain a queue close a site (on site BDII), close a storage area. A good example is the FTS, you can dynamically add or remove a transfer channel by the way of a command. The FTS node is easier to administrate.
Click on the following link for the detailed list of administration tools being requested: RequestedAdminTools

The above text replaces: Provide a service specific management interface for each of the middleware services. Clarifications: We don't mean "start/stop/status" of linux services. We are actually talking about administration tools which are more or less node-specific. For example, add/remove/supend a VO from a node, add/remove/close a CE queue, close a site (on site BDII), close a storage area, etc. We don't want to launch a node configuration. A good example is the FTS, you can dynamically add or remove a transfer channel by the way of a command. The FTS node is easier to administrate.

- Update from TCG 18th april 2007 (JRA1): Not addressed for now but not ignored. A solution is impossible in the short term.
- Update from TCG 11Jul07 (JRA1): Not addressed for now but not ignored. A solution is impossible in the short term.
- Update from TCG 17 Oct: At EGEE'07, during the joint SA1/JRA1 session, Alessandra Forti suggested to create a working group per component to check/push them to have a minimal set of administrative tools (startup scripts, etc.)
- Update 15 Feb08: A single WG for all services has been created and it is collection requirements by sys admins

-Update 09 April 08: still valid, work in progress

- In Progress
14B UK A common control interface to all grid services to provide (a minimum of) remote stop/start/status/. A common method would allow monitoring frameworks a simple way to check that a service was actually working and allow simple remote management based on certificate identity or VOMS role/group eg by ROCs or sysadmins at other sites or in extereme cases, the COD. If there exists a standard web service interface for this, so much the better. By defining our own standard for this, the WSDL and some of the code could be re-used between gLite components and the client end would be the same for all.

- Update from TCG 18th april 2007 (JRA1): Not addressed for now but not ignored. A solution is impossible in the short term.
-
Update from TCG 111Jul07 (JRA1): Not addressed for now but not ignored. A solution is impossible in the short term.

- In Progress
15 France Storage area occupancy management service : VOs should be able to explicitly specify for its community what each storage area is for at site level. This is very important when a site supplies different storage spaces to a VO.

- No discussion at TCG since 0 votes.

-
16 SWE Grid service failover for those services that still do not have it: R-GMA registry, R-GMA, BDII, site-BDII, CE, WMS, RB, gCE, SRM, SE, LFC, FTS. Or provide an installation profile how to make them high available

- No discussion, but see issue #17

- NEW (but see #17)
17 UKI Failover in clients and user tools. E.g. enable user data management tools to use redundant BDIIs to look up information. So if the primary BDII specified by LCG_GFAL_INFOSYS does not respond a backup BDII can be used instead. This will eliminate the need to setup HA BDII service which most sites do not have. There are clear problems that are non-site specific such as BDII failures which are being reported against sites by the SAM tools. BDII stability has to be a high priority.

- Update from TCG 18th april 2007 (JRA1): bdii/lcg-utils - is in the workplan to have more than one BDII used by clients
- UI: it is already possible to use more WMS
- Update from TCG 11Jul07: Consisdered as very important by ROCs. BDII failover in lcg-utils, being worked on. WMS failover on the UI, being worked on.

-Update TCG 09 April 08: yes, however the text should be more specific

- In Progress
18 France, UKI Pass parameters to LRMS: In order to improve the efficiency of LRMS, some information from user's job description should be passed to the CE through the RB as, for instance, the required amount of memory, the required size of scratch space, the required CPU max. time.

- Update from TCG (27th June): There was a detailed discussion see minutes.

- In Progress
19 SWE VOMS: Get rid of VO LDAP servers and make all services using them VOMS aware.

- This should be solved.
This answer is considered sufficient by SWE.

-
20 DECH Test under real conditions. E.g. make it easier to tailor the MW to specific fabrics needs (e.g. Quattor, batch systems, etc.)
CLARIFICATION - It means:
This is not restricted to regional certification, moreover it is raised as a top issue for the middleware specifically design, development and rollout (SA3 and JRA1). E.g. if a middleware distribution would be designed for one single use case, we had a problem.
Real conditions means:
* real situations of as many sites as possible, like
* small sites - like Universities etc.: they have no own cluster (e.g. shared with different Institutes) to use (large amount of conditions to respect per site by the site's administrators), they have no large man power for MW installation and configuration
* large sites - with different non grid VOs in production using the same WNs with the same OS and possible conflicts with the MW etc. : they want it to be easy to tailor the middleware to their needs if necessary at all.
* the use of the middleware in production (maintenance concepts, garbage collection etc.)
Test means, to check
* for small sites if default installation does work
* for large sites if the MW is easy enough to tailor in details (general and Quattor as example)
* for both as few as possible dependencies on OS and OS-Flavour

- Update from TCG: The project can not put more efforts that those already in place. Test under real condition should be done in PPS.
- Update from DECH (23/07/07)
This issue can be closed. We'll raise more specific, single issues instead in future.

-
21 DECH Need clear concepts for maintenance and garbage collection for each persistent/core MW node (like RB/WMS and storage services)

- Update from TCG: There are indications by the security group (90 days for log related to security) the rest is an operational issue and should be done inside SA1
- Update from DECH (23/07/07)
We actually do not care who provides the concepts SA1 or someone else. Nevertheless we see the development of the middleware in charge to be main contributor for these concepts. The core services are simply much to complicated to be operated by non-experts without clear instructions how to cleanup things.

-Update 09 April 08: SA3 will update patch acceptance criteria in accordance. action #6715.

- In progress
22 SWE Make LFC-client work with certificates containing a dot "." in their DN

- Update from RM: this is a bug for ggus/savannah and not an issue for the tcg. Maite to verify if a ticket/bug was created.
6/07/07 Update from Maite: GGUS ticket (unsolved): https://gus.fzk.de/ws/ticket_info.php?ticket=12389, Savannah bug: https://savannah.cern.ch/bugs/index.php?func=detailitem&item_id=19878. It is in progress, but last comment dates from 2007-02-07: We have a proposed solution, but it needs careful testing as it is a change in the very hart of the authorization logic.
- Response from SWE?

- Update TCG 09 April 08: should be removed, it's a bug not a requirement.

? In progress
23 CERN A middleware tool for carrying out the bulk removal files from SEs, appropriately updating the catalogues, etc. is needed. (This came out of the Grid Operations Meeting).
-Update from TCG 11Jul07: Bulk delete methods will be provided for the LFC and DPM Name Server (on going work). But request should be sent to each SE implementation...
2007/03/30 CLOSED
  TCG Site reps HANDLING UPDATES:
* High priority update should contain only the components that are impacted by the high priority update.
* Critical services update with non trivial procedures involving a temporary (even short) shutdown of the service should be released as separate updates too (e.g. DPM release involving a database schema upgrade)
* Non critical updates, in particular updates impacting only the client side of services, can be bundled together in one update.

ANSWER: SA3 will provide clearer and more fined grained information about updates: e.g. priority per service.
Each service will publish to the information system the gLite update they are running (short term), and in the long term the idea is that they publish the service version they run
2007/01/09 In Progress
  TCG Site reps We received the following request from Romain Wartel
"In order to address common questions from the sites with regards to the
security configuration of the middleware, and in order to improve the overall
understanding of the security aspects of the middleware components, the OSCT
(http://cern.ch/OSCT) would like to request the relevant developers to provide
the necessary information to complete the following questionnaire.
The knowledge gained from the questionnaire will be used to prepare security
best practice documentation for the sites, as well as targeted security
training. One of the main objective is also to give a presentation during the
joint MWSG/OSCT session on EGEE08.
It is therefore essential that the questionnaire is completed before Friday 29
August 2008."
Questionnaire sent to the tmb ml.
2008/07/15

Votes

  • 5 for the No1 priority
  • 4 for the No2 priority
  • 3 for the No3 priority
  • 2 for the No4 priority
  • 1 for the No5 priority of your ROC

Num AP CE CERN DECH FR IT NE RU SEE SWE UKI SUM RANK
1         3 3           6 5
2                       0 ~
3                 2     2 ~
4           1 3 2 4   1 11 5
5     2 3             2 7 4
6                       0 ~
7                       0 ~
9                   1   1 ~
10                       0 ~
12   1 5   1   1 1 5     14  
13 2     4               6 7
14a   3 3   2 2     3   3 16  
14b                       0 NEW
15                       0 ~
16 1                 2   3 8
17 3 2 4       2   1     12  
18                     4 4 ~
20                       0 ~
21       5               5 ~
22                   3   3 8
23     1               5 6 NEW

<!--

  • Set STNEW =
  • Set STCLOSED =
  • Set STOPEN = Open
/>
Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif edit_table.gif r2 r1 manage 1.3 K 2007-07-18 - 12:19 UnknownUser  
Edit | Attach | Watch | Print version | History: r51 < r50 < r49 < r48 < r47 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r51 - 2009-02-03 - AndrewElwell
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback