LCG Management Board

Date/Time:

Tuesday 16 May 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a061501

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 19.5.2006)

Participants:

A.Aimar (notes), L.Bauerdick, K.Bos, T.Cass, Ph.Charpentier, J.Gordon, B.Gibbard, V.Guelzow, I.Fisk, M.Lamanna, E.Laure, H.Marten, G.Merino, G.Poulard, Di Qing, L.Robertson (chair), Y.Schutz, J.Shiers, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList 

Next Meeting:

Tuesday 23 May 2006 at 16:00

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of the Previous Meeting

Minutes approved.

 

1.2         Matters arising

-          Received all QR reports (more information). Will prepare the Executive Summary for the OB

 

-          The accounting sheets are expected from the Tier-1 sites before the end of the week (already in the action list).

 

Action:

20 May 06 – L.Robertson add a “comment” field to the site accounting sheet template.

 

2.      Action List Review (list of actions, more information)

 

 

No actions due for this week.

 

3.      Proposal for the Internal Review of the LCG Services (more information) - V.Guelzow

 

 

The attached document includes the mandate of the Internal Review of the LCG Services, the description of the review procedures and a questionnaire for the Tier-1 sites.

 

Review Overview

The main focus of the review will be:

-          Organization and readiness of the operations of the centers.

-          Readiness of the middleware, deployment, maintenance, collaboration between the teams.

-          Overall functionality, reliability and performance.

-          Interoperability and usage of different grids (EGEE, OSG, NorduGrid)

-          Integration of the Tier-2 centers into the service

 

Point to add to the review mandate:

Find which are the essential components still missing in SC4 and needed by the experiments. And the plans to make these components available for the LHC start-up, both from EGEE and OSG.

 

The review will last 2 days (8-9 June 2006). The reviewers are: V.Guelzow (chair), J.Gordon, A.De Salvo, J.Templon, and F.Wuerthwein.

 

The reviewers will also collect information from other reviews made during the same period, without duplicating the presentations and the review work (Castor readiness review and SRM Interoperability Workshop).

 

Questionnaire

A questionnaire is included in order to receive as much information as possible before the review. It covers organizations matters, procedures, status of the infrastructure and readiness of user support and maintenance. 

 

 

Agenda

The agenda outline is:

-          First day.
Tier-1 status and organization.

-          Second day, AM
Status of the middleware , development and deployment, priorities and issues

-          Second day, PM
Status of SRM 2 development and deployment planning (M.Litmaath). Summary of the CASTOR readiness review (J.Harvey).

 

 

Action:

20 May 06 – Produce the agenda of the Internal Review of the LCG Services.

Update: Done. Here is the agenda: http://agenda.cern.ch/fullAgenda.php?ida=a062385

 

Representatives

Sites should send representatives and have a contact person for all issues of the review. Preferably the site should send someone in person to the review.

 

Action:

23 May 06 – Sites should send the names of the contact persons attending, possibly in person, the meeting.

 

Action:

23 May 06 – Sites should complete and send the questionnaire to the chair, V.Guelzow.

 

 

4.      Scheduling of Service Interruptions at LCG Sites (more information) - J.Shiers

 

 

This document  in the agenda was discussed and describes the organization and basic principles for service interventions.

 

The discussion focused on the warning delays for all scheduled interruptions. And whether the site should drain all the jobs in the queues.

 

In summary:

-          Any interruption up to 4 hours should be announced at least one working day in advance.

-          Interruptions longer than 4 hours should be announced at the Weekly Operations meeting prior to the interruption.

-          Interruptions longer than 12 hours should be announced at least one week in advance.

-          This procedure should be independent of whether or not the accelerator is operating.

-          Upgrades should be considered interventions and announced accordingly if they affect the service.

 

Decision:

For the moment all interventions that affect the services in use should be reported.

Later there may be a differentiation between categories of services, by importance of the service and severity of the problems.

 

Action:

31 May 06 – J.Shiers should clarify with the LCG Operations how, and to whom, the announcements of interventions should be distributed

 

Update: Here is the updated document, modified by J.Shiers to reflect the discussion that took place at the meeting.

 

5.      Review of Site Monitoring and Operation in SC4 (more information) - J.Shiers

 

The attached document provides an executive summary on the state of site monitoring and on the Tier-0/Tier-1 throughput tests carried out in April 2006.

5.1         Critical issues

 

1. Ramp up period too long.

It took about two weeks to reach the performance level required, even if these same tests had already been done in January. IThis seems to show that, on some sites, the data transfer is not integrated with the operational service.

 

2. Rates to tapes missing.

Some sites never provided the performance rates of the tape tests, not in real-time nor reporting after the event.

 

3. Site monitoring.

Many issues and problems have been detected manually monitoring the transfers from CERN; often sites fail to detect transfer problems with their local operations monitoring.

 

4. Service interventions.

Discussed in the previous item in today’s agenda.

 

5. Reporting of operational problems.

The reporting of operational problems is, for the moment, inconsistent and insufficient and really needs to be improved.

 

5.2         Recommendations

 

6. Monitoring cpu usage, disk and tape rates.

Sites should provide a schedule to implement monitoring of the data rates to disk and tapes. A common solution would be better but not easy. Currently different sites already use different tools (Lemon at CERN, Ganglia at FZK, etc) and to discuss/change such tools is not an urgent priority. Sites that don’t have a defined solution would benefit from a common discussion of the issue.

 

Action:

31 May 06 – K.Bos should start a discussion forum to share experience and tools for monitoring rates and capacities to provide information as needed by the VOs.

 

Action:

23 May 06 – V.Guelzow will add a question on the “site monitoring system status” to the internal review’s questionnaire.

 

7. Monitoring of the services at the sites.

This is not easy to implement because the middleware does not provide the appropriate APIs to control the status of each service.

 

Action

15 June 06 – I.Bird reports to the TCG the needs of APIs to verify the working status of the middleware services.

 

8. Scheduled interventions.

Already discussed.

 

9. Site operational logs.

The proposal was that all sites should maintain logs visible to the other partners on the LCG. Reports should be open and visible, and reporting problems should be done as soon as they appear, not only after they have been already solved. 

 

RAL and TRIUMF have very good examples of collections of log information even if this is not distributed to the other partners.

 

The discussion at the MB highlighted that this is not an easy infrastructure to set up for some sites, and it did not reach an agreement. In some cases sites use mailing lists for all sys admin issues, or some of their log database also contains information that is private to the site and cannot be published as such.

 

The Management Board did not decide on any policy about site operational logs.

But the MB considers important that the Weekly Report at the LCG Operations meeting must improve and really provide sufficient information on the services status, on the operation log and on the incidents that have occurred.

 

10. Schedule of recovery downtimes

By end of May there should be a plan on how to recover from scheduled downtimes for Tier-0 and Tier-1 sites, both for short (4 hours) and long (about 48 hours) interventions.

 

31 May 06 – J.Shiers proposes a plan for demonstrating capability to recover short and long interventions on the Tier-0 and Tier-1 sites.

 

11. Low Priority transfers

While experiments will use the grid for production during SC4, the transfers between Tier-0 and Tier-1 sites should continue as low priority transfers. These low priority transfers should become a permanent service, monitored and able to run almost unattended. Not sustained by individual “special effort” as currently.

 

The proposal was that this service is monitored in turns, with a different site as coordinators every time. But at the meeting there was not a conclusion on the subject.

 

Related issue: Tier-1 to Tier-1 Tier-2 transfers

The Tier-1 sites should set-up the infrastructure and configuration for Tier-1 and Tier-2 transfers, as needed by the experiments. This is an important SC4 milestone, and is becoming very urgent. Not all information on the status and set-up of these transfers was available during the MB meeting and will be collected later.

 

23 May 06 – Tier-1 sites should confirm that they have set-up and tested their FTS channels configuration for transfers to Tier-1 and Tier-2 sites.

5.3         Metrics

 

This section is postponed to next MB meeting.

 

6.      AOB

 

 

No AOB.

 

7.      Summary of New Actions

 

 

20 May 06 – L.Robertson add a “comment” field to the site accounting sheet template.

 

20 May 06 – Produce the agenda of the Internal Review of the LCG Services.

Update: Done. Here is the agenda: http://agenda.cern.ch/fullAgenda.php?ida=a062385

 

23 May 06 – Sites should send the names of the contact persons attending, possibly in person, the meeting.

 

23 May 06 – Sites should complete and send the questionnaire to the chair, V.Guelzow.

 

23 May 06 – Tier-1 sites should confirm that they have set-up and tested their FTS channels configuration for transfers to Tier-1 and Tier-2 sites.

 

23 May 06 – V.Guelzow will add a question on the “site monitoring system status” to the internal review’s questionnaire.

 

31 May 06 – J.Shiers should clarify with the LCG Operations how, and to whom, the announcements of interventions should be distributed.

 

31 May 06 – K.Bos should start a discussion forum to share experience and tools for monitoring rates and capacities to provide information as needed by the VOs.

 

31 May 06 – J.Shiers proposes a plan for demonstrating capability to recover short and long interventions on the Tier-0 and Tier-1 sites.

 

15 June 06 – I.Bird reports to the TCG the needs of APIs to verify the working status of the middleware services.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.