LCG Management Board

Date/Time:

Tuesday 22 January 16:00-17:00 – Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=26624

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 25.1.2008)

Participants:

A.Aimar (notes), D.Barberis, I.Bird (chair), T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, J.Gordon, F.Hernandez, U.Marconi, H.Marten, G.Merino, Di Qing, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 29 January 2008 16:00-17:00 – Phone Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

  • 13 Jan 2008 – Tier-1 Sites should report on whether they can collect the metrics on tape performance as proposed by the Tier-0 (T.Bell).

SARA, RAL, FZK, PIC, CNAF, ASGC, BNL, TRIUMF will try to match all metrics. CERN, who proposed the metrics, is obviously doing it.
IN2P3 will not be able to collect all metrics easily because they are distributed over several systems and the resources are shared by LHC VOs and other users at IN2P3.

  • 21 Jan 2008 - The LCG Office should define where (a web area, wiki, share point?) the Sites can upload their statistics about their tape storage performance and efficiency.

Not done.

 

3.    CCRC08 Update (Baseline versions; CCRC'08 Wiki; Draft F2F agenda; Next CCRC Planning Meeting Agenda; Paper) - J.Shiers

 

 

The update on CCRC08 is reported in the Paper attached.

 

The services are not only going to be tested for scaling but also new updates are still being tested or certified (LFC, Gfal, lcg-utils).

One should agree on a target set of versions for all components of middleware and storage software no later than the April Face-to-Face Meetings (April 1/2).

 

For February, we should learn as much as possible. For now we have the following three sets of metrics:

1.     The “Scaling Factors” published by the experiments for the various “functional blocks” that will be tested.
These are monitored continuously by the experiments and reported on at least weekly.

2.    The lists of “Critical Services”, also defined by the experiments.
These are complementary to the above and provide additional details as well as service targets. It is a goal that all such services are handled in a standard fashion – i.e. as for other IT-supported services – with appropriate monitoring, procedures, alarms and so forth.
Whilst there is no commitment to the problem-resolution targets – as short as 30 minutes in some cases – the follow-up on these services will be through the daily and weekly operations meetings;

3.    The services that a site must offer and the corresponding availability targets based on the WLCG MoU.
These will also be tracked by the operations meetings.

 

Deferring assessment of scalability to May is clearly risky, as no further run is scheduled to check whether residual problems have been solved. This would therefore occur during data taking!

Note: It is assumed that after May we operate in continuous data taking mode until the end of the year.

 

Daily, except Monday, CCRC’08 meetings at 15:00 will continue throughout February. However, it is proposed that the weekly planning call (Mondays at 17:00) be suspended and any operations issues handled at the standard weekly operations meeting (Mondays at 16:00).

 

Planning meetings for May will restart in March at the F2F during the pre-GDB on the 4th.

Long-term planning will continue but does not warrant a dedicated meeting during this period of high activity.

 

Ph.Charpentier asked why the SRM Update was now removed from the MB agendas.

J.Shiers replied that information on SRM is now in the CCRC06 report. The main SRM issues still open are configuration issues at the sites.

The current problem(s) with SRM V2 should be carefully assessed and a solution discovered (or a compromise/workaround  be found if the solution will take time).

 

Ph.Charpentier mentioned that some severe current SRM problems (with dCache) should be reported to the MB in the next meetings for close follow-up.

I.Bird suggested that the updated list of issues should always be known to the MB.

J.Shiers agreed and added that this list will be at the top of the CCRC'08 Wiki home page.

 

J.Templon asked that the final version of the packages, for the May CCRC and beyond, do not require any major update of dCache, because it is usually a very complex deployment.

J.Shiers replied that the current dCache version is 1.8-12, but one should accept patches when they will be needed. Hopefully no major versions should be needed.

 

4.    LHCC Review Report and Referees Meeting (Agenda) - I.Bird

 

4.1      LHCC Review Report

I.Bird distributed the LHCC Review Report and asked for comments by the MB Members. The MB member should send additional comments within the next few days.

4.2      Referees Meeting

Here is the agenda (as on the 28.1.2007) for the Meeting with the Referees.

 

 

Tuesday 19 February 2008

from 12:00 to 14:55

 

12:00->12:55    Status of CCRC08

12:00 

Status and progress of sites & services (20')

Jamie Shiers (CERN)

12:20 

Experiments' status and progress (20')

12:40 

Summary of SRM v2.2 deployment (15')

 

12:55->13:55    Castor2 and storage metrics

12:55 

Proposed metrics for Castor performance and reliability (15')

13:10 

Metrics for T0 + T1 MSS performance (15')

How we monitor drive and system performance - small files, etc.

 

13:55->14:55    Overview of Tier 1 status for all Tier 1s
Description:

One talk with comparison tables; no need to cover T1s already examined in comprehensive review

 

The status of Feb08 CCRC must be presented, even if it will still be running on the 19th.

As well as the status on the SRM 2.2 deployment.

The Experiments Summary should be summarized by one representative for all four Experiments.

 

The CASTOR metrics were asked at the Review, because those presented were not clear to the reviewers.

The new metrics by T.Bell should be shown and explained.

 

The talk about the Tier-1 sites should be a single talk and cover the High Level milestones, 24x7, VOBOX SLAs, accounting, procurement, etc. For all Tier-1 Sites.

 

5.    Status of 2008 Installations for CCRC and beyond - Sites Roundtable

 

The pledges for 2008 should be available to the VOs by April 2008, as agreed:

Each sites reported:

-       CERN will match the pledges

-       NL-T1 will not match the pledges

-       UK-RAL will match the pledges. Disk resources are available and will be tested for a month before being put in production.

-       DE-KIT hardware already available and will be ready by April

-       FR-IN2P3 not received all hardware but should arrive by end of January and be installed by April

-       ES-PIC CPU and disk storage will match it. For tape the ramp up will start now but reach the full pledges only by October.
The installation of disks will be all done before April but they also need to know the requirements of the VOs for February.

-       IT-INFN-CNAF the hardware will be delivered in March. If arrives early March it will be installed for April.

-       TW-ASGC will send an email after the meeting.

-       BNL are confident they can match the pledges by April

-       TRIUMF the hardware is all installed already, just not completely available to the VOs.

-       FNAL not present at the meeting.

 

6.    Pilot Jobs Framework (Mandate) - J.Gordon

 

J.Gordon presented the updated Mandate resulting from the last GDB discussion.

 

Here are the details about the proposed mandate.

 

Multi-User Pilot Job Frameworks Review Team (14/1/2008)

 

Membership: Maarten Litmaath (chair), David Groep (EGEE2), Eileen Berman, Mina Altunay (OSG), Alice(tbd), ATLAS (Torre Wenaus), CMS (Igor Sifiligol), LHCb(Andrei Tsaregorodtsev)

 

Terms of Reference: To review the Multi-User Pilot Job Frameworks of the LHC Experiments (Draft 1).

To produce a report to the Management Board about the safety and effect of the framework.

 

This review to consider at least:

1. the handling of user proxies from user client, through to running job, via any intermediate storage;

2. identity change.

3. auditability of running processes.

4. the interaction with the local batch system

5. the creation and destruction of jobs within a pilot job.

 

The LHC experiments should provide sufficient documentation for the review. Having the experiment framework people present to give a verbal account is insufficient.

Timescale to be determined/agreed.

 

Notes:

a) The inclusion of experiment framework people in the team is because of their experience and to expose them to the practices of the other experiments. If this leads to any common code, so much the better.

b) When each experiment is being considered, the relevant experiment person answers questions rather than be a reviewer. They can be joined by other people from their experiment. Neither do they have to agree to the team's report on their experiment.

c) I suggest a questionnaire to each framework, a review of the answers by the team then a session with each framework face to face and/or video/phone. The team should only have to resort to code reviews if they are unhappy with the experiment's explanation on any points.

 

 

The review team is defined; but a couple of more members could be added, if this will be asked by MB members.

Obviously the participants are not going to be assigned to review of their own framework.

To initiate the review a questionnaire will be distributed to each framework; will then be followed by F2F and phone meetings.

 

J.Templon asked what happens if (1) the frameworks pass the review or (2) some do NOT pass the review.

J.Gordon replied that the sites had agreed in principle to install the JP frameworks if they had passed the review.

But they will be asked again at the MB. If they do not pass the review the sites should not install the framework until it is modified accordingly to the requests of the review.

 

I.Bird added that if the frameworks are not fixed the respective VO will not be able to safely use the Sites. But hopefully the problems discovered in the frameworks will then be fixed and all VOs should then be able to run on all the assigned Sites.

J.Gordon added that the review should also encourage convergence on similar solutions to similar problems.

 

7.    GDB Summary (Slides) - J.Gordon

 

The GDB in January (http://indico.cern.ch/conferenceDisplay.py?confId=20225) covered the following topics:

 

-       Benchmarking
Update from Helge Meinhard and additional statistics from INFN. Not easily parameterized for the moment.
A testbed being constructed for experiments to benchmark their codes

 

-       Data management
Proposal on how to handle storage tokens. The agreement was reached, but dCache problems have arisen since.

 

-       GSSD
Flavia Donno presented the GSSD final report. Some work still needs to be completed and some interesting new issues were raised.
John Gordon, Ian Bird, Jamie Shiers to consider future programme of work of GSSD and present it at the MB.

 

Ph.Charpentier asked that GSSD at least follows the current work and bug fixing on SRM 2.2.

J.Gordon agreed but added that GSSD should not look beyond that unless they receive it in a new mandate.

 

-       Worker Node Working Group
Tasks: Should be investigating:

-       Consider hard limits, e.g memory, hard disk space, ...

-       Subsequent matchmaking of those resources.

-       The efficient choice of worker nodes within the EGEE grid.

-        e.g low memory jobs on low memory nodes.

-       Batch farms contain heterogeneous resources.

Expected Results: The outputs of the WG should be:

-       Provide deployment process with details to advertise heterogeneous WNs.

-       Provide users with examples for matching a particular set of WNs.

-       Provide middleware development with shortfalls which make efficient utilization of WN resources difficult or impossible.

 

-       Security Policies

A set of documents should be approved:

-       VO Operations Policy. Final call” version (V1.5).
https://edms.cern.ch/document/853968/1

-       Pilot Jobs Policy. V0.3 (1st Oct 07). Policy on Grid Multi-User Pilot Jobs.
https://edms.cern.ch/document/855383/1

-       Logging and Traceability Policy. V1.6 of what was “Security Audit Requirements”.
https://edms.cern.ch/document/428037

 

And the future plans about security are:

-       Update old VO security policy documents

-       VO Security Policy

-       User registration and VO membership requirements

-       Update CA policy (to include new IGTF profile)

-       User-level accounting

 

-       Pilot Jobs

-       Reviewed the current progress. SCAS due in certification end of February (slipped).

-       Starting the Review of LHC Multi-User Pilot Job Frameworks. Mandate on this agenda and was just discussed earlier.

 

-       CCRC08

-       Introduced to the GDB< J.Shiers already reported about it.

 

-       Issues to Follow

-       GSSD Future

-       Storage Tokens

-       Pilot Jobs

-       Job Priority

 

8.    AOB

 

 

F.Hernandez asked which capacity is expected by the Tier-2 sites in April 2008.

I.Bird replied that in principle their 2008 pledges should be made available for April, just like for Tier-1 Sites.

 

9.    Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.