LCG Management Board

Date/Time:

Tuesday 13 February 2007 - 16:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=11625

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 10.2.2007)

Participants:

A.Aimar (notes), D.Barberis, O.Barring, I.Bird, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello, I.Fisk, J.Gordon, C.Grandi, A.Heiss, F.Hernandez, M.Kasemann, J.Knobloch, G.Merino, B.Panzer, R.Pordes, H.Renshall, L.Robertson (chair), J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 27 February 2007 - 16:00

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

No comments. Minutes approved.

1.2         Matters Arising

 

Site Reliability

A note proposing a change in the way that Site Reliability is calculated was circulated and comments followed (see email).

 

Decision:

The MB supported the proposal and the new algorithm, defining reliability as availability/scheduled_availability, will be used to calculate “site reliability”.

 

Job Priorities

The status of group/roles and their mapping on job priorities will be re-assessed at the Face-to-Face MB Meeting on the 6 March 2007.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

         31 Jan 2007 - All Tier-1 Sites should send to H.Renshall their procurement plans for 2007.

 

Done. H.Renshall said that he has the data he needs and that should to be verified by the sites.

 

J.Templon added that SARA-NIKHEF will send updated numbers when they are ready.

 

         9 Feb 2007 - All Tier-1 Sites should send to the MB their Site Reliability Report for December 06 and January 07.

 

Not done. Several sites still have to do send their report. A reminder will be sent to the Tier-0 and Tier-1 sites.

 

3.      Site Resources Table (aka Harry's Table) (2007Q1; 2007Q2; Slides; Document) – H.Renshall

 

 

H.Renshall presented how requested and planned resources are described in his quarterly tables (2007Q1; 2007Q2) and the process for maintaining up to date the information from experiments (requested resources) and sites (planned and installed resources).

3.1         Functions of Harry’s Table

The MB, at the meeting of the 31 Jan, decided that all Tier-1 sites should send to H.Renshall their procurement plans for 2007, at least for Q1 and Q2 for now.

 

Experiments and sites agree that:

-          The reference for sites’ resources and experiments’ requirements are the numbers in Harry’s Table

-          Both sites and experiments should submit the values there and keep them up to date.

-          The installed capacity is the capacity planned to be available at the beginning of the quarter, but the capacity for the current quarter is the capacity actually installed and should be updated, by the sites, whenever new equipment is installed.

-          The accounting and quarterly reports will simply use the latest numbers from the table as the installed capacity.

 

Slide 2 shows an example from IN2P3 that summarizes what is required and what is available at that site.

3.2         Points for Discussion

The accounting report includes the installed resources which, as agreed at the MB of 6 February, will be taken directly from Harry’s table for the current quarter. It also includes for each site the disk capacity allocated to each experiment as well as usage per experiment. The breakdown of disk allocations installed or planned should therefore also be given per experiment.

 

Usability Efficiency Factor: H.Renshall is assuming that sites report cpu, formatted disk space and tape space with no efficiency allowance. He will show this as installed/planned capacity and also show the USABLE capacity (called “available” in the slides) by applying the standard efficiency factors (Tier-1 cpu 0.85, disk 0.7, and tape 1.0). These latter numbers are the ones that should be compared with the experiments’ requests.

 

Capacity should be that definitely planned to be usable at the beginning of the quarter in question, while the experiments’ requirements should be the peak required during the quarter.

 

The notion of “installed vs usable” capacity was discussed at the meeting. The differentiation is required because not all resource can actually be used by the experiment (several over-heads in cpu and disk usage for instance). Therefore an efficiency factor must be taken into account (Tier-1 cpu 0.85, disk 0.7, tape 1.0 as already agreed for the TDRs). The values in the MoU do not include the efficiency factors, but specify the whole amount of resources to be installed at the site. D.Barberis noted that in the TDR the experiments provided numbers without any efficiency factor. [The TDR requirements, and the subsequent updates, are therefore to be compared with the installed capacity while the requirements in Harry’s table are the net requirements to be compared with the usable capacity and the capacity actually delivered and reported by the accounting systems].


Decision: “Usable capacity” will be added to Harry’s Table, calculated from the installed capacity at the site.

 

The concept of “allocated disk resources” was discussed: this is the disk capacity installed and assigned to a specific VO. Sites may have installed resources that are available but not yet assigned to a VO. J.Templon noted that SARA was reporting what is allocated, not what is installed.

 

J.Gordon noted that it can happen that the LCG experiments’ planning does not match with the timing of the “GridPP allocation committee” meetings, where allocations to experiments are decided. Therefore, RAL cannot guarantee to be able to allocate resources if the GridPP allocation committee meets before the experiments have made their requests for the next quarter.

 

Decision:

The MB members agreed on reporting disk resources by VO. Site should report their total installed capacity and also the allocations to each experiment.

 

R.Tafirout noted that some of the benchmarks and performance data provided by the vendors for the Woodcrest processors do not match the observed performance of ATLAS programs when compared with previous processors. Their recently installed Woodcrest processors were rated by the manufacturer in SPECint2000 units as x3 times faster than the previously installed processors, but were measured as only x2.3. What performance value should sites use to plan and report CPU resources? F.Hernandez added that also IN2P3 noted the same behaviour running their benchmark applications and proposed that all sites should use the same criteria.

 

There is a recently established HEPIX working group on benchmarking, chaired by H.Meinhard. L.Robertson will talk to him and report to the next meeting.

 

3.3         Process on Reporting Capacity and Usage Data (Note by L.Robertson)

During the week L.Robertson had distributed a document describing the overall process for reporting site capacity and usage data (Megatable, MoU Tables, Harry’s Table, Accounting Report, etc).

 

Action:

MB Members should review and send feedback on the note distributes by L.Robertson on the “process for reporting site capacity and usage data”.

 

An updated version is available in an email sent to the MB mailing list on the 14 Feb 2007 (see this mail and replies from the Mailing List Archive. It requires a NICE login).

 

4.      Targets and Milestones for 2007 – Proposal (Megatable, Slides) – A.Aimar

 

 

The previous presentation, by J.Shiers, on “Targets for 2007” (MB, 6 Feb 2007) had proposed some specific targets in the framework of the experiments activities (as on slide 2). But some activities mentioned, such as “Tier-1 -Tier-1 and Tier-1 – Tier-2 transfers” and “grid production at Tier-1 sites”, were not quantified.

4.1         Proposed Data Rate Targets

Further discussions suggested that it is maybe better to propose one target to reach by June 2007 made of several goals in terms of:

-          Tier-0-Tier-1 and Tier-1-Tier-2 transfers and

-          Aggregate milestones on the amount of data transferred

-          With the targets specified in the Megatable (link to Megatable, updates 24.1.2007)

 

The targets proposed, for example for CMS taken from the CMS CSA07, are (numbers given as percentages are based on the rates in the Megatable):

-          65% Tier-0 to Tier-1 peak rate, for a week

-          Tier-1 to each Tier-2 sustain 10 MB/s, for 12 hours

-          Each Tier-2 to Tier-1 sustain 5 MB/s, for 12 hours

 

An aggregate milestone at each Tier-1 (all targets simultaneously for 12 hours):

-          Tier-0 to Tier-1 – 50% of average rate +

-          Tier-1 to Tier-2s – 50% of sum of average rate +

-          Tier-2s toTier-1 – 50% of sum of average rate

 

H.Renshall noted that, if we are to define similar targets for all VOs, it will be necessary to test with all sites and VOs active at the same time, and this will require very careful planning. L.Robertson noted that the organization of these tests is the responsibility of the experiments, although LCG can help if needed.

 

M.Kasemann explained that CMS has a 5-weeks cycle for their tests; every cycle tries to achieve or get closer to the goals specified in CSA07. But any other synchronization, with the activities of the other experiments, needs to be coordinated by the LCG.

 

I.Fisk noted that because the experiment is distributed over 24h time zones there will not be all sites simultaneously running, but they will each participate to a 12h period to the test.

 

G.Merino asked whether the values in the Megatable are all “peak rate” values.

L.Robertson replied that the Megatable contains the peak values for Tier-0-Tier-1, and both the peak and average values for Tier-1-Tier-2.

 

22 Feb 2007 - Experiments (ALICE and LHCb in particular) should verify and update the Megatable (the Tier-0-Tier-1 peak value in particular) and inform C.Eck of any change.

 

J.Gordon asked for an Overview of the Experiments Plans for 2007. He added that would be important to have a plan for 2007 where the activities are shown clearly distributed over time.

 

Action:

H.Renshall agreed to summarize the experiments’ work for 2007 in an “Overview Table of the Experiments Activities”.

 

Slide 4 shows a possible visualization of the targets for a specific site and a VO (ASGC and CMS in the example). All tests to pass are shown in the table, with the target to be reached. Once a target is reached the corresponding cell is set to green (for “test passed”). Therefore by June 07 all cells with the targets should be “green” in order to show that that the global target, for the VO at a given site, is successfully completed.

4.2         Site Reliability Targets

The reliability of the sites will continue to be tested by the SAM monitoring system and the usual monthly reports will be produced.

 

The proposed 2007 targets for Tier-0 and Tier-1 sites are:

-          Target for each site
91% - by Jun 07
93% - by Dec 07

-          Taking 8 best sites
93% - by Jun 07
95% - by Dec 07

 

For the Tier-2 sites the SAM reports will be generated on a monthly base; but no targets are set for the time being.

L.Robertson noted that although the targets may not seem to be very high actually they are a good progress on the values reached during 2006. J.Gordon added that there are some key issues not yet solved (BDII instabilities, SRM issues, etc) and until those are solved it is difficult to do better than 90-ish % as site reliability.

4.3         Other Milestones for 2007

Slide 6 proposes additional milestones on services that were already agreed by the sites and experiments and that should be in place in Q1 and Q2 2007:

 

Note: All dates are intended for the end of the month mentioned below.

 

24x7 milestones

-          Definition of the levels of support and rules to follow, depending on the issue/alarm - February

-          Support and operation scenarios tested - April

 

VOBoxes

-          Service level, backup and restore defined - March

-          VOBoxes service implemented at the site - April

 

Job Priorities

-          Mapping of the Job priorities on the batch software of the site - April

-          Configuration and maintenance of the jobs priorities, as defined by the VOs - June

 

3D Service milestones

-          Oracle Service in production, and certified by the experiment(s) - streams interventions CERN working-hours only - February

-          Conditions DB in operations (ATLAS, CMS, and LHCb) - April

 

Slide 7 (from J.Shiers for the 3D workshop) shows how for the Database services the schedule is being prepared in a tabular form that could be used for several other milestones that have to be implemented at all sites.

 

The milestones could be visualized as in slide 9, in a matrix that displays the status of each milestone at each site.

4.4         Milestones not included

J.Gordon asked why some milestones, such as SL4 and gLite 3.2 and 64 bits porting, are not included.

A.Aimar replied that they were in previous versions of the proposal but the dates were not clear at this moment but, when available, will be discussed and added.

I. Bird added that those are milestones of the Deployment Area and sites implementations will be added after the certification.

 

M.Kasemann added that CMS will decide within 10 days, after some ongoing tests, whether CMS will build only SL4 or also SL3 binaries.

 

Milestones on SL4 migration, Glite 32 and 64 bits installation, SRM 2.2 milestones (installations and endpoints) for the sites will be specified once the certification of these components is complete and one is sure that their deployment will be done for the 2007 data taking.

 

J.Gordon asked why disk and tape reading rates are not included but only network transfer rates. N.Brook replied that LHCb has given to the sites their rates as included in the TDR. M.Kasemann added that CSA07 will focus on reaching 50% of the 2008 performance by July 2007. D.Barberis also agreed that reading from tape is not an important test for ATLAS for their FDR in Summer 2007. They will test recalling data from tapes in early 2008. F.Carminati added that also ALICE will do the same.

 

These experiments plans should also be visualized on some time chart (H.Renshall had already agreed to do it, see section 4.1 above).

 

5.      AOB

 

 

No other business.

 

6.      Summary of New Actions

 

Action:

MB Members should review and send feedback on the note distributes by L.Robertson on the “process for reporting site capacity and usage data”.

 

22 Feb 2007 - Experiments (ALICE and LHCb in particular) should verify and update the Megatable (the Tier-0-Tier-1 peak value in particular) and inform C.Eck of any change.

 

H.Renshall agreed to summarize the experiments’ work for 2007 in an “Overview Table of the Experiments Activities”.

 

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.