LCG Management Board

Date/Time:

Tuesday 21 November 2006 at 16:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=a063275

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 24.11.2006)

Participants:

A.Aimar (notes), D.Barberis, S.Belforte, I.Bird, N.Brook, F.Carminati, Ph.Charpentier, Di Quing, C.Eck, I.Fisk, B.Gibbard, J.Gordon, F.Hernandez, M.Kasemann, E.Laure, H.Marten, P.Mato, M.Mazzucato, G.Merino, B.Panzer, R.Pordes, L.Robertson (chair), O.Smirnova, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 5 December 16:00-18:00 Face-to-face at CERN
NO MEETING NEXT WEEK:

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

The minutes of last week where approved.

 

A discussion took place about the CE strategy of the EGEE TCG that was presented at last GDB and discussed during the MB item on the GDB. M.Mazzucato considered that the TCG should first report decisions and planning to the EGEE PMB, where the resources are allocated, and only later present it to the LCG. This was not the opinion of other EGEE members. The MB agreed that this is an EGEE issue and the discussion was not continued further.

1.2         "CPU time escaping jobs" Issue - J.Templon

J.Templon reported that there are two classes of applications that are “escaping” CPU time accounting and causing invalid results for the site:

-          ATLAS is using the Condor glide-in technology which starts a new session for every job. This causes CPU time for ATLAS to be incorrect.
Update: See the email by D.Barberis after the meeting.

-          LHCb have a new version of their software compiled with special flags that will not run on all grid nodes at SARA. If the job reaches a node that cannot run it, it waits 10 minutes and then exits. This means that the job blocks the scheduler slot, causing a loss of up to 10% of the available CPU time.

 

Ph.Charpentier explained that on new machines, compiling with this new flag allows some 20% performance improvement. But the resulting binaries will no run on legacy machines (P3, Athlon, etc).

 

J.Gordon remarked that some sites are not homogeneous and there is no implementation of sub-clusters available, which could allow the use of different binaries depending on the node. This will become an issue also when there will be 32 and 64 bit machines installed in the same site.

 

I.Fisk noted that “wall clock” accounting is useful because it shows how many resources were made available to a VO. The fact that the VO actually used all CPUs, or did not, is a VO’s problem.

 

Ph.Charpentier raised the issue that some sites have part of the nodes with less memory (ex: 512 MB vs. 1 GB). The sites should declare correctly the configuration of the nodes. Otherwise the batch system and the “blah” component (with gLite CE) cannot send the jobs to the right nodes.

 

S.Belforte noted that the issue is that the sites should provide the details of the configurations and correct node descriptions and this is not the case in most sites.

 

D.Barberis mentioned that the ATLAS pilot jobs check the configuration and run small jobs where the nodes have less memory. This is possible because ATLAS discussed with the sites the machines that it uses, while LHCb is trying to use all resources available at the site.

 

New Action:

In 2 weeks D.Barberis and I.Bird will report on the issues with accounting vs. Condor glide-in job submission.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 13 Oct 2006 - Experiments should send to H.Renshall their resource requirements and work plans at all Tier-1 sites (cpu, disk, tape, network in and out, type of work) covering at least 2007Q1 and 2007Q2.

 

Done. H.Renshall presented the situation at the MB and will distribute the requirements.

 

  • 21 Nov 06 - F.Carminati agreed to prepare and propose (mail to the MB) the mandate, participation and goals of the working group on “Caching and Data Access”.

 

On going. Distributed a proposal – discussed later in the meeting (see AOB).

 

  • 21 Nov 06 - I.Bird will prepare the mandate, participation and goals of the working groups on “Monitoring Tools” and “System Analysis”.

 

Started, but not completed yet.

Discussing the mandate with J.Casey and I.Neilson, proposed as joint chairs. It was decided to spend a few more days to carefully define the scope and the mandate of the working group on Monitoring Tools.

For the System Analysis group I.Bird will discuss with J.Andreeva, the proposed chair, during next week.

 

  • 25 Nov 06 - Sites should send to H.Renshall their procurement plans by end of next week.

 

Not done. Several sites reiterated their request to “first receive the summary of the requirements from H.Renshall and then provide the procurement plans”.

 

 

New Action:

28 November 2007 - H.Renshall will send the updated table of the experiments requirements.

These requirements are urgently needed by the sites in order to define their procurement plans.

 

3.      Update on the Megatable - C.Eck

 

The Megatable, by C.Eck, was distributed to the GDB and the MB mailing lists.

 

It is also available on the LCG web site (under Planning - Technical Design Reports - Draft: Storage and Network Requirements for Tier-1s).

 

All changes will be maintained up to date in new versions of the document above.

 

4.      First Ideas for 2007 Targets and Milestones (Slides) - A.Aimar

 

A.Aimar presented few initial ideas about which Level-1 targets and milestones could be defined for 2007.

4.1         General Remarks

Slide 3 shows the general ideas about the 2007 targets:

-          Define a ramp-up from the current values to the targets to reach
Instead of checking progress by big steps every three months.

-          Define realistic targets needed for the end of 2007 and for the 2008 run.
Do not refer to the nominal rate if this is not needed yet. But take account of expected machine and experiment efficiency.

-          Have a standard profile ramp-up starting in Jan 2007, reaching the target in Feb 2008 and with a pause in Sept-Dec 2008 for data taking.

-          Peak rates can be defined as 2 x the average rates.
Peak rates have to be demonstrated only for short periods (e.g. 3 days).

 

J.Gordon asked whether the progress has really to be steady as indicated in the slide and be checked monthly.

A.Aimar answered that this is the proposal for the MB. L.Robertson noted that it is more  important to show steady improvements than to fix targets that have to be met earlier than really required (difficult and expensive) in order to ensure that there is time for corrective actions.

 

D.Barberis noted that the ramp-up curves may change depending of the kind of target and on the resource. For instance disks arrive in bunches; therefore it will not a steady slope but with steps, different at each site.

 

In Slide 4 the proposal is that the aggregation of the values from all sites is done taking into account 70% of the sites. Namely CERN + 8 Tier-1 sites.

 

The target values will be taken from the resources and network requirements, as specified by the Experiments in H.Renshall’s and C.Eck’s tables.

 

Question for the MB: Should the targets be measured for each VO? Or should they instead measure the aggregate for the site (to show that it can support all of its VOs simultaneously)? J.Templon suggested that separated values are better, but the sites will have to provide these values. L.Robertson noted that not all rates can be separated by VO. For instance IO rates may use equipment common to several VOs and so measuring for one VO says nothing about the capability when all VOs are active.

4.2         Targets for 2007

Slide 5 to 9 show examples of possible measurements and their targets:

 

The values proposed are there to stimulate the discussion. The MB member should send feedback to A.Aimar before the 29 Nov 2006.

 

Targets for “Data Rate to Tape” (slide 5)

The target of I/O rates to tape storage should be defined at each site.

The peak rate should be demonstrated:

-          for 3 days before Jun 07

-          for 7 days before Sep 07

 

Sustain average rates at Tier-0 and at each Tier-1, driven by the experiments (DAQ at the Tier-0, FTS at the Tier-1s) and background dteam load

-          3 days at 50% - Feb 07

-          7 days at 70% - Jul 07 

-          10 days at 100% - Feb 08

 

Data is taken from the Megatable and the results should be monitored and reported by each site.

 

Should these tests be done locally or with data streamed from CERN?

Should the peak rates be tested by all sites simultaneously and be coordinated from CERN?

 

Targets for FTS Transfers (slide 6)

The peak 2008 rates should be sustained:

-          for 3 days before July 07

-          Tier-1 to/from Tier-0

-          Tier-1 to/from other Tier-1s

-          Tier-1 to/from all Tier-2s

 

The average rates should be sustained permanently from Tier-0 to each Tier-1, driven by the experiments and background dteam load

-          50% - Feb 07

-          70% - Jun 07

-          90% -  Sep 07

-          100% - Oct 07

 

Targets from Megatable, results extracted from GridView

 

J.Gordon noted that the transfers to the Tier-1 and Tier-2 sites should be verified regularly because sometimes they stop working and are not maintained. Therefore, also after July 07, there should be a small, but sustained, transfer rate required every month, in order to make sure that the Tier-1 and Tier-2 channels continue to work properly.

 

J.Gordon and H.Marten suggested that the experiments should generate the network load (replacing the DTEAM load) because they are the ones that know when the load is appropriate. In addition if dteam uses tapes to run these tests those tapes will not be available to the experiments any more.

 

D.Barberis agreed that the experiments could have their own load generator and switch it on and off to test the transfers when they need it. S.Belforte would prefer a single solution rather than four load generators, one per experiment.

 

M.Mazzucato noted that DTEAM is not assigned the same site resources (servers, etc) as the experiments; therefore tests and load from the experiments would be more realistic.

 

Targets for data from DAQ to Tier-0 and to Tier-1 sites (slide 7)

This needs to reach 100% by November 2007, and obviously be driven by the experiments

 

The metrics start low because these transfers (DAQ to Tier-0 to Tier-1) have not been tried before:

-          3 days 20% before Mar 07

-          5 days 50% before Jun 07

-          7 days 100% before Sep 07

 

Target values will be taken from the experiments and Megatable.

Results will be extracted from the Lemon monitoring system (CERN) and GridView.

 

Target for Job Success Rate (slide 8)

In order to measure job success quality one needs to:

-          set targets for job success rates, by site and by VO

-          perform job log analysis (ARDA dashboard work)

 

Measure should include both the overall success rate (grid, network, site problems) and the site success rate (exclude non-site-related errors)

 

Ramp up based on these targets (percentage of the target, not in absolute %):

-          60% - Jan 07

-          70% - Mar 07

-          80% - May 07

-          90% - Sep 07

-          100% - Feb 08

 

The discussion that followed suggested that the targets may be depending on the VO and can be measured by the VOs. Probably this does not need to be a Level-1 global target.

 

Target for SAM Reliability Tests

Reach 95% by Feb 2008

-          80% - Feb 07

-          85% - Jun 07

-          90% - Sep 07

-          95% - Feb 08

 

Always taking CERN + 8 best sites into account and using the monthly averages.

 

L.Robertson noted that 95% could be too high. Having seen the result of the SAM tests until now (even taking into account the problem of the SAM system) about 90 % could be the reachable limit.

 

Milestones for 2007

Slides 10 and 11 describe the proposed milestones and the MB should comment on them.

Here is the full list from the slides:

 

24x7 milestones

-          Definition of the levels of support and rules to follow,
depending on the issue/alarm                                                February

-          Testing the support and operation scenarios                         April

3D Project milestones

-          Phase 1 sites in production, used by the experiments          February

-          Tests of replication of experiments data among sites           March

-          Experiment condition DB in operations                                  April

SAM tests

-          SAM test complete and including VO tests                            March

SRM 2.2 and MSS Implementation

-          Tier-1 sites upgraded to SRM 2.2                                           March

-          Storage classes implemented at sites                                   April

-          Experiment software using storage classes                          May

gLite CE

-          Installed and available at the sites                                          February

-          Usage by the experiments                                                      April

VOBoxes

-          Service level, backup and restore defined                              February

-          VOBoxes service implemented at the sites                           April

Job Priorities

-          Mapping of the Job priorities on the batch software               April
of the sites

-          Configuration and maintenance of the jobs priorities             June
as defined by the VOs

CAF

-          Experiments define their needs for the CAF                          January

-          CERN implements the CAF for the experiments                   June

General Milestones

-          Targets for procurements.
Equipment for 2007 should be operational by                        May 

-          Define “operation challenges” to check tools                         April & September
and procedures for operation between VOs and sites

-          Targets on number of nodes available at each site               March
(statistics by the job wrappers?)

 

New Action:

29 Nov 06 – The MB member send feedback to A.Aimar on the Targets and Milestones for 2007.

 

5.      AOB

 

5.1         Proposal for a DM working group – F.Carminati

F.Carminati distributed a proposal to the MB mailing list (see proposal). This was made available after the start of the meeting, and the discussion was not conclusive. Some members considered that the proposed mandate emphasized too much the re-assessment of requirements, while at this stage we should rather be focusing on how to use the functionality implemented in SRM 2.2, adapting the computing models if necessary.

 

Ph.Charpentier noted that there was already another GDB group on implementing SRM 2.2 storage classes, and the MB should rather use  this.

 

L.Robertson added that this work could be done by the people now in the Storage Classes group. As an evolution, with these additional tasks, and adding representatives from the experiments to the group.

 

The discussion that followed confirmed the preference for expanding the mandate of the current Storage Classes group, instead of launching a new GDB working group.

 

New Action:

L.Robertson and F.Carminati will discuss with K.Bos about changes to the mandate of the Storage Classes working group.

 

6.      Summary of New Actions 

 

 

28 November 2007 - H.Renshall will send the updated table of the experiments requirements.

These requirements are urgently needed by the sites in order to define their procurement plans.

 

29 Nov 2006 - L.Robertson and F.Carminati will discuss with K.Bos about changes to the mandate of the Storage Classes working group.

 

29 Nov 2006 – The MB member send feedback to A.Aimar on the Targets and Milestones for 2007.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.