LCG Management Board

Date/Time:

Tuesday 18 April 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a061497

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 24.4.2006)

Participants:

A.Aimar (notes), D.Barberis, S.Belforte, K.Bos, T.Cass, Ph.Charpentier, T.Doyle, I.Fisk, D.Foster, B.Gibbard, P.Mato, M.Mazzucato, G.Merino, B.Panzer, Di Qing, L.Robertson (chair), Y.Schutz, J.Shiers, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList 

Next Meeting:

Tuesday 25 April 2006 at 16:00

1.      Minutes and Matters arising (minutes) - A.Aimar

 

1.1         Minutes of the Previous Meeting

No comments received. Minutes approved.

 

2006Q1 quarterly reports (QR wiki page)

 

Still waiting for some QR reports (due 11.04.2006).

 

2.      Action List Review (list of actions)

 

 

  • 15 Apr 06 – D.Duellmann should add performance targets and tests in the LCG 3D project plan.

Done. Explained in the 2006Q1 the Quarterly Report. It includes a milestone for a dedicated DB throughput phase in May.

 

 

3.      Status of Throughput Tests (more information) - J.Shiers

Please see the GridView site and select "Hourly report" from 18 April.

 

 

The status of the SC4 disk/disk transfer tests is available at this wiki page and the information is summarized in this presentation.

 

The goal of these tests is to reach the full nominal throughput rate. The schedule was that in the period April 3rd (Monday) - April 13th (Thursday before Easter) we should sustain an average daily rate to each Tier1 at, or above, the full nominal rate. In addition any loss of average rate > 10% needs to be: (1) accounted for, (2) explained in the operations log and (3) compensated for by a corresponding increase in rate in the following days.

 

Slide 4 shows the sites that meet or exceed the full nominal rates (marked with blue table cells) and those that still are not reaching their nominal rate. All details should be recorded in the SC4 blog (https://twiki.cern.ch/twiki/bin/view/LCG/ServiceChallengeFourBlog) but the blog is not used consistently by most of the sites. Explanations should be provided and are very important in order to learn and retain the solutions.

 

The 1 Gb/s network transfer seems to be a threshold for some individual sites, even for sites that in the past have been transferring data at higher rates than now. This is an issue that need to be investigated in detail at each site.

 

The FTS database was cleaned up on the 12th of April and the interruption is visible in the GridView histogram (slide 6). As shown the clean up improved by about 300 MB/s the total rate (from about 1300 before, to 1600 MB/s after). In the future these clean up operations will be done automatically, but this is not an urgent issue as for now it can be done manually when needed.

 

During Easter Sunday (slide 8) the rate was steady at about 1600 MB/s, and slide 9 and 10 show the rates for each site (in slide view mode) and, in some cases the rate is high but not very stable.

 

In summary, the aggregate target data rate was achieved for one day, but the average for the full week was only 80% of the target. 6 of the 12 participating sites achieved their targets on average during the second week, with a seventh site coming within 10% of its target. On the other hand three sites achieved less than 70% of their weekly target. While the results can be viewed as promising, some of the sites performed worse than they did in the SC3 tests in January, the start-up was slow, and the operation is far from stable and automated at many sites.

 

The disk will be stopped in order to perform tape testing (see below), and give time for the sites with performance problems to debug.

 

Simply reaching the nominal rate of 1600 MB/s is not going to be sufficient for running in a sustainable mode for LCG services and the target rates for the next phase (slide 11) will be increased to 150% of the nominal rate, in order to be able to recover from problems. The file sizes and number of parallel streams will also be changed, to reflect realistic LHC conditions.

 

Disk/tapes tests will be performed during this week, at a reduced rate in order to check the tape installations at the sites. Once this is achieved it was agreed that the disk-disk tests would be run as a continuous operation throughout SC4, in order to allow sites to work on performance improvements and automation of their operation. The goal is to be able to ramp up the transfer rates in a few hours whenever needed. All these transfers will be performed as lower priority transfers in order not to perturb the work on the grid of the experiments.

 

4.      T1-T2 associations, and T0/1<-->T2 data transfer requirements - J.Shiers

 

 

At the GDB in Rome (April 2006) it was decided that the experiments would propose their T1-T2 associations.

 

ALICE:
Asked each Tier-2 site which Tier-1 site they want to be associated to. Negotiations with the Tier-1 sites will then follow.

 

CMS:
The associations have been defined, for now the Tier-2 sites are trying to access different Tier-1 sites where the data is available. Access to the CERN Tier-0 from the Tier-2 is not foreseen in the Computing model.

 

Note: The OPN is configured to support only Tier-0/Tier-1 transfers. All other transfers (Tier-1/Tier-1, Tier-1/Tier-2, Tier-0/Tier-2) are assumed to use the general purpose research networking.

 

ATLAS:
The sites association map is ready and will be discussed (and hopefully approved) next week at the ATLAS Tier-1/2 Coordination meeting.

 

LHCb:

The map is prepared and their use case is simpler because it involves only Monte Carlo data.

 

5.      GLite 3 status and PPS installations (transparencies) – M.Barroso

 

 

Glite Release Candidate 2 was made available to the Pre-Production Service on Tuesday 11 April 2006, before the Easter’s break. Currently four sites have been upgraded but the other twelve only started the upgrade on the 18 April.

 

This Release Candidate still has two issues that could affect usage by the experiments:

-          Problem 15998 - A relocatable UI/WN is needed for the experiments - The WN is needed in order to be able to have the glite 3.0 client libraries available on the production batch farms. Without this only LCG 2.7 is available on the production worker nodes. The UI is needed to be able to put a UI under AFS which the VOs can use centrally.

-          Bulk submission has still several major bugs for which fixes or workarounds have been provided, but have not yet been tested on the Pre-Production Service.

 

This week the priority is to test these new fixes for the gLite RB on the Pre-Production Service, with a rapid cycle for fixing any problems found.

 

The experiments have not been able to do much testing until now as they are waiting for the known major bugs to be fixed.

 

A third Release Candidate is expected this week and will mostly affect DPM and dCache

.

One more Release Candidate will be assembled next week and that one should include the “relocatable UI and worker node” features:

 

The next steps are:

-          Next week the situation will be assessed and a status report and a proposal of the available options will be presented  to the MB (M.Schulz or Ian Bird).

-          Within two weeks will be decided (at the MB) what will be released and installed on the Tier-1 sites.
In this way the experiments can schedule their work and their tests.

 

 

 

6.      CASTOR 2 migration status - T.Cass

 

 

Postponed to next MB meeting.

 

7.      AOB

 

 

No AOB.

 

8.      Summary of New Actions

 

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.