LCG Management Board

 

Date/Time:

Tuesday 10 January 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a057117

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 3 - 17.1.2005)

Participants:

A.Aimar (notes), D.Barberis, L.Bauerdick, I.Bird, K.Bos, N.Brook, F.Carminati, T.Cass, Ph.Charpentier,  I.Fisk, B.Gibbard, J.Gordon, F.Hernandez, E.Laure, P.Mato, H.Marten, M.Mazzucato, G.Merino, B.Panzer, Di Quing, L.Robertson, J.Shiers, M.Schulz

Action List

https://uimon.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 17 January 2006 from 16:00

Minutes and Matters Arising ( minutes )

 

Comment to WLCG High Level Milestones
Changes proposed by FZK  more information )

 

The changes were approved by the MB.

 

News from OSG

Announcement: R.Pordes has been elected Executive Director of OSG.

There will be a meeting of the OSG Consortium in the week of the 23rd Jan 2006; many LCG and EGEE representatives will be present.

 

LCG Collaboration Board

The Collaboration Board has been formed and there will be representation for each Tier-0, Tier-1 and Tier-2 site or federation.

The first meeting will be on the 3rd of February. The main goal of the meeting is the election the Chair of the CB.

The list of the members is reachable from the C-RRB page (http://lcg.web.cern.ch/LCG/Boards/crrb.html) the links to the lists of Tier-1 Centers and Tier-2 Centers mention their CB representatives.

 

Summary of Christmas Operations

 

The summary was distributed by J.Shiers to the MB ( email ).

 

SC3 ran over Christmas to 9 out of 11 Tier-1 sites (all but FNAL and NDGF) using FTS. CMS was not using FTS to FNAL and there was no news from NDGF.

 

Met few problems and most of them were solved quite rapidly (with ASGC, INFN and IN2P3). Here is the detailed log.

 

A more significant problem was that the network to SARA was down for most of the period. This highlights the issue that the procedures to handle such situation were not well defined. The contact details of the network operators on the site were not available., and the network site tests were only checking the general network, not the SC3 dedicated line. 

 

ATLAS used the grid quite successfully during the period. Noted that some jobs were scheduled for execution several days after they had been submitted, during which time their input files had (intentionally) been deleted.

 

The weekly Operations meeting should be attended by all experiments: only LHCb was present at last meeting.

 

Action List Reviewlist of actions )

 

The action list will be reviewed outside this meeting, contacting the MB members involved.

 

Action:

15 Jan 06 - A.Aimar will contact the people with pending actions and will update the action list.

 

SC3 throughput tests re-run

Another attempt to agree on the re-run of the SC3 throughput tests. Does this include tape or not? And if it does not when does the tape path get tested prior to the SC4 throughput tests in April?

 

 

SC3 re-run tape tests

 

In December the MB agreed that the tape tests were important for the SC3 re-run and should be executed.

Since then some sites have announced that they would prefer not to participate, but it is too risky to wait until April when the tapes will have to be working for SC4. 

An assessment of this risk was requested from each Tier-1 site:

 

FZK

Additional tape drives were ordered, and it may be difficult to execute the requested tests during the SC3 re-run. Will send more information to the MB.

 

ASGC

The procurement of four additional tape drivers started and the new drivers may be ready only at the end of March. ASGC will start migrating to CASTOR2 in February. Therefore it is difficult to participate to the tape test in SC3 re-run, but will do the tape test in SC4 starting from April.

 

SARA
Currently the tape equipment is shared by the SARA general usage and the LCG specific needs. In January the two services and equipment will be separated. It is not sure that this operation will be completed in time for the SC3 re-run. Once this is done 3 drives will be available one month later, and another 6 will only be installed in June.


CNAF
January and February are devoted to debugging Castor 2. Additional drives will arrive meanwhile. Will send dates and more information later.

 
IN2P3
Will use the HPSS that is in production. 5 additional tape drives are needed for participating in the exercise and reach the planned 50 MB/sec. Currently, 3 drives are available and temporally installed in the test instance of HPSS so 2 more are needed before the test (they have already been purchased and delivered). A HPSS outage is necessary to integrate the new drives into production. If this intervention can be scheduled on week 4 they will participate and keep the objective of 50 MB/sec, otherwise they will participate but reduce the target throughput to around 30 MB/sec.


PIC
Will do the writing to tapes, but the target rate is reduced since the WAN upgrade is delayed.

 
BNL
Will participate to the SC3 re-run, but at a lower rate.

In February will receive more equipment for SC4. Therefore SC4 will use different equipment, but the same HPSS storage system.

FNAL
The target rates were already demonstrated in both SC2 and SC3 and the local community needs the equipment to do analysis for other experiments. Recycling the tapes used in the SC is very manpower intensive at FNAL and , therefore they will not participate in the new SC3 tape tests.

RAL
RAL did 75 MB/s already in SC3 in 2005. For SC4 they are installing a new robot and tapes drives, and all the resources are working on deploying Castor 2 at RAL.
RAL will not take part in the SC3 tape re-run tests or in the April SC4 tape reduced bandwidth tests. Some tests should be performed with CERN in June in order to be ready for July.

 

 

Action:

17 Jan 06 – Tier-1 sites send updated information and dates about their SC3 tape tests. If they cannot do those tests in time for SC3 they must send the recovery plans in order to perform their tape tests as soon as possible and before the April SC4 throughput run.

 

Castor 2 recording test at CERN


The Castor 2 tests involved 46 disk servers, 19 tape drives (of 3 different types) for a total of 230 TB of data. It ran for a full week at 950 MB/s using Castor 2, and exceeding the target rate of 750 MB/s.

 

Other tests with more streams reached 1.2 GB/s for a 48 h period.

 

There were only minor problems, the system behaved well and the results are very positive.

 

Debugging of SC3 disk/disk transfers 

SC3 disk /disk tests are being debugged, experiments are asked not to start data transfers for the coming week to make it easier for the SC team to understand the problem.

 

LHCb (and ATLAS requested the same at the GDB) wants to repeat their data transfer tests from Tier-0 to their Tier-1 sites after the SC3 tests (end f February).

 

SC4 plans and requirements more information )

Agreement on the process and timetable for defining the services to be provided and building a detailed plan

 

The plans for SC4 need to be defined in greater detail before the CHEP workshop (10 Feb 06) in order to be final for end of February. 

 

Not all requests from the experiments can be fulfilled for SC4 therefore there must be no ambiguity on what will be available.

 

The list provided by F.Donno is a good starting point in order to clarify what is possible to implement in time, with priorities and effort.

 

The plans (see the proposal in the MB agenda) should include:

-          an “experiment view” of services mentioning exactly which features within each service will be available to each experiment for SC4

-          a “site view” with details of the schedule of each service, and when it will to deployed, tested and in production at each site

-          a “schedule by services” that summarizes all activities (development, testing, deployment, commissioning, etc) that need to be executed in order to have each service tested and in production.

 

An initial proposal will be prepared by the SC4 team with the PO. It will then require several iterations with experiments, sites, deployment team and development projects. All parties involved (experiments, site, services, etc) should give priority to the definition of this SC4 plan.

 

The plan will cover Tier-1 sites first. Tier-2 will be included once the plan is better defined for the Tier-1 sites. In principle their installation schedules should not be too different from the Tier-1 plans. 

 

A first version of the SC4 plan should be finished by the end of January, and therefore be ready for a detailed and conclusive discussion at the CHEP Workshop. In parallel the discussion at the EGEE TCG will continue, to prioritise the list of features and improvements, also derived from the same list compiled by F.Donno. This (TCG) process will deal also with the prioritising and planning of longer term developments that will not make it into SC4.

 

The MB agreed to endorse the proposal and experiments and Tier-1 sites will nominate, within 48h, the people able to help with the definition of the plan.

 

Action:

13 Jan 06 – Experiments and sites should all send urgently the name of one person with the authority for discussing the details of the plans for SC4.

                                             

AOB

 

 

The VO-boxes Workshop is going to take place on the 24-25 February 2006.

 

 

Summary of New Actions

 

 

13 Jan 06 – Experiments and sites should all send urgently the name of one person with the authority for discussing the details of the plans for SC4.

15 Jan 06 - A.Aimar will contact the people with pending actions and will update the action list.

 

17 Jan 06 – Tier-1 sites send updated information and dates about their SC3 tape tests. If they cannot do those tests in time for SC3 they must send the recovery plans in order to perform their tape tests as soon as possible and before the April SC4 throughput run.

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.