LCG Management Board

Date/Time:

Tuesday 14 March 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a061508

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 2 - 20.3.2006)

Participants:

A.Aimar (notes), D.Barberis, L.Bauerdick, S.Belforte, I.Bird, K.Bos, N.Brook, T.Cass, Ph.Charpentier, L.Dell’Agnello, I.Fisk, D.Foster, B.Gibbard, J.Gordon, A.Heiss, F.Hernandez, M.Lamanna, E.Laure, M.Litmaath, P.Mato, G.Merino, B.Panzer, Di Qing, L.Robertson (chair), J.Shiers, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 21 March 2006 from 16:00 to 1700

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting (more information)

Minutes approved.

1.2         GDB Search Committee members

 Volunteers for the Search Committee were identified at the GDB meeting:

-            John Gordon

-            Simon Lin

-             Klaus-Peter Mickel

-             Jeff Templon

 

2.      Action List Review (list of actions)

 

 

No outstanding actions.

 

3.       Short Progress Reports (from last meeting)

 

3.1         Progress with data recording tests at CERN (transparencies) - Bernd Panzer

 

The network topology (slide 2) shows the internal LCG network, with 10 Gb connections between CPU, disk and tapes servers. The current setup includes about 50 disk servers and 40 tape servers.

 

Several tests have been performed in the last months (slide 3) in order to validate the complete Tier-0 data flows. The SC3 throughput rerun (16 disk servers in Castor2, reached 1 GByte/s to the T1 sites) and the Castor2 Data Recording Challenge (reached 950 MB/s) provided useful feedback and positive results.

 

Slide 4 describes the tests done in order to check throughput rates to the tapes. These tests confirm a good usage of the tape drive capabilities (about 75% of nominal performance with 2 GByte files). This performance largely depends on the number of streams involved, therefore it will be studied further in order to understand, and avoid, the risks of access congestion on the disk pools.

 

Slide 5: Tests of the Castor2 disk server throughput (input/output) reached 4.4 GB/s, which is a rate comparable to the 4.5 MB/s expected for the proton-proton running of all 4 experiments. More tests will be performed increasing the number of data streams (currently 250) involved.

 

In the experiments’ computing models there are several use cases that need to be studied and verified. For instance some use cases first copy and then open the files, while others open the files directly on the disk servers without copy. Which are the implications of these different use cases? A future note will describe in detail the results obtained.

3.2         Rfio and/or rootd - how is the decision going to be made - Ian Bird

 

Rootd is currently available in Castor2 and dCache. The DPM implementation is being studied and it is not yet clear whether rootd can replace rfio in all experiments’ use cases, and for other (non-HEP) applications.

 

If rootd cannot be used for all LHC applications then rfio will be maintained, and the conflicts between the different rfio versions used by DPM and Castor2 will have to be removed by assigning human resource to develop a common implementation.

 

4.      SRM for SC4 and long-term implementation (transparencies)
Invited: Maarten Litmaath

 

Maarten Litmaath presented the proposal for supporting two SRM storage classes for the SC4 run, and the MB also discussed longer-term strategies regarding SRM.

 

For SC4 the experiments need “permanent” and “durable” storage classes with this meaning:

-          Permanent:
Data stored on tape, the system manages the cache, access can be sometimes slow if data has to be retrieved from tape

-          Durable:
Data is stored on disk, which is managed exclusively by the VO - access is faster as the data is always disk-resident.

 

An additional SRM storage class "durable-permanent" could be needed in order make guarantee that "durable" data is also stored permanently on tape.

 

The attached transparencies are the support material for the discussion and the decisions that follow.

 

In order to reduce software development for SC4 the proposal (for SC4 only) is that storage systems may support different classes in different ways. Castor2 will support only one storage class in each SE – that is there will be two different hostnames (castor-durable and castor-permanent) for each of the two classes supported (see slide 4). dCache (slide 5) will use a single hostname with two different file paths. DPM (slide 6) is currently using the SAType attribute. The DPM publishes the same path for each SAType.

 

The future uniform solution may require data to be copied and may require re-cataloguing of data produced during SC4 (e.g. caused by hostname or paths changes). If needed, catalogue migration/clean-up tools will have to be implemented.

 

During the discussion the MB agreed that:

-            The idea of having a new class “durable-permanent” should not be implemented for SC4 and it will again be discussed for the long-term implementation. But some sites may back durable storage with tape in SC4.

-            DPM should advertise itself as “durable” in order to match the above meaning of the SRM storage classes.

-            The “wantPermanent” attribute should be ignored.

 

On the client software: GFAL, lcg-utils will use the attribute “permanent” by default. FTS will continue to work with complete SURLs (as now) so that the VO must choose the transfer end-points explicitly.

 

The usage of multiple SAPaths of the same SAType should be investigated because some sites will probably implement such solutions. This use case is not needed for SC4 and therefore it will be taken into account when defining the long-term solution.

 

Note: The example in slide 10 should contain different hostnames specified for the “-d” option as it is showing the commands referring to durable and permanent data used via Castor2.

 

If the SAType requested and the type of the SE addressed do not, match ideally an error should be returned (slide 11). But CASTOR and dCache will ignore the flag (for SRM v1.1), so they will not return an error. Therefore Clients _may_ decide that a mismatch between the indicated class and the SAType of the indicated SAPath is an error.

The general “ontology for storage” in term of file retention time, quality of retention and transfer performance (slides 12, 13 and 14) should be clarified between implementation projects, experiments and sites.

 

Slide 15 and following describe the features agreed for the SRM 2.1 by the WLCG Baseline Services Working Group (http://cern.ch/lcg/PEB/BS). They were originally planned to be implemented before WLCG Service Challenge 4 but then delayed until Fall 2006. Since the workshop conclusion many of the features requested (file types, space reservation, quotas, permissions, etc) need to be revised and clarified, in order to match the recent discussions and the lessons learned since Summer 2005 (during service and data challenges).

 

Therefore the MB agreed to appoint a permanent “SRM Coordination Committee”(slide 24), chaired by M.Litmaath, with the mandate to define the external details of the SRM 2.1 implementations and the storage classes to be used by LCG, and monitor the evolution and testing of the corresponding implementations. The committee will include members from the SRM and mass storage system implementation projects, experiments, sites, deployment and middleware development. The last slide contains the list of members proposed: the parties involved can propose (before next MB) to change their representatives and additional sites can make proposals to join the group, but the number of members should not increase considerably.

 

Decision:

The MB agreed to form an “SRM Coordination Committee” (SRMCC).

 

Action:

21 Mar 06 - M.Litmaath will clarify with the SRM and Mass Storage projects, experiments and sites the SRM Coordination Committee members. And circulate the SRMCC membership list to the Management Board.

 

1.      Progress Reports for this Quarter (2006Q1)

 

1.1         2005Q4 reports (documents)

 

Not discussed at this meeting.

 

2.      AOB

 

2.1         VOBoxes discussions

The Overview Board will discuss the VOBoxes next week; K.Bos is invited to the meeting because additional information may be needed.

 

There will be a second workshop on VOBoxes to reach a conclusion on the subject. The GDB decided that until then there should not be new services deployed at the sites. LHCb and ATLAS stated that they need VOBoxes deployed for SC4 (from June 2006). Currently all Tier-1 sites have VOBoxes installed but NIKHEF. NIKHEF stated that they are going to provide VOBoxes for SC4.

 

3.      Summary of New Actions

 

 

Action:

21 Mar 06 - M.Litmaath will clarify the membership of the SRM Coordination Committee and circulate it to the MB list.

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.