LCG Management Board

Date/Time:

Tuesday 25 April 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a061498

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 2 - 2.5.2006)

Participants:

A.Aimar (notes), D.Barberis, M.Barroso, L.Bauerdick, S.Belforte, I.Bird, K.Bos, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, M.Delfino, I.Fisk, J.Gordon (chair), B.Gibbard, C.Grandi, M.Lamanna, J.Knobloch, H.Marten, P.Mato, B.Panzer, M.Schulz, Y.Schutz, J.Shiers, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList 

Next Meeting:

Tuesday 2 May 2006 at 16:00

1.      Minutes and Matters arising (minutes) - A.Aimar

 

1.1         Minutes of the Previous Meeting

Comment received from ATLAS received (highlighted in blue in the minutes).

Minutes approved.

 

2.      Action List Review (list of actions)

 

No outstanding due actions.

 

3.      CASTOR Migration Status (more information) - T.Cass

 

 

 

ATLAS migrated to CASTOR2 on Thursday.

Update: The CASTOR1 service for ATLAS has been switched off.

 

The plot shows access to CASTOR2 by different groups:

-          ~ 200 jobs for SC4

-          ~ 400 jobs for ATLAS, with a peak of 1000 jobs on Monday

 

ALICE has already migrated to CASTOR2.

 

CMS and LHCb were supposed to migrate, but asked to delay the migration:

-          CMS will migrate starting May the 2nd.

-          LHCb will use CASTOR2 as default in the coming days. A reduced CASTOR1 service will be kept until there is CASTOR2 client for Windows and to access with some LHCb legacy software using an old version of ROOT that needs CASTOR1 (because of the RFIO plugin).

 

4.      SC4 Progress and Issues - J.Shiers

-          site maintenance and support coverage during SC4 throughput tests

-          unplanned schedule changes

-          monitoring, showing the data rate to tape at remote sites and also of overall status of transfers

-          debugging of rates to specific sites

-          future throughput tests using more realistic scenarios

 

Site maintenance and support coverage during SC4 throughput tests

Some sites performed scheduled interruptions for maintenance during the SC4 tests; therefore for the Service Challenge one has to be able to handle interruptions of up to 48h per site (not more than one site at a time). The interruptions need to be scheduled among the sites.

 

Action:

31 May 06 – J.Shiers define mechanism to coordinate maintenance interruptions of Tier-1s.

 

On some sites support coverage is very low and is an area of concern, in particular in the future holiday period.

 

Unplanned schedule changes

FZK could not take part to the tape tests.

 

Monitoring, showing the data rate to tape at remote sites and also of overall status of transfers

The current monitoring shows what is sent from CERN but not what is actually stored on tapes at the sites.

Sites monitor independently their tape storage and throughput. A central repository for this information must be defined.

 

Action:

15 Jun 06 – Make available solution central repository to store tape throughput monitoring information. To assign.

 

 

Debugging of rates to specific sites

Four sites are working on improving and debugging their throughput rates:

-          ASGC: working on stability, reached 50 and 70 MB/s

-          INFN will repeat the disk throughput tests after the upgrade to CASTOR2

-          IN2P3, FZK are being investigated

 

Future throughput tests using more realistic scenarios

Realistic scenarios need to be used for the tests. File sizes, number of concurrent transfers and ramp up time should reflect what will be the future usage and needs of the system. Currently the number of simultaneous file transfers may be too high.

 

A technical workshop in June will discuss these conditions. By then sufficient experience should be gained on which parameters can be tuned to improve performance in realistic scenarios.

 

Alternative implementations of the TCP/IP stack should allow higher throughput. The current tests show that the reliability is good but the implications on the configuration and tuning of the servers is still being studied.

 

5.      GLite 3.0 Status and Proposal of Available Options (document, transparencies) - I.Bird

 

5.1         GLite 3.0 Release Schedule and Next Steps

The transparencies show the schedule and the next steps.

 

Currently the RC2 is installed on the PPS since the end of last week. The new RC3 is on the certification testbed, but will not be deployed on the PPS because there are bugs on RC2 that will only be fixed on RC4 that should be released on Wed 26 April. 

 

If RC4 passes certification, it will be installed on the 2 May on the PPS and on the Tier-1 sites to upgrade the production service (15 May).

 

The major change vs. LCG-2.7.0 are:

-          gLite WMS (RB, CE, CEmon), this is main new functionality

-          Updated FTS that are deployed independently

-          Updated LFC, DPM (roles, groups, ACLs)

-          Updated VOMS (already is use) (Oracle)

-          Adapted dependencies to allow LCG-2.7 and gLite-1.5 to work together

 

The new RB can work with the old and new CE, but users of the old RB will be able to use only the LCG 2.7 CE. LCG 2.7 components will still be available. The gLite CE will be deployed for testing purposes, even if the VOs will still be using the LCG 2.7 CE. It is recommended that the Tier-1s maintain an old LCG 2.7 CE in addition to the new gLite CE.

 

The VOs will be able to choose their configuration via the JDL and, at the beginning; each VO will have one dedicated gLite RB.

 

A new installation system for gLite 3.0 based on YAIM will be used, that was tested on the PPS and on 4 ROC sites and there was little feedback about the installation.

 

5.2         Open Issues and Fixes included in the Releases

The document provides details on the status of the open issues that are being fixed. The release in which the fix is included indicated (last column on the right).

 

All issues indicated are fixed in RC3 or RC4, including the relocatable UI and worker node.

 

Not fixed:

-          Accounting will not work on the gLite CE in the releases for now (Savannah ID: 16425).

-          DCache should be installed from the dCache site and not from the release.

 

JRA1 stated that full support will be provided for the current releases and to bug fixing.

5.3         Steps after the Release

Stability and robustness must be the top priorities after the release. After RC4, if successful, only bug fixes will have to be applied and no new functionalities. The issues suspended and bugs found should be top priorities.

 

The future features will be defined updating "Flavia's list" and their implementation will be sequential in order to control changes and be able to go back to the previous version. This process should be followed by the developers and verified by the certification team.

 

The sites should install on the Production system (in a staggered manner for Tier-1 sites), some PPS systems will have to be again available on the sites for future releases and verifications.

 

Action:

Deployment group should define the sequence in which Tier-1s install gLite 3.0

 

Action:

15 May 06 – J.Shiers/M.Schulz: Flavia’s list should be updated, maintained and used to control changes and release. A fixed URL link should be provided.

5.4         Conclusions

Decision:

The experiments and the Tier-1 sites strongly supported this process and the schedule proposed. The MB endorsed it.

 

This information should be communicated to the EGEE TCG by I.Bird and supported by the experiments representatives at the TCG.

 

The contact person for all issues with the release and deployment is M.Schulz.

 

6.      Deciding on how to proceed with Accounting - J.Gordon

-          follow-up to the GDB session in Rome and subsequent email discussions

 

There was no time to discuss Accounting at the meeting.

 

The main issues on the topic are:

-          Site accounting reports need to be completed by end of April.

-          GDB stated that there is a requirement for user accounting, long term and short term solutions should be discussed

-          Keeping APEL working in gLite CE

-          APEL+DGAS accounting

 

7.      AOB

 

 

Exploitation of role and groups in the VOMS implementation should be reported to the MB (J.Templon).

 

8.      Summary of New Actions

 

 

Action:

21 May 06 – J.Shiers/M.Schulz: Flavia’s list should be updated, maintained and used to control changes and release. A fixed URL link should be provided.

 

Action:

31 May 06 – J.Shiers define mechanism to coordinate maintenance interruptions of Tier-1s.

 

 

Action:

15 Jun 06 – Make available solution central repository to store tape throughput monitoring information. To assign.

 

 

Action:

Deployment group should define the sequence in which Tier-1s install gLite 3.0

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.