LCG Management Board

Date/Time:

Tuesday 10 July 2007 16:00-18:00 - Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=13803

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 12.7.2007)

Participants:

A.Aimar (notes), I.Bird, T.Cass, Ph.Charpentier, L.Dell’Agnello, F.Donno, T.Doyle, M.Ernst, I.Fisk, F.Hernandez, J.Gordon, C.Grandi, M.Kasemann, J.Knobloch, E.Laure, U.Marconi, H.Marten, P.Mato, G.Merino, Di Quing, L.Robertson (chair), J.Shiers, Y.Schutz, R.Tafirout

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 17 July 2007 16:00-17:00 - Phone Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received. Minutes of the previous meeting approved.

1.2      High Level Milestones for June 2007 (Site Reports; Updated HLM June 2007) - A.Aimar

A.Aimar updated the High Level Milestones dashboard for June 2007 and asked the sites to verify their values in the dashboard. Deadline Thursday 12 July 2007

 

For the milestone “WLCG-07-12 - Site Reliability above 91%” it is proposed to show the performance during the previous 4 months according to the target applicable to each month. The current version would therefore show the period March-April-May-June. The MB approved the change.

 

For the Procurement milestones (WLCG-07-16/17) the MB agreed that only if a site has installed 100% of all resources pledged (i.e. for CPU, disk and tape) these milestones are considered achieved (e.g. in green in the dashboard).

 

Update: HL Milestones updated on Thursday 12 July 2007

 

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

  • 6 July 2007 - Tier-0 and Tier-1 sites complete the Site Availability Reports for June 2007.

Not completed by all sites. A.Aimar will produce the summary with the data available.

  • 10-July 2007 - INFN will send the MB a summary of their findings about HEP applications benchmarking.

L.Dell’Agnello will distribute a summary in a few days.

  • 10 July 2007 - L.Robertson asked that, for next week, F.Donno presents a detailed SRM roll-out plan with all the activities that have to be executed and deadlines for sites, developers and experiments.

Done. The SRM roll-out plan will be presented and discussed at this MB meeting.

 

3.    SRM Roll-out Plan (document) - F.Donno

 

F.Donno presented the details of the plan for testing and installation of the SRM 2.2 at some Tier-1 sites before general deployment in production. All details are in the document presented by F.Donno.

 

Note: The latest plan with the status of the milestones is always available at the GSSD wiki page.

 

The milestones below are divided into:

-       Installation of test systems at the Tier 1 sites that participate to the testing

-       S2 stress tests

-       High level tools tests

-       Experiment tests

-       Installations in production at Tier-1 and Tier-2 sites

 

The initial test installations at Tier-1 sites are for LHCb and ATLAS covering CASTOR, dCache and DPM. LAL and Edinburgh are also included as Tier-2 sites. See details in the table below.

 

Key Tier-1 sites
Details published here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSD

SRM-01

11.07.07

FZK configured for ATLAS and LHCb tests

In progress

People assigned to task: Doris Ressmann (FZK). Installation and configuration of dCache 1.8.0-7 with MSS connectivity. Total disk available 10.4TB. 3.3TB for LHCb exercises. 5TB for ATLAS exercise. The rest for dteam tests.

SRM-02

11.07.07

IN2P3 configured for ATLAS and LHCb tests

In progress

People assigned to task: Lionel Schwarz (IN2P3). Installation and configuration of dCache 1.8.0-7 with MSS emulation. Total disk available 20TB. 5.7TB for LHCb exercises. 13TB for ATLAS exercise. The rest for dteam tests.

SRM-03

11.07.07

BNL configured for ATLAS tests

In progress

People assigned to task: Carlos Fernando Gamboa (BNL). Installation and configuration of dCache 1.8.0-7 with MSS connectivity. Total disk available 20TB for ATLAS exercise.

SRM-04

18.07.07

SARA configured for LHCb tests

In progress

People assigned to task: Ron Trompert, Mark van de Sanden (SARA). Installation and configuration of dCache 1.8.0-7 with MSS connectivity. Total disk available 5.1TB for LHCb exercise.

SRM-05

11.07.07

CERN configured for LHCb tests

Done

People assigned to task: Jan Van Eldik, (CERN). Installation and configuration of CASTOR version 2.1.3-15/17 with MSS connectivity. Total disk available 20.4TB for LHCb exercise.

SRM-06

18.07.07

NDGF configured for ATLAS tests

In progress

People assigned to task: Matthias Wadenstein (NDGF). Installation and configuration of dCache 1.8.0-7 disk only. Total disk available for ATLAS exercise 2TB

SRM-07

18.07.07

CNAF configured for ATLAS and LHCb tests

In progress

People assigned to task: Giuseppe Lo Re (INFN-CNAF). Upgrade and configuration of CASTOR 2.1.3-15/17. Total disk available for ATLAS and LHCb exercises 6TB. 3.1TB dedicated to LHCb, the rest for ATLAS.

SRM-08

11.07.07

LAL configured with DPM as a Tier-2 for ATLAS in production.

In progress

No experiments have asked for Tier-2s configured for testing. However, this instance is made available in production and in pre-production.

SRM-09

11.07.07

Edinburgh configured with dCache and DPM as a Tier-2 for ATLAS and LHCb

Done

No experiments have asked for Tier-2s configured for testing. However, this instance is made available in  pre-production.

SRM-10

From
11.07.07
 to
31.07.07

Testing experiment scenarios for the tests with experiment specific certificates. All sites should pass these tests.

New

People assigned to task: Flavia Donno. Covered by Lana Abadie and Stephen Burke while on vacation (14/7-3/8). This is preliminary for experiment testing

 

Functionality tests (SRM-10 above) are used to verify that the endpoints are correctly installed and also that VO-specific certificates will be verified in order to prepare as much as possible the environment for the Experiments’ tests that will follow. The verification should be completed by the end of July.

 

S2 Stress Tests

SRM-11

31.10.07

S2 stress tests of SRM v2.2 development endpoints: CASTOR, dCache, DPM, StoRM.

In progress

People assigned to task: Flavia Donno, Giuseppe Lo Presti (CERN), Shaun De Witt (RAL), Timur Perelmutov (FNAL), Tigran Mkrtchyan (DESY), Jean-Philippe Baud (CERN), Luca Magnoni, Riccardo Zappi (INFN-CNAF). This activity is done in coordination with SRM v2.2 developers and Storage Service providers. Patches will be provided by the developers as soon as possible and a patch roll-out strategy published by them. Roll-out of new releases and patches will be announced and coordinated through GSSD.

SRM-12

31.10.07

S2 stress tests of SRM v2.2 dedicated CASTOR and dCache endpoints to simulate experiment patterns and traffic. Sites involved: DESY, Edinburgh, CERN.

New

People assigned to task: Flavia Donno, Lana Abadie (CERN), Stephen Burke (RAL), Mirco Ciriello (INFN), Patrick Fuhrman (DESY), Greig Alan Cowan (Edinburgh), Jan Van Eldik (CERN). This activity is done in coordination with SRM v2.2 developers and Storage Service providers. The goal is to reach and demonstrate the following:
    1. Detemining which load can be handled without degradation for more than 7 days in a row. Demonstrate stability (no server crash, no memory leaks) over this period under the established load.
   2. Downtime of only one day is tolerated.
   3.  Failure rate of less than 1%. A server should be able to protect itself under a load which exceeds its maximum manageable load. The server should be free to deny access for peaks but should  become available again after the peak. The time this takes depends on the peak value.
  4. Degradation of performance of less than 15% for requests in the queue.
A more detailed document is being drafted with details.
 Patches provided and installed following the established strategy (SRM-09)

 

Stress tests (table above) will test the SRM 2.2 installations at endpoints dedicated to these tests and available until the end of October. The stress tests will verify reliability, stability and the non-degradation of the MSS associated services.

 

H.Marten asked whether the stress tests are done on the PPS systems at the sites.

F.Donno replied that these are ad-hoc endpoints not on the PPS. These endpoints have been setup up for the SRM 2.2 tests and are not published in the BDII but used directly.

 

L.Robertson noted that if the criteria in SRM-12 are reached earlier than October the milestone will be considered achieved and the next phase can start earlier.

 

M.Kasemann asked whether experiments are participating to the stress tests.

F.Donno replied that experiments will participate to the next phase but are not contributing to the S2 tests.

 

High-level Tools/APIs tests

SRM-13

31.07.07

Definition of tests to be performed. Definition of testing plan. This includes tests on SRM v1.

In progress

People assigned to task: Flavia Donno, Lana Abadie (CERN), Stephen Burke (RAL). This includes tests to demonstrate full compatibility between SRM v1 and v2.

SRM-14

From
1.08.07
to
31.10.07

Testing High-Level Tools/APIs as defined by the plan

New

People assigned to task: Flavia Donno, Lana Abadie (CERN), Stephen Burke (RAL). Problems reported to SRM developers, Storage Service Providers, High-Level Tools developers. Patches provided and installed following the established strategy (SRM-11)

SRM-15

31.10.07

High-level tools will be modified to set v2.2 as the default version
of SRM

New

People assigned to task: High-level tools and APIs developers.

 

Tests for the high-level tools (GFAL, lcg-utils, etc) are being defined in a separate document.

They include also tests on SRM 1.1 calls in order to verify backward compatibility.

The default version for these tools will be SRM 2.2 

 

Ph.Charpentier added that LHCb would like to be able to choose which version of SRM the tools use. The version of SRM to use should not be decided inside the tool implementation but configurable externally.

 

Experiments testing schedule

SRM-16

31.08.07

Experiments to provide details and plan for their tests

In progress

 

SRM-17

From
1.08.07
to
31.08.07

LHCb transfer exercise from CASTOR@CERN with SRM v2 production data to CNAF, FZK, IN2P3, SARA (all with SRM v2) using FTS 2.0 Production service at CERN. Data reprocessing will also be done using high-level utility and exercising pinning and metadata retrival. Data will be registered in production catalogue.

New

People assigned to task: various people from LHCb already involved in the production exercise, Nick Brook (Bristol). Details on the testing plan can be found at https://twiki.cern.ch/twiki/bin/view/LCG/GSSDLHCBPLANS.
Patches will be provided by the developers as soon as problems are reported and fixed. Roll-out of new releases and patches will be announced and coordinated through GSSD.

SRM-18

From
1.09.07
to
30.09.07

ATLAS transfer exercise from CASTOR@CERN with SRM v1 to BNL, FZK, IN2P3, NDGF (all with SRM v2) using FTS 2.0 PPS service at CERN.

New

People assigned to task: Kors Bos (NIKHEF), Miguel Branco (CERN), and Mario Lassnig (Innsbruck). Patches will be provided by the developers as soon as problems are reported and fixed. Roll-out of new releases and patches will be announced and coordinated through GSSD. The use of the CASTOR@CNAF has to be negotiated.

SRM-19

After CSA07

CMS transfer exercises from Tier-1s to Tier-2s and between Tier-1s using PhEDEx and FTS 2.0.

New

People assigned to task: Daniele Bonacorsi (CNAF). Preliminary tasks will be performed already in August 2007 in coordination with Flavia Donno and some of the sites. Needed resources for this test need to be negotiated. The actual time window for tests has also to be better defined with CMS.

 

L.Robertson commented that the  ATLAS data transfer exercise is insufficient and the tests should be extended to include tests of production, reconstruction and analysis jobs.

 

F.Donno commented that the Experiments are asked to define more in detail their tests (SRM-16). Pages for tests of ATLAS, CMS and LHCb are being prepared on the GSSD wiki pages.

 

M.Kasemann added that the end of CSA07 is for mid-October and therefore CMS should be able to test FTS 2 and SRM 2.2 in the second half of October. D.Bonacorsi is the interface for all these tests. He also requested that the date for providing the definition of the tests for CMS should be delayed until the end of August.

 

Deployment in production

SRM-20

From
15.10.07
to
30.11.07

Upgrade and configuration of the production Storage Instance to dCache 1.8.0-n at FZK and FNAL.

New

If no major show-stoppers found.

SRM-21

30.11.07

Upgrade and configuration of the production Storage Instance at Key Tier-1 sites to the new versions of dCache and CASTOR.

New

If no major show-stoppers found

SRM-21

30.11.07

SRM v2.2 configuration for all Vos at Key Tier-1 sites.

New

 

SRM-22

15.10.07

Start the upgrade and configuration of Tier-2 sites using DPM and StoRM to SRM v2

New

To be finished in January 2008

SRM-23

From
05.01.08
to
31.01.08

Upgrade and configuration of the production Storage Instance with SRM v2.2 at all Tier-1 and Tier-2 sites.

New

 

SRM-24

28.02.08

Have all sites fully functional in production with SRM v2.2

New

 

 

The deployment in production of SRM 2.2 on the Tier-1 sites depends on the success of the previous milestones. But in principle, if possible, all Tier-1 sites should be installed by end of November. On the Tier-2 sites the upgrade can start in mid-October and be completed by January 2008.

 

By end January 2008 all LCG sites should be moved to SRM 2.2.

 

L.Robertson noted that if the tests are positive earlier than mid-October, then the installations can also start earlier.

 

F.Hernandez added that the installation of dCache 1.7 took almost 3 months. Therefore at IN2P3 dCache1.8 will be installed in production only after thorough testing with real load of the system.

 

M.Ernst added that BNL and ATLAS will do realistic load-testing of dCache within the timeline specified above. The success will depend on the problems encountered and on the responsiveness of the dCache development team.

 

H.Marten commented that there is also the constraint that if the installation is after end-October it will be difficult to be ready for cosmic data in November. He is worried that there is already a shift of one month compared to last week, and one risks not to have any contingency left for data taking in production in 2008.

 

L.Robertson proposed that one site could update in production as soon as possible, to discover any problems due to load and realistic usage patterns as soon as possible without endangering all of the resources available to the experiment. This site should not be penalised for poor levels of usage or reliability during this time as they would be providing a necessary service to the experiments.  

F.Donno confirmed that this scenario was discussed at the GSSD Workshop. Feedback from the experiments is expected.

She noted that CASTOR at CERN is already being installed in a realistic production configuration.

 

Ph.Charpentier agreed that one site upgrading soon for real production would be very useful for the Experiments.

 

M.Kasemann noted that CMS CSA07 should not be perturbed by one sites being unstable in that period. But if this happens after mid-October CMS is in favour. And also should not happen during the cosmic runs (end of November).

 

I.Fisk added that FNAL has a test instance for dCache but it is not upgrading to production until after CSA07.

After CSA07 the upgrade at FNAL can be done quickly.

 

L.Robertson concluded that the deployment milestones above should be monitored and redefined according to the results of the tests and of the cosmic runs in November.

 

The MB recommended that, if possible, one site should upgrade to production earlier (mid-October). FZK and FNAL appear to be good candidates for this.

 

J.Shiers (or H.Renshall) will report to the MB about progress and issues of the milestones of the SRM plan.

 

 

4.    GDB Summary (document) - J.Gordon

 

J.Gordon only had time to focus on the SL4 discussion that took place at the GDB:

-       The SL4 WN release has been in production for a month but the uptake has been disappointing.

-       Sites were concerned that the SL4 middleware release did not contain all the rpms required by the experiments that were previously included in SL3. Markus said this was a deliberate choice to remove operating components from the middleware packages.

-       Not all experiments had updated their VO Cards to include any extra rpms that they required. All experiments were to have updated their VO cards by Operations Meeting on Monday 9th July (ALICE and CMS have not updated their VO cards).

 

L.Dell’Agnello asked why the rpms needed by the Experiments were not tested already on the PPS? In this way all conflicts of rpms among Experiments could be tested and solved already.

 

All details are in the attached document.

 

 

5.    AOB

 

 

No AOB.

 

6.    Summary of New Actions

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.