LCG Management Board

Date/Time

Tuesday 22 October 2008 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=39176

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 24.10.2008)

Participants

A.Aimar (notes), I.Bird(chair), D.Barberis, O.Barring, D.Britton, L.Dell’Agnello, S.Foffano, I.Fisk, F.Hernandez, M.Kasemann, M.Lamanna, P.Mato, G.Merino, Di Qing, Y.Schutz, J.Shiers, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 28 October 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting 

G.Merino sent by email some clarifications about the mandate of the working group on CPU Benchmarking. The changes were implemented in the minutes of the previous week. The minutes of the previous MB meeting were then approved.

 

2.   Action List Review (List of actions) 
 

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

About LCAS: Ongoing. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it up should confirm it.

About SCAS: The SCAS server seems to be ready and “certifiable” in a week. The client is still incomplete.

DONE. SCAS was distributed, will need to be certified and deployed.

 

New Action:

SCAS certified and tested.

  • For the ATLAS Job Priorities deployment the following actions should be performed :

-       DONE. A document describing the shares wanted by ATLAS

-       DONE. Selected sites should deploy it and someone should follow it up.

-       ONGOING. Someone from the Operations team must be nominated follow these deployments end-to-end

 

Being discussed in ATLAS.

M.Lamanna reported that today’s system uses Panda to submit production jobs. For Analysis there is progress using glexec and the participation to the WG using pilot has been very useful for ATLAS. Analysis will probably be done using WMS submissions. The importance of the mechanism for JP is decreasing. Will still be necessary that sites distinguish queues and shares for production and analysis and the switching is done by checking VOMS roles. But full JP system maybe not necessary.

This progress ought to be confirmed after the ATLAS Software Week, next week.

 

I.Bird added that, if this strategy change takes place, ATLAS should urgently provide recipes in order to make sure that not every sites defines its specific solutions.

  • Converting Experiments requirements and Sites pledges to new CPU units.

 

Working group launched.

  • Form a working group for User Analysis for a strategy including T1 and T2 Sites.

Done.

  • 23 Sept 2008 - Prepare a working group to evaluate possible SAM tests verifying the MoU metrics and requirements.

Is on the agenda for the WLCG Workshop in November.

  • 8 Oct 2008 – O.Keeble sends to the MB a proposal on possible upgrades to middleware because of the LHC delays.

 

Proposal distributed by O.Keeble.

  • 13 Oct 2008 - O.Keeble will send an updated proposal for software distribution and update at the Tier-1 Sites.

 

Proposal distributed and will be discussed at the GDB.

  • 22 Oct 2008 - By the last week of October the Experiments should provide their new estimated due to the delay in the LHC operations. The assumption is that LHC data taking starts on the April-May 2009 as announced by the DG.
    On the 22 October there will be a special meeting to prepare the Overview Board.

 

Done in this meeting.

 

 

3.   Preparation of the Overview Board (Slides) - I.Bird

I.Bird presented the main points and issues that will be discussed at the OB on Monday 27 October 2008.

3.1      Consequences of the LHC shutdown

The present shutdown of the LHC has a number of consequences for the planning of WLCG:

-       Capacities and Procurements for 2009

-       Software and service upgrades during the shutdown

-       (Re-)Validation of services for 2009 following changes

 

Each is described in the sections below.

3.2      Capacities and Procurement

The WLCG MB has agreed that with the information currently available to us and the present understanding of the accelerator schedule for 2009:

-       The amount of data gathered in 2009 is likely to be at least at the level originally planned, with pressure to run for as long a period as possible this may be close to or exceed the amount originally anticipated in 2008 + 2009 together

-       The original planning meant that the capacity to be installed in 2009 was still close to x2 with respect to 2008 as part of the initial ramp up of WLCG capacity

-       Many procurement and acceptance problems arose in 2008 which meant that the 2008 capacities were very late in being installed; there is a grave concern that such problems will continue with the 2009 procurement

-       The 2009 procurement processes should have been well advanced by the time of the LHC problem in September

 

The WLCG MB thus does not regard the present situation as a reason to delay the 2009 procurements, and we urge the sites and funding agencies to proceed as planned. It is essential that adequate resources are available to support the first years of LHC data taking.

 

J.Templon asked for information about on the rumours that say that repairing the LHC will take 8 months and therefore more than officially stated..

I.Bird replied that this are speculations and the only information is what is on the CERN home page and in the press releases. And there is not  any other internal information suggesting otherwise.

 

L.Dell’Agnello stated that INFN will deliver by April disk half of the 2009 pledges and the second half by July. The first half will be sufficient for the first three months (April-June). By the second half of the year all 2009 pledged capacity will be available. No changes for CPU.

 

F.Hernandez asked whether it is not better to standardize, from 2010, on two times for installing the pledges instead of having 100% of the capacity by April. He proposed to have 50% in April and the other 50% installed later by end of July for instance. This will allow scheduling the installation during a longer period and not having disk not used for several months.

 

J.Templon added that waiting and buy in two slots will also allow to better use power and space and buy more recent disks.

 

I.Bird reminded that in 2008 the delays for procurement were longer than expected. If the second slot is late the installations will be in September or October when is too late. Just-in-time procurement must be done very carefully because it does not allow the verification and changing, if needed, of hardware (e.g. in case of problems as happened at several sites in 2008). Installing in two slots may help the sites to organize their work. But tender, buy and test in two separate slots can cause major problems (if something goes wrong).

 

D.Barberis stated that ATLAS is not against the installation in two batches if the second half is ready by July and without delays. Later in the Autumn the installations would not be acceptable for the Experiments.

M.Kasemann agreed that to have the second slot just before August is very risky and could end up having the disk installed in October. In addition in the Summer there is a lot of data taking activity therefore the Winter shutdown is still a better time for upgrades and installations. Probably the split should not be 50/50 but 60/40 or higher.

3.3      Upgrade Plans

Since several software upgrades were postponed in anticipation of LHC start-up. 

We propose that the following changes are addressed in the coming months:

 

FTS/SL4

This was postponed and will now be deployed.  Has been tested extensively.

 

WN/SL5

Already have a 1st installation at CERN, to be tested by experiments.

Target – available on the infrastructure in parallel to SL4

 

glexec/SCAS

Target - enabling of multi-user pilot jobs via glexec. 

SCAS currently in testing.  Essential for analysis use cases with pilot jobs.

 

CREAM  

Here we should be more agressive: LCG-CE problematic for analysis with many users)

If the use case is direct submission with no proxy renewal, CREAM is basically ready. Proxy renewal should be fixed in the simplest possible way (reproduce the lcg-CE solution)

WMS submission will come with ICE, timescale months

Target – maximum availability in parallel with lcg-CE

 

WMS

Status: Patched WMS, fixing issues with proxy delegation, to be deployed now.

ICE to submit to CREAM. Not required for certification of CREAM.

ICE will be added in a subsequent update (but better before Feb. 2009)

 

Multiple parallel versions of middleware available on the WN

Status - at the moment it is not easy to install or use multiple parallel versions of the middleware at a site. While the multi middleware versions and multi compiler support are not disruptive, they require some changes on the packaging side and a small adaptation on the user side.

Target - it seems advisable to introduce this relatively shortly after the bare bone WN on SL5.

 

Other anticipated upgrades:

-       Glue2 – deploy in parallel – provides better description of resources

-        CE publishing: Better description of heterogeneous clusters

-       gridftp2 patches. These are being back ported to VDT1.6 ; Important for dCache and FTS

 

I.Fisk reminded that it had been agreed that the WN should be installed in 64 bits mode at all Sites.

 

Another issue missing is the SRM upgrades and the short and long term solutions.

3.4      Re-validation f the Service

 

All experiments are continually running simulations, cosmics, specific tests (and have been since CCRC’08) at high workload levels – this will continue

-       A full CCRC’09 in the same mode as 2008 is not regarded as useful. But, we will perform specific tests/validations:

-       Service validation if software is changed/upgraded

-       Specific tests (e.g. throughput) to ensure that no problems have been introduced

-       Tests of functions not yet tested (e.g. Reprocessing/data recall at Tier 1s)

 

Details of the test programme will be discussed and agreed in the workshop already planned for November

 

M.Kasemann asked when the proposal of moving from 5-year planning to 3-years planning for Experiment requirements and procurement.

I.Bird replied that this was mentioned at the last RRB meeting and will be restated at next RRB meeting. When was presented informally everybody agreed.

 

4.   CMS (Slides ) – M.Kasemann

 

M.Kasemann had prepared some explanations on why the requirements for CMS will not change. See Slides.

 

I.Bird added that would be good if also ATLAS had a similar explanation.

D.Barberis added that ATLAS has a short document with similar conclusions. It just needs to be approved within ATLAS and will then be distributed.

 

ATLAS and CMS should provide the assumptions and explanations of their 2009 requirements.

 

Y.Schutz added that also ALICE has the same document available as they prepared it for the Scrutiny group, but was before the LHC delay.

 

5.   AOB

 

 

L.Dell’Agnello stated that he will send the explanation for the INFN reliability issues for the Overview Board, before the end of the week.

 

6.    Summary of New Actions