LCG Management Board

Date/Time:

Tuesday 10 October 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a063269 

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 11.10.2006)

Participants:

A.Aimar (notes), D.Barberis, L.Bauerdick, S.Belforte, I.Bird, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, M.Delfino, Di Quing, B.Gibbard, J.Gordon, I.Fisk, D.Foster, F.Hernandez, M.Lamanna, J.Knobloch, H.Marten, B.Panzer, L.Robertson (chair), Y.Schutz, J.Shiers, O.Smirnova, R.Soltz, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 17 October from 16:00 to 17:00, CERN time

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

The minutes of the previous meeting were not yet distributed at the time of the MB meeting.
Will be discussed at next MB meeting.

 

Apologies.

1.2         Matters Arising

1.2.1          2006Q3 Quarterly Reports (documents )

Received: Applications Area, ARDA, ASGC, CERN, CMS, FZK, LHCb, SARA-NIKHEF, SC4, US-ATLAS, US-CMS

Missing: ALICE, ATLAS, CC-IN2P3, DB Services, Deployment Area, GDB, INFN, NDGF, PIC, RAL, TRIUMF

 

The reports are needed in order to organize the review within a week.

 

Note: since then several other reports are being received.

 

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 30 Jun 06 - J.Gordon reports on the defined use cases and policies for user-level accounting in agreement with the security policy working group, independently on the tools and technology used to implement it.

Done.

  • 31 Jul 06 - Experiments should express what they really need in terms of interoperability between EGEE and OSG. Experiments agreed to send information to J.Shiers.

Done by ATLAS. D.Barberis distributed a mail with the ATLAS proposal for all grids interoperability (email). Action closed although no input was received from other experiments.

  • 29 Sept 06 - MB members should send feedback and comments on the ECM, OPS and SCM meetings mandates and participation. Latest version

Done.

  • 6 Oct 2006 - J.Templon distributes information, as used at SARA, on how to calculate the disk cache size for the case disk0tape1.

Done.

  • 6 October - B.Panzer distributes to the MB a document on “where disk caches are needed in a Tier-1 site” everything included (buffers for tapes, network transfers, etc).

Postponed to end of October.

  • 10 Oct 2006 - The 3D Phase 2 sites should provide, in the Quarterly Report (due now), the 3D status and the time schedules for installations and tests of their 3D databases.

To do.

  • 13 Oct 2006 - Experiments should send to H.Renshall their resource requirements and work plans at all Tier-1 sites (cpu, disk, tape, network in and out, type of work) covering at least 2007Q1 and 2007Q2.

 

3.      US Computing for ALICE (transparencies) - R.Soltz

 

R.Soltz, from LLNL, the Computing Coordinator of the US Institutions seeking admittance in ALICE, gave a brief overview of the computing infrastructure that will be available in the US to the ALICE experiment.

 

Y.Schutz explained that Ohio and Houston universities were already collaborating in ALICE and this is a broader proposal, submitted for the construction of a new ALICE sub-detector (EMCAL). About 10 US institutes will provide resources proportional to the number of collaborators that will join ALICE.

The US infrastructure will  reach about 10% of the total ALICE computing resources (it includes both Tier-1 and Tier-2 resources).

 

The EMCAL project has been approved by the LHCC two weeks ago; it is passing the DOE procedures and has reached the step where the approval is granted.

3.1         Facilities Overview

The resources will be divided in 4 facilities (slide 2):

-          NERSC at LBNL (Berkeley) will be the main site and will cover half of the total contribution 

-          LC at LLNL (Livermore) will use the existing facilities available and has no direct funding needed for now. Additional proposals may always be launched later, when the ALICE software is running.

-          OSC at Ohio State University, smaller centre already in ALICE.

-          TLC2 at University of Houston, smaller centre already in ALICE.

3.2         PDSF at NERSC LBNL

PDSF is the cluster proposed for ALICE at NERSC:

-          Can already provide 500 CPUs + 500 CPUs that will be requested to the DOE.
This will reach 5% of the total ALICE needs.

-          Has an HPSS storage system of 22 PB; that is sufficient for an LCG Tier-1 site.

-          Connects to the ESNET network via a 10 GBS link

-          Has 24x7 support of critical components (interactive nodes, home file-system, etc) and on-call is available for system administrators and projects experts, if needed.

3.3         LC Serial Custer at LLNL

The Serial Cluster is a new system being installed at Livermore, ALICE will use a subset of it:

-          640 CPUs will be available to ALICE

-          Will connect to the ESNET, as NERSC does

-          HPSS capacity is similar to the one of LBNL

-          Support 24x7 is staffed. Monitors network, batch, hardware, etc. On-call is also available like at LBNL.

3.4         Ohio State University and University of Houston

ALICE will have in total ~300 CPUs (200 Ohio, 100 Texas) available, requested also leveraging on the NSF proposal of the “Alien-OSG interface” development project.

 

L.Robertson asked if the MSS systems provide SRM standard interfaces, just like all LCG sites currently do. L.Bauerdick confirmed that NERSC provides the SRM interface. F.Carminati said that if sites do not have SRM installed, ALICE could use HPSS directly from xrootd.

 

L.Robertson reminded that SRM end-points are needed for FTS and therefore this must be analysed and decided.

 

L.Bauerdick asked whether they will use the US LHC transatlantic network and pointed out that a meeting of the US LHC network WG will take place next week and could be used to discuss the routing of ESNET traffic on that link. R.Soltz said that the right contact for network issues is Dough Olson.

 

L.Robertson asked which of the sites will be the US Tier-1site for ALICE. R.Soltz replied that NERSC will surely be a Tier-1 and the others could also be candidates. Further discussions will be take place with ALICE to define the Tier-1 site. I.Bird noted that having more than one Tier-1 will considerably increase (double or more) the work for configuration, monitoring, etc.

 

J.Gordon asked about the Tier-2 sites connected to these US sites. But, for the moment, this aspect has not been defined yet.

 

4.      Preparation of Milestones for 2007 (transparencies) - A.Aimar

 

Milestones and targets for 2007 (and 2008) need to be defined, both at Level 1 (high-level milestones) and Level 2 with milestones and targets at sites, experiments and projects.

4.1         Proposal for 2007 Planning

A.Aimar proposed the next steps that could be undertaken in order to prepare plans with milestones and targets for 2007.

 

Most activities and products need to deliver by Summer 2007 in order to be ready by end of 2007 when all LCG components must work. Therefore clear milestones and targets need to be defined covering the period until commissioning.

 

In order to better track progress intermediate milestones need to be defined; because this will also allow taking corrective actions sooner. High level milestones will group several Level 2 milestones (e.g. 80% of the sites must have reached 90% of the final target at a given date).

 

Last year most of the focus was on preparing the infrastructure at the sites and the middleware deployment. Now, in addition, also important are testing, deploying and operating the experiments software at all sites.

 

This year the milestones will be discussed with all parties involved and the proposal is to:

-          Meet all experiments and all (or voluntary?) sites

-          Collect milestones and targets; and together build a plan

-          Present Level 1 milestones to the MB for approval

-          Check that the corresponding Level 2 are present in the individual plans

-          Monitor and track the progress (with a monthly report at the MB)

 

N.Brook asked whether the milestones are internal to the project or are those that have already been specified and sent to the LHCC. L.Robertson replied that these milestones will be used to plan and monitor the work towards commissioning. If the experiments already have defined suitable milestones for the LHCC these can certainly be used. But it is important to have a sufficient number of milestones or targets to track the progress and identify problems early..

 

L.Bauerdick said that the milestones in slide 3 look like the CMS Level 2 milestones. While their Level 1milestones  are very high level (i.e. of the kind of “CMS computing system ready”).

 

J.Knobloch, who was at the LHCC, said that they were more thinking of combinations of milestones on slide 3. For instance: “Total throughput from DAQ to Tier-1 sties at the rate …”, “Experiments 3D integration completed”.

 

The intention is not to monitor too many milestones at the LHCC level, but have a sufficient number to be able to track the global status. Intermediate targets could be defined to track progress internally at the MB level. Examples of milestones for the LHCC were the “two phases for the preparation of the 3d databases”. Instead of having only one milestone saying “everything ready” allowed to better monitor the status (and the issues) of the 3D Phase 2 sites.

 

The goal of this presentation was to agree on the fact that the milestones will be prepared with the process described above. Slides 3, 4 and 5 only present some examples of possible milestones in order to trigger further discussions.

 

The choice of meeting experiments and sites individually aims at allowing more people to express their input rather than all in a single meeting, where sometimes input is limited. The MB agreed that individual meetings will be more useful in order to get more ideas and better input for the 2007 milestones.

 

5.      AOB

 

5.1         Sites Reliability Metrics (document) - L.Robertson

L.Robertson presented the Site Availability metrics (for September and accumulated) and asked for comments.

 

For the first time reliability and availability metrics are both calculated. Sites should check that these values are correct or reasonable.

 

The new VO (changed from “dteam” to “ops”) used for the tests is still not supported at all sites, therefore both VO names are exceptionally accounted for September.

 

Relevant downtime periods should be commented by the sites; not in the same detail as for the GDB discussion  last month, but long down periods should be explained.

 

J.Gordon noted that in mid-September all sites were above the threshold. Was it because sites were little loaded and tests more successful? Or is there an error in the graph? This will be checked.

 

I.Fisk asked whether it would be possible to know which tests have failed and which are causing the whole site failing. I.Bird replied that it is already possible to find the details of the services and tests, using the GridView tool.

 

H.Marten noted that for this month, because of the usage of dteam and ops VOS, it is even more difficult to find out the details of the failure. The SAM output should be streamlined and provide also a view “per site” and not only “per service”.

 

J.Templon said that some improvements in order to see more details are already being discussed with P.Nyczyk 

 

M.Delfino expected that sites would comment their downtime periods every month not just for July and August. L.Robertson replied that this was done only for those months in preparation of the GDB discussion in September.

 

Some sites said that is very tedious and difficult to go and find where one-month old problems were. And they again asked that SAM should improve these aspects, for better retrieval of the causes of problems at a given site.

 

I.Bird asked to send him input about the sites requests for improvements of the SAM system.

 

M.Delfino mentioned that PIC is now interfacing its alarm system to the SAM test results. This is quite easy because SAM returns an XML structure with all information about successful or failed tests completion.

 

L.Robertson noted the lack of improvement in the total site availability since May 2006. J.Gordon replied that some sites have understood the causes of the failure. But the problems have not been fixed, therefore the availability values cannot improve.

 

M.Delfino, as member of the OB, underlined that the monthly report (with the sites comments) is really useful and it should possibly restart.

 

This suggestion, after a short discussion, was supported by the MB. Every centre should produce a maximum one-page monthly summary of the site availability and reliability problems reported by the SAM system.  

5.2         Move of LCG agendas to Indico

It is proposed to do the migration on Wednesday 18th starting at 6pm.

The Indico support wants that all LCG agendas are moved.

 

All existing links to categories, agendas and attachments will be preserved.

No migration work on (y)our side, except if you used export scripts from CDS Agenda to create external web pages (with list of meetings for example).

 

The Indico guide is quite extensive: and linked in the footer of all the pages.

 

6.      Summary of New Actions 

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.