LCG Management Board

Date/Time:

Tuesday 10 April 2007 - 16:00-17:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=11633

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 14.4.2007)

Participants:

A.Aimar (notes), D.Barberis, F.Carminati, J.Coles, L.Dell’Agnello, T.Doyle, M.Ernst, I.Fisk, D.Foster, F.Hernandez, M.Kasemann, G.Merino, Di Qing, H.Renshall, L.Robertson (chair), R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 17 April 2007 – 16:00-17:00

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

The minutes of the previous meeting were distributed after this MB meeting (see the MB Minutes page).

Apologies. Comments will be discussed at next MB meeting.

1.2         Reminder 2007Q1 QR reports (QR_2007Q1.zip)

Distributed on the 2 April 2007, should be filled and sent back to A.Aimar by Thursday 12 April 2007.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 15 Mar 2007 - CMS and LHCb should send to C.Eck their requirements until 2011.

Not done. ALICE values were sent to Chris. Waiting for CMS (D.Newbold) and LHCb (N.Brook).

 

  • 3 Apr 2007 - A.Aimar will update the targets for all four LHC experiments and distribute it to the MB.

 

Done. Here are the Targets for 2007 (all 4 Experiments) (Experiments Targets 2007_070410.ppt).

Maybe to be discussed in a future MB.

 

 

3.      Accounting for March 2007 – L.Robertson

 

As previously agreed, the new process for accounting started for March 2007, some of the data is collected from the accounting database and from the resource planning table maintained by H.Renshall. The sites should have received a partially pre-filled form for the site accounting.

 

The pre-filled parts are not editable because the sites should have already filled the database or sent their data to H.Renshall. The data was, as agreed, extracted one full week after the end of the month to allow time for checking and corrections to be made.

 

J.Templon asked how their data, reported separately from SARA and NIKHEF in the accounting database, is summed up in the report.

L.Robertson investigated after the meeting and established that the accounting database takes care of summing the data for the three sites  “SARA-NIHHEF = SARA MATRIX + SARA LISA + NIKHEF ELPROD”.

 

F.Hernandez added that resources are allocated “on-demand” to the VOs and therefore the information given to H.Renshall is not always up to date.

L.Robertson replied that it was agreed that once a month each site updates the accounting database and provides the mid-term resources data to H.Renshall, at the latest during the first week of the following month.

 

L.Robertson proposed that a reminder be sent to all sites at the beginning of each month asking them to update the installed and allocated resources at the site.

 

F.Hernandez asked how sites can now specify “grid vs local/non-grid CPU usage” of their site resources.

L.Robertson replied that the MB had agreed to report CPU total usage, this is a temporary measure until the GOC database takes the difference between grid and non-grid usage  into account.

 

G.Merino asked how “installed vs allocated” resources should be reported.

L.Robertson replied that, as agreed, each site should report their total resources installed and the allocations to each VO. In some cases the sum of allocated resources can be different from the resources actually installed.

 

4.      ALICE Top 5 Issues/Concerns (Slides) - F.Carminati

 

F.Carminati presented a summary of the Top 5 issues and concerns for ALICE wrt. the LCG Services.

 

1) Interface xrootd-CASTOR2

-          Essential for data access in our computing model / offline infrastructure at T0 and some T1s

-          Prototype “close to working” since some time

-          We really need now to have it working to be tested meaningfully before data taking

-          This in addition to fixing CASTOR problems

-          In particular unpredictable latency of files recall from tape

-          A “recall estimate” has been requested at both CASTOR2 reviews

 

Federico said that the initial implementation by A.Hanushevsky does not work and should be fixed. ALICE would like to have it investigated and solved soon. L.Robertson said that he assumed that this must be a very recent discovery as at the meeting between ALICE and the IT management on 12 March no problems were reported.

 

2) Interface xrootd-DPM

-          Important for data access in our computing model / offline infrastructure at T2s

-          Prototype “close to working” since some time

-          It was promised to us in September 2006

-          We could go xrootd-only in T2s, but we would like to use the “standard” DPM installations

 

The situation is similar to the one with CASTOR2; the initial prototype needs to be finalized.

ALICE would prefer not to have an xrootd-only solution, in order to avoid the xrootd installation at all Tier-2 sites (where dCache, DPN, etc are already installed).

 

The following mail was sent after the meeting (13 April 07) by Ian Bird, clarifying the situation.

 

Dear Federico,

 

I was not at the LCG MB last week and so could not respond to the question on support for xroot in DPM, so let me clarify the situation as I understand it now.

 

The DPM xroot software has been provided to ALICE who have built and run their own tests on it.  At the MB in September last year we estimated that we could provide a first prototype in mid-October.  It was delivered initially at the beginning of November – very close to our estimate, and subsequently some changes were made during December and January in response to  issues raised by ALICE in their testing.  This work was finished in January.

 

It has been distributed by ALICE to a number of their test sites since about a month or so.  David Smith met with Andreas Peters at that time to understand if anything more was required, for instance if the question of ALICE distributing it to sites directly or whether it should be included in our DPM distribution was likely to be raised.  At that time no further action was requested.

 

On April 3 Andreas reported the first problem (see below).  This is understood and is being addressed by David (delayed by David's work for Castor and Easter vacations).  No other problems have been reported.

 

The reported bug was due to the fact that the plugin was compiled for DPM 1.5.10 and used with DPM 1.6.3. This problem would not happen if we were building the xrootd plugin ourselves and distributing it as part of a DPM release.  This is certainly an option, but I understand that at the moment ALICE prefer to build the plugin themselves.

 

As far as I am aware, at the moment there are no outstanding issues of functionality or performance.  I can assure you that we will support the xroot implementation in DPM, and that we will continue to work with ALICE and Andy H. to understand how best to build and distribute the plugin.

 

Regards,

 

Ian

 

 

 

dCache with xrootd is tested and is already being deployed, with assurances that maintenance support will continue.

 

Slide 5 shows the ALICE’s storage strategy that is independent on the storage implementation. ALICE’s model is depending on xrootd and cannot be changed now.

 

3) FTS

-          Stability and resilience of FTS and its information system

-          Including all underlying components (storage, DB, etc)

-          Establishment of FTS as a real “service” to experiments

-          More proactive problem follow up from service providers as opposed to experiments tracking problems

-          Essential for Tier-0 to/from Tier-1 transfers

-          We are not planning on it for Tier-1 to/from Tier-2

 

ALICE reached the speed required during the transfer tests. Stability is important because FTS is essential for the network transfers. FTS should be maintained and debugged by the FTS experts to track down the problem with the experiments; until now most of the debugging of specific problems is done by the experiment.

 

F.Hernandez asked whether ALICE is going to use FTS for all ALICE’s transfers.

F.Carminati replied that another option could be to use xrootd for the transfers on WAN.

 

L.Robertson asked whether this implies that ALICE is not expecting the Tier-1 sites to provide them with a service to transfer data to the Tier-2 sites.

F.Carminati replied that the Tier-1 sites are not expected to provide such service for ALICE unless the FTS solution is chosen.

 

L.Dell’Agnello asked why not using FTS, which is already installed, also for Tier-1 to/from Tier-2 transfers.

F.Carminati confirmed that ALICE may decide not to depend on FTS for those transfers, but all scenarios have to be evaluated.

 

T.Doyle stated that all Tier-1 and Tier-2 sites already have FTS available for network transfers, but any other arrangement is an overhead that the sites do not have resources for.

 

L.Robertson noted that the only agreed service is FTS, and underlined the risk of using an ad-hoc solution for which help in debugging will not be available. Given the constraints on human resources sites will only be able to support and have experience in the FTS service. We have already seen last year at CERN that network problems experienced by LHCb during direct gridftp transfers were not understood until they occurred also with the FTS service. ALICE must assume that they will have to do their own debugging if they do not used the FTS services at Tier-1s.

 

4) gLite WMS

-          Improve speed of submission and efficiency of the RB information system(s)

-          Reduce time between “glite-job-submit” and the moment the job is registered in the local batch system

-          High hopes that gLite RB+CE with bulk submission is the solution to this issue

-          Job status information sources (RB L&B, IS, CE GRIS not in sync and sometimes providing conflicting information)

-          Need to test gLiteRB+CE as soon as possible

 

ALICE needs to test the gLite RB, WMS and CE.

 

5) VOMS

-          Deployment of the automatic proxy renewal mechanism with VOMS

-          Critical for operation in view of the deployment of gLite RB+CE

 

The automatic proxy renewal is crucial for being able to run long jobs.

 

5.      GDB Summary (document) – J.Coles

 

J.Coles presented a summary of the GDB meeting in Prague.

Most of the text below is taken from the attached document.

5.1         Introduction (SAM tests)

Future GDB meeting topics were requested. Some suggestions were received: Monitoring, aggregate top 5 experiment issues, WMS, gLite CE, T1-T2 interactions and testing, new FCR features and accounting policy approval.

 

Issues with the SAM availability tests were raised which indicated that a SAM BDII was actually responsible for 2-10% of the loss of availability seen at some UK sites.

 

Quattor: The group now has 5 main active contributors with M.Jouvin now taking over the lead from C.Loomis. Quattor has been quite successful with take-up in over 40 sites. Two issues for follow-up were raised.

-          Update to the group’s mandate to cover dissemination?

-          The group have relied on configuration options of service components listed in an xml file which is no longer to be supported. A request for other documentation options is therefore to be taken to the TCG.

 

File systems:  A HEPiX group was setup to look at file systems following an autumn 2006 request from the IHEPCCC to report on distributed file systems (what is to follow AFS?). The group mandate is to understand how storage is accessed and used across sites, review existing solutions (including price/performance).

A questionnaire on this has been sent to T1 sites already (http://hepix.caspur.it/storage/questionnaire1.php) The first group report will be at HEPiX in Spring 2007. It was thought that there should be more experiment input to this working group.

 

SL4 status: L.Field gave an update from SA3. The iterative Build-Install-Test cycle was described. The status of the components was listed as: WN – released 2nd April onto PPS. UI has been tested by Integration & Certification and has 4 packaging problems, 15 configuration problems and 4 runtime problems. The WMS is in testing – there is work being done to address packaging issues. The LB, MON, CE and BDII are “ready to test”.

 

Castor status:  T.Cass reviewed the status of Castor. He showed that performance had been demonstrated in many areas but there are two significant software weaknesses: limitations in the scheduling of requests and support for the “disk1” storage class. The first is being addressed with an LSF plug-in (stress testing to start next week) and the problems underlying the second are understood but this is not currently a priority area It was also noted that there is an inadequate stager hardware platform at CERN (new hardware is being deployed now) and the software build and release process is complex (there needs to be a bug fix release and a separate release with new functionality). ATLAS problems are currently being addressed. Demonstrating support for mixed experiment loads is a high priority.

 

Security policies & updates: D.Kelsey gave the update. A new Grid site operations policy (iterations presented at previous meetings) is to be approved in the next month pending suitable coverage on an area recently raised – Intellectual Property Rights. Several other policy documents are under development and require feedback: Grid Security Policy and the Logged Information Policy. Other areas under development: The Audit Policy and a new VO Policy document.  The request from the meeting was for the GDB to approve the Grid Site Operations Policy (V1.3) when Dave emails the list shortly. 

 

Job priorities: J.Templon gave an update on the implementation of the Job Priorities work at the T0 and T1s – this was coupled with an overview of why VOViews were required (in particular to move away from VO-specific queues). Most sites are now publishing though some specific questions remain against 4 sites (see slide 3 of the talk). During the meeting it was noted that in the short-term there will be a need for both VOViews and VO specific queues.

 

LCG planning: H.Renshall showed the new planning spreadsheet with installed capacity figures now being used in place of available capacity. ATLAS noted that the figures presented in the spreadsheets were off by 3 months – H.Renshall agreed to update the slides and spreadsheet. There were questions surrounding the need for mid-term updates – probably these are needed until April 2008.

 

Storage accounting: G.Cowan from GridPP presented recent work of the GridPP storage group covering such areas as optimisation tests on the WAN and LAN for T2 sites, more information on storage availability tests and issues with them. He also spent some time talking about the implementation of storage accounting and current issues with double/triple accounting depending on how disk is allocated at sites.

 

VOMS coordination: D.Kelsey gave this talk on behalf of John Gordon. He wanted to propose a revised mandate for the VOMS group. Some alternative suggestions for rewording point 2 on slide 3 were given. A revised mandate will be circulated when J.Gordon returns.

 

 

6.      AOB

 

6.1         SAM Reliability for March and the OPN problems

Because of the OPN issues in March there are a few days where for most sites the SAM tests failed even if the service at the site was running. The proposal is to remove these days from the reliability.

 

Decision:

The MB agreed to remove the days when the OPN was not working from the sites reliability statistics for March.

 

J.Templon added that the overall reliability is represented also by including transfers and network failures. But he agreed that for “site reliability” these days should not be counted.

 

 

7.      Summary of New Actions

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.