LCG Management Board

Date/Time:

Tuesday 24 April 2007 16:00-17:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=11635

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 3.5.2007)

Participants:

A.Aimar (notes), D.Barberis, I.Bird, N.Brook, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, X.Espinal, D.Foster, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, J.Knobloch, M.Lamanna, J.Lee, E.Laure, H.Marten, P.Mato, R.Pordes, H.Renshall, L.Robertson (chair), Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 8 May 2007 16:00-17:00

1.      Minutes and Matters arising (minutes) 

 

1.1         Minutes of Previous Meeting

The minutes of the previous MB meeting (17.4.2007) were distributed only after this meeting. Apologies.

 

Update: Here is the link to the Minute of the MB Meeting of the 17 April 2007. Please send comments via email to A.Aimar.

 

1.2         Coming High Level Milestones 2007 (wlcg_hlm_20070424.pdf) A.Aimar

A.Aimar presented the High Level Milestones dashboard and asked the sites to send updates, before end of April, on all milestones due before end of April 2007.

 

The MB agreed to make this a standard process for the verification of the High Level Milestones. Therefore, as already agree the previous week, the High Level Milestones for the Tier-0 and Tier-1 sites are summarized in a dashboard table that will be:

-          updated at the end of every month;

-          presented and reviewed at the F2F MB Meeting of the following month;

-          added to the Quarterly Reports and submitted to the Overview Board.

1.3          Dates for the future 2007 Quarterly Reports L.Robertson

In 2007, the dates of the Quarterly Reports are not in sync with the Overview Board meetings.

 

The proposal, supported by D.Jacobs the Secretary of the LCG OB, is to move forward of one month the next quarterly reports for 2007:

-          Q2: end July 2007;

-          Q3 end October 2007;

-          Q4: end January 2008.

 

For 2008 the QR dates will be decided once the 2008 calendar of the Overview Board meetings is defined.

 

J.Gordon noted that many sites also report to other bodies (e.g. GridPP, EGEE) and this one-month difference would make impossible to use the same report elsewhere. On the other hand he agreed that moving Q1 to January would avoid clashing with the Christmas holiday’s season and with the many activities always at the end of each year.

 

Decision:

The MB agreed to move forward by one month the QR reports for 2007, to July, October and January.

1.4         Date for 2007 LCG-LHCC Comprehensive Review L.Robertson

 

The dates of the LCG-LHCC Comprehensive Review have been defined: 19-20 November 2007.

 

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 15 Mar 2007 CMS and LHCb should send to C.Eck their requirements until 2011.

 

Action removed. C.Eck proposed to have this item removed because, by July 2007, a new update of the requirements (until 2012) is already scheduled.

 

  • 24 Apr 2007 In order to be in sync of the meetings of the Overview Board L.Robertson and A.Aimar will propose new dates for the preparation of future QR Reports. 

 

Done. The MB agreed to move forward of one month the QR reports for 2007, to July, October and January.

 

3.      LHCb Top 5 Issues and Concerns wrt LCG Services (Slides) N.Brook

 

N.Brook presented the Top 5 issues for LHCb with respect to the current LCG Services.

 

Issue 1) Data Management System

 

Data access problems via remote protocol access from WN

-          For dCache sites srm_get wasn’t staging a file: it just retuned a TURL
now is fixed in latest release of dCache

-          Authorization issues associated with use of VOMS
1 associate with Gridmap file
Configuration of gPlazma

-          Stability of dCache servers under heavy load has been a problem

-          Recently discovered a dCache problem with a file registered but not physically on disk.
Observed on 2 dCache sites. Still under investigation not clear load from transfers or caused by pool mi
gration operation

 

Transfer problems (using lcg_utils efficiency <50%)

-          Many instabilities with gsidcap doors. But the current situation is a lot better

 

-          Transfer failures to CASTOR: “Resource Busy” message from CASTOR due to corrupted entry (from previous transfers timed out or failed) that Castor (for consistency) refuses to overwrite.

 

ROOT (AA) seems to be completely disconnected from SRM

-          Need to manipulate TURL format before passing to ROOT
dcap:gsidcap://… , castor://, Nos of slashes, rfio format

The problem is being addressed in the recent releases of ROOT.

 

Issue 2) glexec on the Worker Nodes

 

LHCb are keen to exploit glexec

-          To control resources for analysis

-          To optimize ‘filling’ of the computing slot

-          And to help reduce the load on the LCG RB etc

 

Testing is currently ongoing with IN2P3 (Lyon)

-          An “ad-hoc” glexec was tested successfully and is working on the WNs there

-          The LHCb DIRAC WMS is ready to make use of this solution

 

C.Grandi asked for more information on what is the “glexec” that has been deployed in Lyon.

F.Hernandez replied that it is a script that logs the submitter’s user ID of the real LHCB job, without changing the UID of the job executed. It is called “glexec” in order be a place holder for the future implementation, but it does not have all the glexec functionalities

Ph.Charpentier added that it is sufficient to have a solution that logs who is executing jobs while it is the LHCb WMS that does the prioritization of the jobs.

 

C.Grandi commented that this solution is equivalent to using “glexec in null mode”. Simply the log information would instead be stored in an area not modifiable/editable by the pilot and user jobs.

 

I.Bird added that the “null mode” option, which does not switch the real UID of the job, is not accepted by some 50% of the sites, and consensus has not been reached yet on how glexec should behave wrt to UID of the jobs.

F.Hernandez replied that most of the batch systems at sites cannot anyway handle the switch of the UID and that is why the logging has been implemented with that ad-hoc solution.

 

I.Bird proposed the solution with “standard logging with auditing but without user-switching” should be proposed to the GDB in order to collect the opinions of all Tier-1 sites.

 

Concerns glexec be certified in time

-          Pre-view testbed version
Uses LCAS/LCMAPS information from local files rather than from some dedicated central service

-          Delays are also on the LHCb side
Postponed whilst a stager agent was developed to deal with file access issues

-          Will it be part of gLite release deployed? Any test site will be available to LHCb?

-          What are the possibilities to rollout Lyon’s “ad-hoc” glexec?

 

C.Grandi recognized that the usage of LCAS/LCMAPS using a local file is not a valid deployment model, and if this is a required service some work will be needed to find a centrally-supported solution.

 

R.Pordes added that OSG will deploy “glexec” services and they are interested in participating to providing LCMAPS as a centrally-supported service.

E.Laure replied that actually this work if being done already between EGEE and OSG, independently on the deployment decision of LCG (and a dedicated working group exists already).

 

Issues 3) Storage

 

SE instabilities

-          Last quarter ’06
There was a limited time when all 7 LHCb Tier1s were available and this caused a back log of replication of 80k DSTs

 

-          Sites were upgrading storage implementations
LHCb welcome this but sometimes also resulted in deployment of unstable versions

 

L.Robertson asked whether now the situation has improved.

N.Brook replied that many sites are more stable, while on others reliability has not improved.

 

SRM v2.2 deployment/testing

-          Relatively late deployment of SRM2.2 w.r.t. real data taking
There is only limited time-scale for testing the LHCb use cases

 

Interfaces to SRM2.2 (lcg-utils/GFAL)

-          Currently testing the python binding to lcg-utils/GFAL

-          Bulk operations are not clearly supported. For example, file removal, metadata queries, …
Discussions with developers will take place next week

 

I.Bird reminded that also the existing ATLAS proposal for bulk operations should be taken into account in these discussions.

 

-          Currently there is no support for file pinning in GFAL or lcg-utils

-          GFAL use of “bring_online” for SRM2.2 is needed by LHCb.
Currently in SRM v1.1 there is no generic way to stage files and LHCb has its own solution

 

Issue 4 ) Deployment & Release Procedure

 

Use of LCG AA process

-          Early client exposure
Allows LHCb to test early in production environment
Quick feedback for client developers. E.g. lcg copy to allow surl to surl copies requested in July’06 only now being rolled out with gLite

-          Issues associated with compatibility
Recent problems associated with globus libraries & use of lcg-utils were found

 

Early exposure to VO in parallel to later certification

-          Useful to allow VO to test in production environment

-          E.g. LHCb still not using gLite RB in production
Version of LHCb gLite RB provided at CERN is known to have problems
Version, known to be problematic, was deployed for LHCb to test and wasted LHCb a lot of time

 

Central VOMS servers

-          Lack of central automatic propagation to sites is causing these problems

 

C.Grandi confirmed that the roles of a user and policies are not propagated to the sites. And this would require the usage of a dedicated policy management service (i.e. Gpbox).

I.Bird replied that for now there are no plans to certify Gpbox, at least not until the current components and the SL4 migration are completed.

 

Issue 5) Information System

 

Consistency of information published by different top-level BDIIs (FCR or not FCR, order of various blocks, different pools of sites published)

-          It prevents a real load-balanced service spread among different sites.

 

Latency of the information propagation

-          causing flood of CE (free slots published do not reflect real situation on the LRMS)

-          Effect amplified by using multiple RBs : PIC was brought to a halt.

-          VOViews introduction helps quite a lot

 

Instability/scalability of the system

-          lcg-utils failing just because the SE (working) was not published

-          CE appearing/disappearing from the BDII (while they were working properly).

 

I.Bird noted that a proposed solution to this problem is to remove the information provider from the CE. These are typical cases of misconfiguration. See L.Field presentation at last GDB.

 

Content of the information

-          Disk space consumption and space left published by SRM

-          Granularity of information,

-          OS and platform advertised in a coherent way (SL,SL4,SLC4, Scientific Linux 4, etc)

 

Splitting (pseudo) static information from dynamic information would be beneficial

-          amount of information shipped over the network (and hence latency could be reduced)

-          for improved stability of the system and it allows for major reliability of the access of static information required by DM clients.

 

I.Bird confirmed that this is a top priority for next improvements that will be in the next version of the Information System.

 

Other Issues

 

These are issues could be discussed at future MB or GDB meetings:

-          Stability of shared software area.
In order to have a common area at sites, accessible from all WN.
E.g. NFS is not scalable and when access fails the job crashes during execution.

 

I.Bird noted that this issue should be reported and fixed on a site-by-site base.

 

-          Authorisation/ACL issues surrounding data storage

 

-          VOMS storage of LHCb “nickname”
In order to have an extra field to store a unique name for the job’s output across the grid.

 

 

L.Robertson concluded now that all experiments have presented their concerns, a summary of the Top 5 Issues from all the LHC Experiment, and the proposed follow-up actions, will be presented and discussed at next GDB in May.

 

4.      March 2007 Accounting (and Automatic Accounting) L.Robertson

 

The March 2007 Accounting is done by automatically extracting what has been reported in Harry’s tables and in the APEL repository. Much of the information is still missing and this will have to improve for the next month.

 

Many sites replied that the information extracted from the APEL repository is incorrect. This will require further investigation of the APEL accounting team. J.Gordon will report about its progress.

 

The proposal for providing and extracting the Accounting data is that:

-          On the 2nd of each month the sites will be reminded to update the data in the APEL repository and in the Mid-Term Planning tables (Harry’s tables).

-          On the 8th of each month the information is extracted from APEL and from the Mid-Term Planning tables and the Accounting report distributed.

-          Sites should then check the accounting values extracted and report all inconsistency within a few days.

 

The automation of the accounting process is absolutely urgent because when it will be extended to the Tier-2 sites it cannot still be manually done (as it has been until now for the Tier-1 sites).

 

Pre-filled cells - The pre-filled cells of the accounting sheets are not editable. But several sites have asked to be able to update that data. Editable cell, near the non-modifiable values, will allow sites to correct inconsistencies without erasing the incorrect values.

 

Grid and non-grid usage – The APEL repository does not differentiate between grid and non-grid usage. Therefore at present there will be a single “CPU usage” value to specify the total CPU usage.

 

J.Gordon noted that the APEL sensor supports only work submitted through a CE. There is no general mechanism for reporting non-grid work..

L.Robertson replied that it was agreed that non-grid work should nevertheless be reported to the APEL system using the direct interface (as is at present done by several sites). Alternatively of course sites could provide access to resources exclusively via the CE.

 

The MB noted that it was waiting for a formal proposal made by the APEL team via J.Gordon, on how to represent non-grid usage in the APEL repository.

 

Action:

15 May 2007 - J.Gordon agreed to make a proposal on how to also store non-grid usage in the APEL repository.

 

 

5.      AOB 

 

 

No meeting next week (1st of May).

 

6.      Summary of New Actions

 

 

 

Action:

15 May 2007 - J.Gordon agreed to make a proposal on how to also store non-grid usage in the APEL repository.

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.