LCG Management Board
Tuesday 24 April 2007 16:00-17:00
(Version 1 3.5.2007)
A.Aimar (notes), D.Barberis, I.Bird, N.Brook, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, X.Espinal, D.Foster, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, J.Knobloch, M.Lamanna, J.Lee, E.Laure, H.Marten, P.Mato, R.Pordes, H.Renshall, L.Robertson (chair), Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout
Mailing List Archive:
Tuesday 8 May 2007 16:00-17:00
1. Minutes and Matters arising (minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous MB meeting (17.4.2007) were distributed only after this meeting. Apologies.
Update: Here is the link to the Minute of the MB Meeting of the 17 April 2007. Please send comments via email to A.Aimar.
1.2 Coming High Level Milestones 2007 (wlcg_hlm_20070424.pdf) A.Aimar
A.Aimar presented the High Level Milestones dashboard and asked the sites to send updates, before end of April, on all milestones due before end of April 2007.
The MB agreed to make this a standard process for the verification of the High Level Milestones. Therefore, as already agree the previous week, the High Level Milestones for the Tier-0 and Tier-1 sites are summarized in a dashboard table that will be:
- updated at the end of every month;
- presented and reviewed at the F2F MB Meeting of the following month;
- added to the Quarterly Reports and submitted to the Overview Board.
1.3 Dates for the future 2007 Quarterly Reports L.Robertson
In 2007, the dates of the Quarterly Reports are not in sync with the Overview Board meetings.
The proposal, supported by D.Jacobs the Secretary of the LCG OB, is to move forward of one month the next quarterly reports for 2007:
- Q2: end July 2007;
- Q3 end October 2007;
- Q4: end January 2008.
For 2008 the QR dates will be decided once the 2008 calendar of the Overview Board meetings is defined.
J.Gordon noted that many sites also report to other bodies (e.g. GridPP, EGEE) and this one-month difference would make impossible to use the same report elsewhere. On the other hand he agreed that moving Q1 to January would avoid clashing with the Christmas holiday’s season and with the many activities always at the end of each year.
The MB agreed to move forward by one month the QR reports for 2007, to July, October and January.
1.4 Date for 2007 LCG-LHCC Comprehensive Review L.Robertson
The dates of the LCG-LHCC Comprehensive Review have been defined: 19-20 November 2007.
2. Action List Review (list of actions)
Actions that are late are highlighted in RED.
Action removed. C.Eck proposed to have this item removed because, by July 2007, a new update of the requirements (until 2012) is already scheduled.
Done. The MB agreed to move forward of one month the QR reports for 2007, to July, October and January.
3. LHCb Top 5 Issues and Concerns wrt LCG Services (Slides) N.Brook
N.Brook presented the Top 5 issues for LHCb with respect to the current LCG Services.
Issue 1) Data Management System
Data access problems via remote protocol access from WN
For dCache sites srm_get wasn’t staging a file: it just retuned a
associated with use of VOMS
of dCache servers under heavy load has been a problem
discovered a dCache problem with a file registered but not physically on disk.
Transfer problems (using lcg_utils efficiency <50%)
- Many instabilities with gsidcap doors. But the current situation is a lot better
- Transfer failures to CASTOR: “Resource Busy” message from CASTOR due to corrupted entry (from previous transfers timed out or failed) that Castor (for consistency) refuses to overwrite.
ROOT (AA) seems to be completely disconnected from SRM
Need to manipulate TURL
format before passing to ROOT
Issue 2) glexec on the Worker Nodes
LHCb are keen to exploit glexec
- To control resources for analysis
- To optimize ‘filling’ of the computing slot
- And to help reduce the load on the LCG RB etc
Testing is currently ongoing with IN2P3 (
- An “ad-hoc” glexec was tested successfully and is working on the WNs there
- The LHCb DIRAC WMS is ready to make use of this solution
C.Grandi asked for more information on what is the
“glexec” that has been deployed in
F.Hernandez replied that it is a script that logs the submitter’s user ID of the real LHCB job, without changing the UID of the job executed. It is called “glexec” in order be a place holder for the future implementation, but it does not have all the glexec functionalities
Ph.Charpentier added that it is sufficient to have a solution that logs who is executing jobs while it is the LHCb WMS that does the prioritization of the jobs.
C.Grandi commented that this solution is equivalent to using “glexec in null mode”. Simply the log information would instead be stored in an area not modifiable/editable by the pilot and user jobs.
I.Bird added that the “null mode” option, which does not switch the real UID of the job, is not accepted by some 50% of the sites, and consensus has not been reached yet on how glexec should behave wrt to UID of the jobs.
F.Hernandez replied that most of the batch systems at sites cannot anyway handle the switch of the UID and that is why the logging has been implemented with that ad-hoc solution.
I.Bird proposed the solution with “standard logging with auditing but without user-switching” should be proposed to the GDB in order to collect the opinions of all Tier-1 sites.
Concerns glexec be certified in time
Delays are also on the LHCb
- Will it be part of gLite release deployed? Any test site will be available to LHCb?
What are the possibilities to
C.Grandi recognized that the usage of LCAS/LCMAPS using a local file is not a valid deployment model, and if this is a required service some work will be needed to find a centrally-supported solution.
R.Pordes added that OSG will deploy “glexec” services and they are interested in participating to providing LCMAPS as a centrally-supported service.
E.Laure replied that actually this work if being done already between EGEE and OSG, independently on the deployment decision of LCG (and a dedicated working group exists already).
Issues 3) Storage
were upgrading storage implementations
L.Robertson asked whether now the situation has improved.
N.Brook replied that many sites are more stable, while on others reliability has not improved.
SRM v2.2 deployment/testing
late deployment of SRM2.2 w.r.t. real data taking
Interfaces to SRM2.2 (lcg-utils/GFAL)
- Currently testing the python binding to lcg-utils/GFAL
operations are not clearly supported. For example, file removal, metadata
I.Bird reminded that also the existing ATLAS proposal for bulk operations should be taken into account in these discussions.
- Currently there is no support for file pinning in GFAL or lcg-utils
of “bring_online” for SRM2.2 is needed
Issue 4 ) Deployment & Release Procedure
Use of LCG AA process
Early client exposure
Issues associated with
Early exposure to VO in parallel to later certification
- Useful to allow VO to test in production environment
E.g. LHCb still not using
gLite RB in production
- Lack of central automatic propagation to sites is causing these problems
C.Grandi confirmed that the roles of a user and policies are not propagated to the sites. And this would require the usage of a dedicated policy management service (i.e. Gpbox).
I.Bird replied that for now there are no plans to certify Gpbox, at least not until the current components and the SL4 migration are completed.
Issue 5) Information System
Consistency of information published by different top-level BDIIs (FCR or not FCR, order of various blocks, different pools of sites published)
- It prevents a real load-balanced service spread among different sites.
Latency of the information propagation
- causing flood of CE (free slots published do not reflect real situation on the LRMS)
- Effect amplified by using multiple RBs : PIC was brought to a halt.
- VOViews introduction helps quite a lot
Instability/scalability of the system
- lcg-utils failing just because the SE (working) was not published
- CE appearing/disappearing from the BDII (while they were working properly).
I.Bird noted that a proposed solution to this problem is to remove the information provider from the CE. These are typical cases of misconfiguration. See L.Field presentation at last GDB.
Content of the information
- Disk space consumption and space left published by SRM
- Granularity of information,
- OS and platform advertised in a coherent way (SL,SL4,SLC4, Scientific Linux 4, etc)
Splitting (pseudo) static information from dynamic information would be beneficial
- amount of information shipped over the network (and hence latency could be reduced)
- for improved stability of the system and it allows for major reliability of the access of static information required by DM clients.
I.Bird confirmed that this is a top priority for next improvements that will be in the next version of the Information System.
These are issues could be discussed at future MB or GDB meetings:
Stability of shared software
I.Bird noted that this issue should be reported and fixed on a site-by-site base.
- Authorisation/ACL issues surrounding data storage
VOMS storage of LHCb
L.Robertson concluded now that all experiments have presented their concerns, a summary of the Top 5 Issues from all the LHC Experiment, and the proposed follow-up actions, will be presented and discussed at next GDB in May.
4. March 2007 Accounting (and Automatic Accounting) L.Robertson
The March 2007 Accounting is done by automatically extracting what has been reported in Harry’s tables and in the APEL repository. Much of the information is still missing and this will have to improve for the next month.
Many sites replied that the information extracted from the APEL repository is incorrect. This will require further investigation of the APEL accounting team. J.Gordon will report about its progress.
The proposal for providing and extracting the Accounting data is that:
- On the 2nd of each month the sites will be reminded to update the data in the APEL repository and in the Mid-Term Planning tables (Harry’s tables).
- On the 8th of each month the information is extracted from APEL and from the Mid-Term Planning tables and the Accounting report distributed.
- Sites should then check the accounting values extracted and report all inconsistency within a few days.
The automation of the accounting process is absolutely urgent because when it will be extended to the Tier-2 sites it cannot still be manually done (as it has been until now for the Tier-1 sites).
Pre-filled cells - The pre-filled cells of the accounting sheets are not editable. But several sites have asked to be able to update that data. Editable cell, near the non-modifiable values, will allow sites to correct inconsistencies without erasing the incorrect values.
Grid and non-grid usage – The APEL repository does not differentiate between grid and non-grid usage. Therefore at present there will be a single “CPU usage” value to specify the total CPU usage.
J.Gordon noted that the APEL sensor supports only work submitted through a CE. There is no general mechanism for reporting non-grid work..
L.Robertson replied that it was agreed that non-grid work should nevertheless be reported to the APEL system using the direct interface (as is at present done by several sites). Alternatively of course sites could provide access to resources exclusively via the CE.
The MB noted that it was waiting for a formal proposal made by the APEL team via J.Gordon, on how to represent non-grid usage in the APEL repository.
15 May 2007 - J.Gordon agreed to make a proposal on how to also store non-grid usage in the APEL repository.
No meeting next week (1st of May).
6. Summary of New Actions
15 May 2007 - J.Gordon agreed to make a proposal on how to also store non-grid usage in the APEL repository.
The full Action List, current and past items, will be in this wiki page before next MB meeting.