LCG Management Board
Tuesday 10 April 2007 - 16:00-17:00
(Version 1 – 14.4.2007)
A.Aimar (notes), D.Barberis, F.Carminati, J.Coles, L.Dell’Agnello, T.Doyle, M.Ernst, I.Fisk, D.Foster, F.Hernandez, M.Kasemann, G.Merino, Di Qing, H.Renshall, L.Robertson (chair), R.Tafirout, J.Templon
Mailing List Archive:
Tuesday 17 April 2007 – 16:00-17:00
1. Minutes and Matters arising (minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous meeting were distributed after this MB meeting (see the MB Minutes page).
Apologies. Comments will be discussed at next MB meeting.
1.2 Reminder 2007Q1 QR reports (QR_2007Q1.zip)
Distributed on the 2 April 2007, should be filled and sent back to A.Aimar by Thursday 12 April 2007.
2. Action List Review (list of actions)
Actions that are late are highlighted in RED.
Done. Here are the Targets for 2007 (all 4 Experiments) (Experiments Targets 2007_070410.ppt).
Maybe to be discussed in a future MB.
As previously agreed, the new process for accounting started for March 2007, some of the data is collected from the accounting database and from the resource planning table maintained by H.Renshall. The sites should have received a partially pre-filled form for the site accounting.
The pre-filled parts are not editable because the sites should have already filled the database or sent their data to H.Renshall. The data was, as agreed, extracted one full week after the end of the month to allow time for checking and corrections to be made.
J.Templon asked how their data, reported separately from SARA and NIKHEF in the accounting database, is summed up in the report.
L.Robertson investigated after the meeting and established that the accounting database takes care of summing the data for the three sites “SARA-NIHHEF = SARA MATRIX + SARA LISA + NIKHEF ELPROD”.
F.Hernandez added that resources are allocated “on-demand” to the VOs and therefore the information given to H.Renshall is not always up to date.
L.Robertson replied that it was agreed that once a month each site updates the accounting database and provides the mid-term resources data to H.Renshall, at the latest during the first week of the following month.
L.Robertson proposed that a reminder be sent to all sites at the beginning of each month asking them to update the installed and allocated resources at the site.
F.Hernandez asked how sites can now specify “grid vs local/non-grid CPU usage” of their site resources.
L.Robertson replied that the MB had agreed to report CPU total usage, this is a temporary measure until the GOC database takes the difference between grid and non-grid usage into account.
G.Merino asked how “installed vs allocated” resources should be reported.
L.Robertson replied that, as agreed, each site should report their total resources installed and the allocations to each VO. In some cases the sum of allocated resources can be different from the resources actually installed.
4. ALICE Top 5 Issues/Concerns (Slides) - F.Carminati
F.Carminati presented a summary of the Top
5 issues and concerns for
1) Interface xrootd-CASTOR2
- Essential for data access in our computing model / offline infrastructure at T0 and some T1s
- Prototype “close to working” since some time
- We really need now to have it working to be tested meaningfully before data taking
- This in addition to fixing CASTOR problems
- In particular unpredictable latency of files recall from tape
- A “recall estimate” has been requested at both CASTOR2 reviews
Federico said that the initial implementation
by A.Hanushevsky does not work and should be fixed.
2) Interface xrootd-DPM
- Important for data access in our computing model / offline infrastructure at T2s
- Prototype “close to working” since some time
- It was promised to us in September 2006
- We could go xrootd-only in T2s, but we would like to use the “standard” DPM installations
The situation is similar to the one with CASTOR2; the initial prototype needs to be finalized.
The following mail was sent after the meeting (13 April 07) by Ian Bird, clarifying the situation.
I was not at the LCG MB last week and so could not respond to the question on support for xroot in DPM, so let me clarify the situation as I understand it now.
The DPM xroot
software has been provided to ALICE who have built and run their own tests on
it. At the MB in September last year we estimated that we could provide
a first prototype in mid-October. It was delivered initially at the
beginning of November – very close to our estimate, and subsequently
some changes were made during December and January in response to
issues raised by
It has been distributed by
On April 3 Andreas reported the first problem (see below). This is understood and is being addressed by David (delayed by David's work for Castor and Easter vacations). No other problems have been reported.
The reported bug was due to the
fact that the plugin was compiled for DPM 1.5.10 and used with DPM 1.6.3.
This problem would not happen if we were building the xrootd plugin ourselves
and distributing it as part of a DPM release. This is certainly an
option, but I understand that at the moment
As far as I am aware, at the moment there are no outstanding issues of functionality or performance. I can assure you that we will support the xroot implementation in DPM, and that we will continue to work with ALICE and Andy H. to understand how best to build and distribute the plugin.
dCache with xrootd is tested and is already being deployed, with assurances that maintenance support will continue.
Slide 5 shows the
- Stability and resilience of FTS and its information system
- Including all underlying components (storage, DB, etc)
- Establishment of FTS as a real “service” to experiments
- More proactive problem follow up from service providers as opposed to experiments tracking problems
- Essential for Tier-0 to/from Tier-1 transfers
- We are not planning on it for Tier-1 to/from Tier-2
F.Carminati replied that another option could be to use xrootd for the transfers on WAN.
asked whether this implies that
replied that the Tier-1 sites are not expected to provide such service for
L.Dell’Agnello asked why not using FTS, which is already installed, also for Tier-1 to/from Tier-2 transfers.
T.Doyle stated that all Tier-1 and Tier-2 sites already have FTS available for network transfers, but any other arrangement is an overhead that the sites do not have resources for.
noted that the only agreed service is FTS, and underlined the risk of using
an ad-hoc solution for which help in debugging will not be available. Given
the constraints on human resources sites will only be able to support and
have experience in the FTS service. We have already seen last year at CERN
that network problems experienced by LHCb during direct gridftp transfers
were not understood until they occurred also with the FTS service.
4) gLite WMS
- Improve speed of submission and efficiency of the RB information system(s)
- Reduce time between “glite-job-submit” and the moment the job is registered in the local batch system
- High hopes that gLite RB+CE with bulk submission is the solution to this issue
- Job status information sources (RB L&B, IS, CE GRIS not in sync and sometimes providing conflicting information)
- Need to test gLiteRB+CE as soon as possible
- Deployment of the automatic proxy renewal mechanism with VOMS
- Critical for operation in view of the deployment of gLite RB+CE
The automatic proxy renewal is crucial for being able to run long jobs.
5. GDB Summary (document) – J.Coles
J.Coles presented a summary of the GDB
Most of the text below is taken from the attached document.
5.1 Introduction (SAM tests)
Future GDB meeting topics were requested. Some suggestions were received: Monitoring, aggregate top 5 experiment issues, WMS, gLite CE, T1-T2 interactions and testing, new FCR features and accounting policy approval.
Issues with the SAM availability tests
were raised which indicated that a SAM BDII was actually responsible for
2-10% of the loss of availability seen at some
Quattor: The group now has 5 main active contributors with M.Jouvin now taking over the lead from C.Loomis. Quattor has been quite successful with take-up in over 40 sites. Two issues for follow-up were raised.
- Update to the group’s mandate to cover dissemination?
- The group have relied on configuration options of service components listed in an xml file which is no longer to be supported. A request for other documentation options is therefore to be taken to the TCG.
File systems: A HEPiX group was setup to look at file systems following an autumn 2006 request from the IHEPCCC to report on distributed file systems (what is to follow AFS?). The group mandate is to understand how storage is accessed and used across sites, review existing solutions (including price/performance).
A questionnaire on this has been sent to T1 sites already (http://hepix.caspur.it/storage/questionnaire1.php) The first group report will be at HEPiX in Spring 2007. It was thought that there should be more experiment input to this working group.
SL4 status: L.Field gave an update from SA3. The iterative Build-Install-Test cycle was described. The status of the components was listed as: WN – released 2nd April onto PPS. UI has been tested by Integration & Certification and has 4 packaging problems, 15 configuration problems and 4 runtime problems. The WMS is in testing – there is work being done to address packaging issues. The LB, MON, CE and BDII are “ready to test”.
Castor status: T.Cass reviewed the status of Castor. He showed that performance had been demonstrated in many areas but there are two significant software weaknesses: limitations in the scheduling of requests and support for the “disk1” storage class. The first is being addressed with an LSF plug-in (stress testing to start next week) and the problems underlying the second are understood but this is not currently a priority area It was also noted that there is an inadequate stager hardware platform at CERN (new hardware is being deployed now) and the software build and release process is complex (there needs to be a bug fix release and a separate release with new functionality). ATLAS problems are currently being addressed. Demonstrating support for mixed experiment loads is a high priority.
Security policies & updates: D.Kelsey gave the update. A new Grid site operations policy (iterations presented at previous meetings) is to be approved in the next month pending suitable coverage on an area recently raised – Intellectual Property Rights. Several other policy documents are under development and require feedback: Grid Security Policy and the Logged Information Policy. Other areas under development: The Audit Policy and a new VO Policy document. The request from the meeting was for the GDB to approve the Grid Site Operations Policy (V1.3) when Dave emails the list shortly.
Job priorities: J.Templon gave an update on the implementation of the Job Priorities work at the T0 and T1s – this was coupled with an overview of why VOViews were required (in particular to move away from VO-specific queues). Most sites are now publishing though some specific questions remain against 4 sites (see slide 3 of the talk). During the meeting it was noted that in the short-term there will be a need for both VOViews and VO specific queues.
LCG planning: H.Renshall showed the new planning spreadsheet with installed capacity figures now being used in place of available capacity. ATLAS noted that the figures presented in the spreadsheets were off by 3 months – H.Renshall agreed to update the slides and spreadsheet. There were questions surrounding the need for mid-term updates – probably these are needed until April 2008.
Storage accounting: G.Cowan from GridPP presented recent work of the GridPP storage group covering such areas as optimisation tests on the WAN and LAN for T2 sites, more information on storage availability tests and issues with them. He also spent some time talking about the implementation of storage accounting and current issues with double/triple accounting depending on how disk is allocated at sites.
VOMS coordination: D.Kelsey gave this talk on behalf of John Gordon. He wanted to propose a revised mandate for the VOMS group. Some alternative suggestions for rewording point 2 on slide 3 were given. A revised mandate will be circulated when J.Gordon returns.
6.1 SAM Reliability for March and the OPN problems
Because of the OPN issues in March there are a few days where for most sites the SAM tests failed even if the service at the site was running. The proposal is to remove these days from the reliability.
The MB agreed to remove the days when the OPN was not working from the sites reliability statistics for March.
J.Templon added that the overall reliability is represented also by including transfers and network failures. But he agreed that for “site reliability” these days should not be counted.
7. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.