LCG Management Board

Date/Time:

Tuesday 17 October 2006 at 16:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=a063270

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 19.10.2006)

Participants:

A.Aimar (notes), D.Barberis, L.Bauerdick, S.Belforte, I.Bird, K.Bos, N.Brook, T.Cass, Di Quing, B.Gibbard, J.Gordon, D.Foster, F.Hernandez, J.Knobloch, H.Marten, P.Mato, L.Robertson (chair), Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 24 October from 16:00 to 17:00, CERN time

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

Minutes of the previous meetings (3 and 10 October 2006) approved without changes.

1.2         Update on the SAM availability and reliability tests (Slides) – Ian Bird

I.Bird presented an update on the SAM activities aiming at improving the tests and including the advice that some sites had volunteered.

1.2.1          General Points from Last Week

H.Marten and J.Templon discussed with the SAM developers (outside the MB meeting) about the difference between the total SAM availability and the GridView individual site graphs. Work to study and solve those differences is on going.

 

They also requested that SAM produce extended tables of the individual tests at the sites. This is already in the SAM team work plan

1.2.2          BNL Issues

B.Gibbard had sent a list of issues of SAM usage at BNL. But it seems that many, almost all, are already resolved by SAM now.

 

The current BNL status is that the failure is due to the “job-listmatch” error. This probably means that the “ops” VO is not supported at the site. The SRM/CE tests have all passed successfully.

1.2.3          SAM and Tests Status

First Priority:

-          Debugging & improving display/presentation

-          Implementation by VO (including the VO tests)

-          Existing site service tests, make it reliable – e.g. should not run if dependency fails; more verbose – details of what happened

-          Add missing tests – can some service testing be more complete? Ideas and advice is welcome.

 

Second priority – Missing sensors:

-          gLite WMS – work by EUMedGrid & PPS

-          VOMS – INFN, but this has been very slow (is there any progress?)

-          R-GMA registry – something basic from RAL exists, but not integrated in SAM

-          myProxy – missing tests

 

There is much space for collaboration between sites and the SAM team. Constructive feedback is always welcome especially if it also includes suggestions about new tests or ideas on how to improve the current tests.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

  • 6 October - B.Panzer distributes to the MB a document on “where disk caches are needed in a Tier-1 site” everything included (buffers for tapes, network transfers, etc).

 

Postponed to end October.

 

  • 10 Oct 2006 - The 3D Phase 2 sites should provide, in the next Quarterly Report, the 3D status and the time schedules for installations and tests of their 3D databases.

 

Ongoing. This was discussed later during the meeting and additional information will be asked during the review of the QR reports.

 

  • 13 Oct 2006 - Experiments should send to H.Renshall their resource requirements and work plans at all Tier-1 sites (cpu, disk, tape, network in and out, type of work) covering at least 2007Q1 and 2007Q2.

 

Ongoing. H.Renshall is contacting the experiments in order to obtain information about resource requirements.

 

3.      Review Items to follow (in the MB Action List)

 

-          End-points for ALICE at RAL

-          CMS transfer rate to RAL

-          NDGF Status and requests from ALICE and ATLAS

-          Transfers of Job Log back to CERN

-          Data Transfers from the WN to CERN (LHCb)

 

These were critical long term outstanding actions that had to be followed, even if not as part of the MB action list, in order that we can ensure that they are finally resolved.

 

J.Shiers noted that some of the items were also in the SC4 action list and had been solved already some time ago.

 

End-points for ALICE at RAL

DONE in July

 

CMS transfer rate to RAL

DONE in the summer

 

NDGF Status and requests from ALICE and ATLAS

Still outstanding.

 

Transfers of Job Log back to CERN

General issue not assigned to anyone.

 

Data Transfers from the WN to CERN (LHCb)

Solved in August.

 

4.      Summary of the Service Status and Experiments Activities (ATLAS slides, list of actions; Paper)

As agreed in the MB of 30 May, a combined action list was setup and reviewed weekly through the LCG ECM. The state of this list was regularly reported to the MB, and covered in the paper (see attachment) submitted to the LHCC comprehensive review.

 

4.1         Service Status Report – J.Shiers

J.Shiers presented the latest service report. Available here : service-report-Oct17.pdf

 

The outstanding issues this week were:

-          SARA stability. After the GDB J. Templon discussed with J-Ph. Baud and J.Casey and a solution is understood.

-          LFC 1.5.10 is released and being tested on pre-production. Need to agree the installation of the server.

-          The SARA-NIKHEF network issues were solved time ago and yesterday dCache with the fixes was released. During the FTS workshop SARA and the SC team will agree when to proceed to the installation.

 

On page. 2 the FTS Reports now shows reliability clearly by channel and by VO and one can see in red the transfers failing and select the ones to investigate and fix. P.Badino will give a tutorial at the FTS workshop on how to use the tool.

 

In last couple of weeks the problems have been solved and the process to resolve problems is well defined now.

 

R.Tafirout asked whether these reports are weekly documents and where they are all available.

J.Shiers replied that they are weekly documents and pointed out that all reports (only two for now because it has just started) will be at the SC4 wiki page: https://twiki.cern.ch/twiki/bin/view/LCG/WeeklyServiceReports. (Click “show attachments”).

 

O.Smirnova asked why NDGF is marked as not in production for any of the experiments.

J.Shiers explained that NDGF is not participating to any SC4 activity and therefore is not a production service site. Actually it would be useful if NDGF participated to the SC because the SC experience is an important step in getting to LCG service operations.

4.2         ATLAS Tier-0 Tier-1 Data Transfers – D.Barberis

D.Barberis presented the status of the data transfers tests being performed by ATLAS.

 

Slide 2 shows the Nominal Rates in the new ATLAS model. The total amount is of 1.060 GB/s out of CERN of which 320 GB have to go to tape (RAW data).

 

Slide 3 shows that the average rate achieved is 300 MB/s which is 3.5 times smaller than the nominal rate.

 

L.Robertson asked what the target of the tests actually was.

D.Barberis replied that it was the old nominal rate of 720 MB/s. And ATLAS in July had reached 600 MB/s. Now, not counting NDGF and RAL because the former is not in production and the latter has a reduced disk storage capability, one should still reach 650 MB/s.

 

Looking at the graph on slide 3 one can see that all sites can reach their nominal rate but they do not sustain it for long (except two or three sites) and, therefore, the total never reaches the 600 MB/s value.

 

Slide 4 shows the problems encountered:

-          The tests involved transfer many small datasets onto DQ2 (testing different luminosity blocks), and leading to more FTS jobs (an FTS job is a (sub)set of an ATLAS dataset)

-          Simultaneous use of the FTS server @ CERN by other experiments created additional problems

-          Many FTS contacts failed

-          Submitting new FTS jobs, polling FTS jobs, etc

-          Due to increase on the load of central servers

-          (FTS developers informed and notification service - as opposed to polling -requested)

 

J.Shiers said that he thought that the conclusions on the reasons for the problems were not supported by the evidence and there should be a systematic investigation. Specific sites and services should not be singled out unless it is sure that they are responsible for the problems – and so responsible for fixing them.

 

-          DQ2 had to become considerably more robust handling these failures
Some particularly nasty (memory leaks, sockets left hanging, etc)

 

-          Experimenting with ‘atldata’ (CERN disk-only area) caused several new problems
Disk-servers full/disabled can bring CASTOR to a halt.
Difficult to monitor ‘available’ disk space.
Using ‘atldata’ has proven to be an invaluable experience!

 

Some problems were also caused by unclear communication about “available” vs “accessible” disk-only space; this will be followed up at the Operations meeting.

 

T.Cass clarified that the problem of using a full pool and blocking the system is now solved; but the release with the solution is not installed on all stagers and sites (i.e. RAL did not upgrade yet), in order to provide as much as possible a stable environment.

 

Slide 6 shows the number of pending jobs in a queue. And ATLAS makes sure that the queue is not empty and the system is busy. No particular issues are highlighted by the graph.

 

L.Robertson asked what is done to understand the problems, and whether the coordination meetings are used. D.Barberis replied positively.

 

L.Robertson asked whether the sites knew the level that they had to reach this time. D.Barberis confirmed that they did.

 

L.Robertson said that it would be better to have targets that are achievable rather than simply using the 2008 targets while some sites are not taking part and others do not have sufficient resources. .Otherwise the overall test is bound to fail...

 

H.Marten said, about slide no 5, that it was not known to them when an experiment wanted to do heavy transfers, and that, if they are informed in advance, they can monitor more carefully and understand better possible issues.

 

L.Robertson said that the claim that the problem is in the FTS service is not sure for now and that one needs to investigate more in details when these problems occur. A service should be considered responsible only when there is evidence of that service (or site or VO) being the cause of the issue.

 

R.Tafirout noted that the fact that CMS managed to quickly increase their rate and ATLAS stayed constant (and low) is not necessarily due to the site, or to the FTS service or to the VO software. Could also be due to a combination of some of those effects.

 

J.Shiers said that the activity of the experiment, and the problems encountered, should be communicated immediately to the SC Team. Is important (and easier) to investigate quickly the problems while they occur not afterwards.

 

 

1.      3D Phase 2 Sites - Update from the Sites - NDGF, PIC, SARA, TRIUMF

The 3D Phase 2 sites should specify milestones for installing and commissioning their 3D installations. TRIUMF and PIC did so. NDGF and SARA-NIKHEF should specify the milestones for hardware, software and operations.

 

L.Robertson reminded that the 3D Phase 2 sites should put their plans for deploying the 3D service in the QR for 2006Q3.

NDGF and SARA-NIKHEF should complete their information on their 3D setup showing how they will get into production.

1.1         NDGF

O.Smirnova said that there are resources for procuring the hardware needed by 3D and the purchase process is ongoing. The hardware should arrive in early November. The software will be installed quite quickly because the license and skilled manpower are available. So a few weeks after that the 3D service should be operational.

 

NDGF should send (to A.Aimar) a more detailed list of milestones for the 3D project installation and operations in the Quarterly Report.

1.2         SARA-NIKHEF

K.Bos reported that the hardware is available and being installed. The DBA was on vacation and the licenses are missing. Installation should be done by end of October.

 

SARA-NIKHEF should send (to A.Aimar) a more detailed list of milestones for the 3D project installation and operations in the Quarterly Report.

 

 

2.      AOB

 

 

No AOB

 

 

3.      Summary of New Actions

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.