LCG Management Board

Date/Time:

Tuesday 2 October 2007 16:00-17:00 - Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=18004

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 5.10.2007)

Participants:

A.Aimar (notes), D.Barberis, L.Betev, I.Bird, T.Cass, L.Dell’Agnello, T.Doyle, I.Fisk, S.Foffano, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, U.Marconi, H.Marten, P.Mato, G.Merino, G.Poulard, Di Qing, L.Robertson, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 9 October 2007 16:00-18:00 – F2F Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous meeting were approved.

1.2      Approval of the New GridView Algorithm (New; Old)

The addresses of the new and old Gridview reliability algorithms were available for testing for the past two weeks. The MB was asked to compare the two algorithms in order to approve the switch by the end of September.

 

Decision:

The MB approved the new algorithm and the usual GridView URL (http://gridview.cern.ch) will point to the new implementation.

The new algorithm will be used from October 2007 onward. The old availability and reliability values will not be recalculated.

 

Update received by email from Z.Sekera:

 

[…]

following up on the decision of the PMB of Oct/2/2007,
the GridView application has been modified to take
advantage of the new algorithm for calculating the 
site availability and reliability.
 
The new GridView is now in production under the usual URL
 
         http://gridview.cern.ch
 
click on 'Service Availability' link.
 
The old algorithm is still available from the link
'Switch to Old Algorithm' which is displayed in the left
pane of every screen allowing comparisons if desired.
(From the "old" algorithm one can get back to the new
simply by clicking on the similar link 'Switch to New
Algorithm' in the left pane).
 
All Site Availability graphs are now displayed in
percentages.
 
Please, send all comments to 'gridview-admin@cern.ch'

 

[…]

1.3      Sites Reliability and Job Efficiency Reports for September 2007 (SR and JE Tables) - A.Aimar

A.Aimar distributed the summaries of Site Reliability and Job Efficiency for September 2007.

 

Site Reliability – September 2007

 

 

OPS

ALICE

ATLAS

CMS

LHCb

GOCDB Id

ASGC

93%

-

98%

95%

-

Taiwan-LCG2

BNL

91%

-

72%

-

-

BNL-LCG2

CERN

100%

97%

100%

100%

96%

CERN-PROD

CNAF

80%

97%

85%

100%

66%

INFN-T1

FNAL

89%

-

-

38%

-

USCMS-FNAL-WC1

FZK

91%

95%

62%

99%

91%

FZK-LCG2

IN2P3

70%

45%

26%

8%

97%

IN2P3-CC

NDGF

97%

0%

76%

0%

-

NDGF-T1

NIKHEF

92%

96%

92%

53%

90%

SARA-MATRIX

PIC

93%

-

100%

100%

93%

pic

RAL

90%

96%

100%

100%

97%

RAL-LCG2

TRIUMF

95%

-

98%

-

-

TRIUMF-LCG2

 

F.Hernandez noted that the sites do not have enough information to monitor the VO-specific SAM tests. For example the SAM result of the ALICE tests at IN2P3 is 45% but only the Experiment knows which ALICE tests are running and what they are verifying.

 

L.Betev agreed that the VO-specific tests are written by ALICE for their verifications and it is ALICE, for example, which should investigate the causes of the ALICE SAM failures at IN2P3.

 

H.Marten proposed that are the VOs that look at the VO0specific tests and report on them.

 

Agreement:

The MB agreed that the SAM tests failures should be investigated and commented by Sites and Experiments:

-       OPS general SAM tests by the Sites

-       VO-specific SAM tests by the Experiments

 

Job Efficiency – September 2007

 

 

ALICE

ATLAS

CMS

LHCb

AGENT

GANGA

PROD

CRAB

PILOT

ASGC

-

22%

82%

90%

-

BNL

-

0%

0%

-

-

CERN

99%

50%

92%

76%

99%

CNAF

53%

52%

74%

97%

95%

FNAL

-

-

-

99%

-

FZK

96%

73%

93%

96%

93%

IN2P3

89%

77%

79%

99%

96%

NDGF

0%

0%

84%

0%

-

NIKHEF

100%

45%

84%

-

19%

PIC

-

7%

61%

100%

88%

RAL

99%

15%

93%

90%

90%

TRIUMF

-

4%

94%

0%

0%

 

H.Marten noted that, like for SAM VO tests, a site cannot really find out why an Experiment has problems submitting jobs. It should be the VO that investigates and finds out the issues, asking the Site if the need of intervention from the site admins arises. The sites do not have any tool for monitoring the jobs of the Experiments: The Experiments have those tools.

 

L.Robertson agreed that the Job Efficiency should be verified by the Experiment that runs the application and the Experiment can ask the Site’s intervention if it is needed.

 

J.Templon noted that until now there has been little monitoring of the VOs both on their SAM tests and on the Job Efficiency. In the past months some issues have remained unsolved for days, if not weeks. The solution is to integrate this information in: (1) the ARDA Dashboards for the Experiments and (2) in the Sites Monitoring tools that are being developed by the Monitoring working group.

 

I.Bird replied that the SAM tests failures are already sent to the site monitoring tools but the Job Efficiency values are not.

 

Decision:

For the moment the Job Efficiency values will be presented at the LCG MB but not used in any report until they are clearly understood and adequately monitored by Sites and VOs.

1.4      VO-Specific SAM Tests (Information Page; Other Information)

Information about the SAM tests was distributed to the MB mailing list.

 

Email from A.Aimar:

[…]
  information about the VO-specific SAM tests is now being collected 
and organized by D.Vicinanza in this wiki page: 
 
https://twiki.cern.ch/twiki/bin/view/LCG/SAMVOSpecificTests
 
Just as a reminder the above and other pages about SAM reports, 
documentation, MB reports, etc are all linked for the MB in this page: 
 

https://twiki.cern.ch/twiki/bin/view/LCG/SamMbReports

[…]

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

  • 10-July 2007 - INFN will send the MB a summary of their findings about HEP applications benchmarking.

Not done. Scheduled for the F2F MB Meeting in October.

  • 18 Sept 2007 - Next week D.Liko will report a short update about the start of the tests in the JP working group.

Not done. A.Aimar will ask for updated information.

  • 21 Sept 2007 - D.Liko sends to the MB mailing list an updated version of the JP document, including the latest feedback.

Not done. A.Aimar will ask for updated information.

  • 25 September – MB Members send feedback on the new GridView computation algorithm.

Done.

·         25 Sept 2007 - J.Gordon distributes the Grid Security and Grid Operation policy documents to approve at the F2F MB meeting

Done.

·         1 Oct 2007 - A.Aimar will distribute to the MB the Sites Reliability table for September 2007 and Sites will respond at the F2F Meeting in October.

Done.

 

3.    SRM Update (Roll-Out Plan) – J.Shiers, F.Donno

J.Shiers could not be present at the meeting but reported (via J.Gordon) that there were no outstanding issues.

 

F.Donno confirmed that the plan is being followed.

 

The dCache developer team has fixed 4 of the 9 issues that affect the good behaviour of the lcg-utils (test passed). On October 8th a new release of dCache is expected that fixes all outstanding SRM issues with return codes, important for gfal/lcg-utils. dCache has also provided a proper implementation of space reservation for REPLICA-ONLINE (successfully tested). A preliminary plan for deployment of dCache 1.8 in production has been agreed with the sites.

 

The CASTOR team is working on fixing the last return codes needed for gfal/lcg-utils and other issues. A release that cures most (7) of the 10 outstanding issues has been installed yesterday on the CERN instance but not yet tested. The next release will come on October 15th, 2007.

 

Other information and details are in Status Summary the Roll-Out Plan.

 

4.    LHCC Comprehensive Review Agenda - L.Robertson

                                                                 

The Agenda was already discussed as an Action List item (see Section 2).

 

The complete Agenda will be distributed to the MB next week.

 

 

5.    Status of LCG Resources Planning (Slides) - S.Foffano

 

S.Foffano presented a summary of the current planning of the LCG Resources for next years (until 2012), as she will report to the CRRB on October 23rd, 2007.

 

The initial request was sent 24th August to the participants to Computing Resources Review Board (CRRB) with copy to Collaboration board, GDB and MB members (Deadline 28th September):

-       All external T1 or T2 sites were requested to update their pledges for 2008 to 2012 inclusive and to allocate the resources to the experiments for the reference year 2008.

-       In parallel Experiments were asked to revise their requirements until to 2012.

 

Experiments’ replies were received between July and mid-September.

 

Only a minority of sites have replied, the status of responses up to the 2nd October 2007 is:

-       4/11 T1 sites have responded

-       16/54 T2 sites have responded

 

In cases where a site has replied, often the pledge values for 2011 and 2012 are the same as 2010.

A summary (Slide 4) shows the situation for 2008 including the split across experiments.

 

Pending missing input, a simulation summary (Slide 5) shows the situation from 2008-2012 if all sites who have not yet replied maintain their last pledge values up to 2012. This shows problems ahead, however it is hoped that the response rate will improve and new increased pledge values will arrive.

 

L.Robertson suggested that the problems for 2008 to reports to the CRRB are:

-       ALICE lack of resources

-       Disk issue for CMS

 

The MB noted that LHCb (on slide 4) has now requested half of what it is offered for CPU (1770 vs. 3544 kSI2K).

U.Marconi will investigate the issue in LHCb and send a clarification to L.Robertson.

 

L.Robertson: For the following years (slide 5) the situation gets progressively worse, how should this highlighted to the CRRB?

T.Doyle explained that, for instance for RAL, it is more a planning problem in this moment than resources allocations that will likely be available.

M.Kasemann suggested that even for 2008 the resources need to be optimized and readjusted among experiments.

 

L.Robertson reminded the MB that at the CRRB the Resources Scrutiny Group will be launched.

 

H.Marten added that:

-       In the future the Experiments express their requirements first so that the sites can discuss and motivate their requests to the funding agencies. Several Tier-2 sites in Germany are based at Universities that do not yet have their funding information available for 2010-2012. That can be one of the reasons of the lack of resources in that period.

-       For 2008 if the resources will change there is not much that the sites can do to adapt as all purchasing is already being prepared.

 

G.Merino supported this last point adding that PIC, for instance, has recently received disk requirements from CMS increased of 30% wrt the previous ones. This is a very difficult situation to face with only 7 months before the resources are needed.

M.Kasemann agreed with the issue raised and added that CMS is also preparing a plan in case not all Disk resources are provided by the sites.

 

L.Robertson suggested that the re-evaluation of pledges and requirements for planned December 2007 should be skipped. And next year the there will be enough experience in order to have the planning activity completed by July 2008?

M.Kasemann replied that in Summer 2008 there will not be enough experience to provide a realistic update, except extending the current estimates to 2013. D.Barberis supported this point.

 

Decision:

The MB supported the proposal of L.Robertson of postponing to December 2008 the next updated of the requirements and planned resources.

 

 

6.    Overview Board Report - L.Robertson

 

The comments exchanged via mail are taken into account and will be distributed.

 

One issue will be discussed at the F2F meeting next week (about Jobs Priorities).

 

7.    AOB

 

 

 

8.    Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.