LCG Management Board

Date/Time:

Tuesday 20 November 2007 16:00-17:00 – Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=22187

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 23.11.2007)

Participants:

A.Aimar (notes), I.Bird, T.Cass, Ph.Charpentier, L.Dell’Agnello, C.Grandi, F.Hernandez, J.Knobloch, G.Merino, Di Qing, L.Robertson (chair), J.Shiers, O.Smirnova, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 27 November 2007 16:00-17:00 – Phone Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous meeting were approved.

1.2      24x7 and VO Box SLA Documents (http://cern.ch/lcg-mb-private)

A couple of documents – on 24x7 and VO Boxes support - are available in the area reserved to the LCG MB members (http://cern.ch/lcg-mb-private).

 

Note: To view and upload documents the NICE account is requested. If you need assistance with your NICE account send and mail to Lcg.Office@cern.ch.

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

  • 21 October 2007 - Sites should send to H.Renshall their resources acquisition plans for CPU, disks and tapes until April 2008

 

Not done. Half of the sites have sent their acquisition plans (TW-ASGC, US-T1-BNL, ES-PIC, DE-KIT and FR-CCIN2P3). The missing sites should send theirs to H.Renshall.

 

3.    SRM Update (Slides; Agenda) – J.Shiers

The presentation attached is the one showed at the LHCC Comprehensive Review and contains all the details.

Attached there is also a presentation by M.Litmaath about storage classes implementations.

 

Schedule and Requirements: The crucial issue now is to understand the schedule for experiments’ and to adapt it to the SRM v2.2 time line (understood to be ~end year).

And how the Experiments require storage to be setup and configured each site for January 2008 testing, prior to February CCRC’08 run.

Thus schedule, resources and setup at the sites must be agreed at/by the December CCRC F2F Meeting.

 

dCache: The schedule of the SRM 2.2 deployment is on schedule SARA did the installation and next week will be the installation at IN2P3.

 

CASTOR: CERN has been set up for LHCb and ATLAS, The other Experiments will follow.

 

T.Cass added that CERN is waiting for input from CMS and ALICE about their setup.

 

Ph.Charpentier noted that even if the installation is considered “complete for LHCb”, it is not currently useable because, for instance, lcg-utils for SRM 2.2 is not released and the pre-release configuration is not correct for LHCb. LHCb can do testing the installation but is not in production status and the LCG tools are missing.

 

T.Cass replied that he will investigate why the CASTOR configuration is not fully ready for LHCb.

 

F.Hernandez asked whether the dCache sites should also re-install lcg-utils.

J.Shiers replied that all sites will have to upgrade the LCG tools to the versions supporting the SRM 2.2 features. He added that the availability of all SRM-dependent tools will be verified.

 

I.Bird added that the tools are being certified and seem there are not relevant issues from the certification. But the configuration should be followed at all sites.

I.Bird agreed to summarise the status of the tools certification and inform the MB.

 

L.Robertson concluded that the sites should all verify their configuration vs. the setup required by LHCb.

Ph.Charpentier agreed to report the status of the sites configurations for LHCb at next MB meeting.

 

L.Dell’Agnello added that also at CNAF there are still a few issues with the installation of CASTOR.

 

4.    CCRC Update (Slides; Agenda) – J.Shiers

The slides summarise the current status of the CCRC Planning.

 

At the last CCRC meeting the only LHCb was present. The goal is to now set up the agenda for the F2F CCRC meeting and the input from the Experiments is important.

 

The main points to clarify are:

-       Conclude on scaling factors to apply at the February and May challenges.

-       Conclude on SRM v2.2 storage setup details at all sites wrt. the Experiments’ needs

-       CDR challenge in December – clarify the separation between ‘temporary’ (for the challenge but can be delete after) and permanent data

 

T.Cass added that this clarification is important for the calculation of tape capacity and the organization at the sites.

 

-       ‘Walk-througs’ by activities of the Experiments, emphasising “Critical Services” involved and appropriate scaling factors

-       Conclusions on scope / scale of February challenge (resource limited)

F.Hernandez and J.Templon asked for a clarification of the exact needs of CCRC at each site in order to plan CCRC wrt. the normal production activities at the sites.

-       Discuss the eventuality of “de-scoping” some of the CCRC activities. How will it be negotiates, if required?

 

L.Robertson noted that the Experiments are crucial to the CCRC preparation and their presence is really necessary.

J.Shiers added that only LHCb was always present at the meetings.

 

L.Robertson will contact the Experiments about their participation to the CCRC preparation.

 

5.    LHCC Comprehensive Review – L.Robertson

L.Robertson asked whether the MB Members had comments about the LHCC Comprehensive Review that took place on the 19-20 November 2007.

 

Note: The MB members are welcome to the Closed Session of the Review that takes place on Thursday morning.

 

No comments from the MB members about the review.

 

J.Knobloch added that the conclusions and the report of the review should be available within 10 days.

 

6.    Site Reliability Summary - October 2007 (Site Reports; Slides) - A.Aimar

6.1      Site Reliability from January 2007 to October 2007

This month only 5 sites are above the target (91%) but other 4 (FR-CCIN2P3, NL-T1, US-T1-BNL, NDGF) are very close to the target, and well above the 90% of the target (82%).

 

Site

Jan 07

Feb 07

Mar 07

Apr 07

May 07

Jun 07

Jul 07

Aug 07

Sept 07

Oct 07

CERN

99

91

97

96

90

96

95

99

100

99

DE-KIT (FZK)

85

90

75

79

79

48

75

67

91

76

FR-CCIN2P3

96

74

58

95

94

88

94

95

70

90

IT-INFN-CNAF

75

93

76

93

87

67

82

70

80

97

UK-T1-RAL

80

82

80

87

87

87

98

99

90

95

NL-T1(NIKHEF)

93

83

47

92

99

75

92

86

92

89

CA-TRIUMF

79

88

70

73

95

95

97

97

95

91

TW-ASGC

96

97

95

92

98

80

83

83

93

51

US-FNAL-CMS

84

67

90

85

77

77

92

99

89

75

ES-PIC

86

86

96

95

77

79

96

94

93

96

US-T1-BNL

90

57*

6*

89

98

94

75

71

91

89

NDGF

n/a

n/a

n/a

n/a

n/a

n/a

n/a

n/a

n/a

89

Reliability Target

88

88

88

88

88

91

91

91

91

91

Target
+ 90% target

5 + 5

6 + 3

 4 + 1

7 + 3

6 + 3

3 + 2

7 + 2

6 + 2

7 + 2

5 + 4

 

Avg. 8 best sites: Apr 92% May 94% Jun 87% Jul 93% Aug 94% Sept 93% Oct 93%

Avg. all sites:       Apr 89% May 89% Jun 80% Jul 89% Aug 88% Sept 89% Oct 86%

 

As slide 6 summarizes, the problems encountered. They are similar to those of the last few months and related to the SRM implementations with the same issues (gridftp doors, pool servers blocking, etc). Hopefully most of these issues are solved by dCache 1.8 that is being installed.

6.2      Vo-Specific SAM Tests

The comparison with the VO-specific tests in not very useful this month. The VO-specific tests are still incomplete and a clarification of their readiness is needed.

 

L.Robertson suggested that the readiness of the VO-specific SAM tests should be presented, by the Experiments, at the F2F Meeting in December (4.12.2007).

 

Ph.Charpentier agreed about LHCb presenting their specific tests and the reasons for the failures in the table below.

The other Experiments are not present at the current meeting and will be informed via email.

 

F.Hernandez asked how the score in the table below is computed.

A.Aimar replied that is calculated by SAM with the same reliability algorithm used for the OPS VO and agreed by the MB.

 

October 2007

OPS

ALICE

ATLAS

CMS

LHCb

 

CERN

100%

67%

93%

98%

93%

CERN-PROD

DE-KIT

75%

72%

52%

98%

86%

FZK-LCG2

FR-CCIN2P3

90%

0%

9%

0%

53%

IN2P3-CC

IT-INFN-CNAF

97%

32%

94%

99%

42%

INFN-T1

NGDF

86%

0%

0%

-

-

NDGF-T1

UK-RAL

95%

82%

94%

97%

69%

RAL-LCG2

NL-T1

89%

69%

88%

-

89%

SARA-MATRIX

CA-TRIUMF

91%

-

92%

-

-

TRIUMF-LCG2

TW-ASGC

51%

-

81%

83%

-

Taiwan-LCG2

US-FNAL-CMS

73%

-

-

64%

-

USCMS-FNAL-WC1

ES-PIC

96%

-

94%

97%

55%

pic

US-T1-BNL

88%

-

75%

-

-

BNL-LCG2

 

G.Merino asked whether the values shown above are (or should be) the same as the ones in GridView.

A.Aimar replied that the numbers in GridView are using the new algorithm while the ones above are calculated by SAM still using the old availability and reliability algorithm.

 

T.Cass noted that the only downtime for CERN was scheduled and therefore CERN should be modified to be 100% in the table above.

A.Aimar updated the table above.

 

L.Robertson noted that the MoU had reliability targets that do not seem reachable and they should be redefined. There is no point in setting unreachable targets. By March next year new targets should be formally agreed by the sites.

6.3      Job Efficiency

L.Robertson also raised the issues of the Job Efficiency data published by ARDA. Currently it is calculated using only the WMS logs. It could also use the Job Wrapper logging but this requires work and needs to be agreed in advance.

 

As Experiments use other submission systems the analysis of the WMS is not a very reliable representation of the perceived job efficiency.

Discussions about the algorithm should take place with the ARDA team in order to agree on how to calculate the efficiency and how to consider jobs not reporting a result, etc.

 

7.    AOB

 

7.1      LHC OPN Sites Representatives – I.Bird (for D.Foster)

The LHC OPN group is now discussing the process for providing organized OPN operations and support.

D.Foster asked to know the responsible for the OPN operations at each Tier-1 site. Not the technical contact (which is already known) but the responsible of the OPS service (in case is the same person or a different one). Sites should send this information to A.Aimar.

 

Action:

The Tier-1 sites should send to A.Aimar the name of the person responsible for the operations of the OPN at their site.

7.2      Issues for the Overview Board – L.Robertson

L.Robertson asked the MB members to send to him any issue they would like to be reported to the Overview Board meeting (on the 3rd December 2007).

7.3      Definitions of Availability and Reliability in the MoU – L.Robertson

As G.Merino noted, in the MoU the meaning of “availability” corresponds to the one of “reliability” as used by the MB and the SAM calculations: It exclude scheduled downtime from the calculation.

 

The values of availability and reliability as interpreted by the MB are clearer and should continue to be used. This re-definition should be reported in the update of the MoU.

 

Action:

L.Robertson will prepare a note reporting and explaining the current definitions of “availability” and “reliability” wrt. the definitions in the MoU.

 

8.    Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.