LCG Management Board

Date/Time:

Tuesday 25 September 2007 16:00-17:00 - Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=18003

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 27.9.2007)

Participants:

A.Aimar (notes), T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, I.Fisk, S.Foffano, J.Gordon (char), F.Hernandez, M.Kasemann, J.Knobloch, M.Lamanna, U.Marconi, H.Marten, G.Merino, R.Pordes, G.Poulard, H.Renshall, Y.Schutz, J.Shiers, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 2 October 2007 16:00-17:00 - Phone Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

J.Gordon submitted a minor change to an unclear/incorrect sentence in section 6, the paragraph is highlighted in blue (see link)

 

The minutes of the previous meeting were approved.

1.2      High-Level Milestones Update

For information to the MB A.Aimar provided the link to the High Level Milestones as updated until the 25.9.2008 (see link)

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

  • 10-July 2007 - INFN will send the MB a summary of their findings about HEP applications benchmarking.

Not done. Could be for the F2F MB Meeting in October.

  • 18 Sept 2007 - Next week D.Liko will report a short update about the start of the tests in the JP working group.

Not done.

Update from D.Liko via email:
We have tested the Job priority configuration on the certification testbed. We found several issues in the automatic yaim configuration and had to tweak the configuration by hand. The issues were reported to the yaim developer. When the changes are available we will try again.

  • 21 Sept 2007 - D.Liko sends to the MB mailing list an updated version of the JP document, including the latest feedback.

Not done. J.Templon sent again his feedback to the authors because was not included in the copy circulated. A new version should be distributed.

  • 25 Sept 2007 - A.Aimar finds out the information about the VO-specific tests and adds it to the SAM wiki page (contacts, test documentation, etc).

Done after the meeting:

Update from A.Aimar email:
Information about the VO-specific SAM tests is now being collected
and organized by D.Vicinanza in this wiki page:
https://twiki.cern.ch/twiki/bin/view/LCG/SAMVOSpecificTests 

Just as a reminder the above and other pages about SAM reports,
documentation, MB reports, etc are all linked for the MB in this page:
https://twiki.cern.ch/twiki/bin/view/LCG/SamMbReports

  • 25 September – MB Members send feedback on the new GridView computation algorithm.

At next MB meeting unless there are major issues the MB could approve the change.

T.Doyle asked that, once the new algorithm is in production, further future improvements or issues to the availability metrics be discussed.

J.Gordon replied that further improvements and issues will be discussed and collected at the GDB on the 10th October.

  • 25 Sept 2007 - I.Bird will propose speakers for the Tier-1 and Tier-2 presentations at the LHCC Comprehensive Review.

Removed. The Agenda for the LHCC Comprehensive Review is anyway going to be discussed further in this meeting.

 

3.    Report from the LHCC Referees meeting - J.Shiers

3.1      Preparation Agenda for the Comprehensive Review (Draft agenda for Comprehensive Review)

J.Shiers discussed the agenda of the Comprehensive Review with F.Forti after the Referees Meeting.

 

Three issues remain open that the referees would like added to the agenda (in RED in the agenda below):

-       Overview of the Status of Storage at the Tier-o and Tier-1 sites. Should be by someone from a site, not from CERN (maybe Greg Cowen or Graham Robinson?)

-       Was well received that for the Tier-1 there is a talk from the US (FNAL) and one from Europe.

-       For the Tier-2 presentations there are volunteers from OSG and from the UK. Missing is one more Tier-2talk , and someone to give the Tier-2 Summary.

 

NOTE – All times should be seen as MANDATORY 25% discussion, MAXIMUM 75% presentation

 

Monday 19 November 2007

 

09:00 Overview - (40')  Les Robertson

09:40 Resources and Accounting (20')  Sue Foffano

 

10:00  – 10:15 Coffee (served outside IT amphitheatre)

 

10:15->12:45    Stream A - part 1 – Grid

10:15  Middleware (1h00')

-          Status of the middleware to support baseline services

--  general

--  EGEE specific

-- OSG specific

 

11:15 Grid Deployment (1h30')

-          Middleware Deployment (30')

-          Operations - EGEE and OSG (1h00')

 

10:15->12:45    Stream B - part 1 - Mass Storage, Fabric,  Networking

 

10:15  Mass Storage Progress (1h30')

-          Overview (15’) TBD

-          CASTOR (20')   Tony Cass

-          dCache (20')   Patrick Fuhrmann

-          DPM (20')  Jean-Philippe Baud

-          SRM v2.2 & Experiment Progress (30')
                                                        Flavia Donno

11:45   CERN Fabric -  Tier-0 + CAF Status -
                Performance - Reliability (30') Bernd Panzer

12:30   Networking including the LCG OPN (20')
                                                                David Foster

13:00  LUNCH (Sandwiches to be provided outside IT amphitheatre)

14:00->16:30 Stream A - part 2  - Applications, 3D

 

 

14:00  Application Area (1h40')   Pere Mato

-          Overview (20')

-          Simulation & generators (20')

-          Core Libraries and Services (20')

-          Persistency Framework Projects (20')

-          Software process (20’)

 

15:45  – 16:00 Coffee

 

 

16:00  3D project (30’)

 

 

14:00->17:30    Stream B - part 2 - Tier-1s, Tier-2s

 

14:00  Tier-1 Status (1h30')

-          Summary of the status of the Tier-1s (30')

-          Reports from 3 Tier-1 sites (1h00')

-          FNAL + 2 others

-          (TRIUMF – Reda Tafirout)

 

 

15:30 – 15:45   Coffee

 

16:00  Tier-2 Status (1h30')

-          Summary of the status of the Tier-2s (30')

-          Reports from 3 Tier-2 sites (1h00')

-          OSG Tier2s – Ruth Pordes

-          UK Tier2s – Andrew Sansum / John Gordon

-          Another Tier-2

 

17:30     Visit to the CERN Computer Centre (or immediately after lunch)

3.2      LCG-LHCC Referees Meeting Report (Agenda)

There were presentations about the SRM roll-out (J.Shiers), the FDR preparation (H.Renshall) and the plans for end-user analysis from ATLAS (K.Bos) and CMS (I.Fisk):

-       SRM (SRM v2.2 rollout plan; Slides) - Most sites will be ready by end 2007 and all by February 2008. The referee asked what were the possibility of more delays in the SRM. The reply by J.Shiers was that testing has been going on during all the tests and seems realistic to expect it by February 2008.

-       FDR of the Experiment - The presentation covered the plans presented at the WLCG Workshop and at CHEP by the Experiments. The questions were about the resource readiness at the Sites and the installation of SRM at all sites. The 2008 resources should be available for use by the 1st April 2008.

-       End-use Analysis – The presentations by ATLAS and CMS did not raise major questions or comments from the Referees.

3.3      LHCC Open Session (Agenda)

The Open Session presentations took place on Tuesday morning:

-       LHC Status Report (L.Evans)

-       ATLAS Status Report (P.Jenni)

-       LCG Status Report (J.Shiers) the presentation (Slides) focused on the status of the LCG Services and on the need of the CCRC combined challenges in 2008. The CCRC tests were very favourably received by the Experiments’ management. Both ATLAS and CMS said they would fully support these challenges. J.Shiers also warned the Sites about the difficulties of ramping-up their CPU and disk resources in 2008 (by a factor 4 to 6).

 

4.    GDB Summary (Paper) - J.Gordon

 

J.Gordon distributed the Summary of the GDB Meeting in TRIUMF.

 

All details are in the Paper but at the MB meeting he highlighted the following future issues for the GDB:

-       The Grid Security Policy and the Grid Operation Policy documents were approved by the GDB and will be brought to the F2F MB meeting of the 9th October for MB approval.

 

New Action:

25 Sept 2007 - J.Gordon distributes the Grid Security and Grid Operation policy documents to approve at the F2F MB meeting

 

-       Security Policies. The definition of these policies progresses slowly. Members should encourage feedback and participation by their countries.

-       Authorisation. The list of issues prepared after the July GDB needs further discussion and was distributed to the MB. C.Grandi has asked to reply and discuss the points distributed by J.Templon about authorization (on VOMS, proxies, etc) at next meeting.

-       Job Submission Priorities are being progressed by a subgroup of EGEE TCG; updates could be discussed at the GDB.

-       glexec” issues gradually becoming clear. Still some work to do before taking to TCG. 

-       SRM2.2 GSSD has booked the pre-GDB afternoon on October 9th.

-       The Common Computing Readiness Challenge (CCRC) has been scheduled for February and May 2008 to test the full computing cycle for all four experiments running simultaneously. 

-       Availability monitoring. Further discussion on GridView algorithm requested. Do the critical tests meet requirements and is site available algorithm useful to experiments?

 

5.    Update on Automatic Accounting (Paper) - J.Gordon

 

J.Gordon summarized the progress of Automatic Accounting of the WLCG resources.

 

Of the 12 sites (CERN + Tier1s) in the WLCG April Report:-

-       7 sites (BNL, CERN, CNAF, FNAL, FZK, PIC, and TRIUMF) showed complete agreement between their local records and APEL (same number as July but not all the same sites.). RAL did not publish correctly from their SL4 CE.
T.Cass reported that CERN values were incorrect and were corrected when they were circulated.

-       5 sites (ASGC, Lyon, BDGF, NIKHEF, and RAL) published to APEL but reported discrepancies between APEL and their local records. Wrt 4 in July. NDGF had correctly published accounts for ATLAS but had manually added Alice.

-       All sites published into APEL. Is the first time.

 

T.Cass asked that the CPU vs wall clock efficiency should be monitored more closely. Which is the usage of the CPU resources at the Sites?

J.Gordon suggested this should be an item for the F2F MB meeting.

 

 

6.    Report on Site Reliability and Job Efficiency (JE Summary; SR Summary; Sites Reports Aug07;Slides) - Aimar

 

A.Aimar presented the periodical monthly report on Sites Reliability and Job Efficiency.

6.1      Site Reliability in 2007

The table below show the monthly averages since the beginning of 2007:

In august there were 6 sites above the target (i.e. above 91%, in green) and 2 more above 90% of the target (i.e. above 82% in orange).

 

Site

Jan 07

Feb 07

Mar 07

Apr 07

May 07

Jun 07

Jul 07

Aug 07

CERN

99

91

97

96

90

96

95

99

GridKa/FZK

85

90

75

79

79

48

75

67

IN2P3

96

74

58

95

94

88

94

95

INFN/CNAF

75

93

76

93

87

67

82

70

RAL

80

82

80

87

87

87

98

99

SARA-NIKHEF

93

83

47

92

99

75

92

86

TRIUMF

79

88

70

73

95

95

97

97

ASGC

96

97

95

92

98

80

83

83

FNAL

84

67

90

85

77

77

92

99

PIC

86

86

96

95

77

79

96

94

BNL

90

57*

6*

89

98

94

75

71

NDGF

n/a

n/a

n/a

n/a

n/a

n/a

n/a

n/a

Reliability Target

88

88

88

88

88

91

91

91

Target
+ 90% target

5 + 5

6 + 3

4 + 1

7 + 3

6 + 3

3 + 2

7 + 2

6+2

6.2      OPS Tests and VO-Specific SAM Tests

In addition to the SAM generic tests, executed using the “OPS” VO, Experiments have written their specific tests.

The table below shows the results for August 2007 of the VO-specific tests in comparison with the SAM tests.

 

AUG 2007

OPS

ALICE

ATLAS

CMS

LHCb

CERN-PROD

99%

42%

99%

99%

95%

FZK-LCG2

67%

35%

35%

95%

97%

IN2P3-CC

95%

43%

12%

99%

94%

INFN-T1

70%

14%

73%

15%

65%

NDGF-T1

88%

-

88%

-

-

RAL-LCG2

99%

85%

100%

100%

98%

SARA-MATRIX

86%

42%

86%

70%

90%

TRIUMF-LCG2

97%

-

94%

-

-

Taiwan-LCG2

83%

-

94%

97%

-

USCMS-FNAL-WC1

99%

-

-

54%

-

pic

94%

-

99%

100%

100%

BNL-LCG2

71%

-

66%

-

-

 

The table below shows the results for September 2007 (until the 24 September) of the VO-specific tests in comparison with the SAM tests.

 

 SEPT 2007

OPS

ALICE

ATLAS

CMS

LHCb

CERN-PROD

100%

96%

100%

100%

95%

FZK-LCG2

92%

100%

53%

100%

91%

IN2P3-CC

64%

56%

7%

10%

96%

INFN-T1

77%

96%

85%

100%

69%

NDGF-T1

98%

-

94%

-

-

RAL-LCG2

88%

95%

100%

100%

96%

SARA-MATRIX

94%

96%

91%

66%

88%

TRIUMF-LCG2

96%

-

97%

-

-

Taiwan-LCG2

91%

-

96%

97%

-

USCMS-FNAL-WC1

92%

-

-

24%

-

pic

91%

-

100%

100%

92%

BNL-LCG2

91%

-

90%

-

-

 

A.Aimar proposed that Sites and Experiments investigate the causes of the issues highlighted by low (in “red”) percentages below the targets in the values for September 2007.

 

New Action:

1 Oct 2007 - A.Aimar will distribute to the MB the Sites Reliability table for September 2007 and Sites will respond bat the F2F Meeting in October.

 

J.Gordon explained that the SAM tests for September are below the target because the OPS certificate at RAL had expired, while the other VOs certificates were still valid.

H.Marten repeated the need for contacts and documentation of the VO-specific tests. They are needed in order to understand what it is required by the Experiments. Currently the Sites do not know what the Experiments are verifying in their tests and how to interpret the errors of the VO-specific tests.

 

Update from A.Aimar email (as in Section 2 above) :
Information about the VO-specific SAM tests is now being collected in this wiki page:

 

https://twiki.cern.ch/twiki/bin/view/LCG/SAMVOSpecificTests  

 

F.Hernandez noted that the specific tests for ATLAS are no longer “critical” and this has improved the results at IN2P3 but the reason of the change was is not clear. The process for adding and modifying the VO-specific tests should be d presented/discussed at the MB.

6.3      Job Efficiency vs. Site Reliability for August 2007

The table below shows the between Site reliability and job efficiency percentages for each Site and Vo.

Note: For BNL and NDGF there are no Job Efficiency values available.

 

AUG 2007

OPS

ALICE

ATLAS

CMS

LHCb

 

SAM

SAM

AGENT

SAM

GANGA

PROD

SAM

CRAB

SAM

PILOT

ASGC

83%

-

-

94%

-

79%

97%

97%

-

-

BNL

71%

-

-

66%

-

-

-

-

-

-

CERN

99%

42%

93%

99%

98%

89%

99%

82%

95%

97%

CNAF

70%

14%

90%

73%

97%

80%

15%

95%

65%

92%

FNAL

99%

-

-

-

-

-

54%

91%

-

-

FZK

67%

35%

87%

35%

94%

83%

95%

68%

97%

99%

IN2P3

95%

43%

100%

12%

100%

88%

99%

98%

94%

98%

NDGF

88%

-

-

88%

-

90%

-

-

-

-

NIKHEF

86%

42%

7%

86%

100%

78%

70%

-

90%

6%

PIC

94%

-

-

99%

100%

79%

100%

96%

100%

90%

RAL

99%

85%

97%

100%

99%

77%

100%

89%

98%

91%

TRIUMF

97%

-

-

94%

26%

87%

-

-

-

-

 

J.Templon commented that, for instance, the low Job Efficiency (6%) of LHCb-Pilot at NIKHEF was caused by error in a script and was detected only one month later. NIKHEF did not even know that any VO test was being executed during that period. Therefore the issue was discovered several weeks later. Sites should be informed of all details of when and which tests are executed

M.Lamanna noted that the Experiments Dashboards show the efficiency of the Experiments applications.

 

J.Templon replied that the sites cannot look at the Experiment dashboard and all other VOs and in the case above they did not even know that such test was running.

M.Lamanna noted that the values of the SAM CMS tests in FNAL are very low (54%)and is not matching with the Job Efficiency values (91%). He was wondering whether some SAM tests are testing the FNAL PPS services and not the production ones.

A.Aimar replied that he had asked for information to A.Sciabá and he will attach the reply to the minutes.

 

Email from A.Sciabá on the CMS tests at FNAL:

there are two problems with the availability of FNAL calculated for CMS.

The first problem is that the CEs are published under two sitenames, USCMS-FNAL-WC1-CE and USCMS-FNAL-WC1-CE2, while the SE and the SRM are published under the sitename USCMS-FNAL-WC1.
I hope that there is a good reason to do this, because it wrecks the calculation of the availability. The practical effect for WLCG, at this time, is that the FNAL availability is calculated only based on the availability of the SE and the SRM because apparently for WLCG FNAL is USCMS-FNAL-WC1.

The second problem is that the availability for FNAL is n/a because the availability for the SE and the SRM is n/a, and this is so because the CMS critical tests for the SE and SRM are submitted by CMS only to nodes at sites from a static list of sitenames:

https://twiki.cern.ch/twiki/bin/view/CMS/SitesList

As now the FNAL SE and SRM belong to a sitename, USCMS-FNAL-WC1, that is not in the list, the tests are not submitted and therefore the availability is undefined. The fact that there was a period where the availability was defined between two periods when it was undefined could be explained by some messing up of the information system, by which the SE and the SRM temporarily appeared in one site that was in the list.

I have now fixed the list, so the second problem should now be solved. The first one is the the most important, but can be solved only by FNAL.

1.    AOB

 

No AOB.

 

2.    Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.