LCG Management Board

Date/Time:

Tuesday 19 February 16:00-17:00 – Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=27470

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 22.2.2008)

Participants:

A.Aimar (notes), I.Bird (chair), Ph.Charpentier, T.Cass, L.Dell’Agnello, T.Doyle, M.Ernst, J.Gordon, C.Grandi, F.Hernandez,, M.Lamanna, U.Marconi, P.Mato, G.Merino, R.Pordes, M.Schulz, C.Sehgal, J.Shiers, O.Smirnova

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 26 February 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

1.2      Metrics on Tape Efficiency from the Tier-1

I.Bird reminded the sites that, although not mandatory, most sites had said that they would manage to generate at least some of the efficiency metrics proposed by T.Bell. And all Sites would act in order to be able to record such metrics in the near future.

It should not be ready for February but for May they should be available and also requested by the LHCC Referees meetings in the morning.

 

New Action:

26 Feb 2008 - Discuss/agree on a milestone about providing tape efficiency metrics should be introduced for the Tier1-s sites, before CCRC08 in May.

 

L.Dell’Agnello reported that CNAF has the scripts produced by Tim Bell for CASTOR’s tape usage statistics. They have problems in getting meaningful figures for data migration. Apparently, an error in log of the migration invalidates the log files of the “rtcpd”, process causing a complete mismatch with the effective situation (i.e. the whole stream is declared as failed in presence of even just 1 failed migration).

The tape servers at CNAF were still installed running CASTOR 2.1.3-23 - while core services and disk-servers are on CASTOR 2.1.4-10-2 - they have tried to upgrade one of the tape-servers but the problem is still present. The CERN CASTOR support has been notified of the issue.

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

·         12 Feb 2008 – Sites should start publishing their tape efficiency data in their wiki page (see https://cern.ch/twiki/bin/view/LCG/MssEfficiency)

 

To be replaced by a milestones for the sites to prepare tape efficiency metrics.

 

·         13 Feb 2008 - A.Aimar will ask for the Sites Reliability Reports January 2008. Sites should complete them by end of the week.

Done.

 

3.   CCRC08 Update (Slides) - J.Shiers

 

J.Shiers presented an update about the CCRC08 challenge.

3.1      Scope and Timeline

CCRC08 will not achieve the sustained exports from ATLAS+CMS (+others) at nominal 2008 rates for 2 for weeks by end February 2008. There are also aspects of individual experiments’ work-plans that will not fit into 4-29 February CCRC slot.

 

These can be achieved soon after February. Therefore the proposal is to proceed and continue through March, April and beyond. The WLCG Computing Service is in full production mode and to run permanently is its purpose. One needs to move from mind-set of “challenge then relax” to “full production all the time”. Reporting and Handling Problems

One needs to clarify current procedures for handling problems – some mismatch of expectations with reality.

 

Experiments should remember that there is no GGUS TPMs on weekends / holidays / nights. A problem submitted to GGUS on a Friday evening will be answered on the next Monday morning.

As agreed for urgent matters should use the on-call services and expert call out as appropriate {alice-, atlas-}grid-alarm; {cms-, lhcb-}operator-alarm;

 

Contacts are needed on all sides – sites, services and experiments e.g. who do we call in case of problems?

 

New Action:

26 Feb 2008 - The Sites and Experiments should provide an updated list of their contacts (correct emails, grid operators’ phones, etc).

Here is the current contact information: https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails

 

In addition providing complete and open reporting in case of problems is essential. Only in this way one can learn and improve. It should not require investigations to figure out what really happened when a problem occurred.

 

When MoU targets are not met one should trigger post-mortem analysis at the end of the CCRC phase. This should be a light-weight operation that clarifies what happened and identifies what needs to be improved for the future. Once again, the problem is at least partly about communication.

 

H.Renshall sent some informal observations:

-       The CCRC'08 e-logbook is for internal information and problem solving but does not replace, and is not part of, existing operational procedures.

-       Outside of normal working hours GGUS and CERN remedy tickets are not looked at. Currently the procedure for ATLAS to raise critical operations issues themselves is to send an email to the list atlas-grid-alarm. This is seen by the 24 hour operator who may escalate to the sys admin piquet who can in turn escalate to the FIO piquet. Users who can submit to this list are K.Bos, S.Campana, M.Branco and A.Nairz. It would be good for IT operations to know what to expect from ATLAS operations when something changes. This may be already in the dashboard pages.

 

The various views that are required need to be taken into account e.g. sites, depending on VOs supported, overall service coordination, production managers, project management and oversight.

 

Reminder: There will be the March and April F2F Meetings plus collaboration workshop, review during June CCRC’08 “post-mortem

3.2      FTS Corrupted Proxies and Other Issues

The FTS proxy is only delegated if required (lifetime < 4 hours). The delegation is performed by the glite-transfer-submit CLI. The first submit client that sees that the proxy needs to be re-delegated is the one that does it - the proxy then stays on the server for ~8 hours or so

 

There is a race condition in the delegation - if two clients (as it is likely) detect at the same time that the proxy needs to be renewed, they both try to do it and this can result in the delegation requests being mixed up - so that that what finally ends up in the DB is the certificate from one request and the key from the other. One does not detect this situation and the proxy remains invalid for the next ~8 hours.

The real fix requires a server side update and therefore a quick fix will have to be deployed for the time being.

 

M.Ernst noted that this issue was not promptly communicated to the Sites.

J.Shiers replied that this was immediately reported in the CCRC e-logbook but not communicated further.

Slides 6-8 report about the list of problems encountered by ATLAS and that are begin solved.

The issue encountered by ATLAS on Friday would have required the call of the on-call service to have a quick response. A ticket in GGUS does not provide a quick response to urgent issues.

3.3      Priorities of the SRM 2.2 Issues

J.Shiers reminded the Experiments that feedback is expected on the priorities of the SRM2.2 issues to solve:

-       Protecting spaces from (mis-)usage by generic users. Concerns dCache, CASTOR

-       Tokens for PrepareToGet/BringOnline/SRMCopy (input). Concerns dCache, DPM, StoRM

-       Implementations fully VOMS-aware. Concerns dCache, CASTOR

-       Correct implementation of GetSpaceMetaData. Concerns dCache, CASTOR. Correct size to be returned at least for T1D1

-       Selecting tape sets. Concerns dCache, CASTOR, StoRM. Selecting by means of tokens, directory paths?

3.4      Service Summary

From a service point of view, services are running reasonably smoothly and progressing (reasonably) well. There are issues that need to be followed up (e.g. post-mortems in case of “MoU-scale” problems, problem tracking in general) but these are both relatively few and well understood.

 

But CCRC08 need to hit all aspects of the service as hard as is required for 2008 production to ensure that it can handle the load.

 

J.Gordon noted that the Post Mortem could compare the results wrt the MoU expectations and be used to trigger actions on the Sites.

 

4.   LHCC Referees Meeting – I.Bird

 

The meeting with the LHCC Referees took place the same morning (see Agenda).

 

The questions and concerns of the referees were about:

-       Status of the Sites and the SRM installations

-       Why the Experiments were not running all at the same time and the Referees would like this to be demonstrated.

-       Extend the tape efficiency metrics to all the Sites.

-       Define overall metrics for Site performance.

 

They reviewed also the Tier-1 status and in particular 24x7 and VOBoxes milestones and expressed concern about the delays reported by some sites.

 

5.   Tier-1 and Tier-2 Reliability and Availability - Jan 2008 (Site Reports; Slides; Tier-1 Data; Tier-2 Data) – A.Aimar

 

5.1      Tier-1 Reliability and Availability

Below are the reliability metrics since April 2007.

 

One should not that the target is moved to 93%; otherwise 10 sites would have been above the old 91% target.

 

Site

Apr 07

 May 07

  Jun 07

   Jul 07

  Aug 07

  Sept 07

  Oct
07

Nov
07

Dec
07

Jan
08

CERN

96

90

96

95

99

100

99

98

100

99

DE-KIT (FZK)

79

79

48

75

67

91

76

85

90

94

FR-CCIN2P3

95

94

88

94

95

70

90

84

92

95

IT-INFN-CNAF

93

87

67

82

70

80

97

91

96

70

UK-T1-RAL

87

87

87

98

99

90

95

93

91

92

NL-T1

92

99

75

92

86

92

89

94

50

57

CA-TRIUMF

73

95

95

97

97

95

91

94

96

97

TW-ASGC

92

98

80

83

83

93

51

94

99

97

US-FNAL-CMS

85

77

77

92

99

89

75

79

88

93

ES-PIC

95

77

79

96

94

93

96

95

96

93

US-T1-BNL

89

98

94

75

71

91

89

93

44

91

NDGF

n/a

n/a

n/a

n/a

n/a

n/a

89

98

100

92

Target

88

88

91

91

91

91

91

91

93

93

Target
+ 90% target

7 + 3

6 + 3

3 + 2

7 + 2

6 + 2

7 + 2

5 + 4

9+2

6 + 4

7+3

 

The averages for the best 8 sites and for all sites are the following:

 

Avg. 8 best sites: Jun 87% Jul 93% Aug 94% Sept 93% Oct 93% Nov 95% Dec 96% Jan 95%

Avg. all sites:       Jun 80% Jul 89% Aug 88% Sept 89% Oct 86% Nov 92%  Dec 87% Jan 89%

 

In slide 2 the reliability values for NL-T1 (57%) need to be verified.

 

New Action:

26 Feb 2008 - A.Aimar verifies the values of the GridView reliability calculations for NL-T1.

 

L.Dell’Agnello explained that a power outage caused the unavailability over a full week end.

In addition their installation of glite 2.1 upd. 12, since then some tests keep failing at the end of January. And also during the first half part of February as CNAF did not move the classic SE to another SE.

 

For the OPS VO some centres still use the classic SE. M.Schulz reminded that the classic SE is deprecated since more than one year and sites should upgrade.

L.Dell’Agnello agreed and said that CNAF will move to STORM V1 soon.

 

I.Bird added that using an old SE for the OPS VO can be considered like testing a non-realistic situation. The tests should be redesigned in order to check also, for instance, which version of the SE is used.

 

The MB proposed that M.Schulz presents a summary of the status of the SAM tests and what they are currently testing, e.g. which SRM and endpoints are tested.

 

M.Schulz will present in a couple of weeks the status of the SAM tests.

 

M.Ernst asked that the values for BNL should be recalculated taking into account the values for the last few months, BNL is incorrectly under estimated.

 

New Action:

29 Feb 2008 - A.Aimar will verify with the GridView team the possibility to recalculate the values for BNL.

5.2      Tier 2 Reliability and Availability (Tier-2 Data)

I.Bird proposed that the MB starts, from next month, to verify also the status of reliability and availability of Tier-2 sites.

Attached is the data collected for January 2008.

 

R.Pordes added that OSG needs contacts with D.Collados and J.Templon for the tests that OSG is preparing. They should review this development.  R.Quick should contact them.

5.2.1       US Sites

I.Bird asked the opinion of M.Ernst as representative of one of the two US Tier-1 sites (FNAL was not represented) about Rob Gardener’s request of accounting of all the US sites but only check the reliability of the sites mentioned in the MoU. This would imply that the MoU pledges would be met also counting resources that are not at the required level of availability.

 

M.Ernst replied that everything that is pledged should be provided by the sites that signed the MoU. All other resources are the result of opportunistic usage not to be accounted when checking the MoU Pledges.

5.2.2       Italian Sites

J.Gordon noted that all Italian sites are included in all Italian federations as almost all of them provide resources to the 4 LHC Experiments. This is an unusual situation as not all sites have signed the MoU but INFN as signed the MoU for all its sites.

I.Bird recommended that APEL should make sure that resources are not double counted to many VOs.

 

L.Dell’Agnello agreed to provide feedback on the accounting of the Italian sites their assigned to VO-specific federations.

 

G.Merino noted that the reliability calculations seem incorrect and should be verified.

 

New Action:

29 Feb 2008 - A.Aimar will verify why the reliability values for the Tier-2 sites seems incorrect (being lower than availability).

 

6.   AOB

 

No AOB.

 

1.   Summary of New Actions

 

New Action:

26 Feb 2008 - Discuss/agree on a milestone about providing tape efficiency metrics should be introduced for the Tier1-s sites, before CCRC08 in May.

 

New Action:

26 Feb 2008 - The Sites and Experiments should provide an updated list of their contacts (correct emails, grid operators’ phones, etc).

Here is the current contact information: https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails

 

New Action:

26 Feb 2008 - A.Aimar verifies the values of the GridView reliability calculations for NL-T1.

 

New Action:

29 Feb 2008 - A.Aimar will verify with the GridView team the possibility to recalculate the values for BNL.

 

New Action:

29 Feb 2008 - A.Aimar will verify why the reliability values for the Tier-2 sites seems incorrect (being lower than availability).