LCG Management Board

Date/Time:

Tuesday 15 January 16:00-17:00 – Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=22195

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 18.1.2008)

Participants:

A.Aimar (notes), D.Barberis, I.Bird (chair), T.Cass, F.Carminati, Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, I.Fisk, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, J.Knobloch, U.Marconi, H.Marten, P.Mato, H.Meinhard, G.Merino, M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 22 January 2008 16:00-17:00 – Phone Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

1.2      Issues during the Holidays Period

No relevant issues reported about operations during the holiday period.

1.3      Metrics for measurement of tape performance: Feedback from Tier-1 Sites (Proposal)

During the previous meeting T.Bell presented the metrics used at CERN and proposed that, if possible, some of those metrics should be collected by the Tier-1 sites. The Tier-1 sites should comment whether the Tier-1 can collect the same metrics.

 

J.Templon and F.Hernandez commented that they have asked their storage experts for feedback.

M.Ernst added that the HPSS monitoring system is already collecting the same metrics. BNL will collect data in the next few months and make it available.

J.Gordon noted that RAL, as it is running CASTOR, can probably just use the same scripts used at CERN.

 

M.Kasemann asked how these metrics are going to be made available to the Experiments. The Experiments could see the performance of the system and the repercussions of the changes they make to their applications.

 

New Action:

21 Jan 2008 - The LCG Office should define where (a web area, wiki, share point?) the Sites can upload their statistics about their tape storage performance and efficiency.

1.4      High Level Milestones: Update needed (HLM Dashboard)

In order to prepare for the QR reports at the end of January A.Aimar asked for an update of the High Level Milestones.

 

New Action:

24 Jan 2008 – Sites should update the High Level Milestones. A.Aimar will remind it via email during the week.

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

  • 18 Dec 2007 - LHCb should nominate who is responsible for the benchmarking of their applications on the machines made available by the HEPiX Benchmarking Working Group.

Done. Ph.Charpentier will be included in the mailing list until LHCb nominates someone.

  • 11 Jan 2008 - H.Meinhard will distribute the information about HEP Benchmarking to the contacts for benchmarking in the Experiments.

Done.

  • 10 Jan 2008 - A.Aimar will ask for the Site Reports for December 2007. Experiments should review their specific SAM tests and see whether the GridView summary is correct.

Done.

  • 14 Jan 2008 - A.Aimar will communicate these names of the OPN Reps. to D.Foster and ask him which are the Tier-1 sites that are not properly represented, in order to address them directly.

Done.

  • 9 Jan 2008 - T.Bell will distribute to the MB a set of metrics that could be also measured by all Tier-1 Sites in order to describe in a uniform way the performance of the tape storage systems.

Done.

  • 13 Jan 2008 – Tier-1 Sites should report on whether they can collect the metrics on tape performance as proposed by the Tier-0 (T.Bell).

In progress. Sites should report next meeting.

 

3.    SRM v2.2 Production Deployment Update - J.Shiers

 

Report from J.Shiers:

 

 

The target of deploying SRM v2.2 at the Tier0 and all "major" (self-defining) Tier1s by end-2007 has been met.

 

Configuration, testing and production use by the experiments is now being followed as part of CCRC'08 preparation (and execution).

 

We will need to revisit SRM v2.2 in this context after the February run of CCRC'08 to discuss and agree priorities and schedules for fixes and - if necessary - enhancements that are consistent with the time and manpower available.

 

 

From next week this item is part of CCRC08.  The SRM meetings will continue to be every two weeks.

 

M.Ernst noted that not all sites have enabled dCache space management.

J.Shiers replied that this kind of issues will be part of the CCRC preparation work.

 

4.    CCRC-08 Update (Slides; Agenda) - J.Shiers

 

4.1      Overview

CCRC’08 planning has changed to more frequent follow-up of progress and issues:

-       Monthly F2F

-       Weekly planning con-calls

-       Daily follow-up calls with ‘fixed’ agenda.
Maximum 15’, but agenda can be adjusted based on experience.

 

The list of middleware and storage software required at sites is now defined.

 

Deploy in 3 phases:

1.    What is available now (CE, WN, WMS, RB, dCache, CASTOR)

2.    What can be done now in preparation (LFC -> SL4).
This is an error: First version supported on SL4 is the one in certification.

 

M.Schulz explained that actually the 32-bits version is certified but CERN chose to install the 64-bits version that is not certified yet.

 

3.    What needs to wait for production release (DM client tools, LFC, DPM & FTS).
Expectation is for client tools to be released to production next Monday, following testing by experiments in PPS.

 

M.Schulz clarified that FTS is certified and LFC will be certified within a week.

 

The Experiments’ requirements for storage configuration also defined, still some information missing from CMS.

 

The draft agenda for in-depth review at WLCG Collaboration Workshop is available:

http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=6552#2008-04-22

4.2      Critical Issues and Goals

The critical issues below will be addressed in up-coming weekly planning meetings

:

1.    FTS bug fix patch level 1589 (files transferred treated as volatile – is fine for February but does not test how we will really run)

2.    LFC bulk deletes also important but could in the limit come a little bit later.

3.    Site configuration for storage still needs further clarification.

-       This is a difficult area and clearly one where we expect to gain much from the experience of the February run and its preparation!

 

There are still some areas where there are no clear metric (waiting to test in May against an agreed metric leaves no time for correction). In addition, the specific goals for the Tier-0 and Tier-1 sites need to be defined

 

The goal for the February challenge is to move to the next level of problems, discovering and fixing them for the May challenge.

 

For reference here are the GSSD-CCRC08 links for the LHC Experiments:

 

https://twiki.cern.ch/twiki/bin/view/LCG/GSSDALICECCRC08 (ALICE)

https://twiki.cern.ch/twiki/bin/view/LCG/GSSDATLASCCRC08 (ATLAS)

https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCMSCCRC08 (CMS)

https://twiki.cern.ch/twiki/bin/view/LCG/GSSDLHCBCCRC08 (LHCB)

 

 

1.    WLCG Service Interventions / Interruptions (Slides) - J.Shiers

J.Shiers summarized the issues of (1) “expected response time from the Experiments” and (2) what could be achieved by defining clear procedures and common standards.

 

Note: The Slides contain more details than what was actually discussed at the meeting and is summarized below.

 

Some Experiments (CMS and LHCb) have requested for the most critical services, a maximum downtime of 30’ has been requested.

 

It as has been stated on several occasions, including at the WLCG Service Reliability workshop and at the OB, maximum downtime of 30’ is impossible to guarantee at an affordable cost. Even intervene in 30’ is not obvious.

 

A realistic time for intervention (out of hours) is 4 hours and is not guaranteeing that a solution is found within a certain time.

But much can be done in terms of reliability – by design.

 

There are standard techniques that work very well and are extensively used all over the world:

-       DNS load balancing

-       Oracle “Real Application Clusters” & Data Guard

-       H/A Linux (less recommended, because it is not really H/A)

 

Standard operations procedures must be in place:

-       Contact name(s); basic monitoring & alarms; procedures; hardware matching requirements;

-       Sound hardware naming and labeling:

-       Make use of aliases to facilitate hardware replacement.

-       Have a “good” name on the sticker (e.g. all lxbiiii machines may be switched off by hand in case of a cooling problem).

 

Work must be done right from the start (design) through to operations (much harder to retrofit if done too late).

-       Reliable services take less effort to run than unreliable ones

 

At least one WLCG service (VOMS) middleware does not currently meet stated service availability requirements.

 

Also, ‘flexibility’ not needed by this community has sometimes led to excessive complexity (“the enemy of reliability”) (WMS)

One will also need also to work through experiment services using a ‘service dashboard’ as was done for WLCG services (see draft service map)

 

There will be a one day (Monday 21st April) dedicated to follow-up on measured improvement in service reliability.

 

For the services where the guidelines are already implemented, production experience is consistent with the requests from the Experiments. We should use the features of the Grid to ensure that the overall service reliability is consistent with the requirements (i.e. no SPOFs). Individual components may fail, but the overall service can and should continue.

 

I.Bird asked to the Experiments what issues need to be answered/fixed in 30’ and whether this is should not to be addressed by operational solutions (e.g. bigger buffers, multiple instances of services, etc).

 

CMS: M.Kasemann replied that their data-taking can continue for a longer time and is not depending on IT services thanks to their buffers at the pit. The impact of storage problems for CMS is for the access to the data for analysis at the Tier-0. Their judgement was that the interruption should not be more than 30 minutes. If this cannot be done CMS will need to analyze how to tackle these longer response time.

I.Fisk added that the input buffer into IT is of about a day of data; therefore the size of the buffer should be correlated to the response time. For instance the buffer should cover at least twice the expected response time.

 

G.Merino noted that the CMS list of critical services only included Tier-0 services: are the Tier-1 services not critical to CMS?

M.Kasemann replied that their document should be extended to services at the Tier-1 sites.

 

J.Shiers also added that a proper estimate should be done taking into account the frequency of problems that could cause a total blocking. In addition he reminded the MB that “the solution” cannot be guaranteed within a given time. Only the intervention can be guaranteed and then a work around must be found and deployed as quickly as possible. Automatic recovery and fail-over must be put in place in all cases where there are possible crucial downtimes.

 

Follow-up in a few weeks:

I.Bird concluded that the way Experiments want to address the problems of delayed interventions and solutions should be monitored in future meetings. Define acceptable downtimes, sizes of buffers and, most of all, automated solutions in order to properly address these issues.

 

J.Templon noted that at the EGEE TCG is going to discuss and decide about the scheme of assigning dynamic fair-share to the VOs. This is a kind of flexibility that has never been used of configured by any VO and should not be priority.

C.Grandi replied that the sharing across groups inside a VO was asked by the Experiments at the TCG and the TCG is discussing longer term features.

 

I.Bird added that the LCG is not going to change any configuration regarding any new feature just introduced now. The solution of using VOViews is sufficient for ATLAS. The TCG discussion is for a much longer term scenario and will be evaluated by the LCG in due time.

 

D.Barberis clarified that ATLAS has given up to some past requirements (on quotas, fair-share, etc) and these reduction of priorities should be reported at the TCG.

 

2.    Update on CPU Benchmarking - H.Meinhard

 

H.Meinhard reported the progress at the HEP CPU Benchmarking working group:

-       Several different systems (seven different processors) are available at CERN

-       SPEC standard benchmarks have been run at CERN on those seven hosts over Christmas

-       Experiments representatives are now known and discussions have started with them in order to run their specific benchmark applications

-       Some initial problems to provide easy to the Experiments are being addressed

 

3.    NDGF SAM Tests (Slides) - O.Smirnova

 

O.Smirnova presented an update about the specific SAM tests that have been developed in order to calculate the standard Site Availability at NDGF.

 

General Availability tests:

ARCCE-auth

-       Tests if the DN running the SAM tests can authenticate against the ARCCE (without sending a job)‏

ARCCE-caver

-       Tests if the list of supported CAs published by the ARCCE is consistent with the latest IGTF release

ARCCE-job-submit

-       Submits a test job to the cluster

ARCCE-softver

-       Finds the middleware version installed on the ARCCE

ARCCE-status

-       Checks ARCCE status published in the information system

 

These tests are run on the worker node from within the ARCCE-job-submit test

ARCCE-csh

-       Tries to run a csh scripts

ARCCE-gcc

-       Finds the default installed gcc version

ARCCE-perl

-       Finds the default installed perl version

ARCCE-python

-       Finds the default installed python version

 

Data management is done by the Grid Manager on the ARC CE – the test checks that it works – they don’t move any files themselves

ARCCE-gridftp

-       Stage-in from and stage-out to a gridftp server

ARCCE-rls

-       Stage-in of a file registered in RLS, stage-out to one of the RLS servers default output locations with RLS registration

ARCCE-srm

-       Stage-in from and state-out to an SRM server

ARCCE-wn-gridftp, ARCCE-wn-rls, ARCCE-wn-srm

-       Parts of the above tests that are run at the worker node

 

The CE Tests have also been developed and are equivalent to the LCG standard tests.

 

ARCCE tests

-       ARCCE-auth

-       ARCCE-status

-       ARCCE-caver

-       ARCCE-csh

-       ARCCE-job-submit

-       ARCCE-gridftp, -rls, -srm

-       ARCCE-softver

 

LCG CE tests

-       CE-host-cert-valid

-       CE-sft-brokerinfo

-       CE-sft-caver

-       CE-sft-csh

-       CE-sft-job

-       CE-sft-lcg-rm

-       CE-sft-softver

 

Not everything can be tested by sending jobs. The tests are run via a client installation at grid.tsl.uu.se.

-       proxy of the server certificate

-       the list of services to test is obtained from the SAM database

 

The SAM database is filled from the ARC BDII by the central installation at CERN.

 

The following are exactly the same tests as for all the EGEE/WLCG sites, and run with OPS VO credentials:

-       Storage Element (SE)

-       SRM (SRM)

-       Site BDII (sBDII)

 

The NGDF tests are integrated in GridView.

 

J.Templon, who was one of the referees, said that he will look at the latest release but it seem that all his comments were taken into account.

 

4.    Site Reliability Report- Dec 2007 (Site Reports 200712; Slides) – A.Aimar

 

All sites reports for December 2007 are available (Link).

 

Sites >Target  (>90% target)

8+2

Avg. 8 best sites
Jun 87% Jul 93% Aug 94% Sept 93% Oct 93% Nov 95% Dec 96%

 

Avg. all sites
Jun 80% Jul 89% Aug 88% Sept 89% Oct 86% Nov 92% Dec 87%

 

The table below shows the reliability data in the last 10 months

 

Site

 Mar 07

Apr 07

 May 07

  Jun 07

   Jul 07

  Aug 07

    Sept 07

  Oct
07

Nov
07

Dec
2007

CERN

97

96

90

96

95

99

100

99

98

100

DE-KIT (FZK)

75

79

79

48

75

67

91

76

85

90

FR-CCIN2P3

58

95

94

88

94

95

70

90

84

92

IT-INFN-CNAF

76

93

87

67

82

70

80

97

91

96

UK-T1-RAL

80

87

87

87

98

99

90

95

93

91

NL-T1

47

92

99

75

92

86

92

89

94

50

CA-TRIUMF

70

73

95

95

97

97

95

91

94

96

TW-ASGC

95

92

98

80

83

83

93

51

94

99

US-FNAL-CMS

90

85

77

77

92

99

89

75

79

88

ES-PIC

96

95

77

79

96

94

93

96

95

96

US-T1-BNL

6*

89

98

94

75

71

91

89

93

44

NDGF

n/a

n/a

n/a

n/a

n/a

n/a

n/a

89

98

100

Target

88

88

88

91

91

91

91

91

91

91

Target
+ 90% target

 4 + 1

7 + 3

6 + 3

3 + 2

7 + 2

6 + 2

7 + 2

5 + 4

9+2

8+2

 

The two sites below targets were NL-T1 and US-BNL-T1, due to specific problems fixed by now:

-       SARA: the new version of dCache had too many connections to postgres and the DB had to be killed. The recovery required almost two weeks to find the solution and to manually fix the database.

-       BNL: on the 14th December the SAM tests started failing because of a change in the SRM configuration and publishing at the site. Help was asked directly to P.Nyczyk, instead of the SAM support, who meanwhile had left for 3 weeks. The issue should be fixed by the end of the day.

 

F.Hernandez reminded that IN2P3 finds that their scheduled downtimes are not accounted correctly by GridView. This has been reported several times in the last 6 months and is in recorded in GGUS.

A.Aimar will check with the GridView team what is the exact situation and report.

 

5.    AOB

 

The next meeting with the LHCC Referees if for the 19 February 2008 at 1200.

The agenda is being defined. Most likely the meeting will be focusing on (1) CCRC preparation and on (2) metrics for CASTOR performance and reliability.

 

6.    Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.

 

New Action:

21 Jan 2008 - The LCG Office should define where (a web area, wiki, share point?) the Sites can upload their statistics about their tape storage performance and efficiency.

 

New Action:

24 Jan 2008 – Sites should update the High Level Milestones. A.Aimar will remind it via email during the week.