LCG Management Board

Date/Time:

Tuesday 11 December 2007 16:00-17:00 – Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=22190

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 13.12.2007)

Participants:

A.Aimar (notes), D.Barberis, I.Bird, T.Cass, Ph.Charpentier, L.Dell’Agnello, S.Foffano, C.Grandi, F.Hernandez, R.Kalmady, M.Kasemann, M.Lamanna, E.Laure, H.Marten, H.Meinhard, G.Merino, P.Nyczyk, B.Panzer, L.Robertson (chair), A.Sciabá, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 18 December 2007 16:00-17:00 – Phone Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

A few comments received and highlighted in blue in Section 2 of the MB minutes (4 Dec 2007):

-       RAL: OPN representative specified (J.Gordon)

-       DE-KIT: clarification about the 2008 pledges (H.Marten)

1.2      Matters Arising

Other two issues followed last week’s meeting and generated some emails exchange during the week:

-       What happens if
a) Sites reduce their pledges and how should they inform the Experiments
b) Experiments change/rebalance their requirements wrt the Sites pledges.
c) By when and how these changes must be announced?


L.Robertson noted that it seems that in a couple of occasions the computing coordinators were not informed of pledges reduction at Tier-1 sites.

L.Dell’Agnello clarified that the national representatives for the Experiments were informed of the changes at CNAF, well before the RRB meeting.

D.Barberis added that the rebalancing from the Experiments should usually be within the pledges. If the requirements are above the agreed pledge of a site this must be specifically addressed with that site.

-       Many sites have not replied to S.Foffano about their pledges; therefore their pledges are considered confirmed.

L.Robertson reminded that the pledges agreed are considered a commitment and should be honored. Any change will have to be discussed at the RRB meetings in October of the previous year.
J.Templon clarified that the NL-T1 site will have the necessary resources and the pledges will be normally fulfilled unless the use of resources at the site is drastically lower than what is already available.

1.3      Site Reliability Report - November 2007 (Site Reports; Slides)

A.Aimar distributed the Site Reports for November 2007.

 

The table below shows the sites reliability since January 2007.

In November 2007 9 sites were above 93% (target is 91%) and another 2 were above 82% (90% of target).

 

Site

Jan
07

Feb
07

 Mar
07

Apr
07

 May
07

  Jun
07

   Jul
07

  Aug
07

Sept
07

  Oct
07

Nov
07

CERN

99

91

97

96

90

96

95

99

100

99

98

DE-KIT (FZK)

85

90

75

79

79

48

75

67

91

76

85

FR-CCIN2P3

96

74

58

95

94

88

94

95

70

90

84

IT-INFN-CNAF

75

93

76

93

87

67

82

70

80

97

91

UK-T1-RAL

80

82

80

87

87

87

98

99

90

95

93

NL-T1

93

83

47

92

99

75

92

86

92

89

94

CA-TRIUMF

79

88

70

73

95

95

97

97

95

91

94

TW-ASGC

96

97

95

92

98

80

83

83

93

51

94

US-FNAL-CMS

84

67

90

85

77

77

92

99

89

75

79

ES-PIC

86

86

96

95

77

79

96

94

93

96

95

US-T1-BNL

90

57*

6*

89

98

94

75

71

91

89

93

NDGF

n/a

n/a

n/a

n/a

n/a

n/a

n/a

n/a

n/a

89

98

Target

88

88

88

88

88

91

91

91

91

91

91

Target
+ 90% target

5 + 5

6 + 3

 4 + 1

7 + 3

6 + 3

3 + 2

7 + 2

6 + 2

7 + 2

5 + 4

9+2

 

The averages across all sites are 95% for the 8 best sites and 92% across all sites:

 

·         Avg. 8 best sites: May 94% Jun 87% Jul 93% Aug 94% Sept 93% Oct 93% Nov 95%

·         Avg. all sites:       May 89% Jun 80% Jul 89% Aug 88% Sept 89% Oct 86% Nov 92%

 

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

·         21 October 2007 - Sites should send to H.Renshall their resources acquisition plans for CPU, disks and tapes until April 2008

 

NDGF and NL-T1 should send to H.Renshall and S.Foffano an estimate about the delivery of their 2008 capacity.

 

Update: NL-T1 send a mail with November 2008 as estimate for making available the 2008 capacity.

 

·         30 November 2007 - The Tier-1 sites should send to A.Aimar the name of the person responsible for the operations of the OPN at their site.

 

Not done. Received from TW-ASGC, FR-CCIN2P3 (Jerome Bernier),  IT-INFN (Stefano Zani), RAL (Robin Tasker)

 

·         14 Dec 2007 - L.Dell’Agnello, F.Hernandez and G.Merino prepare a questionnaire/check list for the Experiments in order to collect the Experiments requirements in a form suitable for the Sites.

 

In preparation. Being discussed with the storage and system administration experts at the respective sites.

 

3.    SRM 2.2 Weekly Update (Slides) - J.Shiers

 

J.Shiers presented a summary of the SRM 2.2 deployment progress.

 

The tables below are examples of the Sites and Experiments respective views of the SRM installations and configurations. These tables will be used to summarize the status of all sites in a concise view.

 

Site

SRM v2.2?

Space Management?

Endpoints published?

Space configuration per token & VO

XXXX

Y|N

Y|N

Y|N

Link

 

Experiment

DM framework SRM v2.2 ready

Tested client tools against production endpoints

Tested FTS transfers against production endpoints

YYY

Y|N

Y|N

Y|N

 

The main points described are:

-       More than 50% of the sites are now with SRM 2.2

-       No new bugs with the client tools found

-       Bug in the FTS (not marking the file as permanent). Solved but not released yet.

-       Bug in SRMCopy; the temporary solution is to use Globus URLCopy.

-       Experiments have specified their storage classes requirements.

-       The issues about space for files recalled. For now it will not be solved. First one needs to understand the issue and collect experience. Later design a standard solution for all implementations.

 

Below is the slide presented.

 

       SRM v2.2 production deployment is proceeding on schedule without hiccoughs

      Typically 1 to 1.5 days per (dCache) site, including other housekeeping operations

       NDGF, FZK, SARA, IN2P3 (done), FNAL (deferred), others in coming week.

       CERN, RAL (done), CNAF?

       Bugs tracked by the EMT with high priority

      No new bugs with client tools; Bug fixed in FTS (not yet through release procedure), new issue when using SRMCopy between sites with different SRM versions. A working solution is to use Globus URLCopy at this stage.

       Information from experiments on storage classes at sites now available!

      Needs to be collated in a single place

      See next slide for status

Issue: space for files recalled does not seem to be well defined (knowledge is “lost”)

      Workaround then long-term solution

      Standard behaviour is essential! (and was (re-)agreed at con-call)

      We need practical experience – we cannot make decisions based on interpretation alone!

       (More a CCRC’08 issue: have to expect fast-track bug fixes during CCRC’08 and run-up, i.e. AA copy of client tools and/or pilot FTS etc.)

 

 

L.Robertson asked whether the installation at CNAF is completed or not (because there was a question mark in the slide above).

L.Dell‘Agnello explained that the endpoints for CASTOR SRM 2.2 are working at CNAF. They still need to agree with the Experiments that they should use the new endpoints. The agreement must be made before the 17 December because no changes can be performed during Christmas.

 

L.Robertson asked whether now the Experiments have enough sites with SRM 2.2 in order to start using them for tests transfers.

J.Shiers replied that now some combinations useful to the Experiments are now possible. The tables above should help to visualize the situation by Site and Experiment.

 

4.    Update on CCRC-08 Planning (Slides) - J.Shiers

The action list agreed at the F2F CCRC Meeting of the previous week is shown below.

 

1.     Fabio, Gonzalo & Luca to produce their view of requirements from Tier1 sites

 First iteration presented at ATLAS Tier1 Jamboree

2.     Consolidation SRM v2.2 space token requirements

To be prepared this week < next SRM v2.2 con-call

See also SRM report

3.     Decision on monitoring / logging / reporting

We have all the input – meeting of the experts to make concrete proposal à next Monday’s con-call + iteration at F2F

4.     Information flow during the challenge

Most likely an SC-style daily meeting for Jan / Feb, until it can be folded back into the daily operations meeting

Put in place reporting from VOs à weekly operations meeting as from start of 2008

5.     Last planning meeting of 2007 – Monday 17th @ 17:00 – update on the above plus planning for January 10th F2F

 

 

 

5.    Status of the VO-specific SAM tests (VO tests results - new and old)

5.1      CMS (Slides) - A.Sciabá

5.1.1       SAM in the CMS Tests

SAM is used in CMS to test the basic functionalities which are needed by the CMS workflows both for Monte Carlo production and for Analysis. In this context the SAM tests are used to test both EGEE and OSG sites.

 

The submission in all cases done through the LCG Resource Broker. Two sensors are used so far: the CE and the SRM sensor. The tests are run only at specific sites, essentially all CMS Tier-N plus a few others

 

CE Tests

CMS submits custom tests for the CE since the beginning of 2007. Tests are submitted every two hours. All tests are run with the lcgadmin role, but the MC test which is run with the production role.

 

Test name

What it does

job submission

Submits a job to the CE

Basic

Checks that the CMS local site configuration is OK

Swinst

Checks that the CMSSW installation are OK and all versions needed for the MC production are there

Monte Carlo

Checks that it is possible to stage out a file from the worker node to the local storage

Squid

Checks that the local Squid server works

FroNtier

Reads calibration data using CMSSW via the local Squid server

 

SRM Tests

Since June 2007, CMS uses custom tests for SRM v1. File transfer is done via srmcp and the “production” role is used.

They have a dependency on the PhEDEx database.

 

Tests for SRM v2 are in development and there are no tests for the SE and CMS does not see reasons to use both the SE and the SRM sensor in SAM.

 

Test name

What it does

get-pfn-from-tfc

Finds the LFN-to-PFN translation rule for that SRM in the PhEDEx database

Put

Copies a local file to the remote SRM via srmcp

Get-metadata

Queries the file metadata from SRM

Get

Copies back the remote file and compares it with the original one

advisory-delete

Tells the SRM that it can delete the test file

 

5.1.2       WLCG Availability Calculation

It is determined by the choice of the critical tests.

 

CE

-       Job submission: the CE is unavailable if it cannot run a CMS job via RB [run by CMS]

-       CA certs: the CE is unavailable if it has not the correct CA certificates [run by ops]

-       VO tag management: the CE is unavailable if the publication of experiment tags does not work [run by ops]

 

SRM

-       Put: the SRM is unavailable if it is not possible to copy a file on it via srmcp [since 10/12/07]

 

SE, FTS, RB, etc.

-       No critical tests defined

 

There are some problems with the WLCG Availability calculation:

The availability calculation in GridView is wrong (bug #31233). If a service type stops having critical tests, all its instances will have status UNKNOWN but the combined service status is not updated any more.

This is serious: CMS stopped having critical tests for the SE on 13/11, and since then the SE status is frozen to what it was immediately before.

 

The impact for the Tier-1 global availability is "random"

-       ASGC: SE always available, no impact

-       CERN-PROD: SE always unavailable, serious impact (always red)

-       FNAL: SE status UNKNOWN, no impact

-       FZK: SE status UNKNOWN, serious impact (always grey)

-       IN2P3: SE always on maintenance (not for real!), serious impact (always yellow)

-       INFN-T1: same as FZK

-       PIC: same as FZK

-       RAL: same as IN2P3

 

In other words, the Tier-1 WLCG availability for CMS is wrong since ~ 1 month

 

R.Kalmady explained that this problem is fixed and GridView will be released in the next few days.

The graphs will be regenerated when possible. In some cases the hosts have been modified and there is no data for some of them.

 

There is also a problem in the new WLCG availability algorithm (bug #31233)

-       If a service has no critical tests defined, the status is UNKNOWN, but

-       If a VO says that no test is critical for a service type, it means that that service is always available for them (unless it is on maintenance of course)

-       Therefore, if e.g. the SE has no critical tests, all SEs should be always OK

 

R.Kalmady asked what should be considered if a service has no critical tests defined? Should it be considered up?

A.Sciabá and Ph.Charpentier proposed to consider the service are working.

 

R.Kalmady asked if should be considered “green” or there should be a new state.

L.Robertson and Ph.Charpentier noted that there should be a status that shows that the service is “not tested”. But be propagated as “OK” in the calculations.

 

Decision:

The MB agreed that the service should be in a state “irrelevant” or “not tested” and should be ignored (considered “up”) in the calculations.

 

R.Kalmady asked what happens if a not tested service is scheduled down, should it be marked as such?

 

Decision:

The MB decided that when a service without critical tests is scheduled down should be marked “scheduled down” and that status should be propagated.

 

G.Merino asked why the SE service is still there as all the SRM service is provided by all the Tier-1 sites and the SE service is a legacy.

 

There are BDII inconsistencies at FNAL:

-       FNAL publishes its resources on different "GLUE" sites
USCMS-FNAL-WC1: contains the SE and the SRM. This is the only site known to GridView.

-       uscms-fnal-wc1-ce: contains one CE

-       uscms-fnal-wc1-ce2: contains another CE

The affect: the FNAL WLCG availability ignores the status of the CE therefore there is a possible overestimation of the availability if the CE is down

5.1.3       CMS Availability

CMS has its own definition of availability. CMS has been using a custom definition of availability for internal use.

Calculated as the daily fraction of CMS SAM tests for the CE which were successful

-       No test is really "critical", every failure just degrades a bit the estimation

-       The SRM tests are not included in the calculation

It is calculated by a script run by hand

 

A new calculation more WLCG-like has been implemented in the ARDA dashboard

-       The algorithm very similar to the WLCG one

-       All CMS tests are taken as critical, for now

 

See slides 13 and 14 for the differences in the availability calculated with the new and the old definitions.

 

The choice of critical tests for WLCG is constrained by the fact that if a CE fails a critical test it is also removed from the BDII by FCR. Therefore the choice must be careful and conservative

For CMS, a test might be critical if its failure prevents some high level workflow from working. Therefore the choice can include other tests (e.g. jobs run, MC is OK, access to calibration DB fail).

5.1.4       SAM Tools

The FCR is the only tool to set the critical tests and to white list or blacklist specific sites instances.

There are two problems:

-       OSG sites are not included anymore. But they were some time ago

-       The only service types supported are the CE and SE. But not SRM.

 

The standard SAM web interface is inadequate and basically frozen since several months. It does not show EGEE sites and OSG sites together. It does not allow showing only "real" CMS sites. It has some bugs in the history view

 

CMS has turned its attention to the ARDA dashboard team in order to have a better graphical interface. Very easy to have new features in place and the work can be easily reused by other VOs.

5.1.5       Future Plans

-       Add more tests, in particular "Analysis" tests including read access to local data and stage out to remote storage.

-       Provide feedback on the CMS availability into SAM as another SAM test for easy viewing. Plug in the CMS SAM tests in the site monitoring. Tools in development in SAM group at CERN.

 

F.Hernandez noted that the Experiments are using several visualization tools and this diversity is not easy use by the sites.

I.Bird commented that the Sites should include the SAM results into their site monitoring systems. As is also suggested by the Monitoring Working Group.

J.Templon added that there should be tests corresponding to the MoU requirements so that a site can monitor whether it is within the MoU agreements.

5.2      LHCb (Slides) - Ph.Charpentier

5.2.1       LHCb and SAM Tests

 

LHCb Critical Tests:

Dedicated jobs (SAM jobs)

-       Check site capabilities

SW repository access rights

Check correct sgm account mapping

Verify the platform, deployed middleware (lcg-utils version)

-       Installs LHCb software

from a list of current releases of applications

installation on the SW repository (shared area)

-       Runs test applications

simulation, digitisation, reconstruction on 10 events

analysis

 

Not yet in Production:

Dedicated tests (cron jobs)

-       FTS transfers (full matrix)

-       SRM tests (response time)

 

Information gathered by operations

-       SE performance (staging time, SRM response)

-       transfer error analysis

5.2.2       Tests Execution and Logging

Slide 3 shows how the tests are submitted through DIRAC. They target all CEs accepting LHCb jobs, report to the SAM DB and upload the log files to the DIRAC log system.


Slide 4 shows how the history of the job can be retrieved and one can also check the log file of the execution.

 

Slide 5 shows that GridView is reporting a difference with the new algorithm. There is a bug in the reliability calculation.

 

R.Kalmady acknowledged that there is a bug in the reliability display.

 

Slide 6 shows that LHCb reports only the status of the CE. But in order to correct the services that have no test LHCb has added some dummy tests (already discussed).

 

SAM-DB keeps knowledge of all past tests a clean-up is really needed.

5.2.3       Comments on GridView

-       The list of Tier-1 does not correspond to those serving LHCb. How to restrict Tier1’s to those relevant?

 

-       Why are NIKHEF and SARA two Tier1’s? Should be NIKHEF CE with SARA SE.

 

J.Templon added that actually NL-T1 is also having CEs at SARA also and the combination is more complicated that having CEs at NIKHEF and SEs at SARA.

.

P.Nyczyk added that the calculation in GridView is simpler than the one the VOs and the sites seem to expect. But to make it more customizable is a lot of work (which is being started).

 

-       SE and SRM states should NOT be the “OR” of instances. It is not enough.
For a site to be usable, one needs all its SEs and SRMs to be available. How can one define those that are needed by a given VO?

More generally: would it be possible to define the “availability logic”? Why should there be CE, SE and SRM and only those?
E.g. how to include availability of VO services (LFC, VOBOXes)?

 

P.Nyczyk added that these principles are those he presented at the GDB. But they imply major changes in SAM and it will take time to study and implement them. Maybe the right place for these calculations is in the dashboard where is already implemented.

 

-       Ergonomics of the queries: The menus are not really convenient and selecting time ranges cumbersome (limited to 31 days!).

 

J.Templon questioned the fact that VOBOXes could be used for site availability while they are completely under the responsibility of the VO. They had a problem with ALICE VOBOX at SARA. If the VOBOX reliability is low, unless it’s a hardware problem, it should not be on the sites but on the VO.

 

 

6.    HEP Benchmarking (Slides) - H.Meinhard

H.Meinhard presented a summary of the activities of the HEPiX working group on Benchmarking.

 

In Autumn 2006: IHEPCCC chair contacts HEPiX conveners about help on two technical topics: File systems and CPU Benchmarking.

The CPU Benchmarking group started in April 2007 but actually mostly started in the HEPiX meeting in St Louis in November 2007.

The focus initially on was benchmarking the processing power for worker nodes. People who can spare a WN machine temporarily will announce this to the list.

 

A standard set of benchmarks should be run (SPEC benchmarks) and also they seek collaboration of the Experiments in order to check how well real HEP code scales with the industry-standard benchmarks.

 

The environment is fixed on the one in use by the Experiments (SL 4 x86_64, 32-bit applications with gcc 3.4.x) with the compilation options agreed by the LCG Architects’ Forum and should also be used to evaluate multi-threaded benchmarks vs. multiple independent runs. The proposal of H.Meinhard is to have an Interim report expected at HEPiX at CERN May 2008.

 

L.Robertson reminded that the adoption of new benchmarks is very urgent for the Experiments and for the Sites in order to specify the requirements and proceed with the tenders. In October after the presentation of M.Michelotto was agreed to prepare a range of machines on which run the SPEC 2006 benchmarks and the applications from the Experiments. The current time scale seems too late for the needs of the LCG.

 

H.Meinhard agreed that SPEC 2006 is the most interesting test to verify and should be done sooner. The Experiments will initially run their jobs themselves on the benchmarking machines. Later they could package their benchmark applications to be run without the Experiments experts.

 

L.Robertson reminded that by early March the new benchmarks should be adopted because they are needed for the Resources Scrutiny Group meeting in March.

 

Ph.Charpentier noted that the benchmarks are needed also because:

-       the Experiments will have to re-calculate their requirements using the new unit and

-       the Sites will have to use it for their pledges and tenders.

 

Action:

H.Meinhard agreed to ask the working group to quickly proceed to the preparation of some hosts and report to the LCG MB the progress. He will report to the MB in January about the setup and the initial benchmarking.

 

Ph.Charpentier proposed that, when the new unit is known, all worker nodes should be evaluated under this new unit.

 

Action:

Experiments should make nominate who is responsible for the benchmarking of their applications on the machines made available by the HEPiX Benchmarking Working Group. .

 

7.    AOB

 

 

No AOB.

 

8.    Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.