LCG Management Board

Date/Time:

Tuesday 25 March 16:00-17:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=27475  

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 27.3.2008)

Participants:

A.Aimar (notes), I.Bird (chair), T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, X.Espinal, I.Fisk, S.Foffano, F.Hernandez, C.Grandi, M.Lamanna, M.Livny, H.Marten, R.Pordes, R.Quick, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 1 April 2008 16:00-18:00 – F2F Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

-       26 Feb 2008 - The Sites and Experiments should confirm to A.Aimar that they have updated the list of their contacts (correct emails, grid operators’ phones, etc). Here is the current contact information: https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails

 

Information confirmed only by:
Sites:                           ASGC, CERN, CNAF, FZK, NDGF, PIC. SARA, BNL
Experiments:               ALICE

 

Will be checked next week at the F2F Meeting.

 

-       29 Feb 2008 - A.Aimar will verify with the GridView team the possibility to recalculate the values for BNL.

Not done. Asked to GridView but it still needs to be implemented.

-       29 Feb 2008 - A.Aimar will verify why the reliability values for the Tier-2 sites seems incorrect (being lower than availability).

 

On the way. Will be fixed for March.

 

-       18 Mar 2008 - Sites should propose new tape efficiency metrics that they can implement, in case they cannot provide the metrics proposed. Experiments should provide the read and write rates that they expect to reach. In terms of clear values (MB/sec, files/sec, etc) including all phases of processing and re-processing.

 

 

-       31 March 2008 - OSG should prepare site monitoring tests equivalent to those included in the SAM testing suite. J.Templon and D.Collados will verify this equivalence and report to the MB; as it was done for NDGF.

-        

 

Discussed later in the agenda.

 

3.   CCRC08 Update (Slides) – J.Shiers
 

J.Shiers presented an update of the CCRC08 activities.

3.1      Service Activities

Service

Issue

DBs

Disk failure on the ATLAS integration RAC, è storage rebalancing. During this operation: another disk failure è redo the re-balancing. To protect against other failures this was done offline and the ATLAS integration RAC was not available on Wednesday from 11:00 to 20:00 UTC+1

Oracle published the 64bit version of 10.2.0.4 – it is deployed on a test RAC at CERN. The first standard tests were performed without problems.  Some issues with OEM agents remain – tight for May???

3D

PIC was down from Monday at 14.30 to Wednesday 15:00 (UTC+1).  Streams was re-synchronized after PIC came back. [ See below ]

CASTOR

Upgraded to 2.1.6-11 on ATLAS, CMS and LHCb.

GGUS

A bug in GGUS was found, sometimes blocking updates to GGUS coming from PRMS/Remedy. The bug is understood – fix pending.

ASGC

Start of downtime [UTC]: 18-03-2008 23:30 (Oracle upgrade + CASTOR2)

End downtime      [UTC]: 19-03-2008 10:00

CNAF

Start of downtime [UTC]: 19-03-2008 14:55 (Upgrade)

End downtime      [UTC]: 07-04-2008 14:55

 

Note: The downtime is of 20 days.

L.Dell’Agnello added that he will check this downtime.

 

Action:

L.Dell’Agnello will verify the downtime at CNAF (19.3 until 7.4.2008).

PIC

Start of downtime [UTC]: 15-03-2008 08:00 (Annual power maintenance)

End downtime      [UTC]: 20-03-2008 07:00

 

3.2      Experiments Activities

Not a lot of reports from the Experiment.

 

Experiment

Day

Issue

ATLAS

Mon

are shifting their announced schedule to continue T1-T1 debugging this week and perform throughput tests next week (after Easter). They would like to see a simple interface to the T1 Gocdb downtimes where all T1 are summarised on one page.

Tue

Two activities - setting up the SRM2 CERN Castor environment and functional tests T1-T1 with NL-T1 as target. Tests to IN2P3 were mildly successful (2 ddm problems found) while those to PIC will have to be repeated. Some channels are slow so they are trying to tune FTS configurations. The intention is to build a Wiki page of suitable configurations for the T1-T1 combinations.

Wed – Fri

Nothing.

 

CMS

Mon

have released prodagent 7.1 with improvements on cleanup at sites. Still having delays getting global run data into the CAF. In T1-T1 commissioning only RAL-ASGC is missing.

Tue

This morning patches were applied to the CERN CMS Castor instance to improve the performance of their CAF data access.

Wed – Fri

Nothing.

 

ALICE

Mon - Fri

Nothing.

 

LHCb

Mon

An LHCb software week this week so not much activity on the grid. Testing migration of T1D1? data from STORM to tape under the IBM Tivoli Storage Manager.

Tue – Fri

Nothing.

 

3.3      Reminder of Next Meeting

Next CCRC’08 Face-to-Face Tuesday 1st April:

-       http://indico.cern.ch/conferenceDisplay.py?confId=30246

-       Site focussed session in the morning then Experiment and Service focussed session in the afternoon.

21st – 25th April WLCG Collaboration Workshop (Tier0/1/2) in CERN main auditorium (90 people registered at 12:00 UTC):

-       http://indico.cern.ch/conferenceDisplay.py?confId=6552

-       Possible themes:

-       WLCG Service Reliability: focus on Tier2s and progress since November 2007 workshop

-       CCRC'08 & Full Dress Rehearsals - status and plans (2 days)

-       Operations track (2 days, parallel) – including post-EGEE III operations!

-       Analysis track (2 days, parallel)?

-       12th – 13th June CCRC’08 Post Mortem:
http://indico.cern.ch/conferenceDisplay.py?confId=26921 is in preparation. “Globe event” foreseen for week of 23 June.

 

4.   HEP Benchmarking (Slides) - H.Meinhard
 

H.Meinhard presented a summary of the status and progress of the HEP CPU Benchmarking working group.

 

The working group has regular phone conferences once every two weeks, more frequently when needed.

Currently there is a dedicated test cluster at CERN (of 7 machines) and a few machines available for benchmarking elsewhere (DESY Zeuthen, RAL).

 

CERN

-       2 x Nocona 2.8 GHz, 2 x 1 GB, E7320 board

-       2 x Irvindale 2.8 GHz, 4 x 1 GB, E7320 board

-       2 x Opteron 275, 4 x 1 GB, AMD-8132/8111 board

-       2 x Woodcrest 2.66 GHz, 8 x 1 GB, 5000P board

-       2 x Woodcrest 3 GHz, 8 x 1 GB, 5000P board

-       2 x Opteron 2218, 8 x 1 GB, HT2000/1000 board

-       2 x Clovertown 2.33 GHz, 8 x 2 GB, 5000P board

 

DESY Zeuthen:

-       2 x Harpertown 2.83 GHz

RAL:

-       2 x AMD Barcelona – not yet stable

 

The working group defined an agreement on boundary conditions (OS, gcc, options, how to run SPECcpu) and the participation from the Experiments is very active.

 

The Experiments representatives are:

-       Atlas: Alessandro di Salvo, Franco Brasolin

-       CMS: Gabriele Benelli et al.

-       Alice: Peter Hristov

-       LHCb: Hubert Degaudenzi

 

Below is the benchmarking of the CERN cluster (7 machines) using the SPECint2000 and SPECint2006.

 

Note: all plots below are all normalized on the first machine (lxbench01).

 

scaled

 

Below are details of SPECint2006, in 32 and 64 bits for integer and floating points.

 

si2k6

 

sf2k6

 

ATLAS and CMS have produced their first results (see graphs below).

 

The light blue bar is the SPECint2000 that seems closer to the ATLAS applications and to ATLAS Total than the SPECint2006. These results are quite surprising and need to be verified carefully.

 

J.Templon noted that also the other benchmarks are, within a few %, also close to the ATLAS Total. And therefore also SPECint2006 can qualify as metrics for the CPU benchmarking.

 

 

Also for CMS (see graph above) SPECint2000 seems to scale more like the CMS applications.

 

More results from Experiments are expected in the coming weeks:

-       ATLAS is moving to standard benchmark suite, and plans of redoing their tests

-       CMS: Analysis of the current results is ongoing

-       Awaiting results from Alice and LHCb

 

The aim is also to extend measurements to more machines and run with perfmon (CPU performance/event counters).

 

I.Bird noted that the sites urgently need advice of this working group in order to decide which benchmarks to use for their procurement.

 

J.Templon asked whether the SPECint to use are “rate” and not the “base” version used by the working group.

H.Meinhard replied that the difference is not big and also running multiple spec “base” (1 x no of cores). is similar to the way Experiments run (i.e. as many applications as no of cores on each node).

 

The interim report on HEP Benchmarking will be presented at HEPiX in May and then to the F2F MB Meeting the week after.

 

5.   OSG Site Functional Tests (Status of OSG RSV) - Rob Quick (OSG)

 

R.Quick presented the status of the integration of the OSG test system (RSV) with the SAM site monitoring system developed at CERN.

 

The RSV records are uploaded into SAM; the process has been stable for several weeks. The few problems encountered have been solved.

Using an updated list of OSG resources monitored and published to WLCG - each resource manager approves sending the test results to WLCG.

 

This is a combination of resources publishing information from the OSG BDII to WLCG (mainly US CMS) and resources who do not want to publish BDII information but do want to publish Site Functional Tests to WLCG SAM (mainly US ATLAS).

In total of 30 resources listed there are 13 sites reporting: http://oim.grid.iu.edu/publisher/get_osg_interop_monitoring_list.php

 

The SAM Tests seem to be accurate, but there is only a subset of the full test set published on GridView.

A Probe Equivalency page is located at https://twiki.grid.iu.edu/twiki/bin/view/Operations/RsvEquivalency (Thanks to D.Collados for his help).

 

The SE Probes are now in VDT and will be tested in the OSG-ITB over the next several weeks. When testing is completed the results will start being sent to SAM. Tests will be conducted with during ITB to make sure storage probe information gets to the WLCG SAM correctly

 

 

The table above shows the tests results at each of the 13 sites that accept reporting to SAM. And below is an example of the tests run at one of those sites (MWT2_IU).

 

This are now testing of V2 of RSV that will include:

-       Probe Set: https://rt-svn.uits.indiana.edu/svn-web/osg/listing.php?repname=rsv&path=%2F&sc=0

-       Nagios wrappers will be included with V2 for publishing probe results into existing Nagios instances.

-       Improved proxy and configuration procedures and the number of sites should increase.

 

Availability calculations are understood and under discussion. Reliability still needs to be implemented. Questions include for example how OSG Scheduled Maintenance windows are calculated and reported to SAM.

 

They are also collecting feedback from OSG collaborators about statistics and publishing in SAM/GridView.

The next group meeting between EGEE and OSG developers is requested for later this week.

 

I.Bird asked whether the critical tests and the basic tests equivalence is positive.

R.Quick replied that the equivalence is available in the web page https://twiki.grid.iu.edu/twiki/bin/view/Operations/RsvEquivalency

The SE tests do not apply to OSG. But the other tests have almost a one-to-one match with SAM tests.

 

I.Bird asked whether the timeline is following the milestone that was proposed by OSG for end of March. It seems that the reliability will be implemented only in May. The Review Board could ask explanations on why the OSG sites are not reporting like the other sites.

M.Livny replied that the OSG needs to collect reliability information about its sites and then forward it to the GOCDB. This is an ongoing effort and will be on a future release.

 

I.Bird also noted that the scheduled downtime needs to be collected by OSG and communicated to SAM because is needed for the Site Reliability calculations.

 

New Action:

M.Livny and R.Pordes agreed to present a new timescale for the OSG milestones about sites reliability reporting, at next week’s F2F MB Meeting.

 

6.   AOB

 

No AOB.

 

7.   Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.