LCG Management Board

Date/Time

Tuesday 13 May 2008, 16:00-18:00 – F2F Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=31117

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 16.5.2008)

Participants  

A.Aimar (notes), L.Betev, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, J.Gordon (chair), F.Hernandez, M.Kasemann, M.Lamanna, E.Laure, U.Marconi, P.Mato, H.Meinhard, G.Merino, A.Pace, R.Pordes, G.Poulard, A.Sciabá, M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 20 May 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

 

J.Templon stated that the sentence in the minutes, from D.Barberis, considering the discussions at the ATLAS Jamboree as the clarification about the needs of ATLAS is not sufficient for the sites. In his opinion the issue still needs to be better clarified by ATLAS.

J.Gordon agreed and added that at the ATLAS Jamboree the discussion had not ended with clear agreement and several issues where open. The sites still would like to have a summary for each site separately. For instance which data flows overlap and which are separate is not clear to the sites.

Here is the example from LHCb: https://twiki.cern.ch/twiki/pub/LCG/GSSDLHCB/Dataflows.pdf. Similar information from ALICE, ATLAS and CMS is still needed.

1.2      Tier-1 and Tier-2 Reliability Reports (drafts) (Tier-1 Reliab; Tier-2 Reliab)

The Tier-1 and Tier-2 Reports are being updated following the comments received (just one, from F.Hernandez) and will now be distributed to the WLCG boards.

1.3      Weighted Reliability of Tier-2 Sites - J.Gordon

As requested by several parts the reliability for Tier-2 sites and federations should be weighted by the capacity of the sites. But this information is not available at the moment in the information system, therefore should be planned.

 

M.Kasemann added that the weighted reliability would allow the sites to see better which sites are more important and give them priority. Maybe the Experiments could produce this data; in some case they know the resources installed at the sites because they are using them and accounting the usage by their applications.

1.4      Tier-2 Accounting – J.Gordon

Some sites deliver 10% and other more than 150% of the pledges: either the accounting is wrong or the real delivery of resources is unrelated to the pledges. This could be discussed at the GDB on Tier-2 sites in June.

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

-       18 Mar 2008 - Sites should propose new tape efficiency metrics that they can implement, in case they cannot provide the metrics proposed.

 

Done. The sites that cannot implement the metrics have reported to the reasons and a HL milestone is introduced to track the situation.

Here is the link to the wiki page: https://cern.ch/twiki/bin/view/LCG/MssEfficiency

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

FZK GridKa

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

MSS/Tape Metrics

WLCG-08-03

April 2008

Tape Efficiency Metrics Published
Metrics are collected and published weekly

 

 

 

 

 

 

 

 

 

 

 

 

 

 

-       31 March 2008 - OSG should prepare Site monitoring tests equivalent to those included in the SAM testing suite.

-       J.Templon and D.Collados will verify this equivalence and report to the MB; as it was done for NDGF.

 

Will be discussed next week at the MB on the 20 May 2008.

D.Collados distributed an email with the link to the wiki page where the information is available.

He and R.Quick will present it at the MB on the 20 May. The OSG tests are described here: http://rsv.grid.iu.edu/documentation/help/.

 

The proposed new list of critical tests is available here:

https://twiki.cern.ch/twiki/bin/view/LCG/OSGCriticalProbes#Proposed_Critical_Probes_for_OSG

 

-       31 March 2008 - ALICE, ATLAS and CMS should provide the read and write rates that they expect to reach. In terms of clear values (MB/sec, files/sec, etc) including all phases of processing and re-processing.

 

Not Done. It was agreed that the Experiments provide the same information in the format used by LHCb.

Here is the example from LHCb: https://twiki.cern.ch/twiki/pub/LCG/GSSDLHCB/Dataflows.pdf

 

  • 30 Apr 2008 – Sites send to H.Renshall plans for the 2008 installations and what will be installed for May and when the full capacity for 2008 will be in place.

 

H.Renshall will be absent several weeks in May, Sites should send it to S.Foffano in addition to H.Renshall.

 

J.Gordon added that the sites should report what is in place in May. If there is no change since the WLCG Workshop there is no need to report.

 

  • 6 May 2008 - Sites should confirm to the MB that they will define an email digitally signed for their alarm system submission. 

 

R.Tafirout noted that if some sites can do it all sites should be able to implement this solution.

 

  • 18 May 2008 - Sites should send the emails address where they are going to receive the alarms. Sites should confirm that they can achieve this.

 

A new page for alert emails for the sites, not the same as the contact pages. This is a separate page to be prepared.

Is not the same and the contact page already prepared (https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails).

Should be a different list only for alerts and only for a few people from the Experiments.

 

New Action:

A.Aimar will talk to WLCG Operation in order to check the set up a page for the site operations alerts.

 

  • 16 May 2008 - Each Experiment proposes 4 users who can raise alarms at the sites and are allowed to mail to the sites alarm mailing list.

 

Not done.

 

  • 9 May 2008 - Sites should send comments about the New HL Milestones.

Done. Will be discussed during this meeting.

 

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

This action is maybe not necessary: The discussion at the MB last week did not agree on a working solution to implement. A couple of sites said they could try it but it will be maybe discussed at the GDB the following day.

 

J.Templon reported that the TCG is not in favor of a solution using LCAS. Otherwise a solution using SCAS would never be installed at the Sites. The only installation of LCAS could be on the Pre-Prod at CERN for the Experiments to try their procedures while the developers move forward on the development of SCAS.

 

M.Schulz added that a solution with LCAS will make it very complex (if possible at all) managing the same users with different usernames at different sites. The SCAS solution will provide a centrally managed and scalable solution in a few weeks.

 

J.Gordon noted that there is no agreement on the issue the action stays and it is discussed again next week.

 

3.   CCRC08 Update (Agenda June PM Workshop; Minutes week19; Minutes week20; Slides) - J.Shiers

 

J.Shiers presented the weekly summary of the status and progress of the recent CCRC08 activities.

 

The activity clearly exceeds that of February – and it is still week 1 of the challenge. The problem resolution is significantly faster than February but there are still communication problems to be resolved between Experiments and Sites.

3.1      Week 19 Minutes

The minutes of last week’s daily calls run to 8 pages and were not commented in detail.

 

In general, ATLAS, CMS and LHCb are active. ALICE is preparing for their third Commissioning exercise.

Their primary requirement is upgrade of VO Boxes to SLC4 – status at https://pcalimonitor.cern.ch:8443/admin/linux.jsp.

L.Betev added that the sites installations are positive.

 

More sites participated than in recent weeks to the daily calls. In general, problems resolved rather rapidly. The mailing lists / elogs / hyper-news etc all very active – including over the (long) weekend. Progress on various infrastructure issues will be reported at the GDB.

Most services were running smoothly.

3.2      CERN Services

Monday:

-       Intervention on VOMS services (lcg-voms.cern.ch and voms.cern.ch).

 

Tuesday:

-       DB services in SLS: http://cern.ch/sls/ to databases to experiment databases including streams.
Problems: Databases are listed under “Services for engineers” and the drill-down is about as clear as mud).
Problem: CERN DB monitoring pages (next) – only visible within CERN

-       “vo-operator-alarm@cern.ch” mailing lists created and configured.

-       RSS feeds for elog & GGUS published + feedback URL

 

Thursday:

-       GLite 3.1 update 22 delayed due to EGEE II-III transition meeting.

 

Friday:

-       C2ALICE on 15th May;

-       AFS problem in AFS22 - ~95% of volumes recovered, ~5% (232) to be restored from backup tape (affected job monitoring application);

-       C2CMS – patch 2.1.7-6 applied at 10:00, on C2PUBLIC Tuesday(?);

-       streams monitoring down – being investigated

 

Slide 4 shows the screen shots for the DB monitoring system.

3.3      Sites Issues

RAL: power outage in the RAL region on Tuesday morning (explained at slide 6). It took some time bringing back all services (~8h hours). This also affected GOCDB – DB corruption (see slide from Andrew).

 

NDGF: ATLAS through-put issues. Faster disks in front of tape will be installed this week; more tape libraries installed during the summer.

 

BNL: disk space issues. Had not yet received FY08 resources prior to start of CCRC’08 (1.5PB raw). Cleanup / additional disk space (110TB) plus deployment of 31 Thumpers (FY08 resources)

 

NL-T1: dCache library inconsistency – solved on Wednesday.

 

FZK: transfer problems (LHCb) rapidly solved – SRM load

 

PIC: tiny files from ATLAS SAM tests going to tape. Configure test directory to be not migrated (as other sites…)

3.4      Experiments

The detailed reports are available in the minutes.

 

LHCb: start up delayed due to issues on online system. Massive cleanup of existing data at 2Hz. Deployment of needed s/w successful. Things running ‘happily’ by end of week.

 

CMS: quite some problems with CASTOR configuration (garbage collection policies etc.) Some actions (done) on CMS side. Discussions with experts. “Reaching stable operations after 4 days”

 

ATLAS: next week’s focus is T1-T1 transfer. The requirements were circulated. Post-mortem of week 1 will be done this week. An LFC intervention to correct SURLs being prepared.

 

J.Templon commented that the Experiments need to have a single point of communication to the Sites otherwise the requests and their importance is not clear for a site.

 

4.   HEPIX Benchmarking (Slides) - H.Meinhard

 

H.Meinhard presented a summary of the progress of the HEP Benchmarking working group. This is an update to the presentation on 25-Mar-2008.

 

The Interim report was presented at HEPiX at CERN the previous week:

http://indico.cern.ch/sessionDisplay.py?sessionId=11&confId=27391

-       Status report (Helge Meinhard)

-       CERN benchmarking cluster, including SPECcpu (Alex Iribarren)

-       Atlas (Franco Brasolin)

-       CMS (Gabriele Benelli)

-       Alice (Peter Hristov)

-       LHCb (Hubert Degaudenzi)

-       CPU-level performance monitoring (Andreas Hirstius)

 

The benchmarking cluster at CERN is now made of 8 permanent machines (added 2.33 GHz Harpertown) and 3 temporary nodes.

Other sites also provide specific hardware - DESY: Harpertown 2.83 GHz; Padua: Harpertown 2.33 GHz, Barcelona 2.1 GHz;

 

All Experiments have run a variety of benchmarks on most benchmarking nodes. The experiments are moving towards freezing their benchmarking applications, and packaging them for easy distribution and running by non-experts. The goal is that the Experiments benchmarks can be run by the sites without the Experiment expert on future new machines.

 

Perfmon analysis was performed on the SPECcpu and one found that:

-       SI2000 is 1% FP (while SI2006: was 0.1%)

-       Average 10% bus utilisation low, but “Bus Not Ready” is at percent level. The memory requests are in burst.

-       L2 cache misses is 2x higher on SI2006 than on SI2000. This means SPEC2k6 has a much larger footprint larger than SPEC2K.

-       SPEC2006 C++: About 10-14% FP

 

Running Perfmon requires a new kernel as it cannot run on the standard SLC4 2.6.9-xx kernel.

Even with perfmon enabled, systems running 2.6.24-xx are ~ 10% faster than with 2.6.9-xx.

4.1      Methodology

Full least-squares fit awaiting consistent estimation of measurement errors. Meanwhile the working group assumed linear dependency and checked Pearson’s correlation coefficient.

Slides 6-11 show how close to 1 is the correlation of each application and the benchmarks.

 

ALICE and CMS: The correlation values are all above 95-97 %..

ATLAS: Are all at ~ 65% and this needs to be understood. There must be some error in the calculations.

LHCb: No results to present for the moment.

 

The first chi-squares fit only take account of spread of multiple runs on the same machine. As expected the error clearly underestimated.

Other calculations are being considered:

-       Runs on different HW of supposedly same configuration

-       Runs on different HW of “similar” configuration

-       Runs with different seeds

-       Spread between cores running SPEC cpu base in parallel

-       Multiple runs on the same machine

4.2      Conclusions

The preliminary results indicate that standard benchmarks are probably adequate.

The rather poor ATLAS correlation and the treatment of errors need to be understood.

Run experiments’ code with perfmon (CPU performance/event counters).

 

L.Dell’Agnello asked whether the SPECint 2006 is going to be adopted and can be used by the sites in their next tenders.

H.Meinhard replied that it is likely SPECint 2006 the selected benchmark and the final checks need to be done. CERN has still used SPECint2K for their recent tender.

 

M.Kasemann asked why the applications have good correlation with both INT and FP. Maybe is just showing that there not much meaning on the tests executed.

H.Meinhard agreed and added that perfmon will probably provide the answers to this kind of questions. He also added that all the runs are parallel SPEC rate runs.

 

L.Betev noted that all the Sites should use the same units and processors could be benchmarked centrally.

H.Meinhard replied that is easy for all the suppliers and sites can run the 2K6 benchmarks and that checking the real node is much better than relying on pre-calculated tables. .

 

J.Templon asked that the MoU pledges to be converted to the new unit.

Ph.Charpentier noted that the sites must account each node in detail. Otherwise the lack precision in the resources from a site is well larger than the 5% of these benchmarks.

 

Decision:

The MB asked H.Meinhard to report in 1 month the recommendations and to report them to the GDB.

 

5.   CMS-specific SAM Tests (Slides; VO_SAM_Tests) - A.Sciabá

 

A.Sciabá presented the SAM test result for CMS in April and provided an overview of the CMS tests.

5.1      Critical CSM Tests

Below are the CMS critical tests as selected with the SAM FCR tool. Are only 3 tests in total in CMS for SAM.

 

Test name

Run by

Meaning

Computing Element

CE-sft-job

CMS

If fails no CMS job can be run

CE-sft-caver

OPS

If fails CA certs or CRLs are not updated

SRMv2

SRMv2-lcg-cp

CMS

If fails no file can be copied from UI to SRMv2 using lcg-cp

 

CMS is decommissioning SRMv1 and therefore no critical tests anymore on SRMv1 since the end of April 2008.

 

SRMv2 is not a critical service in GridView therefore SRMv2 failures do not affect site availability. This means that the availability could be seriously misrepresented from now on.

 

The many CMS-specific SAM tests (for FroNTier, software installation, CMS local site configuration, local file stage out etc.) are not critical in FCR because in most cases they do not indicate a problem at the Grid level but in the CMS setup.

 

L.Dell’Agnello noted that tests for SRM 2 should be added urgently to the standard SAM tests.

5.2      Reliability by Site (see slides 5-8 for details)

 

 

OPS

ALICE

ATLAS

CMS

LHCb

 

CERN-PROD

95%

100%

1%

80%

59%

CERN-PROD

DE-KIT

95%

97%

50%

84%

34%

FZK-LCG2

FR-CCIN2P3

98%

100%

5%

98%

42%

IN2P3-CC

IT-INFN-CNAF

73%

89%

31%

76%

0%

INFN-T1

NDGF

85%

100%

-

-

-

NDGF-T1

ES-PIC

90%

100%

41%

98%

94%

pic

UK-RAL

93%

97%

0%

70%

34%

RAL-LCG2

NL-T1

90%

86%

0%

-

33%

SARA-MATRIX

TW-ASGC

97%

100%

22%

96%

-

Taiwan-LCG2

CA-TRIUMF

96%

-

19%

-

-

TRIUMF-LCG2

US-FNAL-CMS

77%

-

-

96%

-

USCMS-FNAL-WC1

US-T1-BNL

93%

-

N/A

-

-

BNL-LCG2

 

CERN: 80% (incorrect)

-       SRM: the status is mostly unknown because SRMv1 endpoint not used anymore by CMS since long

-       The value of 80% is artificial: the periods in status UNKNOWN are ignored and are predominant.

 

FNAL 96% (incorrect)

-       The calculations are wrong because take into account CE that are not there anymore.
This will be communicated to GridView.

 

ASGC: 96%

-       CE: intermittent few job submission errors

-       SRM: sporadic CASTOR errors (“device too busy”)

 

IN2P3: 98%

-       SRM: very few connection timeouts but generally is working well.

 

CNAF: 76%

-       CE: a few job submission errors

-       SRM: some communication errors and timeouts

 

PIC: 98%

-       CE: some errors due to a network intervention, but working well

 

RAL: 70%

-       CE: a few disappearances from BDII

-       SRM: status unknown because SRMv1 endpoint not used anymore by CMS

-       Overall reliability statistically insignificant (see CERN) because often “unknown”.

5.3      Conclusions

For three sites the reliability is not meaningful:

-       CERN, RAL: because CMS was not testing a service (SRM) which still had critical tests. Fixed from the end of April.

-       FNAL: GridView seems to be picking up the wrong CEs

 

SRMv2 should become a critical service in GridView. Otherwise SRM failures will not appear and all sites will look unrealistically good

 

Reminder: many CMS tests do not affect the WLCG availability on purpose. e.g. cannot blame a site if some CMSSW version is missing, or if the Trivial File Catalogue has a typo.

 

Ph.Charpentier asked whether the CMS tests use any space token.

A.Sciabá replied that the “CMS-default” space token is used. Is that fails it will be retried without space token specified.

 

M.Schulz added that currently the default SRM tests use the SRM 1 interface but the default should be changed to SRM 2. An additional test could be added and made critical.

A new list of SAM tests should be presented to the MB and a date be set for the sites to move to SRM2 for instance.

 

New Action:

M.Schulz should present an updated list of SAM tests, for instance testing SRM2 and not SRM1.

 

6.   High Level Milestones for 2008 (Comments_Received; HLM_20080502) - A.Aimar

The Milestones were distributed and received a couple of comments:

-       R.Pordes asked to review the OSG RSV milestones. This will be done next week at the MB.

-       J.Templon asked that the milestones on the preparation of the tenders 2009 be clarified.

 

J.Gordon proposed this new phrasing of the 08-04 milestone.

 

WLCG-08-04

Sep 2008

Sites Report on the Status of the MoU 2009 Pledges 
Each site reports whether is on track with the MoU pledges by April. If not which is the date when the pledges will be fulfilled and equipment installed.

 

 

 

 

 

 

 

 

 

 

 

 

 

The MB started reviewing incomplete and new milestones present in the HLM dashboard.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

FZK GridKa

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

24x7 Support

WLCG-07-02

Apr
2007

24x7 Support Tested
Support and operation scenarios tested via realistic alarms and situations

 

 

 

Apr 2008

Apr 2008

 

 

 

 

 

 

 

WLCG-07-03

Jun
2007

24x7 Support in Operations
The sites provides 24x7 support to users as standard operations

 

 

 

Apr 2008

Apr 2008

 

Mar 2008

 

Apr 2008

 

 

 

VOBoxes Support

WLCG-07-04

Apr
2007

VOBoxes SLA Defined
Sites propose and agree with the VO the level of support (upgrade, backup, restore, etc) of VOBoxes

Mar 2008

Apr 2008

 

 

 

 

Mar 2008

 

 

 

 

 

WLCG-07-05

May 2007

VOBoxes SLA Implemented
VOBoxes service implemented at the site according to the SLA

Apr 2008

Apr 2008

Mar 2008

 

 

Mar 2008

Mar 2008

 

Apr 2008

 

 

 

WLCG-07-05b

Jul 2007

VOBoxes Support Accepted by the Experiments
VOBoxes support  level agreed by the experiments

ALICE

n/a

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

ATLAS

 

 

 

 

 

n/a

n/a

 

 

 

 

n/a

CMS

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

 

LHCb

n/a

 

 

 

 

n/a

 

 

 

n/a

n/a

n/a

 

WLCG02-3

-       FZK not present.

-       INFN no news.

 

WLCG-04/05b:

-       IN2P3: They are preparing the VOBOXes SLA with the Experiments and should be ready in the next couple of weeks.

-       ASGC not present.

-       CERN: will send an email. Are working on arranging the final version of the SLA with the Experiments.

-       NDGF: Will check what the status is.

-       PIC: LHCb is done.CMS is contacted but no answer in last 2 weeks.

-       NL-T1: They are agreeing it with the VO. But is still being defined. 0704 should be back to red.

 

MSS/Tape Metrics

WLCG-08-03

April 2008

Tape Efficiency Metrics Published
Metrics are collected and published weekly

 

 

 

 

 

 

 

 

 

 

 

 

These metrics are due and sites not providing them should be marked as red.

 

 

Tier-1 Procurement

WLCG-07-17

1 Apr 2008

MoU 2008 Pledges Installed
To fulfill the agreement that all sites procure they  MoU pledged by April of every year

Apr 2008

Apr 2008

Apr 2008

 

CPU
Apr 08
Disk
May 08

CPU
Apr 08
Disk
Sep 08

CPU
Apr 08
Disk
Jun 08

March 2008

Nov
2008

 

 

 

Sites should send this information about what they have installed until now. If there is a value means CPU and disk are due by the same date.

 

And these are the updated values for the Reliability in April.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

FZK GridKa

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

Tier-1 Sites Reliability - June 2008

WLCG-08-06

Jun
2008

Tier-1 Sites Reliability above 95%
Considering each Tier-0 and Tier-1 site

Jan 93%

 

 

 

 

 

 

 

 

 

 

 

 

Feb 93%

 

 

 

 

 

 

 

 

 

 

 

 

Mar 93%

 

 

 

 

 

 

 

 

 

 

 

 

Apr 93%

 

 

 

 

 

 

 

 

 

 

 

 

May 93%

 

 

 

 

 

 

 

 

 

 

 

 

June 95%

 

 

 

 

 

 

 

 

 

 

 

 

 

F.Hernandez noted that the reliability is also due to the reliability of the middleware.

J.Gordon noted that many sites are green with the same middleware therefore must be possible to have a reliability above target.

M.Schulz added that the tests just check that a service is running. This is a test that should be passed by all the sites.

 

7.   AOB
 

 

J.Gordon noted that the MoU commitments are not still measured by any SAM specific test.

 

New Action:

J.Shiers will ask SAM to review the MoU requirements and define SAM tests to verify them.

 

8.   Summary of New Actions

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.

 

New Action:

M.Schulz should present an updated list of SAM tests, for instance testing SRM2 and not SRM1.

 

New Action:

J.Shiers will ask SAM to review the MoU requirements and define SAM tests to verify them.