LCG Management Board

Date/Time:

Tuesday 8 May 2007 16:00-17:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=13794

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 12.5.2007)

Participants:

A.Aimar (notes), D.Barberis, T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, I.Fisk, D.Foster, J.Gordon, J.Knobloch, H.Marten, G.Merino, R.Pordes, Di Quing, H.Renshall, L.Robertson (chair), Y.Schutz, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 15 May 2007 16:00-17:00

1.      Minutes and Matters Arising (minutes) 

 

1.1         Minutes of Previous Meeting

Minutes of the 17 April approved.

 

Minutes of the 24 April still to approve. Comments are welcome.

 

1.2         Matters Arising

-          Site Reliability Reports for April 2007 (Reliability Data April 2007)
Reliability Data for April 2007 distributed last week. Site Availability Reports expected by Wed. 9 May 2007.
.

-          QR Report Reviewed (Review Document)
Some QR reports still missing and they are urgently needed in order to complete the report to the Overview Board.

 

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

No actions due for this week.

 

3.      GDB Summary (agenda) - J.Gordon

 

 

The GDB Summary will be presented next week. The action list of the GDB will be reviewed too.

The Top 5 Issues were discussed and the next step is to propose a plan of actions for the MB.

 

New Action:

15 May 2007 - L.Robertson will present a summary of the actions proposed at the GDB and the issues that will be followed up. .

 

J.Templon noted that there were no comments/issues raised by the sites at the GDB. In EGEE there is the attempt to collect the Top 5 issues from the ROC managers but on longer-term requirements.

Should the LCG Tier-1 sites define their top issues too? There are some issues that have major impact on the sites reliabilities (for example, the gridftp doors issues in dCache).

 

J.Gordon replied that it is a good item for next GDB Meeting: sites will be asked to present their top issues at the GDB in June.

Someone should collect the top issues from the sites, so that all eleven sites present their issues separately.

 

4.      March 2007 Accounting Report (Paper; Slides) - J.Gordon

 

 

The March accounting report was populated with grid CPU accounting data from the APEL Portal. The differences between the data taken from APEL and the corrections reported by the sites in their manual reports have been discussed, not with every site for the moment.

 

 

APEL cpu

APEL wall

Grid cpu

Grid wall

Non-Grid cpu

Non-Grid wall

Total cpu

Total wall

Notes

ASGC

1,904

3,542

5,120

7,048

 

 

5,120

7,048

log files overflowed. Since corrected? APEL does not agree.

BNL

 

 

16,834

21,794

 

 

16,834

21,794

Now publishing to APEL as unregistered

CERN

655

1,184

45,223

92,499

43,528

89,850

88,751

182,349

Not all Ces seen by APEL

CNAF

-

-

32,276

35,847

18

181

32,294

36,028

No Grid cpu/wall in either email or s/sheet but subsequently uploaded to APEL. Still slight discrepancy.

FNAL

-

-

24,869

29,871

10,717

12,873

35,586

42,744

Now reporting in APEL via OSG but numbers do not agree.

FZK

24,538

30,407

24,538

30,407

1,988

2,478

26,526

32,885

APEL OK

IN2P3

9,909

16,852

7,814

13,969

2,236

3,098

10,050

17,067

APEL slightly low but close to grid+non-grid

NIKHEF

7,605

9,131

13,083

17,540

724

998

13,807

18,538

Don't understand differences. Multiple Ces?

NGDF

 

 

7,845

8,573

 

 

7,845

8,573

Not publishing to APEL.

PIC

8,915

13,556

8,915

13,556

 

 

 

 

APEL OK

RAL

6,890

19,633

6,890

19,633

14

45

6,904

19,678

APEL OK

TRIUMF

103

343

2,939

2,320

 

 

2,939

2,320

Gap in March

 

Of the 12 sites (CERN + Tier1s) in the WLCG March Report::

-          3 sites (FZK, PIC, RAL) showed complete agreement between their local records and APEL.

-          1 site (CC-IN2P3) showed agreement to within 1%.

-          4 sites (BNL, CNAF, FNAL, NDGF) had no March data published in APEL Tier1 Table

-          4 sites (ASGC, CERN, NIKHEF, TRIUMF) published to APEL but reported large discrepancies between APEL and their local records.

 

Since the March report was produced the following progress has been made.

-          CNAF are now publishing into APEL with only small remaining discrepancies.

-          FNAL are reporting from Gratia (OSG Accounting tool) into the OSG tree in APEL.

-          BNL are partially reporting into Gratia but it appears in APEL as an unregistered site, not under the OSG Tree (but under a tree called “unregistered”).

-          ASGC had a problem with overflow of logs. This is believed to be fixed. They haven’t fully republished the data yet.

-          TRIUMF have a gap in March data and should republish before we look for further discrepancies. April data looks OK. 

-          There was a known problem for sites with multiple CEs which can cause under-reporting. This has affected results for CERN and NIKHEF. CERN are publishing correctly and fully for April.

 

There is a reasonable chance of correct APEL reporting in April from 9 sites.

-          This will leave only TRIUMF and NIKHEF with incomplete reporting.
NIKHEF could be either Multiple CE, R-GMA failures, or a VO not recorded as ATLAS. Confident this can be fixed soon.

-          NDGF not publishing at all.

 

The data for April will be extracted on the 10th of May 2007.

 

5.      Status of the High Level Milestones (HL Milestones 2007) - Roundtable

 

 

ASGC (information provided by email):

-          WLCG-07-01: 24x7 Support Definition
Still ongoing and is expected to be complete before the end of May.

-          WLCG-07-02: 24x7 Support Testing
After drafting the standard operation procedure, it will be tested at 16x7 support level.

-          WLCG-07-04: VOBoxes SLA Defined:
ASGC will propose the support level and send to experiments for agreement.

-          WLCG-07-06: Job Priorities Available at Site
ASGC has completed the mapping of the job priorities with Maui and is also publishing the information.

 

CC-IN2P3

Not represented at the MB meeting.

 

CERN

-          WLCG-07-08: Accounting Data in APEL:
The accounting data for March was incorrect but is fixed for April.

 

FZK/GridKa

-          WLCG-07-01: 24x7 Support Definition
WLCG-07-02: 24x7 Support Testing
FZK will have a delay for some time. The milestones are not inline with the internal milestones of FZK that were already defined previously. Holiday and vacation support was defined in the past and worked.

-          WLCG-07-04: VOBoxes SLA Defined
Currently in a draft and will be discussed with the Technical Advisory Board where the VOs are also represented.

-          WLCG-07-06: Job Priorities Available at Site
Being installed on the PPS but is not production ready yet.

 

INFN

-          WLCG-07-02: 24x7 Support Testing
Late for the testing, the 24x7 support has just been defined.

-          WLCG-07-04: VOBoxes SLA Defined
Discussing with the Experiments how to define the services.

-          WLCG-07-08: Accounting Data in APEL:
The issue for March is understood and for April the data should be published correctly.

 

At CNAF the hardware and the standard installation is provided by the site to the Experiments; then the local VO representative installs all VO-specific software.

How is this going to be possible on a 24x7 basis? Just as an example, the ALICE xrootd installation, unless otherwise agreed with the site, will have to be performed by ALICE locally or remotely on a 24x7 basis. The VO must agree with the site how this should be done.

 

 

NDGF

Not represented at the MB meeting.

 

PIC

-          WLCG-07-01: 24x7 Support Definition
WLCG-07-02: 24x7 Support Testing
Setting up on-call service with the current staff; late due to administrative issues and legal verifications.

-          WLCG-07-04: VOBoxes SLA Defined
Discussion on-going with experiments to determine their requirements in term of reaction time and support.

 

G.Merino asked whether the experiments discussing the SLA separately with each site and why there was not a single agreement for all sites (at least on response times, etc)

L.Robertson replied that the sites should agree independently because they all have slightly different local procedures.

 

RAL

-          WLCG-07-01: 24x7 Support Definition
WLCG-07-02: 24x7 Support Testing
Funding for call-out staff has been approved from April 2008. A plan will be provided how to achieve the 24x7 support until that date.

-          WLCG-07-04: VOBoxes SLA Defined
The action is being implemented, started discussing with the Experiments.

 

SARA NIKHEF

-          WLCG-07-01: 24x7 Support Definition
WLCG-07-02: 24x7 Support Testing
SARA intends to provide the level agreement by default in the MoU. If is not adequate the VOs should contact the site. It is not always clear who the VO contacts are and there are problems about hiring staff for such support.

 

L.Robertson stated that is the site has the responsibility (and is in its interest) to define an agreement on 24x7 support, covering the use cases and needs of the Experiments. And support over the week end will be needed: the MoU says that during accelerator operations acceptance of data must be supported within 12 hours.

 

-          WLCG-07-04: VOBoxes SLA Defined
The action is being implemented, started discussing with the Experiments.

 

TRIUMF

Not represented at the MB meeting.

 

BNL

-          WLCG-07-04: VOBoxes SLA Defined
The support is being defined with ATLAS. BNL has a hot spare CPU available to replace it in case of failure.

-          WLCG-07-06: Job Priorities Available at Site
The priorities are implemented; the only missing step is the publishing into the Information System.

 

FNAL

-          WLCG-07-06: Job Priorities Available at Site
The priorities are implemented; the only missing step is the publishing into the Information System because it was not a priority for CMS.

 

1.      AOB 

 

1.1         Recent Job Priorities Issues

J.Templon reported that Job Priority was interpreted in different ways by different services (WMS, BDII) and this caused serious problems for the ATLAS jobs.

BTW: Why was the discussion done “privately” without involving the MB or any other LCG group/body?

 

A solution was found and it is being deployed and tested on the PPS at CERN. ATLAS asked the site to stop any activity in upgrading the Job Priority services. Sites could continue the deployment but do not put it in production. Sites that have it deployed should NOT un-deploy it.

1.2         LHCC Referees Meeting in the Morning

Not all referees were present and the meeting and it was organized too late for booking a suitable room for a remote phone connection. Apologies.

1.3         Next Meetings

J.Templon asked for an update on the status of the work on how the VOMS group/roles interlock into the different services.

L.Robertson replied that it could be done in two weeks because before there is the EGEE review.

 

Next meeting: Because of the EGEE Review at CERN a meeting room for the MB will be booked.

Update: Salle B has been booked for the MB on the 15th of May. Phone conferencing will also be available as usual.

 

2.      Summary of New Actions

 

 

No new actions.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.