LCG Management Board

Date/Time

Tuesday 1  December 2009 16:00-18:00 – F2F Meeting 

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=71054

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 12.12.2009)

Participants

A.Aimar (notes), I.Bird (chair), K.Bos, M.Bouwhuis, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, D.Duellmann, I.Fisk, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, M.Litmaath, H.Marten, P.Mato, G.Merino, A.Pace, B.Panzer, M.Schulz, Y.Schutz, H.Renshall, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Invited

J.Casey, S.Graeme

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 15 December 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

A.Heiss agreed to remove from the minutes his references to the urgency of the recent updates. They would lead to an underestimation of the issue. .

 

The minutes of the previous meeting were approved by the WLCG MB.

 

2.   Action List Review (List of actions)

 

·         I.Bird will prepare the answer to the Scrutiny group about their request to have the 2011 Experiments' Requirements by early 2010.

To be done.

·         OPN Mandate: Experiments and some Sites should provide names for working with the OPN on the needs and actions needed on Tier-1 to Tier-2 links.

To be done.

 

3.   LCG Operations Weekly Report (Slides) – D.Duellmann
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Summary

This summary covers the period from 23th November to 29th November which was the first week with LHC beam, collisions and ramp up to 1.18 TeV.

 

Despite the excitement and the activities there was relatively smooth operation.

Incidents leading to (eventual) service incident reports were just two:

-       CMS data loss at IN2P3

-       SIR received for CMS Dashboard upgrade problems

 

Meeting Attendance

 

Site

M

T

W

T

F

CERN

Y

Y

Y

Y

Y

ASGC

Y

Y

Y

Y

Y

BNL

Y

Y

Y

CNAF

Y

Y

Y

FNAL

FZK

Y

Y

Y

Y

Y

IN2P3

Y

Y

Y

Y

NDGF

Y

Y

Y

NL-T1

Y

Y

Y

Y

Y

PIC

Y

RAL

Y

Y

Y

Y

Y

TRIUMF

n/a

n/a

n/a

n/a

n/a

 

NDGF started attending the meetings and maybe for OSG Sites the attendance should be reviewed and the agreement could be that they participate when is needed (like TRIUMF).

 

GGUS Summary

The number of tickets is increasing, especially for ATLAS.

 

VO

User

Team

Alarm

Total

ALICE

0

0

0

0

ATLAS

9

66

1

76

CMS

1

2

0

3

LHCb

0

3

1

4

Totals

10

17

2

83

 

There were 2 alarm tickets:

-       LHCb: Streams replication stopped to LHCb several T1 Sites. Work had already started when ticket was received streamlining of GGUS integration with CERN DB services initiated

-       ATLAS: The second was a successful ALARM test by ATLAS after the GGUS issues on the week before

 

J.Gordon added that there might the need of a 24/7 GGUS support and they will report to the GDB on the current situation.

3.2      SIRs and VO Availability

Slide 6 shows a very positive VO SAM reliability for the week.

 

SRM Problems at CERN

The steps followed were those suggested in the recent SIR. SRM request mixing observed and protective measures was put in place the number of SRM threads was increased. Timeout problems for SRM ping operation were fixed.

An SRM upgrade was performed and stable operation since then. Some infrequent call-back problems are still being investigated by CASTOR/SRM team.

 

FTS Problems at CERN for ATLAS

Outgoing transfers from CERN were severely affected. ATLAS had to revert back to FTS 2.1 after problems with 2.2 resulting in core dumps. Also other Sites running 2.2 (they upgraded prematurely) so far not affected, but need to be cautious until the problem is fully understood

FTS development is analysing the case.

 

I.Bird asked whether the issue is clearly understood.

D.Duellmann replied that it seems not fully understood and it is only happening for ATLAS and only at CERN.

 

CMS Data Loss at IN2P3

11 Nov: cleanup of unwanted CMS files resulted in larger deletion than expected.

 

Cause: communication problems between the CMS and IN2P3 teams

-       660 TB (480TB custodial) were erroneously deleted   

-       100 TB could not be retransmitted from CERN or other T1s

 

All event samples could be re-derived later if need should arise for CMS. Procedures have been reviewed to avoid similar problems. Full details are at: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents

3.3      Miscellaneous Reports

Even if the SAM report was almost all green there were other issues:

-       ALICE stress tested new Alien release – no issues reported

-       NSCD daemon problems caused ROOT failures for LHCb at NIKHEF - fixed

-       Regression of dcap VOMS awareness after dCache upgrade for users with multi-VO certificates

-       Several investigations due to low transfer performance for ATLAS (increased FTS slots, also issues in gridftp transfer phase, clock skew)

-       Timeout issues with large CMS files – retuning done

-       SRM@RAL performance impacted by low DB memory – fixed by moving DB processes to larger node

J.Gordon added that they are not yet sure what are the causes of the problems and RAL is investigating the issues.

 

-       Additional DB server node added to ATLAS offline DB

-       Repeated unavailability of CREAM CE for ALICE due to failure + upgrade of fall-back node.

3.4      Proposal from GGUS Development

Periodic ALARM ticket testing rules.

They propose to run GGUS tests periodically from the development team. Proposed procedure at

 

https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru

 

ATLAS and CMS agree with the proposal but feedback from T1 Site is welcome.

Would one test ticket per month be acceptable?

 

R.Tafirout asked that the test alarms are send a reasonable near- working time for the Sites, not at 2 AM for TRIUMF.

J.Gordon added that GOCDB stores the working hours of the Sites and this information could be used.

He also noted that the tests should be done using the VO certificates not by the GGUS developers otherwise it is not a realistic alarm and may confuse the Sites instead of checking them. Sometimes is not so clear when it is a test alarm and some Sites respond only after a few hours not detecting immediately that was a test alarm.

3.5      Conclusions

 

This was the first week with beams and collisions .All Experiments reported success and rapid turnaround of their local and grid wide data management and processing systems. A big success for Experiment and grid computing infrastructures!

 

Few new problems mainly in the file transfer area

-       Quickly being picked up by concerned Sites

-       Good attendance at daily meetings

-       Good fraction of “nothing to report” statements from T1 Sites

-       Definitely still many areas to improve but a smooth and controlled first week with LHC data.

 

 

4.   Multi-User Pilot Jobs (MUPJ) Summary (Summary ) – I.Bird

 

 

I.Bird summarized the recent discussions and emails regarding MUPJ. He noted that very few people sent a lot of comment while the majority was rather silent.

 

- It will be acceptable for the Experiments to run multi-user pilot jobs without requiring identity switching for a period of 3 months (i.e. until end of February 2010)

 

-This means that we are temporarily and exceptionally suspending the identity-switching requirement of the existing JSPG Policy on Grid Multi-User Pilot Jobs: https://edms.cern.ch/file/855383/2/PilotJobsPolicy-v1.0.pdf

 

-During this time problems with workloads at a Site will be the responsibility of the VO  i.e. The entire VO could be banned from a Site;

 

-The situation will be reviewed after 3 months, or earlier if needed due to operational or other circumstances.

 

- The deployment of glExec and SCAS (or equivalent) should proceed as rapidly as possible at all Sites.  The versions of both components for SL5 are now available.  Note that SCAS is the solution for EGEE Sites, other implementations of this function may be implemented elsewhere, but glExec must be deployed as the interface for the pilot jobs to use.

 

- The Experiments that propose to submit multi-user pilot jobs should endeavour to ensure that their frameworks make use of these tools on this same timescale.

 

- We will implement a test to validate the availability and usability of a glExec installation at a Site.

 

- The long term policy requirements of traceability and the ability to ban individual users from a Site remain unchanged, but we agreed to start a review of how these requirements could be better managed and implemented in the long term to satisfy the needs and constraints of Sites and Experiments

 

K.Bos stated that ATLAS, and probably other Experiments too, cannot deploy a new tool, their framework works with the current solution and the new solutions need to be tested. ATLAS will have no time for these tests. They are under pressure for the Physics conferences next year.

I.Bird agreed but noted that is very serious to neglect possible security issues. At any moment all Physics activities could be blocked for a long time. Therefore Experiment should act towards the solution of these issues.

 

M.Schulz added that glExec with Gums is running since a long time; it is in use by CMS. He asked that the usage of glExec is discussed and encouraged at the Technical Forum.

M.Kasemann replied that glExec is not used everywhere. GlExec is used in OSG and not in many EGEE Sites. He agrees that Experiments should endeavour to ensure the implementation of the glExec solution. If Sites do not proceed with the installations there should be a review until it is done.

I.Bird added that is these changes do not take effect they will have to be enforced.

 

J.Templon proposed that the GDB in March in Amsterdam discusses the issue and that it is assessed and reviewed in advance, January and February, by the MUPJ Working Group and the Tech.Forum.

 

Ph.Charpentier added that the proposed solution works correctly only if the Sites are correctly configured, otherwise it can take weeks to solve a problem. It is not a VO’s task to debug the Site’s configuration. Moving more than one hundred Sites is an impossible effort and it is surprising that Sites that do not want to perform any other upgrade are now asking for this major change. The MUPJ working group should re-assess the risks of not using the VO framework developed.

 

I.Fisk added that the problem is to “track who executed a particular job”, the implementation should be left to the Sites. OSG is running the Gums backend since the beginning and the US Tier-2 are running it. If it is useful they can probably work together in defining SAM tests to check Sites.

 

Decision:

The MB concluded that the MUPJ should address the situation in the next months and in parallel the Experiments progress in the usage of glExec.

 

J.Gordon added that many Sites keep refusing or postponing MUPJs and the problem should be clearly stated and then see who is accepting or refusing it.  We must know the amount of resources that accept or refuse running MUPJs.

Ph.Charpentier asked that Sites that refuse MUPJ present their objective reasons for the refusal.

 

S.Graeme stated the UK Sites are not refusing this solution.

J.Gordon and I.Bird replied that an official answer from the Sites is needed because does not seem the same when comes via the Experiments and directly from the Sites responsible. There is a mismatch on the position of the Sites depending to whom you asks to. We need the official reply at the GDB and maybe at the CB.

 

M.Kasemann added that if many Sites are going refuse the solution why should Experiments work to adapt to it?

M.Schulz replied that is it mostly due to packaging issues and configuration complexity.

 

O.Smirnova added that in NDGF the administrators are reluctant to use glExec, but she agreed that it should officially be clarified.

 

 

5.   Issues with SAM/Nagios Migration (Slides) – J.Casey

 

J.Casey introduced a few issues that appeared in the migration from SAM to Nagios.

 

Currently SAM merges several overlapping information sources when creating list of sites and services to test, namely:

-       OSG OIM

-       EGEE GOCDB

-       BDII

 

This has caused problems in the past with for instance Services flipping from one Site to another and having:

-       Impacts availability

-       Large operational load

 

For the future is proposed that only tests sites and services from GOCDB+OIM will be tested. No other sources.

Currently there is about 15% difference in the results; less in Nagios than currently in SAM.

 

Jobs from the VOs can still go to the resources in the BDII but those resources are not tested 

 

J.Gordon noted that VOs have their own list of Sites and check the GOCDB status themselves. Having a single source would be positive.

 

 

6.   From GDB: Installed Capacity Reporting (Slides) – J.Gordon

 

 

At the previous GDB there was a presentation of Gstat 2.0.

 

Gstat 2.0 was described by Laurence Field at the November GDB
 http://indico.cern.ch/materialDisplay.py?sessionId=6&materialId=0&confId=45481 

 

There are site views of Tier-0, 1, 2 for WLCG and now is in the process of validating the data published.

 

There are SAM tests to check that sites publish and sanity checks on the values published. But only sites know what they should be publishing.

 

J.Gordon has compared some data with the published in the WLCG accounting. Slide 3 shows a typical Gstat screen. Selecting the “level 1”, for Tier-1 Sites. One can see the usage of CPUs, Storage and Grid Job running.

 

The table below shows the comparison between pledges and what is measured by gstat.

-       In bold are the main agreements and discrepancies

-       In Green the main agreements (e.g. NDGF)

-       In Red the

 

 

CPU (SI2K)

Disk(TB)

Tape (TB)

SITE-NAME

Pledge 2009 (kSI2K)

(kSI2K)

Monthly Accounts

gstat

Pledge 2009

Accounts Installed

Accounts Allocated

gstat

Pledge 2009

Accounts Installed

gstat

CERN

32,970

37,594

37,594,000

13,476,648

10,065

7,774

7,774

0.609

25,063

25,083

 

FZK-LCG2

10,355

10,355

10,355,000

17,127,600

5,142

5,142

3,868

3,498

7,190

7,190

 

IN2P3

7,998

7,998

7,998,000

9,342,180

4,600

4,600

2,553

0

5,175

5,200

 

INFN-T1

5,300

3,000

3,000,000

4,607,820

2,300

1,379

1,300

220

2,600

1,500

9,510

NDGF-T1

3,030

3,030

3,030,000

3,047,280

1,580

1,580

1,202

1,583

1,660

1,660

 

NL-T1

7,544

3,833

3,833,000

19,816,720

3,753

1,110

940

1,229

3,548

460

8,000

PIC

2,591

2,591

2,591,000

1,595,181

1,702

1,042

1,067

1,100

1,844

1,050

47,550

RAL-LCG

5,775

5,775

5,775,000

4,304,000

3,401

2,684

2,508

3,082

4,045

2,070

3,318

Taiwan

5,000

2,432

2,432,000

3,747,170

3,000

2,500

1,340

220

5,000

3,000

9,510

TRIUMF

1,420

905

905,000

3,384,240

990

500

500

0

750

750

 

BNL

7,337

7,337

7,337,000

N/A

5,822

3,800

3,000

N/A

3,277

3,000

 

FNAL

5,100

5,100

5,100,000

N/A

2,600

2,600

2,600

N/A

7,100

9,771

 

 

94,420

89,950

89,950,000

80,448,839

44,955

34,711

28,652

10,933

67,252

60,734

77,888

 

Sites should explain the discrepancies:

-       PIC and RAL do not use the correct SPECint value.

-       IN2P3 and FZK report all accounting that only what is used for the LCG.

-       T.Cass reported that CERN considers the grid and not grid capacity for CPU.

-       The US Tier-1 Sites seem to be missing.

 

J.Templon commented that NL-T1 has installed a lot of resources recently.

J.Gordon replied that the goal of this proposal is to see whether a report like this could be adequate. And at the same time Sites should correct their information, contacting him if needed.

 

I.Bird proposed that Tier-1 Sites report why they cannot provide gstat with the correct values.

 

 

Slide 7 shows the same information for the Tier-2. Tier-1 Sites should check their Tier-2 Sites.  Some Sites are not flagged as Tier-2.

 

Action:

Tier-1 Sites should explain the differences between the pledges, monthly accounts and the gstat values (see presentation at the MB 1.12.2009). They should also check their Tier-2 Sites.

 

R.Tafirout asked why the TRIUMF values are in gstat 1 they are not in gstat 2.

J.Gordon replied that the way to publish the data has changed and should talk to S.Traylen.

 

 

7.    AOB

 

 

 

Next MB in 2 weeks: 15 December 2009

 

If the GDB in March is in NIKHEF also the MB should be there.

 

 

 

8.    Summary of New Actions

 

 

 

Action:

Tier-1 Sites should explain the differences between the pledges, monthly accounts and the gstat values (see presentation at the MB 1.12.2009). They should also check their Tier-2 Sites.