LCG Management Board

Date/Time

Tuesday 23 June 2009 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=55745

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 27.6.2009)

Participants

A.Aimar (notes), D.Barberis, O.Barring, I.Bird(chair), K.Bos, D.Britton, Ph.Charpentier, L.Dell’Agnello, D.Duellmann, M.Ernst, S.Foffano, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, G.Merino, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout

Invited

A.Sciabá

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 30 June 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received about the minutes. The minutes of the previous MB meeting were approved.

 

H.Marten commented (via email) the previous issues with announcements of scheduled downtimes at FZK on too short notice.

I.Bird noted that even if there was some misunderstanding Sites should provide more information about their plans and interventions.

A.Heiss added that there was no expected interruptions, the only problems was a few minutes in the 3D services for CMS. And in the GOCDB one cannot declare downtimes only for the tape back-ends or for 3D, one can declare downtimes only for the whole SE. SO FZK would have to declare all down even if was only a small fraction of the services.

1.2      Quarterly Report Preparation

A.Aimar asked that the Experiments give the usual short presentation for the coming quarterly reports:

 

The proposed schedule is:

-       30 June: ALICE, ATLAS

-       7 July: CMS, LHCb

 

WLCG Operations, GDB and Applications Area should also provide their contributions.

 

Note: After the meeting Ph.Charpentier asked that also LHCb presents their QR on the 30th June.

 

2.   Action List Review (List of actions)

  • VOBoxes SLAs:
    • CMS: Several SLAs still to approve (ASGC, IN2P3, CERN and PIC).
    • ALICE: Still to approve the SLA with NDGF. Comments exchanged with NL-T1.

CMS:
No progress for CMS.

ALICE:
NL-T1 had reported that ALICE replied positively with a few comments and NL-T1 has only to implement some minor changes.
NDGF is waiting for the approval by ALICE.

  • 5 May 2009 – CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.

L.Dell’Agnello stated that CNAF completed their internal tests and will send a report to R.Wartel.

  • Tier-1 Site should start publishing the UserDN information.

Done. J.Gordon asked the portal developers in order to have an anonymous list displayed on the web.

 

·         9 Jun 2009 - Sites should report to the MB whether now, after the GDB presentations, the situation of the data rates is clear.

 

ALICE data rates presented in this meeting.

 

·         Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar

 

Not done by: DE-KIT, FR-CCIN2P3, NDGF, NL-T1, US-FNAL-CMS

Sites can provide what they have at the moment. See http://sls.cern.ch/sls/service.php?id=WLCG_Tier1_Tape_Metrics

Sites should send URLs to existing information until they do not provide the required information.

 

New Action:

A.Aimar finds how to display directly SLS information from all Sites, without using the SLS interface, for July’s F2F Meeting. And also which metrics Sites are currently displaying.

 

3.   LCG Operations Weekly Report (Slides) – D.Duellmann
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting.

All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Summary

Ramping down after the main STEP’09 activity. Attendance in OPS meeting ramping down, which is not a good sign.

 

Service incident report for

-       CERN power cut

-       CORAL/LFC issues affecting LHCb.

 

Only an ALICE ticket in slide 3.

 

On slide 4 there are the availability plots for the 4 SAM VO reports for each Tier-1 Site.

Issues for LHCb at FZK and a scheduled downtime of RAL impacting all Sites.

3.2      CORAL/LFC Degradations (LHCb)

Date: June 11

Description: Scalability issues with CORAL credential look up from LFC

Impact: LHCb LFC degraded

 

Timeline of the Incident

11 June: The R/O instance of LFC (1 node) for LHCb was degraded. Connections were refused because the server was busy servicing over 60 connections and there were no more threads available for new requests.

12 June: LHCb moved to conditions data access via SQLite files

16 June: LHCb reverted to CORAL access to conditions, but using a different method to obtain DB credentials

Full report at https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents

 

CORAL changes to avoid blocking LFC threads during LFC credential lookup agreed between CORAL and LFC team.

Development to fix LFC look up from CORAL planned for July.

 

3.3      Power Cut at CERN

 

Date: June 18

Description: CERN-wide power cut

Impact: Batch, CASTOR and FTS services interrupted for 2 hours

Timeline of the Incident

 

12:05 - Initial brief CERN-wide power cut. CC services continue on UPS battery power.

12:40 - Second power cut; CC running with on UPS battery power

12:45 - Power restored, but batteries for one of the 3 UPS units without diesel backup power are exhausted leading to service outage for parts of the batch, CASTOR and database services. Failure of UPS supply leads to automatic trip of associated switchboard in CC substation.

13:10 - Substation switchboard re-enabled.

13:15 - CERN Control Centre confirms power is stable; power-up starts for affected systems

13:20 - batch scheduling disabled

13:30 - Service restart commences

14:00 - FTS database back - FTS service resumes

14:10 - All LFC nodes have been rebooted - all LFC services probably back then

14:26 - Load balancing service back a first time after restart by operator (probably stuck due to high number of nodes down)

14:30 - CASTOR and SRM services restored

14:40 - batch queues re-opened

14:40 - Normal service is resumed, myproxy-fts service still affected

15:45 - Load balancing service back again after being stuck again

Full report at https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents

 

Preliminary analysis: believed to have been caused initially by a short circuit followed by a thermal cut out when rerouted supplies caused a cable overload.

3.4      Twiki Outage – June 15

Affected experiments reports and also daily operations meeting attendance as connection info was temporarily

Unavailable. An alternate copy of this information in ops list archive and at

http://indico.cern.ch/conferenceDisplay.py?confId=62132

 

Twiki content recovered until 15 minutes before the outage but some twiki attachments were lost (morning of 15 June).

3.5      Update: Oracle BIGID problem in CASTOR

Received the patch for Oracle internal consistency problem now available for 64-bit and 32-bit clusters.

Test deployments of the patch ongoing at CERN; the original test case is now fixed.

 

ASGC, who have been most significantly been affected, are suggested to confirm the patch before applying the castor workaround.

3.6      Other Experiment News

CMS tape tests at CERN and RAL without ATLAS.

-       CERN: more than 1GB/s, back-log digested over 12h period

-       RAL: 200 MB/s

 

F.Hernandez asked how many drives were used for the tests at RAL and whether they were reading and writing tests. It seems not visible in the SLS collected metrics.

D.Duellmann and O.Barring replied that the tests were “write  tests but is not known how many drives were used.

J.Gordon added that CMS had up t o10 drives during their read and write tests, but one should see this information maybe in the SLS metrics.

 

ALICE transfer tests successful at 6 Tier-1 Sites. Initially some efficiency issues at FZK, but redistribution of data worked well

 

Post-STEP deletion is ongoing (ATLAS: ~5PB).

 

All experiments declared the end-of-STEP and are returning to standard production activities.

They are analysing the results and preparing for STEP post-mortem Workshop.

3.7      Other Site News

RAL proceeding with scheduled outage for machine room move. Detailed schedule at http://www.gridpp.rl.ac.uk/blog/2009/05/14/schedulemovenewbuilding/

 

ASGC moving back to main site

 

PIC: issues with dCache load balancing configuration, similar to what NIKHEF had before. Should there be a repository collecting deployment receipts per storage project?

 

SARA performed the DMF upgrade in preparation to the tape drives update.

 

 

 

4.   ALICE Dataflow and Rates at Tier-1 Sites (Paper) – Y.Schutz

 

 

The Sites were only asked to check whether the information is now clear.

 

F.Hernandez asked whether ALICE will ship data from Tier-2 to Tier-1 Sites at 100-120 MB/s. Can this be tuned in order to avoid overloading the network?

 

Y.Schutz answered that the rate can go up to 360 MB/sec to all the Tier-1 Sites (all or each?).

 

New Action:

30 Jun 2009 - Sites comment on the ALICE dataflow and rates.

 

 

5.   High Level Milestones (PDF) – A.Aimar

 

 

The MB discussed and updated the High Level milestones still due. Here is the HLM dashboard BEFORE and AFTER the discussion.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

VOBoxes Support

WLCG-07-05b

Jul 2007

VOBoxes Support Accepted by the Experiments
VOBoxes support  level agreed by the experiments

ALICE

n/a

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

ATLAS

 

 

 

 

 

n/a

n/a

 

 

 

 

n/a

CMS

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

 

LHCb

n/a

 

 

 

 

n/a

 

 

 

n/a

n/a

n/a

 

WLCG-07-05b: Still to complete by ALICE and CMS.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

Tier-2 and VOs Sites Reliability Reports

WLCG-08-09

Jun
2008

Weighted Average Reliability of the Tier-2 Federation above 95% for 80% of Sites
Weighted according to the sites CPU resources

See separated table of Tier-2 Federations
80% of the Sites above 95% reliability

WLCG-08-11

Apr
2009

VO-Specific Tier-1 Sites Reliability
Considering each Tier-0 and Tier-1 site
(and by VO?)

Apr 2009

 

 

 

 

 

 

 

 

 

 

 

 

May 2009

 

 

 

 

 

 

 

 

 

 

 

 

Jun 2009

 

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-08-09: There is a separately monthly report for the Tier2 Sites. The latest for May is here:

https://twiki.cern.ch/twiki/bin/viewfile/LCG/SamMbReports?filename=Tier2_Reliab_200905.pdf

 

I.Bird noted that only 60% of the Sites are above 95% reliability. The percentage by federation is difficult to see in the current Tier-2 report.

 

WLCG-08-11: A.Aimar will update the milestones with the values for April, May and June. Will have to add values for each VO.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

WLCG-09-22

Jul 2009

SL5 Deployed by the Sites (64 bits nodes)
Assuming the tests by the Experiments were successful. Otherwise a real gcc 4.3 porting of the WN software is needed.

 

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-22: The status of the SL5 milestone will be tracked at the GDB in July.

 

L.Dell’Agnello asked when do Sites are expected to have SL5 available. Experiments should define a clear calendar.

J.Gordon replied that CERN makes a small cluster available with the metapackage defined at the AF installed. Sites and Experiments should check that is suitable for them. At next GDB it will be agreed how to proceed. Each Site should provide a CE pointing at some SL5 WNs by July, with a non trivial amount of WNs on SL5.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

Tier-1 Sites Procurement - 2009

WLCG-09-01

Sept 2009

MoU 2009 Pledges Installed
To fulfill the agreement that all sites procure they  MoU pledged by April of every year

 

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-01: Will wait for the meeting with the RSG and the Referees.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

WLCG-09-18

Apr 2009

SCAS Verified by the Experiments
Experiment verify that the SCAS implementation is working (available at CNAF and NL-T1)

ALICE
n/a

ATLAS

CMS
n/a ?

LHCb

WLCG-09-19

09-18 + 1 Month

SCAS + glExec Deployed and Configured at the Tier-1 Sites
SCAS and glExec ready for the Experiments.

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-20

09-18 + 3 Month

SCAS + glExec Deployed and Configured at the Tier-2 Sites
SCAS and glExec ready for the Experiments.

 

 

I.Bird noted that the MB should check the status of the patch solving the issues about the passing of the environment.

 

New action:

M.Schulz should report about the status of the gLExec patch on passing the environment.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

WLCG-09-03

TBD

Tier-2 Sites Report Installed Capacity in the Info System
Both CPU and Disk Capacity is reported in the agreed GLUE 1.3 format.   

% of T2 Sites Reporting

WLCG-09-04

TBD

User Level Accounting
(verify with the Experiments)

 

 

WLCG-09-03: S.Traylen should come to the F2F to provide a summary of the situation about reporting installed capacity.

 

WLCG-09-04: Experiments are they satisfied with the information available? Split in 2 milestones on by Experiment one by Sites.

 To see whether the publish the information

 

STEP 2009 - Tier-1 Validation

WLCG-09-23

Jun 2009

Tier-1 Validation by the Experiments

ALICE

n/a

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

ATLAS

 

 

 

 

 

n/a

n/a

 

 

 

 

n/a

CMS

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

 

LHCb

n/a

 

 

 

 

n/a

 

 

 

n/a

n/a

n/a

 

WLCG-09-23: Will be updated at the STEP09 Post-mortem Workshop.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

CREAM CE Rollout

WLCG-09-26

May 2009

All European T1 + TRIUMF and CERN at least 1 CE.  5 T2s supporting 1 CE

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-27

Jul 2009

2 T2s for each experiment provide 1 CREAM-CE each.

ALICE

ATLAS

CMS

LHCb

WLCG-09-28

Sep 2009

50 sites in addition to the ones above

 

 

WLCG-09-26: J.Gordon asked who is actually testing CREAM at the Tier-1 Sites apart from ALICE. There are currently 7 Tier-2 Sites installed.

 

IN2P3 and BNL noted that they do not have the CREAM CE installed.

G.Merino added that PIC has CREAM CE installed but is not clear whether the Experiments are testing and planning to test CREAM at other Sites. ]

R.Tafirout added that TRIUMF has installed the CREAM CE but it seems that ATLAS is not going to use it soon.

 

I.Bird asked about the ATLAS’ plans about testing the CREAM CE. They will be discussed in the GDB.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

SRM Milestones

WLCG-09-05

Dec 2008

SRM Short-Term Solutions Available for Deployment

CASTOR

dCache

DPM

StoRM

BestMan

WLCG-09-06

TBD

SRM Short-Term Solutions Deployed at Tier-1 Sites Installation at the Tier-1 Sites

 

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-05 and -06: Seems there is no outstanding development planned. The remaining list was presented and no Experiment considered needed.

I.Bird suggested, and the MB agreed, marking these milestones as done.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

Metrics and Monitoring Milestones

WLCG-09-09

TDB

Tier-1 Sites Define Their MSS Metrics
Tier-1 Sites specify which metrics are going to be collected to demonstrate the dataflow supported

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-10

TDB

Tier-1 Sites Show Their MSS Metrics
Tier-1 Sites specify where their MSS metrics are available

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-11

TDB

Automatic Alarms (SAM, etc) at the Tier-1 Sites
Tier-1 Sites should be able to automatically send, receive and handle alarms and problem notifications

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-12

TDB

Monitoring of the Storage Systems
The Storage systems used provide monitoring information to Sites and Experiments

CASTOR

dCache

DPM

StoRM

BestMan

WLCG-09-13

TDB

Performance Metrics?
User Response, Services Downtimes
Operations KPI

 

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-09 and -10: Replaced by the status of the SLS integration activity ongoing.

 

WLCG-09-11: Experiments should be able to send Nagios messages to Sites. Sites can decide what to do with the message. Should be discussed at the GDB.

 

WLCG-09-12:D.Duellmann noted that CASTOR is going to release some monitoring features. But the SRM-related monitoring should be standardized. A defined list of metrics was not agreed yet. At the WLCG Post-mortem Workshop there will be reports covering the issues. This could be a topic for the Technical Forum.

 

I.Bird noted that this milestone should stay because there are still serious problems/issues about behaviour of the SRM/SEs.

K.Bos added that at the Tier-2 Sites was impossible to measure the I/O between CPU and storage systems, which is very important in the analysis use cases.

 

WLCG-09-13: Some metrics should be defined and stored. Session at the GDB and the Technical Forum.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

WLCG-09-16

Apr 2009

New Experiments Requirement in HEPSPEC-06
Experiments should convert their requirements to the new unit (or by LCG Office)

ALICE

ATLAS

CMS

LHCb

WLCG-09-24

May 2009

Sites Report Capacity in the HEPSPEC-06
Resources from the Sites should be converted to the new unit

 

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-16: This milestone is done. Experiments have their requests in HEPSPEC-06 units.

 

WLCG-09-24: This should be that the Sites should re-express their capacity in the new unit. They should really benchmark their capacity and publish the values.

 

MILESTONES AS UPDATED DURING THE MEETING: https://twiki.cern.ch/twiki/pub/LCG/MilestonesPlans/WLCG_High_Level_Milestones_20090623.pdf

 

 

6.   AOB

 

 

 

Gordon noted that the Post-Mortem could ask for changes like moving all to SL5 and this can endanger stability. The post-mortem should conclude on common measures.

 

The GDB and MB will do complementary topics to the Post-mortem Workshop.

 

 

7.    Summary of New Actions

 

 

New Action:

A.Aimar finds how to display directly SLS information from all Sites, without using the SLS interface, for July’s F2F Meeting. And also which metrics Sites are currently displaying.

 

New Action:

30 Jun 2009 - Sites comment on the ALICE dataflow and rates.

 

New Action:

M.Schulz should report about the status of the gLExec patch on passing the environment.