LCG Management Board

Date/Time

Tuesday 1 September 2009 16:00-17:00 – Phone Meeting 

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=62556

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 4.9.2009)

Participants

A.Aimar (notes), D.Barberis, O.Barring, L.Bauerdick, I.Bird, M.Bouwhuis, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, S.Foffano, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, M.Litmaath, G.Merino, A.Pace, B.Panzer, H.Renshall, M.Schulz, Y.Schutz, J.Shiers (chair), O.Smirnova

Invited

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 8 September 2009 16:00-17:00 – F2F Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received about the minutes.

 

L.Dell’Agnello asked to be removed from the attendance list. DONE.

 

The minutes of the previous MB meeting were approved.

1.2      2009-2010 Requirements/Procurements (Document) – I.Bird

I.Bird summarized the situation with the 2009 Experiments requirements and Sites procurement.

He attached to the agenda:

-       The report of the LHCC Review of the Computing Resources. Final, but not approved until end of September.

-       The Scrutiny Group report. Also not final but describes the current status.

 

I.Bird and S.Foffano will meet S.Bertolucci during the week and will agree on what to tell for 2010 to the funding agencies before next RRB, at the end of the month.

 

Ph.Charpentier asked whether is clear to the Sites that all 2009 resources should be installed before end of September.

I.Bird replied that it should be clear since several months, as was agreed at the MB. He asked whether there is any Tier-1 that has not understood this.

 

L.Dell’Agnello noted that CNAF had already said that they will not be able to install additional capacity for 2009 and be below the pledges.

I.Bird replied that this was a known situation.

 

M.Bouwhuis added that NL-T1 has purchased all the hardware but is not going to have installed all by end of September. They have purchased the material and will be received, accepted and installed by early November.

 

D.Britton added that the UK Tier-1 will only have half of the disk pledges in September because they have to validate the rest and will be available for end of November.

 

A.Heiss reported that all hardware for 2009 is installed but not made available. When it is needed in can be brought all online in a few days.

 

D.Britton asked where the updated 2010 requirements are collected; for instance, there is a discrepancy for LHCb.

I.Bird replied that it is specified in the attached document. The LHCC has agreed on the LHCb request while the Scrutiny report is not definitive yet. The discussion with S.Bertolucci will clarify what is the official request.

 

F.Hernandez noted that on the LCG Web there is a Resources page. That is the last RRB document.

I.Bird replied that was the past reference and the LHCC Review in July was made to clarify these number. But being not approved is not on the official WLCG Web Site.

 

S.Foffano added that she will send a request for input from each Tier-1 Site.

1.3      EGI FTE at the Tier-1 Sites

I.Bird added that the EU Tier-1 Sites should confirm whether they want 50% of one FTE as part of the EGI project.

 

New Action:

Each EU Sites should send an email to J.Shiers replying whether they want or not the (50%) FTE for Services Support in the EGI Proposal. .

 

 

2.   Action List Review (List of actions)

 

  • 5 May 2009 - CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.

R.Wartel told A.Aimar that the action is still incomplete and the information is insufficient.

L.Dell’Agnello replied that he will send additional information.

  •  Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar.

Not done by: ES-PIC, FR-CCIN2P3, NDGF, NL-T1,
Sites can provide what they have at the moment.

See http://sls.cern.ch/sls/service.php?id=WLCG_Tier1_Tape_Metrics

Sites should send URLs to existing information until they do not provide the required information.

  • The MB sends to I.Bird input on topics and priorities for the Technical Forum.

Done.

  • Sites will be asked to report on the items resulting at the STEP09 Workshop. STEP09_Actions.pdf

Done.

  • A.Aimar will contact T.Bell for MSS real-time metrics.

Done. Scheduled for the F2F next week.

  • Experiments send to A.Sciabá their ranking for the features in the SRM MoU addendum.

Done. Scheduled for the F2F next week.

 

3.   LCG Operations Weekly Report (Slides) – O.Barring
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      GGUS Tickets and Alarms

The report covers the two weeks 17th to 29th August. The attendance was good. The first week affected by holiday period with a significant ramp up with high attendance in second week.

 

The main events were:

-       BNL site name change: BNL-LCG2 à BNL-ATLAS

-       Kernel patch for published local exploits

 

Two alarm tickets were sent during the two weeks:

-       51206 ATLAS à BNL: TEST ALARM Ticket after name change

-       51172 CMS à CERN: All dedicated CMS LSF queues at CERN disabled with no clear pre-warning.

 

Incidents leading to service incident reports were only the following:

-       26/8/2009 CERN: Closing of public and production queues for Emergency batch reboot to apply kernel upgrade

 

VO

User

Team

Alarm

Total

ALICE

8

0

0

8

ATLAS

21

56

1

78

CMS

2

0

1

3

LHCb

0

12

0

12

Totals

31

68

2

101

3.2      SAM VO Reliability

Slide 4 shows the reliability of each Site in terms of SAM VO tests results.

 

One can notice that

-       NIKHEF completed the move ~24/8

-       SARA: the ALICE ongoing downtime is due to a failing ALICE VOBoxes there.
GGUS ticket 51238 submitted on the 31st and probably solved

-       RAL was back in full prod on Tuesday 18th after the air conditioning stoppages due to water chiller failures that had started on 12th (see H.Renshall’s report at the last MB)

3.3      SIR: Closing batch queues at CERN

All details are here: https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090826

 

On the advice of the CERN security team and Grid security team, and due to two severe kernel vulnerabilities (one with no workaround), the batch services had to be reinstalled with a new kernel and rebooted. The system was partially drained first.

 

Wednesday

          16.00 - Long running public batch queues set Inactive to drain servers.

          16.50 - Mail sent to info-experiments describing this and the plan to reboot at 10.00 next day.

          18.40 - All batch queues set Inactive.

          18.49 - Mail sent for SB to Computer Operations describing current situation (posted 18.55)

          20.11 - MOD sees original info-experiments mail and moves the 18.49 update [saying it is all batch queues] to the usual place on SSB, but links from it the original mail sent at 16.50 saying it is only the public batch queues

          21.38 - CMS "T0 operations list" mail describing the problem of the Inactive T0 queue and complaining about the confused messages ("public" queues vs. "all" queues)

          22.26 - Update to computer operations on SSB saying that public and grid queues are now open with limited capacity.

          22.38 - GGUS alarm ticket from CMS arrives detailing the issue with Inactive CMS T0 queue

          23.06 - Operator responds to alarm ticket, asking for clarification

          23.30 - CMS T0 hosts rebooted and queues reopened.

 

Thursday

          09.45 - Remaining batch jobs killed. Service rebooted.

          12.00 - Reboots ongoing (30% already available). All queues reactivated.

          16.00 - All machines rebooted and in production (except for a few outliers).    

 

 

The patch of the WN was ready but needed to be propagated which meant to stop or drain the queues.

CMS did not notice the announcement and this costed some resources to them (about 6h).

 

The MOD and the CC Operator where updating at the same time the information and communication was not handled properly. The options for the future are:

-       Drain the queues and wait for the jobs to end

-       Reboot the machines and the jobs will be reported as failed to the VO.

To drain or not? Only ~650 jobs out of 17,000 were lost. But, closed queues == lost processing time. Some VOs prefer us to keep queues open and not drain the nodes while rolling out the new kernel but rather kill the running jobs

 

The CERN batch configuration is highly customized with >70 dedicated queues. For a simplified configuration with only a (few) large resources it would have been possible to handled the rollout almost transparently without closing the queue.

 

During the emergency rollout there was no time for negotiating how to handle each dedicated queues. FIO are considering different options for how to better handle similar situations in the future. FIO will make a proposal to the VOs later this month.

 

D.Barberis noted that is a bad practice to send the description of the exploit in the information email.

O.Barring agreed that was a serious mistake and that is why the update was forced immediately everywhere.

3.4      BNL Site Name Change

Site name change in BDII on 25/8 at 15:00 CEST (09:00 EDS):

 

-       BNL-LCG2 à BNL-ATLAS

 

A few minutes (~15’) later ATLAS Tier-1 sites could start updating their FTS channel definitions following a procedure provided by the FTS team. The update of the channel definitions had to be synchronized with the BDII updates.

 

RAL and TRUIMF have disabled daily their BDII updates. Therefore the update is still pending.  And some sites were not actively confirming the update though successful.

 

BNL planned the operation and executed it across the grid with the assistance of the Tier-0 FTS team. The operation was transparent for ATLAS and transfers did not suffer. The site name was also done for GGUS and a TEST ALARM ticket was injected by ATLAS in order to confirm that the workflow was working.

 

I.Bird added that RAL and TRIUMF should report why they did not update the BDII database.

3.5      Miscellaneous Reports

A few other smaller issue were reported during the week:

-       ALICE VOBoxes issue at CERN. Problem had been reported from several sites where ALICE VOBoxes couldn't connect to the Alien DB at CERN. The port 8084, needed by Alien, had been closed recently and must now be re-opened.

-       RAL lost one disk server when recovering from the cooling problems. 99k files out of 4M in MCDISK space token

-       CMS tape migrations stalled at ASGC. Ongoing for two weeks. Local CASTOR staff is investigating. Help from CASTOR teams @ CERN may be required. If so, the castor-operation-external@cern.ch mailing list should be used

-       OPN problems at PIC. Connectivity between PIC and CNAF did not work over the backup route.

L.Dell’Agnello added that the network provider to the Tier-1 has static routing and this caused the problem to move to the backup route.

 

-       Early announcement for downtime registration not possible anymore in GOCDB? Reported by PIC who wanted to register a several hours downtime a week ahead. The option to send an advance announcement seems to have disappeared?

 

 

4.   CASTOR Versions (Slides) – A.Pace

 

 

A.Pace reported about the current and coming version of CASTOR.

 

The current versions and their status are:

-       2.1.7 Stable, released in March 2008.
In production at Tier 1 sites (RAL, CNAF, ASGC)
No major issues reported from Tier-1

-       2.1.8 Stable, released November 2008.
In production at CERN.
Better support for analysis requirements and tape repack functionality.
Will be needed when there is massive recover of files (e.g. in analysis).

-       2.1.9 Planned – no date available
Major consolidation aiming at operational cost reduction, no extra features
Database schema changes to allow future support of new tape format.

 

Statement about support.

Until Tier-1 do not need the extra functionalities of 2.1.8 / 9 they can remain on 2.1.7.

-       Several bug fixes cannot be back ported to 2.1.7.

-       2.1.8 and 2.1.9 functionalities will not be back ported to 2.1.7

 

Amount of support requests from Tier-1 in recent months has been very low therefore 2.1.7 is reliable. CERN will try to continue supporting 2.1.7 but have concerns if this would need to continue in 2010

 

They do not have any 2.1.7 running instance at CERN. Several support and operational issues requiring support have been addressed in 2.1.8 or 2.1.9.

 

I.Bird noted that Sites should be pushed to move, but it is not so urgent.

D.Barberis added that it should depend on the Site. But they will not have improvement if are needed.

J.Shiers added that there are operational concerns on why the update should be done. The upgrades are not cumulative and if all pending updates are needed they will have to be done in a rush.

 

D.Britton stated that in April the decision to stay to 2.1.7 was taken with the Experiments and CERN. RAL will not make any update in 2009. Could do it in 2010 and will be discussed with the Experiments.

O.Barring reminded that the longer a Site waits the more difficult will be because of the non cumulative updates.

 

Ph.Charpentier asked why RAL waits to be in data taking for the upgrade instead of doing it before.

D.Britton replied that the situation was analysed and agreed. RAL will upgrade in 2010 only if needed. The current situation is very stable and there are no compelling features that are needed by ATLAS.

 

M.Kasemann asked when the support to 2.1.7 will end.

A.Pace replied that CERN will try to continue support of 2.1.7 as long as necessary. But if a major fix is needed it will not be ported to 2.1.7 and cannot be reproduced at CERN. And some features do not work on 2.1.7. If the system will be loaded more the issues will show up.

 

D.Britton replied that STEP09 validated the realist load on the Sites. And Sites were validated in STEP09 with 2.1.7. Moving to 2.1.8 will be a not properly verified setup.

 

J.Gordon added that an Experiment has it CASTOR installation and LHCb could be upgrade separately, but not in 2009.

 

 

 

5.   High Level Milestones Update (HLM_20090724.pdf) – A.Aimar

 

 

A.Aimar reviewed the status of the High Level Milestones with the MB.

 

Pilot Jobs Frameworks

WLCG-08-14

May 2008

Pilot Jobs Frameworks studied and accepted by the Review working group
Working group proposal complete and accepted by the Experiments.

ALICE

ATLAS

CMS

LHCb
Nov 2007

 

WLCG 08-14: M.Litmaath added that there is no progress.

 

Tier-2 and VOs Sites Reliability Reports

WLCG-08-09

Jun
2008

Weighted Average Reliability of the Tier-2 Federation above 95% for 80% of Sites
Weighted according to the sites CPU resources

See separated table of Tier-2 Federations
80% of the Sites above 95% reliability

WLCG-08-11

Apr
2009

VO-Specific Tier-1 Sites Reliability
Considering each Tier-0 and Tier-1 site
(and by VO?)

Jul 2009

 

 

 

 

 

 

 

 

 

 

 

 

Aug 2009

 

 

 

 

 

 

 

 

 

 

 

 

Sep 2009

 

 

 

 

 

 

 

 

 

 

 

 

 

Currently about 80% of the Sites are above 90% reliability not 95%.

See the latest report here: https://twiki.cern.ch/twiki/pub/LCG/SamMbReports/Tier2_Reliab_200907.pdf

 

SL5 Milestones

WLCG-09-21

Mar 2009

SL5 gcc 4.3 (WN 4.1 binaries)Tested by the Experiments
Experiments should test whether the MW on SL5 support their grid applications

ALICE

ATLAS

CMS

LHCb

WLCG-09-22

Jul 2009

SL5 Deployed by the Sites (64 bits nodes)
Assuming the tests by the Experiments were successful. Otherwise a real gcc 4.3 porting of the WN software is needed.

 

 

 

 

 

 

 

 

 

 

 

 

 

M.Schulz noted that the SL5 Deployment is just ready. The percentage of Sites that moved to SL5 can be extracted from the database.

J.Gordon added that Sites will report at the GDB and then one can update these milestones.

 

SCAS/glExec Milestones

WLCG-09-17

Jan 2009

SCAS Solutions Available for Deployment
Certification successful and SCAS packaged for deployment

Done in March 2009

WLCG-09-18

Apr 2009

SCAS Verified by the Experiments
Experiment verify that the SCAS implementation is working (available at CNAF and NL-T1)

ALICE
n/a

ATLAS

CMS
n/a?

LHCb

WLCG-09-19

09-18 + 1 Month

SCAS + glExec Deployed and Configured at the Tier-1 Sites
SCAS and glExec ready for the Experiments.

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-20

09-18 + 3 Month

SCAS + glExec Deployed and Configured at the Tier-2 Sites
SCAS and glExec ready for the Experiments.

 

 

M.Schulz noted that the Sites move to SL5 will be in conflict because there is not glExec on the SL5 WN. SCAS can be instead deployed as a service on the Site.

J.Gordon will also ask this at the Tier-1 Sites.

 

Accounting Milestones

WLCG-09-02

Apr 2009

Wall-Clock Time Included in the Tier-2 Accounting Reports
The APEL Report should include CPU and wall-clock accounting

APEL

WLCG-09-03

Jul 2009

Tier-2 Sites Report Installed Capacity in the Info System
Both CPU and Disk Capacity is reported in the agreed GLUE 1.3 format.   

% of T2 Sites Reporting

WLCG-09-04a

Jul 2009

Sites publishing the User Level Accounting information

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-04b

Jul 2009

User Level Accounting verified and approved by the Experiments

ALICE

ATLAS

CMS

LHCb

 

WLCG-09-03: J.Gordon will find the information.

 

STEP 2009 - Tier-1 Validation

WLCG-09-23

Jun 2009

Tier-1 Validation by the Experiments

ALICE

n/a

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

ATLAS

 

 

 

 

 

n/a

n/a

 

 

 

 

n/a

CMS

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

 

LHCb

n/a

 

 

 

 

n/a

 

 

 

n/a

n/a

n/a

 

Can be considered done by all Experiments.

 

 

CREAM CE Rollout

WLCG-09-25

Apr 2009

Release of CREAM CE for deployment

 

WLCG-09-26

May 2009

All European T1 + TRIUMF and CERN at least 1 CE.  5 T2s supporting 1 CE

n/a

 

 

 

 

n/a

 

 

 

n/a

n/a

n/a

WLCG-09-27

Jul 2009

2 T2s for each experiment provide 1 CREAM-CE each.

ALICE

ATLAS

CMS

LHCb

WLCG-09-28

Sep 2009

50 sites in addition to the ones above

 

 

M.Schulz reported that NL-T1 and FR-IN2P3 are not supporting yet the CREAM CE.

 

J.Gordon added that the WMS need to be installed in order to use the CREAM CE.

M.Schulz replied that it is certified and available. Sites should install it in order to test.

I.Bird added that one needs to have a realistic test with many Sites and workload.

 

M.Litmaath added that the tests need to be done before October. A.Retico is discussing it with CMS.

 

M.Schulz will report about the preparation of the CREAM CE tests with CMS at next week.

 

CPU Benchmarks/Units Milestones

WLCG-09-14

Dec 2008

CPU New Unit Working Group Completed
Agreement on Benchmarking Methods  Conversion Proposal and Test Machines

CPU New Benchmarking Unit Working Group

WLCG-09-15

Feb 2009

Sites Pledges in HEPSPEC-06
Pledged from the Sites should be converted to the new unit

LCG Office

WLCG-09-16

Apr 2009

New Experiments Requirement in HEPSPEC-06
Experiments should convert their requirements to the new unit (or by LCG Office)

ALICE

ATLAS

CMS

LHCb

WLCG-09-24

May 2009

Sites Benchmark their Capacity in the HEPSPEC-06
Resources from the Sites should be converted to the new unit

 

 

 

 

 

 

 

 

 

 

 

 

 

Will be asked at the GDB.

 

Here is the HLM table as updated after the discussion at the meeting:

https://twiki.cern.ch/twiki/pub/LCG/MilestonesPlans/WLCG_High_Level_Milestones_20090901.pdf

 

 

6.    AOB

 

 

J.Gordon: No Pre-GDB next week.

I.Bird: Next Referee's meeting is for the 21st September at lunchtime. An Agenda will be distributed soon.  

 

 

7.    Summary of New Actions

 

 

 

No new actions.