LCG Management Board

Date/Time

Tuesday 10 November 2009 16:00-18:00 – F2F Meeting 

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=71051

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 2 – 9.11.2009)

Participants

A.Aimar (notes), D.Barberis, I.Bird (chair), K.Bos, M.Bouwhuis, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello,, D.Duellmann, Qin Gang, M.Girone, J.Gordon, A.Heiss, F.Hernandez, M.Litmaath, G.Merino, A.Pace, B.Panzer, M.Schulz, Y.Schutz, O.Smirnova, R.Tafirout

Invited

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 24 November 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

Feedback received from RAL and D.Barberis about the minutes of the previous meeting.

Minutes updated (changes are in blue).

 

2.   Action List Review (List of actions)

 

  • Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar.

See http://sls.cern.ch/sls/s0065rvice.php?id=WLCG_Tier1_Tape_Metrics

NDGF provided a URL and all issues.
FR-CCIN2P3 will start publishing the first set of metrics on November 16th   (Update: Done on the 16t)h.

3.   LCG Operations Weekly Report (Slides) – D.Duellmann
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Summary

This summary covers the period from 25th October to 6th November. There were a few incidents leading to SIR reports:

-       ASGC ATLAS conditions database re-synchronization.

-       ASGC CASTOR Service outage

-       SARA FTS and LFC database inconsistency after h/w move

-       Large amount of batch queries at CERN from LHCb

-       New LINUX exploits and associated patch campaigns

-       SIR for IN2P3 cooling incident on Nov 3rd 

 

The GGUS summary is shown below:

VO

User

Team

Alarm

Total

ALICE

2

0

0

2

ATLAS

23

98

0

121

CMS

11

1

0

12

LHCb

1

18

0

19

Totals

37

117

0

154

 

No alarm tickets in the period.

3.2      Meeting Attendance

Site

M

T

W

T

F

CERN

Y

Y

Y

Y

Y

ASGC

Y

Y

Y

Y

Y

BNL

Y

Y

Y

Y

Y

CNAF

Y

Y

Y

Y

Y

FNAL

FZK

Y

Y

Y

Y

IN2P3

Y

Y

Y

NDGF

NL-T1

Y

Y

Y

Y

Y

PIC

Y

Y

RAL

Y

Y

Y

Y

Y

TRIUMF

n/a

n/a

n/a

n/a

n/a

3.3      Availability (slide 5) and Service Incidents

One can notice that TW-ASGC was unavailable to CMS for the whole two weeks

 

Security Intervention against new Linux Exploits

From the EGEE broadcast last Thu

A severe vulnerability in 2.4 and 2.6 the Linux kernel(CVE-2009-3547) was published, which leads to a kernel NULL pointer dereference, i.e. it falls into the same category as the previous vulnerabilities which have led to a number of successful root level intrusions. A public exploit has been released, and all sites are asked to URGENTLY APPLY the relevant security patches.

 

A large number of installations were affected (e.g. also WNs):

-       Started a rapid “patch and reboot” campaign at CERN and other sites

-       VO responsible also reacted quickly for VOBoxes

-       Intervention at CERN was concluded on last Friday

 

BNL suggested that security team might be able to give some certification information for urgent fixes together with incident announcements.

 

ASGC DB Problems

Two major DB problems were encountered:

-       Atlas Condition DB: After a long outage of ATLAS conditions DB performed a complete re-instantiation using transportable table spaces. Thanks to BNL who acted as source DB.

-       CASTOR DB: Several dedicated phone meetings with experts from CERN and ASGC. All CASTOR DBs (apart from monitoring DB) have been recovered  The DB configuration review resulted in setup of 2 DB clusters using ASM for CASTOR ns and other CASTOR services

 

The lessons learned for recovery procedure and setup will be part of upcoming Distributed DB Operations workshop: http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=70892

 

ASGC CASTOR Service

Despite the large efforts from ASGC and CERN experts to re-establish the CASTOR service there are still problems for experiments to get a reliable service for data transfers. The ASGC team is working hard to resolve issues

 

We still need completion of existing Service Incident report from ASGC. CERN experts offer help to provide technical input on DB and CASTOR side. A larger rate of emergency situations like this is not sustainable for site teams. or for the CERN teams  

 

Cooling Incident at IN2P3 SIR

Cooling outage at CC-IN2P3

Tuesday, November 03rd 2009

Author: Marc Hausard

 

Description

Unexpected outage of cooling service while performing some work on the heating system.

 

Timeline

15:15  Circuit breaker powers off the building control unit and the water pump.

15:25  Abnormal raise of temperature is noticed.

15:30  Staff powered off WN. Actions taken to lower the room temperature by extracting warm air.

15:50  Critical services back. Workers nodes are gradually back into production.

16:24  Incident is reported to users through newsgroup.

19:00 Batch is re-opened

[….]

Follow up

Incident showed that the time delay allowed for reaction is very short in such failure.

It has been agreed to set up an automated mechanism to switch off WN based on temperature rise.

 

 

Miscellaneous Issues

A few other miscellaneous report were send during the period:

-       Tape migration problems under investigation at RAL

 

J.Gordon added that the issue was due to a tape drive not writing properly and now the issue is solved.

 

-       FTS 2.2 being tested by ATLAS (main reason checksum support), using patched version @ CERN. Unexpected FTS upgrades to 2.2 took place at BNL, FZK, and NL-T1.

-       dCache upgrades at many sites – so far no major problems

-       New VObox version tested by ALICE

-       Interventions at PIC and BNL planned with experiments affecting part of site storage

3.4      Summary and Conclusions

The long standing problems at ASGC (CASTOR and Condition Database) hopefully resolved now. The Conditions DB and CASTOR services are now reopened.

 

NL-T1 had difficulties with moving a consistent DB version for LFC to new DB cluster. A coordinated DB recovery validation exercise will take place the 26th November: all T1 sites are encouraged to participate. Additional documentation for application schema migration in preparation.

 

The Urgent Security issue triggered rapid reaction at CERN and other WLCG sites.

 

The activity is rising – in some areas above the sustainable level for existing staffing. 

 

 

4.   Technical Forum Update (Slides) – M.Litmaath

 

 

M.Litmaath presented an update on the activity of the WLCG Technical Forum, focusing in particular on the data access efficiency for jobs.

4.1      Data Access Efficiencies

 

ATLAS

User Analysis Test. The Users were not able to saturate the system; the HammerCloud was used to top up. This maybe because not enough users tried the system.

HammerCloud tests are ongoing: https://twiki.cern.ch/twiki/bin/view/Atlas/HammerCloudDataAccess

 

Results only reported for some of the clouds so far with very good results: most sites tested show big improvements with FileStager which pre-stages the files needed by the next job.

Support for FileStager being fully implemented both in PanDA and Ganga and Sites can choose that behaviour as default.

 

CMS

In the October Exercise they found that most problems came from remote stage out.

An intermediate stage out to local SE foreseen as fallback solutions.

 

A.Sciabá presented ATLAS HC tests at CMS PADA meeting. The HammerCloud will be adapted to support CRAB and a Fellow will start to work on it in January.

4.2      Technical Forum Progress

A Wiki for the various working groups has been created and is derived from successful JSPG Wiki.

 

Is hosted by A.McNab at Univ. of Manchester: https://wlcg-tf.hep.ac.uk/wiki/Main_Page

 

It bootstrapped with 2 discussion topics:

-       Proper use of “file” protocol.
Will greatly benefit job efficiencies at StoRM sites. RFEs are open for GFAL/lcg-utils. But Experiment applications will have to be adjusted to make use of the enhancements.

-       Future of grid storage systems,
which is a longer-term discussion. The Twiki page will have categorized and a ranked list of “top-3” topics from the EGEE’09 presentation and mailing list discussions: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGTechnicalForum

 

 

5.   Tier 1 Status of Resource Installation – Roundtable

 

5.1      Status of 2009 and 2010 Installations

I.Bird asked each site about the status of 2009 installations, which should all be done by now, and 2010 plans.

-       CA-TRIUMF:
2009 and 2010 CPUs are both purchased, and are being installed in next 2 weeks. Storage they are waiting for 2TB drives in first week of December. Of 2009’s resources they have installed about 850 TB of 900 TB due.

-       DE-KIT:
2009 resources are installed since May 2009

-       FR-IN2P3
2009 CPU is all installed. Disk storage partially allocated. Tape also available when needed.
2010 tender for disk is our and deliver end of January. CPU ordered in 2010Q1.

-       NL-T1
CPU all available not all installed but will be installed in 2 week and will cover also the 2010 pledges. Disk in production next week 2009 pledges covered in 2 weeks. 2010 disk will be available in January 2010.

-       IT-INFN-CNAF
2009 INFN did not do tenders for 2009 pledges.
2010 tenders are ongoing and will cover also the 2009 pledges. CPU will be in place in 2010Q1. Storage will be 2010Q2.

-       NDGF
2009 CPU installed and disk and storage not all online but available on demand.
2010. Norway has not committed yet to the 2010 pledges. By June all the rest will be installed by June 2010.

-       ES-PIC
2009 CPU is there. Disk waiting for 2TB disks. Will be online in the next 2 weeks and cover also 2010.
2010 CPU will be online early 2010.

-       TW-ASGC
2009 pledges are delayed until mid-December. Storage end of December. All should be there by end of the year.
2010 needs to be checked on when can restart the procurement process. But is not started yet and usually takes about 2 months. Should manage to have it by June.

-       UK-RAL-T1
2009: CPU is in place 50% on the extra disk is in place. The other 50% had hardware problems.
2010 Disk tender is on the way. Delivered Jan 2010, will meet the June 2010 deadline. CPU tender is ending and should be delivered by February and installed by April.
US-BNL-T1
Sent an email to I.Bird.
Status 2009

-       CPU [kHS-06]              33.30

-       Disk [TB]                       3,800  (usable capacity as seen by ATLAS)

-       Tape [TB]                      3,000

Plan for 2010

-       CPU [kHS-06]              50.00 (available to ATLAS by June 2010)

-       Disk [TB]                       5,800    by Jan, 2010

-                                              8,000    by Sep, 2010

-       Tape [TB]                      4,000    (media only, additional SL8500/10k slot library already in place)

-       US-FNAL-CMS
2009 pledges all in place since a few months. The CMS tape requirements are above the pledge but will be covered (25% more).
2010 1/3 of resources are already purchased. And will be all there by 2010Q2.

-       CERN
2009 CPU is on Site and needs to be allocated. Delay due to security. Disk Servers had problems of delivery because of incompatibilities between power supply and motherboard upgrade. 2009 Disk will be there only by end of the year.
2010 9 TB disk and CPU order is out for delivery in January and commissioning in March.

5.2      C-RSG Scrutiny Group Requests

The RSG recommended that the 2010 pledges should be there by June and not July.

 

Their request for information by March needs to have a reply. I.Bird will prepare it.

 

 

6.   Experiment Round Table

-       Status of resource ramp up planning profiles (i.e. had previously stated that would provide quarterly profiles.

 

The profiles of requirements were asked by the Sites.

 

I.Bird asked whether these profiles are still necessary.

F.Hernandez replied that for the Sites is of great help to know in advance the needs towards June.

D.Barberis noted that the Sites have their schedules already in place and Experiments know them.

 

I.Bird noted that the profile is useful for the deployment, in particular for disks.

D.Barberis replied that is probably easier if the VO deals directly with the Sites.

 

J.Gordon added that especially for Disk Sites would like to have quarterly (minimum) requirements for each quarter of the year.

 

K.Bos agreed that a profile could be distributed and will be discussed with J.Gordon and F.Hernandez of the Tier-1 which asked for the profile. The other Sites did not intervene and seems fine with the current information about Experiments requirements.

 

 

7.    AOB

 

 

 

No AOB.

 

 

8.    Summary of New Actions

 

 

 

No new actions for the MB.