LCG Management Board

Date/Time

Tuesday 18 August 2009 16:00-17:00 – Phone Meeting 

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=62554

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 22.8.2009)

Participants

A.Aimar (notes), D.Barberis, O.Barring, D.Britton, Ph.Charpentier, D.Duellmann, M.Ernst, I.Fisk, D.Foster, Qin Gang, J.Gordon, A.Heiss, M.Kasemann, M.Lamanna, P.Mato, G.Merino, R.Pordes, H.Renshall, M.Schulz, Y.Schutz, J.Shiers (chair), O.Smirnova, R.Tafirout

Invited

J.Andreeva, L.Perini  

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 1 September 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received about the minutes. The minutes of the previous MB meeting were approved.

1.2      ALICE Requirements on SL5 and CREAM Deployment (Slides) – Y.Schutz

ALICE requested all WLCG Sites to provide as soon as possible SL5 Worker Nodes and the CREAM CE.

 

SL(C)5

ALICE has expressed its wish to have the WNs in SL5 asap since last year.

All ALICE Sites should have migrated to SL5 by mid-September 2009

 

Those sites which have not completed the migration will not be able to participate in the production until the full migration. ALICE does not wish to maintain double support to the sites (SL5 AND SL4) nor hybrid setups (WNs in SL5 and VOBOX in SL4 for example).

 

VOBOX

Therefore the same deadline applies also to VOBOXES. The migration of gLite-VOBOX service to SL5 is ongoing by both IT-GD and IT-GS developers.

 

CREAM-CE

ALICE has expressed its interest to have CREAM-CE at all sites (in parallel to the LCG-CE) before the real data taking. ALICE would like to have the CREAM-CE system in parallel to the LCG-CE at ALL sites by November 2009.

 

The current situation is not evolving (CREAM-CE, VOBOX with CREAM-Clients + gridftp server):

CERN, RAL, Torino, CNAF, Legnaro, SPBu, KISTI, Kalkota, FZK, IHEP, Prague, Subatech, SARA

Which is the same situation since CHEP09.

 

ALICE would like to know from the Sites by when they can fulfil the agreed deadline.

 

J.Gordon noted that the request is in reality about the resources allocated to ALICE at the Sites.

 

M.Schulz added that the VOBOX on SL5 is currently being clarified, in term of packages to include, and will be define in by the end of the week. Testing will start right after. Instead for the SL5 WNs is all ready for deployment.

 

A.Heiss added that FZK will move all WNs for ALICE to SL5. For VOBOXES they are not sure they manage by September as depends on hardware arrival and presence of the people responsible.

 

Ph.Charpentier asked that the Sites give to the CE referring to SL5 batch queue a new name. In DIRAC each CE refers to a specific platform. And should be properly advertised in the BDII.

 

J.Gordon noted that WLCG asked that a number of Sites run the CREAM-CE. And that target was achieved; but maybe not enough of them are supporting ALICE.

 

Ph.Charpentier asked what happens now with the request. Who is going to follow it up?

Y.Schutz asked for a report at the GDB about the progress of these requests.

J.Gordon replied that the requests was already asked and will be also reported at the GDB. The MB endorses the request but the VOs should follow the progress.

M.Schulz added that a global target can be added to the milestones for the Sites.

 

 

 

2.   Action List Review (List of actions)

 

 

The MB Action List was not reviewed this week.

 

3.   LCG Operations Weekly Report (Slides) – H.Renshall
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting.

All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      GGUS Tickets and Alarms

There were three alarm tickets:

-       From CMS to CERN on 12 August when an ETICS virtual machine flooded network with DHCP requests (not reproducible) overloading a network switch then a router in front of castorcms and castoralice. Switch was stopped overnight.

-       ATLAS to DE-KIT on 9 August (when MC disk space filled up) exceptionally to keep MC Production running. 2TB added (with thanks from ATLAS). But KIT think this did not warrant an alarm ticket – experiment production space should be well planned – ATLAS agree but at the time did not know status of some obsolete data deletion services at FZK.

-       ATLAS to RAL on 7 August for hanging LFC connections. Front end servers were rapidly restarted fixing problem.

 

Incidents leading to service incident reports:

-       RAL air conditioning (chiller) failure from 12-17 August. Draft SIR available (see later report).

-       Fibre cut between Madrid and Geneva – LHCOPN primary and secondary down. SIR requested.

 

VO

User

Team

Alarm

Total

ALICE

4

0

0

4

ATLAS

10

32

2

44

CMS

8

1

1

10

LHCb

0

19

0

19

Totals

22

52

3

77

 

Slide 4 shows the VO Availability results and one can notice:

-       NIKHEF scheduled move to new computer centre performed from 10th to 14th for normal servers back up on 14th. This week moving disk servers and worker nodes.

-       The ALICE earlier but ongoing downtime at SARA was due to a failed ALICE VObox there. There is a separate ALICE VObox at NIKHEF – not clear if/how the two are coupled.

-       RAL had air conditioning stoppages due to water chillers’ failures starting on 12th continuing till Monday 17th. 

 

On the RAL Ait Conditioning Stoppage the following report was sent. The SIR from the LHC-OPN is still expected.

 

The RAL Tier1 (RAL-LCG2) carried out an emergency power down following air conditioning failure during the night Tuesday-Wednesday 11-12 August, this was the second event in 2 days. All batch and CASTOR services had to be halted (and remain down) other critical services such as RGMA, the LFC and FTS have remained up the whole time. On Thursday 13th we also suffered a water leak (condensation) onto our main CASTOR tape robot.

 

Status as of 14.00 today is that All services are now back up and the downtime was ended in the GOCDB by 10am this morning. The over-pressure sensor that took down the Tier-1 has been re-configured to provide an alarm only but it is not yet completely clear if there actually was an over pressure (in the chilled water system) and if so, what caused it. We are actively seeking answers to these questions but have to work with the contractors and within the warranty constraints. Our best assessment is that there is a 5-10% chance of a recurrence.

 

We lost one D1T0 disk array from an ATLAS MCDISK pool on restart containing 99000 files. List being passed to ATLAS. 

 

A rough draft of an SIR is on the GridPP website at http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090810

 

The water leak onto the tape robot was traced to an overflowing drip tray on an aircon unit in the upper floors of the building. The water had made its way some distance and had leaked through a crack in the ceiling into the machine room. The drip tray has been replaced with a larger one and the reasons for the overflow are being investigated. An inventory of all water sources is being made. Some damaged electronics has been replaced, and various tapes and driveheads have been examined and we believe only suffered superficial splash marking. We will monitor the drive error rate for any increase; we believe the leak had been ongoing for some time.

 

3.2      Miscellaneous Reports

CERN

CERN responded to the Red Hat 4 and 5 zero-day exploit of loaded modules that was exposed on the morning of Friday 14th August. The workaround of modifying /etc/modprobe.conf was validated as good enough for the weekend for protecting lxplus/lxbatch. Was rapidly propagated but nodes with a suspect module loaded (mostly bluetooth was found) needed reboot – one lxplus (of 50) and 150 lxbatch (of 1000). \

 

Experiment-info list was informed of reboots with apologies for any losses and also of later decision to reboot lxbuild machines. Linux for controls list was also advised to include WAN accessible DAQ and accelerator services. CERN actions were minuted in the daily report.  

 

LHC-OPN

Fibre Madrid to Geneva was cut about 12.30 on Wednesday 5th August due to public construction work and taking out both primary and secondary OPN links. First indication was a GGUS ticket from an ATLAS user at 17:04 on the 4 August reporting file transfer errors showing in the dashboard DDM web page for PIC_DATADISK where PIC thought problems lay with FZK and INFN. Reply from OPN was then confusing as it referred to previous non-OPN routing problems between PIC and FZK and CNAF. Did PIC know the OPN was down ? Probably not. And why not?

 

This ticket was superseded by an ATLAS team ticket at 20.00. Reply on 5 August did mention that the OPN was down and referred to a GGUS LHCOPN ticket. These cross references are part of the agreed procedure. The T1 were confused between the issue of the routing when the OPN is down and the OPN failure itself and the tickets reflect this.

 

The main concern is why there was no GGUS LHCOPN ticket earlier if the break happened at 12.30 on 4th as this would have saved Experiment and Sites a lot of time. We cannot see any dashboard style monitoring available to us of the OPN – some monitoring is password protected.

 

Total downtime was 26 hours and a SIR has been requested.

 

DE-KIT

Several FZK disk servers (ALICE and LHCb) were hit over several days by a bug in the BIOS that falsely detected overheating and shut down the machine.

 

In addition, one of three FZK tape robots broke down several times from 8 August leaving the tape service intermittently degraded. Finally fixed on 13 August.

 

WLCG now formally requesting all sites to upgrade worker nodes to SL(C)5 following ATLAS confirmation.

3.3      Summary

In summary the main issues of the last two weeks were:

-       Should space exhaustion warrant an alarm ticket to save production time?

-       This month’s computer centre infrastructure failure was air conditioning at RAL.

-       LHCOPN monitoring/alarms/information flow for the sites and experiments needs improvement.

-       Serious zero-day exploit on SL4 and 5 diverted a large amount of expertise.

-       Universal migration to SL(C)5 worker nodes requested.

 

D.Barberis thanked FZK for adding disk upon ATLAS’ request. Disk space request should not be an alarm but the persons on shift were not experts and that one was the only solution found. The reason of the lack of disk space was that the deletion scripts did not work.

 

J.Gordon added that the installations of Red Hat 5 that do not run SE Linux may have to install additional security updates as critical and this otherwise would not be needed. If ATLAS was allowing the use of SE Linux it would help the Sites.

D.Barberis replied that only reasons are the error codes libraries.

D.Duellmann added that the problem is due to the ORACLE client and the new ORACLE client 11 has still to be validated in the Applications Area.

 

Ph.Charpentier added that the AF stated that the request and it is independent on which Experiments needs it. It should apply to all Sites, for all VOs.

J.Gordon noted that is ATLAS requires something and the Site is not supporting ATLAS could avoid to do the patch.

P.Mato noted that also other VOS may use the same ORACLE client (e.g. LHCb) and the Site should not look at the specific VO request.

J.Shiers suggested that the issue is clarified and reported to the MB.

 

D.Barberis highlighted the fact that even is a new release is deployed the old ones will still be required for 12 months (i.e. end of 2010).

J.Gordon noted that in this way old security issues will still be such until all Sites can enable the SE Linux extensions.

 

 

4.   Update on the EU Proposals (Slides) – J.Shiers

 

 

Additional Material:

-       Wiki

-       "P0" SSC preparation page (Indico);

-       EGI SA1 Draft - as an example;

-       HEP brief input to EGI editorial board (end July);

-       MB Agenda July 7th (previous update);

-       P0 SSC explanatory mail;

-       SA4 outline;

-       Agenda of HEP SSC call August 17;

-       Draft WLCG SA4 task descriptions

 

J.Shiers presented the status of the EU proposals and asked for input from the Experiments. There were references to several documents and material (links above) but the situation is summarized in the Slides attached.

 

The main repository of information is the HEP SSC wiki: https://twiki.cern.ch/twiki/bin/view/LCG/HEPSSCPreparationWiki

From this page one can find pointers to the Indico category, the mailing list and all documents and presentations (check uploaded material on the wiki page).

4.1      Overview

The previous update to MB was made on July 7th (link).

Since that date several event happened:

-       The call has officially opened (July 30)

-       HEP has provided input to the EGI editorial board on its outline plans (July 31)

-       A “HEP” member of the editorial board has been proposed (J.Shiers)

-       Timetables for preparing the relevant proposals in response to 1.2.1.1 (EGI) & 1.2.1.2 (Service Deployment) + 1.2.3 (Virtual Research Communities) have been prepared

 

-       Laura Perini is coordinating the preparation of the single proposal to cover 1.2.1.1 & 1.2.1.2

-       Cal Loomis is coordinating the “P0” proposal (see e-mail link ) for an SSC against 1.2.3

4.2      Manpower Estimates

The table below was used in response to the editorial board. The funding from the EU will be 50% of the resources below (except for row 11).

 

Key

Activity

Effort 1.2.1.2

Effort 1.2.3

1

Operations and user support liaison

1

2

Middleware liaison

1

3

Services for heavy users (“core”, i.e. FTS)

6

4

Additional VO-specific Services

10

5

Dashboard, Ganga, Diane, AMGA

4 + 2 + 1 +1

6

WLCG “EIS”

8

7

Analysis tasks for WLCG

2

8

Grid technology outside LHC

4+2+4 + (4 to 8)

9

International communities

1.5

10

TOTAL FTE

26 (+4?)

~21.5 – 25.5

11

TOTAL EU Funded

13 (+2?)

11 – 13

 

Following yesterday’s call there are the following changes:

-       Move 1 and 2 (liaison functions) to SSC

-       Diane and AMGA: combination of other communities (e.g. AMGA) and / or SSC

 

The remaining estimated effort (6 + 10 + 4 + 2) is “about right” but needs further (“final”) discussion plus justification.

-       22 FTE at 50% EU funding over 3 years = 11 x 3 x EUR100K = EUR3.3M

-       1.2.1.2 total is expected to be capped at EUR5M (out of EUR25M total for 1.2.1.1 & 1.2.1.2)

 

Ph.Charpentier asked about the resources for Diane and AMGA.

J.Shiers replied that they are removed from the table above and will be covered by other sciences if they want them. The table above total is 22 and not 26 as before the changes.

4.3      Draft “EGI Proper” Proposal

EGI Proper is coordinated by L.Perini will cover 1.2.1.1 & 1.2.1.2. Is a working title and will change to a meaningful acronym.

 

From the Activity SA4 (“Services for Communities of Heavy Grid Users”) document (link) the Communities identified as Heavy Users Communities (HUCs) are:

-       High Energy Physics (HEP)

-       Life Sciences (LS)

-       Astronomy and Astrophysics (AA)

-       Computational Chemistry and Material Sciences (CCMT)

-       Earth Sciences (ES)

 

The tasks currently foreseen in the SA draft are as follows (all prefixed by TSA4., e.g. TSA4.1 for management):

-       Management

-       Hosting of Community Specific Services (TSA4.2)

-       Hosting of VO Specific Services (TSA4.3)

-       Support of Frameworks (TSA4.4)

-       Support of Scientific Gateways

-       Support for interoperability within SA4 services

-       Support for transition to SA4 services

 

Task

Description

Services

Manpower

TSA4.2

Provision and operation by a small number of NGIs of Core Grid Services (O-N-8) explicitly needed to support this user community, but of potential benefit to other communities.

These centres will be experts and provide an SLA around the hosting of services such as FTS, LFC, Hydra, AMGA and VO specific services.

FTS, LFC

6

TSA4.3

Provision and operation by a small number of NGIs of Core Grid Services (O-N-8) explicitly needed to support this VO, but of potential benefit to other users.

Justified if the VO users are a relevant fraction of the Grid users and/or they use a relevant fraction of the grid resources, or if the service is foreseen to become of more general interest during this Project.

VO Services

10

TSA4.4

The frameworks integrate different components and services for performing functions tailored for specific communities or VOs

An example are the VO Dashboards: VO Dashboards have been found to be very useful by large VOs to provide a VO view of the infrastructure for their community.  Other examples  may be GANGA, PHEDEX, DDM, WISDOM

 

This task in case of Dashboard includes the hosting of the service and the integration and development of VO specific tests, driven by the particular user community, necessary to verify the correct functioning of the infrastructure for their work. This will also draw on the generic service monitoring infrastructure and tests maintained by the NGIs.

 

The content is analogue in the case of the other Framework and should be described by the specific  writers

 

Dashboards;

GANGA

4

2

TOTAL

(N.B. 50% to be requested from EU)

22

4.4      Next Steps

There is too little information on the VO services needed. Their input is needed urgently.

We should write 3 task descriptions covering the above areas ~ 1 page (perhaps more for VO services)

-       Deadline is Friday 21st August

-       Drafts are attached (pages 3 to 6 in link)

 

Next con-call is Thursday at 16:00: From Thursday’s call on: move focus to SSC preparation.

-       SSC deadline: Monday 31st August

 

Propose to use next Tuesday’s MB slot for further SSC discussions, plus probably Thursday too (both at 16:00 UTC+2).

There is still time for further iteration prior to, during and even after EGEE’09 in Barcelona! (HEP SD & SSC sessions).  

 

The current SSC proposals are:

-       EGI SSC P0 (this proposal): High-Energy Physics (HEP), Life Science (LS), Comp. Chem./Material Science (CMST), Complex Systems (CS), and Grid Observatory (GO) (C. Loomis, CNRS, FR)

Deadline: end August 2009 

-       EGI SSC P1: Astron/Astrophysics, Earth Science, Fusion (Claudio Vuerli, INAF, IT)

-       CUE: Dissemination, Training, Outreach to business (Roberto Barbera, INFN, IT)

 

The future timetable is:

-       Thursday 20th Aug: “HEP SSC” con-call @ 16:00

      Iterate on SA4 input to 1.2.1.2

      Start to discuss plans for P0 SSC for 1.2.3

-       Tuesday 25th Aug: “HEP SSC” con-call in MB slot

      Any feedback from SA4 input to 1.2.1.2? Actions?

      First drafts of sub-contributions to HEP SSC
WLCG; non-LHC HEP; Photon Science; FAIR

 

Further calls and meetings are to be defined and several sessions already scheduled at EGEE’09.

 

Ph.Charpentier asked whether Hosting of Community Specific Services (TSA4.2) and Hosting of VO Specific Services (TSA4.3) includes providing the software running and the client software needed to use such Services (e..g. FTS, DPM, LFC, etc). Also the dashboard Services support needs to be clarified. He also noted that the expected deadline was for September not end of August.

 

J.Shiers replied that the proposals all together must provide all that is needed. But is not to be all in a proposal. Development should not be included but support and evolution of the software needed can be included.

L.Perini added that if the call specifies Service Deployment it may include the support of the software needed. Further development can be added to the JRA Joint Research activities not in the SA activities.

 

M.Kasemann added that also the liaison functions with the Experiments should be added to the manpower requests.

J.Shiers agreed but reminded that major changes in the manpower data cannot be done. 

 

Ph.Charpentier asked what the Analysis Task for WLCG is. And whether the VO services should all be listed (Panda, DIRAC, PhEDEx, ALIEN, etc)

J.Shiers replied that the Analysis Support seems to require 2 more people. The list of what the VO needs should be provided as their input. Each Experiment could specify, as input, their requests in a page with all the software and services expected.

 

 

5.    AOB

 

 

5.1      Visits to the Tier-1 Sites

M.Kasemann about organising visits to the Tier-1 Sites. The goal is not to scrutinize the situation but to tune the expectation to the current status of the Sites and to meet the people at the Sites.

 

J.Shiers replied that WLCG would be interested and D.Barberis added that also ATLAS would participate and supported the proposal.

Which Sites could be visited should be discussed at the next meeting in 2 weeks.

 

 

6.    Summary of New Actions

 

 

 

No new actions.