LCG Management Board

Date/Time

Tuesday 16 September 2008, 16:00-17:00 - Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=39171

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 25.9.2008)

Participants

A.Aimar (notes), I.Bird(chair), K.Bos, D.Britton, S.Campana, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, F.Giacomini,, A.Heiss, F.Hernandez, M.Kasemann, H.Marten, A.Pace, B.Panzer, R.Pordes, Di Qing, H.Renshall,, M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon 

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 30 September 2008 16:00-18:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

  

1.1      Minutes of Previous Meeting 

The minutes of the previous MB meeting were approved.

 

FZK sent a modification to the minutes of the previous week. Tier-1 2008 procurement: DE-KIT, according to the MoU, has a second milestone in October'08 mainly to increase Alice resources. No any problems to fulfill this milestone; all hardware is on site, running in test mode and will be provided on October 1st.

 

L.Dell'Agnello noted that site LFC servers and read-only replicas are not distinguished in the SAM tests: the LFC read-only replicas are considered unavailable if they are tested for writing. He suggested that maybe this info could be put in the GOCDB.

M.Schulz replied that currently this information (whether an LFC is a read-only replica or a not) is not available in the IS. The IS it is being modified to store this information.

 

H.Renshall asked FNAL for an update on their 2008 resources.

I.Fisk replied that the 2008 pledges are both complete for CPU and disk. 

1.2      QR Preparation (June-September)

Proposed calendar: ALICE: 23/9, CMS: 30/9, LHCb: 7/10, ATLAS 14/10.

 

M.Kasemann and Ph.Charpentier agreed on the calendar.

Y.Schutz was not present to the meeting but agree also for the 30/9. (the 23/9 was since then cancelled).

 

New Action:

Experiments’ QR presentations at the MB.

Agreed calendar: ALICE: 30/9, CMS: 30/9, LHCb: 7/10, ATLAS 14/10.

1.3      2009 Procurement (HLM_20080909)

FNAL: On target for all installations in place by April 2009. Will verify that the disk order is in place.

TRIUMF: Tender released in early October; the material will be received by February 2009. Therefore April 2009 seems reasonable.

IN2P3: Installed a second tape library that will provide 10 PB of tape storage. Received 4 PB of disk to complete the 2008 pledges and half of 2009 pledges. Launching a framework tender for CPU purchasing and will receive 30-40% of the 2009 pledges by December. For budget reasons, the rest can only be ordered in early 2009 and be in place by May.

FZK: Expect no delays for CPU, disk and tape. All pledges will be installed by April 1st 2009. The disk tender is finished and orders are being placed. The tape infrastructure is sufficient and future upgrades are being studied. Details are in an email sent to the MB mailing list.

CNAF: The tape library has 10 PB and the tapes for 2009 are already purchased. Disk and CPU tenders will be sent by end September and the material will be delivered by end February. Probably only partially by 1st of April.

NL-T1: Power/cooling issues could cause possible problems. Both SARA and NIKHEF are in the process of acquiring space/power/cooling additions. The NIKHEF addition is scheduled to be completed by April and is being handled by an external company. The tender process is being revaluated and new tenders will be sent in October.

NDGF: CPU pledges will be fulfilled earlier. Disk and tape resources will also be likely be ready by April.

ASGC: Funding for 2009 will be only available then and therefore purchases could be delayed; They will send information after the meeting.

RAL: The disk tender is being completed, the material will arrive by January and installed by April. The CPU tender is a month later and the material will arrive by end January. The second tape robot will also arrive in time. The new building in RAL should be delivered by December. If this is late it can cause delays on the installations at RAL. 

BNL: The tender is ready but exact dates for spending are still being discussed in detail.

CERN: The tape robot expansion been ordered. For CPU and disks the procurement is split in two parts:

-       First tenders: CPU orders are ready, will be delivered by November and installed by early 2009.  Disk servers delivery December and service by end of February.

-       Second part of tenders: Responses will we evaluated and approved in December, delivery expected in February and will be ready by April 2009.

 

 

2.   Action List Review (List of actions) 
 

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

About LCAS: Ongoing. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it up should confirm it.

About SCAS: The SCAS server seems to be ready and “certifiable” in a week. The client is still incomplete.

No news for this week about LCAS and SCAS.

  • For the ATLAS Job Priorities deployment the following actions should be performed :

-       DONE. A document describing the shares wanted by ATLAS

-       DONE. Selected sites should deploy it and someone should follow it up.

-       ONGOING. Someone from the Operations team must be nominated follow these deployments end-to-end

 

ATLAS will report on the status of the tests. No news at the F2F meeting.

 

S.Campana reported that job prioritises have been deployed in Milan but not in Naples yet. The instructions written after the PPS experience worked well.

The open issues are:

-       what to do for sites using LSF (e.g. Rome)?

-        in Italy they tested the gLite “repackaged” by INFN-grid. Would be good to test is on other sites (NL-T1 and Edinburgh will do it).

 

S.Campana will distribute the instructions on job priorities to the test sites.

 

3.   LCG Operations Weekly Report (Slides) - J.Shiers  

 

Summary of status and progress of the LCG Operations. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Sites Reports

The ORACLE problems were solved but the report is not sufficiently clear. Reports of corrections must be detailed in order to be useful in the future and to other sites.

3.2      First Beam Day

The pertinent question now is “were we ready? It is clear that some of the (even the key ones) services still need further hardening – and more beside. This is also true for some procedures – e.g. alarm mail handling – put in place too late to be fully debugged during CCRC’08

3.3      Service Load Problems

Service load has been seen on a number of occasions, including ATLAS conditions DB at various sites, as well as ATLAS CASTOR instance at CERN (post-mortem, including discussion of GGUS alarm ticket follow-up).

-       10’ for CASTOR expert to be called

-       10’ for intervention to start

-       18’ for problem to be identified

-       < 3 hours total from start of problem to confirmation of resolution

 

These problems are likely to persist for at least weeks (months?) – we should understand what are the usage patterns that cause them, as well as why they were not included in CCRC’08 tests. Why were these usage pattern not tested or why where they changed?

 

As the post mortem below shows, the reaction time was very satisfactory.

 

          18:10 - problem started

          19:34 - GGUS ALARM TICKET submitted by ATLAS shifter:

          19:35 - mail received by CERN Computer Centre operator

          From: GGUS [mailto:helpdesk@ggus.org] Sent: mercredi 10 septembre 2008 19:35 To: atlas-operator-alarm; support@ggus.org Subject: GGUS-Ticket-ID: #40726 ALARM CH-CERN Problems exporting ATLAS data from CASTOR@CERN

          19:45 - CASTOR expert called

          19:55 - CASTOR expert starts investigating

          20:13 - CASTOR expert identifies that the problem is due to a hotspot. The resolution is applied (see below) and ATLAS informed.

          20:47 - CASTOR re-enabled the disk server after having confirmed that the requests were better load balanced over all servers in pool.

          20:57 - ATLAS confirms that situation is back to normal

3.4      Network Problems

BNL has often reports regarding network-related problems. The primary OPN link failed Thursday night when a fibre bundle was cut on Long Island. Manual failover to secondary link. Need to automate such failovers plus continue to follow-up on (relatively) high rate of problems seen with this link

 

M.Ernst added that BNL, CERN and ESNET are looking for a clear collaborative solution to the issues.

I.Bird asked why the automatic fail-over did not work properly.

M.Ernst replied that the solution requires communication between sites and network infrastructure needs. A different more automated solution is being studied.

 

Network problem at CERN Monday caused 3.5 hour degradation affecting CASTOR

T.Cass added that the causes are being studied with the network support at CERN.

3.5      Database Service Enhancements

Support (on best effort) for CMS and LHCb online databases was added to the service team responsibility 

 

Oracle Data Guard stand-by databases in production for all the LHC experiments production databases; and using hardware going out of warranty by the end of the year. Additional protection against

-       human errors 

-       disaster recoveries 

-       security attacks  

3.6      WLCG at EGEE08

The idea is to have a panel / discussion session with 3 main themes:

-       Lessons learned from the formal CCRC'08 exercise and from production activities

-       Immediate needs and short-term goals: the LHC start-up, first data taking and (re-)processing of 2008 data;

-       Preparation for 2009, including the CCRC'09 planning workshop.

In each case the topic will be introduced with a few issues followed by a wider discussion involving people also from the floor. Not looking for ‘official statements’ – the opinions and experience of all are valid and important.

These panels have worked well at previous events (WLCG workshops, GridPP, INFN etc.) and do not require extensive preparation. It is probably useful to write down a few key points / issues in a slide or two (not a formal presentation!)

 It is also an opportunity to focus on some of the important issues that maybe have not been fully discussed in previous events.

3.7      Post Mortem Reports

The participants to the Operations meeting are now preparing timely and detailed post-mortems. But what happens next? E.g. both CASTOR/ATLAS and CASTOR/RAL “post-mortes” propose actions and other follow-up. Without inventing a complex procedure, how do we ensure that this happens? When should go to MB action list?

 

I.Bird replied that the MB should be involved only when the parties involved do not follow-up and solve the issue. But not be inserted in the MB list from the beginning.

J.Templon added that a Savannah tracker can be used to follow the open issues (like happens at the TMB) for the daily Operations meeting.

3.8      Security Issues

Security at the WLCG sites was discussed briefly at the MB.

The MB concluded warning all Sites and Experiments to give maximum attention to the Security initiatives that are being discussed outside the MB, by the appropriate channels.

 

4.   GDB Summary (Report) - J.Gordon

 

 

The MB did not discuss the report below, but the action mentioned should be followed up during next meetings.

 

From J.Gordon, chairman of the GDB:

 

1.  Storage – we need to check whether the tiered model of support is working for storage. Local and regional support is working in some places but do all sites have reasonably local support for their storage solution of choice?

The observed dCache problems would be helped if dCache T1s attended the weekly phone conferences more often, especially when they have issues. As well as this weekly meeting there is a need for regular planning meetings to consider the longer term issues. The castor sites have regular meetings but their DBAs would also benefit from meeting to compare experiences.

 

2. WN Client Installation. There was very little support for the proposal. The issue of support for multiple versions was agreed to be a responsibility of the VOs. CERN GD will work with the experiments to consider whether a standard solution could meet their requirements.

 

3. Other Middleware. There are several changes in the pipeline (SCAS, SL5, CREAM, etc) but there was no consensus on how/when they would be deployed – during shutdown? All sites?  

 

4. 24x7 Most sites expected to have scheduled downtime and expected that these would be outside of beam time. A means of proposing and advertising T1 downtimes.  Experiments already populate calendars with alerts from GOCDB for downtime for all their sites. Will this be sufficient? Or should something similar be deployed just for T1s?

(Ian, were there any other questions/actions from this session? Sites who did not report perhaps?

 

5. Benchmarking. A group has been formed, chaired by Gonzalo Merino, who will consider how the agreed benchmark can be introduced for WLCG.

 

6. I will try to agree themes for GDBs longer in advance so that appropriate OSG people may attend.

 

 

5.   Update on End User Analysis (Wiki_page) - B.Panzer

 

B.Panzer organized meetings and prepared a proposal with a working group on end user analysis at CERN.

CERN IT will provide a CASTOR instance from 1 October for user analysis. It will be providing a pool of 100 TB for ATLAS and CMS. The exact functionality will be announced later in the week.

 

A.Pace added that the future new version will prevent users from staging files from tape and have strong authentication. But if this new version cannot be deployed it will be the current version with only the xrootd interface added.

Ph.Charpentier asked why ALICE and LHCb are not covered. LHCb has its pools for analysis in their current CASTOR setup.

B.Panzer replied that this is set up for end user analysis to have 1 TB each for a few hundred users.

 

I.Fisk noted that the wiki page does not explain what the Experiments considered as highly-desirable.

I.Bird replied that the Experiments requirements are in the longer-term plan, but this solution is just to provide a short-term immediate solution at CERN. No accounting or quota per user will be provided but access to files is protected.

 

F.Hernandez asked whether the IN2P3 group can be observer to the group at CERN.

I.Bird replied that the general strategy forT1 and T2 needs to be discussed and a group should be started. The current group is to find a solution at CERN in the immediate time and its mandate can now be considered completed.

 

K.Bos added that the scenarios should be tested by some representatives of the Experiments.

A.Pace agreed that the participation and testing of the Experiment is necessary.

 

New Action:

Form a working group for User Analysis with a strategy including T1 and T2 Sites.

6.   AOB

 

No meeting next week. The next MB meeting will be on the 30 September 2008.

 

 

7.    Summary of New Actions

 

Form a working group for User Analysis with a strategy including T1 and T2 Sites.

 

Experiments’ QR presentations at the MB.

Agreed calendar: ALICE: 30/9, CMS: 30/9, LHCb: 7/10, ATLAS 14/10.