LCG Management Board

Date/Time

Tuesday 30 September 2008, 16:00-17:00 - Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=39173

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 4.10.2008)

Participants

A.Aimar (notes), I.Bird(chair), K.Bos, D.Barberis, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, F.Giacomini, A.Heiss, F.Hernandez, M.Kasemann, O.Keeble, M.Lamanna, H.Marten, G.Merino, A.Pace, B.Panzer, Di Qing, Y.Schutz, J.Shiers, R.Tafirout, J.Templon 

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 7 October 2008 15:00-17:00 – F2F Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting 

The minutes of the previous MB meeting were approved.

1.2    QR Preparation (Previous QR)

A.Aimar reminded that by mid/October the QR report (June-Sept 2008) has to be sent to the Overview Board which is meeting on the 27th October 2008).

 

2.   Action List Review (List of actions) 
 

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

About LCAS: Ongoing. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it up should confirm it.

About SCAS: The SCAS server seems to be ready and “certifiable” in a week. The client is still incomplete.

No news for this week about LCAS and SCAS.

  • For the ATLAS Job Priorities deployment the following actions should be performed :

-       DONE. A document describing the shares wanted by ATLAS

-       DONE. Selected sites should deploy it and someone should follow it up.

-       ONGOING. Someone from the Operations team must be nominated follow these deployments end-to-end

 

ATLAS will report on the status of the tests. No news at the F2F meeting.

The open issues are:

-       What to do for sites using LSF (e.g. Rome)?

-       In Italy they tested the gLite “repackaged” by INFN-grid. Needs to be tested on other sites (NL-T1 and Edinburgh will do it).

  • Converting Experiments requirements and Sites pledges to new CPU units.

 

M.Kasemann asked whether there is a committee nominated at the GDB.

G.Merino replied that the work is just starting.

I.Bird added that the mandate should be verified at next F2F MB meeting (next week).

  • Agree on the software distribution and update at the Tier-1 Sites.

O.Keeble explained that the proposal for the Experiment and Sites needs to be reformulated.

 

New Action:

13 Oct 2008 - O.Keeble will send an updated proposal for software distribution and update at the Tier-1 Sites.

  • Form a working group for User Analysis for a strategy including T1 and T2 Sites.

I.Bird proposed that this is discussed at the F2F meeting on the following week. Sites and Experiments can send their comments before then.

  • 23 Sept 2008 - Prepare a working group to evaluate possible SAM tests verifying the MoU requirements.

Not done.

 

3.   LCG Operations Weekly Report (Slides) - J.Shiers  

 

Summary of status and progress of the LCG Operations. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

 

This report covers the last 2 weeks; there was no MB meeting on the 23rd.

3.1      Summary of Week 1

Some service problems in the first week were relevant:

-       The ATLAS conditions DB high-load seen at several Tier1 sites – technical discussions held, plan for resolution still pending; follow-up on cursor sharing bug. There is a task force including ATLAS + IT-DM.

-       Some cases of information being hard-wired into experiment code (both times CEs).

-       Reminder that problems raised at the daily operations meeting should have an associated GGUS ticket/elog entry.

3.2      Summary of Week 2

The week was overshadowed by news bulletins (from DG) about LHC delays. The clear message at WLCG session in EGEE’08 (Neil Geddes et al) was to continue with the service as it is now. But there are some pending things that can now be planned and scheduled in a non-disruptive way. E.g. migration of FTS services at Tier0 and Tier1s to SL(C)4.

 

The need for a more formal contact between WLCG service and LHC operations is apparent:

-       Propose to formalize existing informal arrangement with Roger Bailey / LHC OPS – that builds on roles established during LEP era. E.g. attendance at appropriate LHC OPS meetings with report to LCG SCM + more timely updates to daily OPS as needed.

-       RB invited (since some time) to give talk on 2009 outlook at November “CCRC’09” workshop

3.3      Services Reports

The problems related to CASTOR2 and Oracle continue – this is reminiscent of problems seen with ORACLE 10.2.0.2 and “cached cursor syndrome” but which was reputedly fixed.

 

In the past, we had {management, technical} review boards with major suppliers and representatives from key user communities. Given again the criticality of Oracle services in particular for many WLCG services, should these be re-established on a quarterly or monthly basis? The Interim post-mortem from RAL here.

 

On-going discussions with(in) ATLAS on conditions DB issues

Reminder – service changes, in particular those that {can, do} affect multiple services & VOs should be discussed and agreed in appropriate operations meetings

This includes both 3D for DB-pure plus regular daily + weekly ops for additional service concerns

-       Some emergency restart of conditions DBs reported Wed (BNL, CNAF, ASGC) for a variety of reasons

-       Network (router) problems affected CMS online Thu/Fri, then DNS problems all weekend – fixed Monday morning

-       LFC stream to SARA aborted on Friday night. Fixing some rows at destination - data was changed at destination but should be R/O!

-       On-going preparation of LHCb LFC updates for migration out of SRM v1 endpoint. One hour downtime needed to run script at CERN and at Tier1s.

-       Oracle patches installed on validation DBs and scheduled on production systems over coming weeks

-       SLC4 services for FTS are now open for experiment use at CERN

3.4      Post Mortem Reports

Three incidents required a post-mortem report:

-       Post-mortem on network-related incident (major incident in a Madrid data-centre) to be prepared.

-       The interim post-mortem on RAL Castor + Oracle available here.

-       September 7 at CNAF, CASTOR problem (see slide notes for details)

CNAF: On September 7 (Sunday) we experienced a complete disservice for CASTOR.

 

The problem was fixed on September 8 in the morning.

The downtime was been caused by and Oracle known bug (affecting the version 10.2.0.3) causing the Oracle management agent to consume up to 100% CPU time and subsequently to hang. Due to lack of response from the management agent, Oracle starts spawning new agent processes which degrade in the same way.

 

This brought all the 4 CASTOR RAC nodes to hang and to prevent any attempt of reboot, even from KVM console, hence requiring an on site intervention.

 

The "culprit" cluster is composed from 4 nodes, each host having 2 dual-core CPUs and 8GB of RAM. The operating system is Red Hat Enterprise 5 kernel 2.6.18-53.el5

The Oracle version is 10.2.0.3

 

The cluster contains 3 databases:

- stager (host1, host2): 2 instances, one of them preferred

- name server (host3,host4): 2 instances, host3 preferred

- dlf (host3,host4): 2 instances host4 preferred

 

The problem has been solved upgrading the agent to version 10.2.0.4.

 

3.5      Experiments Reports

Mostly routine operations from the Experiments – a mix of cosmics and functional tests. Longer-term plans have to be decided, including internal s/w development schedules etc.

 

Reprocessing tests are continuing – this could be a major goal of an eventual CCRC’09 – the overlap of multiple VOs is important.

 

Planning workshop for the latter is November 13-14. Draft agenda available – to be revised in the light of recent news – available here. Registration is now open.

 

M.Kasemann noted that, even if LHC is not running in 2008, the Experiments will continue their tests and development and the procurement schedules should not be changed.

 

4.   Accelerator schedule delays (Slides) – I.Bird

 

4.1      Consequences of the LHC Delays

The MB should proactively face the new scenario:

-       Avoid/minimize effects on 2009 hw procurement: We need an agreed MB position that avoids delaying procurements

-       Decide which upgrades of software and services that had been postponed, or were not ready, are now possible.

-       Define the need for a concrete CCRC’09 challenge, in particular if many updates are made.

4.2      2009 Procurement

There is not yet an official statement of a revised LHC schedule for 2009 and the reassessments of the experiments’ requirements based on such changes.

 

BUT:

-       2009 data taking could be at the anticipated levels (or extended?). The amount of 2008 data missed could then be small in comparison and the longer term implications will be negligible

-       In the resource ramp up to 2009 levels – the jump from 2008 to 2009 will still be x2. The risk of problems is procurement/delivery is high and therefore one must not ease the pressure to have resources in place as planned.

-       Experiments will continue to work with cosmics for calibration etc. All Experiments have expressed the statement that their needs will not change considerably.

 

Therefore we need to resist the temptation to allow procurements to be delayed. With the possible exception of tape purchase.

Unless we get a clear message about long delays from CERN management we have to maintain the existing plan.

 

J.Templon agreed with the conclusions (and anyway NL-T1 has to catch up on their 2008 procurement). The problems of ramping up have been shown to be very serious and therefore procurement should not be delayed. In addition there is going to be more MC data to be produced and stored. On the other end all estimations where done in 2005 and even if LHC is more than t 2 years late from the original schedule the requirements have never been reduced.

 

J.Gordon and I.Fisk agreed that it can be difficult to defend that the requirements are independent on whether LHC data or not are collected. One must provide to the funding agencies clear calculations to support the statement that requirements and timelines for procurement stay basically unchanged.

 

I.Bird agreed with the statements but current procurement should not be changed until a new schedule is clearly available.

 

I.Fisk suggested highlighting that in 2008 one would have just have had 45 days of data taking. And we should note that even 100TB reduction should be reported. It would be a sign that the requirements have been recalculated and changed.

 

M.Kasemann noted that the changes must be known asap. In addition the 2009 runs could start earlier and therefore the 2008 missed period will be caught up in early 2009. Will some information be available by the RRB in November?

 

J.Gordon added that one should highlight the factual changes: how much less data is taken in 2008 and how much more data is taken in 2009.

 

F.Hernandez and G.Merino noted that for estimation by the 20 October the Sites can only keep the current pledges.

I.Bird replied that on the 20th the pledges should not change and assume the current requirements. Experiments need to do their re-assessment before.

 

Decision:

I.Bird proposed that while the arguments are prepared for the RRB nothing should be changed. The MB agreed on the proposal.

 

New Action:

20 Oct 2008 - By the last week of October the Experiments should provide their new estimated due to the delay in the LHC operations. They should also clarify the assumptions on data taking made in their calculations.

 

An MB on Wednesday on the 22 October should discuss this. The 21 is not possible.

4.3      Upgrades

The delay does allow the possibility to finish deployments and upgrades that had been postponed or were not ready. But should be finished by end of this year or January 2009. For instance IN2P3 will use opportunity for infrastructure work and CERN has similar plans (discussed later).

 

Concerning middleware and services (see Oliver Keeble’s talk) one should discuss:

-       SL5 migration

-       FTS porting to SL4/SL5

-       SCAS/glexec for multi-user pilots framework deployed

-       Updates of storage software (Castor 2.1.8/SRM 2.8; dCache). Especially Castor deployment for user analysis

-       Asses the situation for a CREAM certification/deployment

 

O.Keeble agreed to send a proposal for what to do on each of the items above. With the constraint that all must be done by January and followed by a real CCRC challenge.

 

New Action:

8 Oct 2008 – O.Keeble sends to the MB a proposal on possible upgrades to middleware because of the LHC delays.

 

Ph.Charpentier and I.Fisk noted that the SLC5 migration needs to be done now as there will not be many occasions in 2009 with a denser LHC schedule in view.

 

I.Fisk proposed that one goal is to remove all SL3 machines (now still used for FTS and RB). SL3 should be officially deprecated.

 

F.Hernandez noted that migrating the WN to SL5 cannot be done without taking into account that there are also VOs that cannot be asked to move to SL5 just because the WLCGVOs need it. Migrations should happen even during operations with a mixed environment, migrating gradually over several months.

 

I.Bird agreed but stressed that this is a very good occasion to upgrade as much as possible.

 

Ph.Charpentier asked whether we are talking about migrating to SL5 64 bits OS instead of 32 bits.

O.Keeble replied that the OS on the WN is already 64 bits and the 32 bits code is ran with 32 bits compatibility libraries.

 

Ph.Charpentier replied that this is true only at CERN. And this should be a goal for many Sites. Have 64 bit OSes and 32 bits libraries available for all VOs that need them.

I.Fisk added that 64 bits applications gain bout 30% performance. Therefore some VOs can benefit if they run their applications in 64 bits mode.

 

I.Bird proposed that the MB states that OS on the WN should be in 64 bits mode at all Sites.

 

O.Keeble noted that the EGEETMB had refused it. But the LCG could re-propose the statement.

 

 

5.   Service Changes at CERN (Slides) – T.Cass

 

T.Cass presented the changes that could take effect because of the delays.

 

Foreseen Changes - These changes are going to be possible at CERN:

-       Worker nodes migrated to SL(C)5 @ CERN

all new hardware to be installed with SLC5, with rapid migration of existing nodes once certification complete

-       FTS 2.1 deployment @ CERN

Will test on SLC5, but priority is for deployment of FTS 2.1, not for migration to SLC5 platform.

 

Ph.Charpentier asked whether the gcc libraries will be compatible with the SL4 one.

T.Cass replied that the standard gcc will be installed (gcc 4.1) but the Application Area can move the gcc libraries (gcc 3.4) they need.

 

O.Keeble added that the middleware (for the WN) will also be compiled with the same version needed by the Experiments (App. Area).

 

CASTOR Issues – The main actions that will take place are:

-       SRM stability- Focus on deployment of CASTOR SRM 2.7.

Delay (full) implementation of purgeFromSpace until CASTOR SRM 2.7 fully in production

-       @ CERN. Deploy 2.1.8 on all instances

Support for quota on analysis instance from January

Strong authentication enabled, but perhaps only fully implemented on analysis instance

 

Ph.Charpentier asked whether this specific analysis instance is temporary or permanent.

T.Cass replied that there is the issue of how analysis can impact the existing CASTOR instances for the VOs. Separating the instances guarantees the performance of the Tier-0 functions. If possible in a few years one can evaluate whether to join these (analysis and Tier-0) instances.

 

CERN Infrastructure – There will be several service upgrades at CERN:

-       Aggressive repack of media to high density format able to write 1 TB on the same media (3592C, T10KB). This will free tapes and storage space in the future.

Will require drive dedication to the migration, but can free drives if tape queues build up.

-       Transform all IBM robots to full-length, high-density models

Will require 1-2 day interruption per robot

-       Migrate disk and tape server infrastructure to SLC5.

Certainly for new hardware but the migration of existing servers is disruptive (for D1T1), i.e. requires drain and re-install.

 

I.Bird noted that even if the changes are disruptive it will be more complicated to do them between 2009 and 2010.

 

Ph.Charpentier and M.Kasemann noted that each intervention on the pools should be agreed with the VO concerned well in advance.

T.Cass agreed on the statement.

 

6.   AOB

 

LHCC Review Meeting – I.Bird reported on the Review meeting:

-       The reviewers asked why all resources in place are not used. For them all resources should be used on purchase postponed.

-       The (new) reviewers will need to be informed on the situation outside the short review meetings. They could be invited to the WLCG Workshop.

 

QR Presentations - The QR presentations will be done quickly (5 minutes each at next meeting with the essential points) at the F2F meeting.

 

Next F2F Meeting at 15:00 - The F2F meeting will be at 15:00-17:00 because there is L.Robertson’s farewell drink at 17:00.

 

Pre-GDB Next Week – The GDB will be devoted to the Tier-2 discussions. The Pre-GDB will just be from 13:00 to 15:00 and could cover Tier-1 issues.

 

 

 

7.    Summary of New Actions

 

 

8 Oct 2008 – O.Keeble sends to the MB a proposal on possible upgrades to middleware because of the LHC delays.

 

 

New Action:

13 Oct 2008 - O.Keeble will send an updated proposal for software distribution and update at the Tier-1 Sites.

 

 

New Action:

20 Oct 2008 - By the last week of October the Experiments should provide their new estimated due to the delay in the LHC operations. They should also clarify the assumptions on data taking made in their calculations.