LCG Management Board

Date/Time:

Tuesday 27 March 2007 - 16:00–17:00 - Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=11631

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 29.3.2007)

Participants:

A.Aimar (notes), L.Betev, I.Bird, N.Brook, T.Cass, Ph.Charpentier, L.Dell’Agnello, Di Quing, C.Eck, I.Fisk, S.Foffano, D.Foster, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, J.Knobloch, H.Marten, G.Merino, H.Renshall, L.Robertson (chair), O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 3 April 2007 - 16:00-18:00 – F2F Meeting in Prague

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

No comments received. Minutes approved.

1.2         Matters Arising

No matters arising.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 27 Feb 2007 - H.Renshall agreed to summarize the experiments; work for 2007 in an “Overview Table of the Experiments Activities.

 

Done. The calendars by experiment and site are already available on the SC4 Experiments Plans Wiki.

 

For example inside that page there are links to all plans:

AlicePlans- AtlasPlans- CmsPlans- LhcbPlans- CernPlans- SiteASGC- SiteBNL- SiteCNAF- SiteFNAL- SiteFZK- SiteIN2P3- SiteNDGF- SiteNIKHEF- SitePIC- SiteRAL- SiteTRIUMF- SiteUSALICE

 

VOS should always send all updates to H.Renshall (via the ECM meeting).

 

  • 10 Mar 2007 - ALICE and LHCb should send to A.Aimar their targets for 2007, in a similar format to those specified for ATLAS and CMS.

Done. Alice and LHCb sent their targets for 2007.

 

New Action:

3 Apr 2007 - A.Aimar will update the targets for all four LHC experiments and distribute it to the MB.

 

  • 15 Mar 2007 - CMS and LHCb should send to C.Eck their requirements until 2011.

Not done. Waiting for CMS (D.Newbold) and LHCb (N.Brook).

 

  • 16 Mar 2007 - Tier-0 and Tier-1 sites should send to the MB List their Site Reliability Reports for February 2007

Done. Summary presented at this MB meeting.

 

  • 20 Mar 2007 – L.Robertson will present the agreement(s) to propose to the different supplier of external software for the LCG.

Not done. L.Robertson will propose a new date for this milestone. (end May 2007)

 

  • 31 Mar 2007 - Experiments to provide very long term (2010-2020) estimates of their computing needs at CERN

Cancelled. L.Robertson had discussed the issue with J.Engelen. Beyond 2009 there are several options about the evolution of the LHC, but nothing has been decided yet and so computing estimates must remain speculative.

 

3.      Site Reliability Reports for February 2007 – Summary (Reliability Data; Site Reports; Slides) – A.Aimar

 

All sites commented the Reliability Data document distributed at the end of February (pages 3 to 5 refer to February 2007).

 

In the attached Slides there are the February reliability daily values (slide 2 and 3) that are summarized and compared to the values in January in slide 4 and in the table below (sites ordered as in the Reliability Data document):

 

Site

Jan 07

Feb 07

CERN

99

91

GridKa/FZK

85

90

IN2P3

96

74

INFN/CNAF

75

93

RAL

80

82

SARA-NIKHEF

93

83

TRIUMF

79

88

ASGC

96

97

FNAL

84

67

PIC

86

86

BNL

90

57

NDGF

n/a

n/a

 

Reliability >= 88%   (>= Target)

Reliability >= 79%   (>= 90% of Target)

Reliability < 79%   (< 90% Target)

 

The target of 88% for the best 8 sites was not reached

            5 Sites >  88% (target)

            5+3 Sites >  79% (90% of target)

 

In January we had 5+4 Sites > 79%.

 

J.Gordon asked why NDGF data is “n/a” while actually reliability data is available in GridView for NDGF.

 

L.Robertson replied that it was agreed to wait for a proposal from NDGF on the use of tests that they would develop and which would be equivalent to those of SAM. The results for NDGF will be included when this has been agreed and the new tests developed.

 

O.Smirnova confirmed that some SAM tests execute successfully (e.g. BDII) but, for instance, those for the CE and the WN need to be developed specifically by NDGF because of the structure of the NDGF Tier-1 site. And those tests are not ready yet.

 

The table below summarizes the Site Reports received, with “Problem, Solution” format, when the solution is available:

 

Description

SITE: Problem, Solution

SRM/MSS

IN2P3:  SRM/dCache failures, upgraded SRM server

CERN:  Unalarmed CASTOR failure, fixed the configuration

INFN:  CASTOR overload, increased pool disk space

FZK:  SRM not responsive, dCache patch applied, needs to be verified

RAL:  CASTOR log files too large, solved with log files rotation

TRIUMF:  SRM overloaded, dCache patch applied

BNL:  SRM overloaded and unresponsive, patches + SRM DB upgraded

BDII overload/timeout

RAL:  BDII timeouts and RB problems, added 2 nodes situation improved

TRIUMF:  CERN Top-level BDII timeouts, moved to local top-level BDII

Other Load Problems, upgrading hardware

IN2P3:  CE overloaded (not due to IN2P3), fixed RB config

PIC:  CE overload, added nodes and banned temporarily some users

FNAL:  Gatekeeper instabilities, upgraded hardware

Not understood

FZK:  RM failures (RM or BDII timeouts)

RAL:  Unspecified Gridmanager Error, GGUS ticket submitted

Operational Issues

IN2P3:  Misconfigured RB (not due to IN2P3), fixed

CERN:  Misconfigured test systems, fixed

FNAL:  CSH misconfigured after cluster upgrade, fixed

BNL:  Moved to SL4, SAM and gLite do not run until new release

NDGF:  Not reporting

SAM

IN2P3: SAM messages are not clear enough

SARA:  SAM modifications needed to access resources at NIKHEF

 

One can notice that the main issues were related to the SRM/MSS systems at the sites:

-          CASTOR:
Running SRM 1.1. Some overload that required resizing of the hardware (INFN) or corrective actions to failures (CERN, RAL).

-          DCache:
Several sites upgrades to the version 1.8 that supports SRM 2.2. This required patching the system and unscheduled downtime causing some site to be largely under the target (BNL, IN2P3).

 

J.Templon added that the dCache patch fixed the “gridftp doors” problem but not completely. That is why it required further intervention at the site.

 

F.Hernandez pointed out that the “misconfigured RB” was not due to IN2P3, but IN2P3 was overloaded because of this problem.

 

There were not outstanding issues but can be noted that:

-          Several problems were due to the upgrades of the dCache systems to SRM 2.2.

-          The SRM and MSS start being overloaded on many sites, probably due to more realistic usage by the experiments.

-          Individual site BDIIs are needed (decided at the Operations Meeting). Sites that have installed them have solved their timeout problems.

-          Hardware upgrade was sufficient in many cases of service overloading (SRM, CE, etc).

 

-          There were no “false positives” so the reliability of the SAM tests seems to have improved. But the site reliability did not improve in February compared to January. This may be due to teething troubles with the dCache upgrades at several sites but the target 88% seems to be difficult to reach for 8 sites. Other upgrades (SL4 migration, gLite 3.1, SRM 2.2, etc) still have not been implemented and they could introduce new problems that could reduce further reliability and stability of the sites:

 

The work for checking and reporting on site reliability to the Operations Meeting in the weekly site reports is ongoing. It will not be ready for March, but probably by the end of April 2007.

 

F.Hernandez noted that, in spite of past requests, SAM had not evolved to be more “site-friendly” with readable failure messages. This makes it very difficult to understand the causes of the failures.

I.Bird replied that this work is all on the list of things to do and will be taken into account, after other more urgent issues.

 

Received by M.Ernst:

As it is correctly stated in the table on "Operational Issues" SAM tests

are failing for BNL since 19 February because of problems with the gLite

middleware (in particular with the CE) since the farm was upgraded to

SL4 (and a new Condor version).

 

I think the real issue here is we need to correct the metrics.

Monitoring the LCG/gLite CE at BNL means we are probing the availability

of a component that is not needed in the (U.S.) ATLAS operations model.

No production/analysis job ever gets dispatched through this CE.

 

Jamie and I discussed the matter last week and concluded that the

metrics will be corrected for BNL such that the CE will be taken off the

list of components to be tested.

 

4.      VO Boxes - Sites Support Level (Slides) – G.Merino

 

G.Merino was asked to present the level of support of VOBoxes at PIC in order to trigger discussion and clarification of those issues between sites and experiments. A high-level milestone is set for April 2007: “Sites should propose and agree with the VO the level of support (upgrade, backup, restore, etc) of VOBoxes”.

4.1         VOBoxes at PIC

PIC supports VOBoxes for the three experiments: ATLAS, CMS and LHCb

 

For each of the experiments PIC has one person on-site acting as VO-liaison who:

-          knows and operates the  VOBox services (CMS)

-          knows whom to contact in the experiment for VOBox issues (ATLAS, LHCb)

 

The VOBoxes at PIC run respectively:

-          ATLAS: Data Transfer agents (Don Quijote)

-          CMS: Phedex Agents

-          LHCb: DIRAC Configuration Service and Transfer Agents

4.2         VOBoxes Operation

Backup: No backup requested to the site

-          CMS: The only important files to keep are Phedex agents’ configuration files. These are kept in an external (CMS-owned) CVS repository.

-          ATLAS: The MySQL tables are re-generated centrally in case of failure.

 

Monitoring: No VO-specific sensors provided for sites to show the VOBox status. All experiments have VO-specific sensors external to the site and they detect whether the VO-specific agents are working correctly.

-          If any problem occurs, the VO contacts the local VO-liaison

-          Site operators do not get alerted

 

Site should only monitor basic metrics (host is alive, CPU load …).If any of these is triggered, the local VO-liaison is notified

 

Recovery: No special recovery action is agreed.

-          If a major problem happens, the site admin tries to reboot the VOBox node

-          If the machine is not responsive, it reinstalls a new default VOBox.
The local experiment liaison will redeploy all the VOBox software

 

L.Robertson asked which is the “response time” agreed with the VOs.

 

G.Merino replied that there is not a formal agreement; the VOBoxes are “ping-ed” like any other host. PIC has no special spare for VOBoxes or for specific purposes. They have spare equipment that is like a “powerful WN”, not a full server; which would be used to replace a VOBox if needed.

As of today to prepare a VOBox from scratch it would take about a full day. Is this adequate for the experiments?

4.3         Specific Issues on VOBoxes at PIC

 

1. The CMS VOBox still doing quite a lot of "class 2" functions. Today it would not work if taken out of the site LAN.

For instance: checking whether a pre-staged file is already on disk, by using "local" commands (CASTOR 1 commands, for instance). Is this just a temporary solution until we get SRM-v2.2?

 

Nobody from CMS present to reply.

 

2. Many of the LHCb transfers hitting PIC's SRMs do not use FTS

The LHCb Transfer Agents in the VOBox seem to play the FTS role (transfer queue, retries, etc). This prevents the PIC system administrators from having any control on the LHCb data flow. When transfers are FTS-driven, one can administer the FTS channels (in particular, close them when there are problems). For LHCb, PIC can not do this since only LHCb people can interact with transfer agents in the VOBox.

 

For PIC was it impossible to recover a backlog work because all the retries were done directly from the WN and would overload and block all network transfers.

 

N.Brook replied that the VOBoxes retry the transfers when they fail. LHCb will continue to use the WN only for the first copy attempt, but failures will be retried by the VOBox using FTS channels in the future. He noted, however, that the transfer rate integrated on all Tier-1 sites is less than 10 MB/sec for LHCb.

 

3. There are some LHCb Tier-2s with no disk at all

The data flow from some LHCb Tier-2 sites into PIC will always come directly from the WN to the SRM. PIC lacks any FTS “operations” control. Is this a temporary situation, or will every Tier-2 site have some small SRM-disk to act as a buffer for those transfers?

 

N.Brook replied that also in this case the VOBoxes will use FTS; therefore the site will be able to control those transfers via normal FTS service administration.

 

L.Robertson asked for comments from other sites:

-          H.Marten reported that similar informal procedures are defined at FZK with VO-liaisons and basic backup/recovery like at PIC.

-          J.Gordon added that RAL has similar agreements but nothing is formally specified.

 

H.Marten added that a written agreement common to all sites would help to make sure all issues are clear between site and VOs and all sites use the same standards. The proposal was not discussed further for now.

L.Robertson concluded that is important that all sites know what to do if there is a VOBox failure. For now sites should verify that independently. This will be verified by the high-level milestone scheduled for April.

 

5.      Mid-Term Resource Planning (Slides; document, 2Q2007 Req. Table) – H.Renshall

 

 

The document attached is a new version of L.Robertson’s document “Summary of the Process for Reporting Experiment Requirements, and Site Capacity and Usage Data for CERN and the Tier-1 Centres (version 7)”. Changes are highlighted in yellow in the document.

 

From 1 April we will start the new resource reporting process, using the values in the Mid-Term Resource Planning tables. The 2Q2007 table of WLCG Service Coordination planning uses values for ‘installed’ capacity taken from the end of January accounting plus any increments announced at the January workshop. Tier-1 sites should verify their values.

 

Other miscellaneous changes to 2Q2007 table were:

-          new values from ALICE (Y.Schutz, 9 Feb.) with reduced tape requirements

-          addition of LHCb requirements for June and July

-          addition of new columns of Allocated disk space (in italics) per site per experiment taken from the end of February accounting

-          updated pledges for 2007, which need to be available to experiments from 1 July 2007 till 1 April 2008; the 2006 pledges should be available up till 30 June 2007

-          added CERN Tier-0 and CAF resources for completeness

 

Issues for discussion:

-          New proposal that Tape1Disk0 disk buffer size be quantified and be added to disk requirements.

-          The attached draft 2Q2007 table shows discrepancies that we will follow-up individually (after sites have verified their installed capacities).

-          For ATLAS and CMS the allocated disk capacity is much larger than that formally required so this is to be understood (we know, for example, that ATLAS MC event sizes are bigger than predicted).

-          For ALICE the required disk space is much greater than that allocated. ALICE wants to ramp-up to these values but with what time profile?

Ph.Charpentier added that LHCb has fewer resources than required and are using all resources available.

 

-          We have used the averages of the site offers of disk and CPU to give a single site share for each experiment and used that to calculate all the site requirements. Is this an oversimplification (tape offers are often much lower)? This point needs to be investigated further.

-          In the longer term we are assuming that the full 2007 requirements as in the TDR addenda need to be available to experiments from 1 October 2007(corrected during meeting to 1 July 2007) and that the full 2008 requirements, as per the ‘Summary of Regional Centre Capacities 01/02/2007’ need to be available from 1 April 2008 at which point the mid-term planning tables converge to the annual planning and probably no longer need a separate existence.

-          The ramp-ups to 4Q2007 (double of 2Q2007) and then to 2Q2008 (double of 4Q2007) are very steep. A lot of hardware needs to be bought, installed and commissioned.

 

L.Robertson reminded the sites that from March 2007 the available and installed data for resource accounting will be taken from these tables as at the end of each month (i.e. March values from the 2Q2007 table, taken on the 8th of April 2007) and from the APEL repository for automatic resource accounting.

 

Decision:

Following the email exchange preceding the MB meeting, L.Robertson asked the MB to agree (again) that the requirements from the experiments include the efficiency factors agreed in the TDR and all values are gross value. The MB agreed.

 

The efficiency factors include already all disk needed for buffers, etc. and the experiments should request gross values (including efficiency) which is what the sites will install and report on.

 

N.Brook added that LHCb will send to H.Renshall their updated requests, in order to include the efficiency factors.

 

Ph.Charpentier stated his scepticism on the fact that tape efficiency can be 100% without overhead as agreed in the TDR.

H.Marten replied that probably this was decided taking into account that data compression from the tape equipment can give 15-20% compensation. The MB agreed to discuss this in the future if the 100% factor for tape will prove to be incorrect.

 

6.      AOB

 

 

No AOB.

 

7.      Summary of New Actions

 

 

3 Apr 2007 - A.Aimar will update the targets for all four LHC experiments and distribute it to the MB.

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.