LCG Management Board

Date/Time:

Tuesday 27 February 2007 - 16:00 

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=11627

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 2.3.2007)

Participants:

A.Aimar (notes), D.Barberis, I.Bird, T.Cass, L.Dell’Agnello, T.Doyle, C.Eck, M.Ernst, I.Fisk, S.Foffano, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, J.Knobloch, E.Laure, H.Marten, P.Mato, P.McBride, G.Merino, B.Panzer, H.Renshall, L.Robertson (chair), Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 6 March 2007 - 16:00-18:00 – F2F Meeting at CERN

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

Some comments received asking whether “experiments resources requirements” include the “efficiency factors” or if it is up to the sites to apply them. The topic needs further clarification; therefore an updated proposal is discussed in this meeting.

 

Minutes approved.

1.2         Matters Arising

The latest Targets and Milestones for 2007 (document) were distributed at the MB and ECM and should be commented by the experiments and sites.

CMS milestones are included, similar information is needed for the other experiments.

The document will be attached to the agendas every week until it is stable (hopefully very soon).

 

Action:

10 Mar 2007 - ALICE, ATLAS and LHCb should send to A.Aimar their targets for 2007, in a similar format to those specified for CMS.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 9 Feb 2007 - All Tier-1 Sites should send to the MB their Site Reliability Report for December 06 and January 07.

Done. A short summary is presented later in this meeting.

 

  • 20 Feb 2007 - MB Members should send feedback to L.Robertson on the list of products and on the kind of agreement to reach with the external software providers.

Done. L.Robertson will send a proposal on how to proceed with the different suppliers of software.

 

New action:

15 Mar 2007 – L.Robertson will present the agreement(s) to propose to the different supplier of external software for the LCG.

 

  • 22 Feb 2007 - Experiments (ALICE and LHCb in particular) should verify and update the Megatable (the Tier-0-Tier-1 peak value in particular) and inform C.Eck of any change.

The Megatable values will be discussed today and then will be decided how to proceed.

 

  • 27 Feb 2007 - MB Members should review and send feedback on the note distributes by L.Robertson on the “process for reporting site capacity and usage data”.

Done. Feedback received and an extended version is discussed in this meeting.

 

  • 27 Feb 2007 - H.Renshall agreed to summarize the experiments’ work for 2007 in an “Overview Time Table of the Experiments Activities”.

Not done.

 

3.      Reliability Reports Jan 2007 Summary (Site Reliability Data; Site Reports Collection; Slides) - A.Aimar

 

A.Aimar presented a short summary of the Site Reports (link) that all Tier-0 and Tier-1 sites provided.

See also the Site Reliability Data (link) previously distributed.

 

Slide 2 and 3 show the daily reliability of each site, in tabular and graphical format.

Sites were asked to send a report commenting all days in which they did not reach the current reliability target (88%).

All sites replied. CERN was always above the reliability target and NDGF still has no reliability data available.

 

The WLCG target to reach was “88% for the best 8 sites”; and was almost reached:

-          5 Sites were above the target

-          9 Sites were above 90% of the target (i.e. 79%)

 

Even if the target was not reached this is still the best reliability result ever.

The targets will be raised to 91% (June) and 93 % (December) in 2007

 

As the LCG is now a running service, there are fewer milestones to complete and we focus more on performance and reliability metrics, so additional monitoring information will be provided in the QR reports.

 

All site reports are in the document attached (link) and slide 5 provides a simple classification of the issues encountered and whether they were solved or not. Sometimes, from the reports received, is not always clear whether the issue is completely fixed or not.

3.1         Summary of Reported Issues and Solutions

 

Problem Description

Fixed

Not fixed yet

SAM info not found

(30 days passed)

IN2P3: but found info elsewhere

NDGF: not reporting

SARA: not found SAM results

BDII overload/timeout

TRIUMF: installed their site BDII

FZK: BDII timeout

RAL: BDII timeout

MSS

BNL: dCache down for upgrade

RAL, PIC: dCache gridftp doors issues (upgrade 1.7 should fix it)

CNAF: CASTOR sporadic failures

Load Problems, increasing hardware

ASGC: Possible BDII and CE overload (manual restart)

FNAL: batch and FS scaling problems (gatekeeper replaced)

INFN: LFC dedicated hardware, CASTOR increase pool size

PIC: CE overload, add 2 CES

Not understood

BNL: timing certificates and AFS tokens (manual fix for now)

IN2P3: proxy expired + LFC timeout

PIC: Possible SAM false positive

Operations Issues

(all Fixed)

INFN: CASTOR config issues

FZK: dCache config issues

RAL: faulty script executed

ASGC: endpoint and NFS  config, host certificate expired

SARA: File system full on SAM tests

BNL: FTS info in Globus different than in BDII db

 

The issues in the Sites Reports were be grouped in six families (as in the table above):

 

-          SAM Information not found - After 30 days the detailed SAM information stored in the database cannot be seen via GridView.  Therefore if sites do not verify regularly their reliability data it will be more difficult for them to retrieve the cause of an issue.

F.Hernandez, referring to the IN2P3 issues, added that they found information in the site report to the Operations Meeting to correlate their failures. But they do not know which SAM tests had failed at their site early January.

 

-          BDII overload or timeouts – TRIUMF solved the issue installing a site BDII (they had previously used one of the CERN BDIIs). RAL had timeouts and now have added two more servers to see whether this solves their timeout problems.

J.Gordon added that some services at the Tier-1 sites are due to overloading of the CERN BDII. This could be the cause of some random timeouts. And, in addition, some Tier-2 sites from the regions point to the CERN BDII even if they could (and should) instead point to the BDII available at their regional Tier-1 site.

I.Bird added that the site’s BDII should not be installed on the CE (see the ASGC BDII overload problem) and that, even if the CERN top-level BDII was upgraded, Tier-1 sites should consider also installing their own top-level site BDII. It is believed that the current architecture can cope with the LCG needs for the next couple of years and meanwhile we should see how to improve the Information System services.
L.Field will give a presentation about BDII, status and current configuration, at the next GDB meeting.

-          MSS-related issues – BNL was down during their upgrade to dCache 1.7. Other sites (RAL and PIC) had issues with the “dCache gridftp doors” that should be resolved by upgrading to dCache 1.7. INFN has sporadic CASTOR failures but it is unclear from their report whether they are fixed or not.

-          Load Problems – ASGC had BDII and CE overload problems that required periodic manual restart of the services. FNAL had problem with the scaling of their batch system and their file system, they solved it by increasing the size of the gatekeeper. INFN had overload of the LFC (could be solved increasing dedicated hardware) and of CASTOR (solved by increasing the pool size); both issues seem solved now but are not completely understood. PIC added two additional CEs to solve the overload of their CE.

L.Dell’Agnello added that there was an issue with the new LFC hardware that did not pass certification and should be solved soon by their hardware supplier. For CASTOR they increased the storage pool size for the experiments in order to eliminate the sporadic timeout that were occurring.

-          Problems not yet understood – BNL had problems of timing between certificates and AFS tokens. IN2P3 experienced problems with proxy expiration and LFC timeouts. PIC suspects that they could have some false positive from SAM (the only case found this month).

-          General operations issues – Typically wrong configurations (of MSS, NFS, etc), sysadmin scripts that did not work properly, file systems full or certificates left to expire.

3.2         Conclusions and Next Steps

Some issues remain, but none seems outstanding nor unsolvable:

-          Individual site BDIIs seem to be needed (already discussed and agreed at Operations meeting)

-          dCache version 1.7 solves gridftp doors issues that a few sites had

-          Hardware upgrades helped in many cases of timeout and overload of some services 

 

Sites reliability has improved and the target is almost reached for the best 8 sites. But upcoming major upgrades could have an impact on reliability (SL4, gLite 3.1, SRM 2.2, etc).

 

Following J.Templon’s proposal it was agreed that from now on reliability information should be verified (via GridView) and the reasons for problems reported to the weekly Operations Meeting in the weekly site report.

 

The Reliability Data Report and the Sites Reports will be included to the Quarterly Reports, starting from the current quarter.

 

4.      LCG Resources Tables (Slides; document) – L.Robertson

 

The MB discussed further which and how resources, both experiments requirements and site equipment installed, should be reported and updated. The proposal in detail is in the document attached (link).

 

Note: Slide 1 says “Experiments Requests” but this should be Experiment Requirements.

 

Experiments Requirements

-          Are those mentioned in the TDR

-          Specified in gross units – calculated by the experiments as “real needs plus an allowance for efficiency”.

-          Standard efficiency factors are those specified in the TDR

 

The Experiments Requirements should formally be updated in:

-          July - for the October C-RRB at which the pledges will be committed for the following year

-          January – with the experience of the previous year’s run

 

 

Regional Centre Resource Table

-          Defines the current status of the resources pledged or planned by funding agencies, for each regional centre

-          Formally reported by the funding agencies to the Project Resource Manager (C.Eck).
It is assumed that the pledges have been agreed in advance between experiments and sites.

-          Pledges are in gross units

 

Megatable (link)

-          Specifies the relationships between Tier-1 and Tier-2 centres for each site & experiment – according to the way in which the experiment expects to be able to use its resources at the site during the 2008 run

-          Inter-site data rates are specified as follows:

    • Tier-0 to Tier-1  - peak
    • Tier-1 to/from Tier-1  - aggregate peak
    • Tier-1 to/from Tier-2  – peak and average for each T1,T2

-          Also contains sizing and utilisation of storage classes at the Tier-1

-          The values for storage specified in the Megatable for the Tier-2 are net units in order to compare them with the network rates.

-          The sum has the efficiency factors applied to obtain the gross units to be used for the pledges and procurements.

 

D.Barberis suggested focusing on the data rates because the storage size will not be clear until the LHC is running. L.Robertson replied that is important to define realistic requests assuming a realistic efficiency of the LHC. Otherwise the resources will not be there when needed because procurement takes time.

 

Y.Schutz said that ALICE does not yet have the resources it requires on both the Tier-1s and the Tier-2s. It is therefore difficult to specify the data rates that a Tier-1 must support (any additional Tier-2 resources that become available  will increase the load on the Tier-1s). ALICE therefore base their Megatable numbers on what they would like to see rather than on what the sites currently offer.

C.Eck said that this results in ALICE asking for about twice the resources sites are planning to provide. In practice sites will have to apply a reduction when considering how to organise their ALICE resources.

 

Harry’s Table (modified during the meeting)

-          Current and following (1-3) quarters

-          A set of tables that show the experiment requirements at each Tier-1 site for computing and storage by storage class

-          Requests are expressed in gross units – what is expected to be installed

-          The table also shows the installed (current quarter) or planned (future quarters) capacity at the site as reported by the site

-          Installed capacity is also specified in gross units, and so can be compared directly with the experiments requests

 

Accounting is in net units and therefore the efficiency factors must be applied in order to compare resource usage with the planning values in Harry’s table.

 

D.Barberis asked for site allocation graphs to be provided “by experiment” not only the total.

J.Gordon replied that currently that is not implemented in the Accounting Portal anymore but should be put back in the new accounting reports.

 

H.Marten noted that gross units are “usable units”, e.g. excluding RAID and file system overheads, and are specified In decimal units.

 

H.Renshall asked that the name of “Harry’s Table” be changed.

 

F.Hernandez asked to clarify for each table above the frequency of the updates.

 

C.Eck asked that experiments requirements are needed until 2011. Only ATLAS has provided them.

 

Action:

15 Mar 2007 - ALICE, CMS, LHCb should send to C.Eck their requirements until 2011.

 

 

5.      AOB

 

 

Experiments were requested to provide requirements until 2020, in order to support the discussion about the future CC at CERN.

The calculations should be done by using the same models, involving someone with the accelerator.

 

Action:

6 Mar 2007 - L.Robertson will organize a meeting with someone from the accelerator to discuss the planning until 2020.

 

Now that J.Gordon has been appointed GDB chairman, with an ex officio place in the MB, Tony Doyle will represent RAL at the MB meetings.

 

6.      Summary of New Actions

 

 

6 Mar 2007 - L.Robertson will organize a meeting with someone from the accelerator to discuss the planning until 2020.

 

10 Mar 2007 - ALICE, ATLAS and LHCb should send to A.Aimar their targets for 2007, in a similar format to those specified for CMS.

 

15 Mar 2007 - ALICE, CMS, LHCb should send to C.Eck their requirements until 2011.

 

15 Mar 2007 – L.Robertson will present the agreement(s) to propose to the different supplier of external software for the LCG.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.