LCG Management Board

Date/Time

Tuesday 9 December 2008 16:00-18:00 – F2F Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=45196

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 14.12.2008)

Participants

A.Aimar (notes), O.Barring, I.Bird, Ph.Charpentier, L.Dell’Agnello, F.Donno, M.Ernst, S.Foffano, J.Gordon, F.Giacomini, F.Hernandez, M.Kasemann, U.Marconi, H.Marten, G.Merino, B.Panzer, R.Pordes, Di Qing,, H.Renshall, M.Schulz, J.Shiers (chair), R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 16 December 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments. The minutes of the previous MB meeting were approved. 

1.2      Site Reliability Reports for OPS and VOs (Reliability Reports Nov 2008)

A.Aimar distributed the Availability and Reliability reports for November 2008. Comments will be collected for next MB meeting.

 

New Action:

16 Dec 2008 - Experiments comments the VO-specific Reliability Reports

 

 

 

2.   Action List Review (List of actions) 
 

  • SCAS Testing and Certification

First testing was done but problems were found.

M.Schulz reported that they are waiting for fixes from the developers and are ready for testing and preparation for distribution will be quick unless new issues are discovered.

 

  • 30 Nov 2008 - Sites should configure (if needed) their SRM V2 before the end of November. The SRM V2 SAM results are already available in the SAM tests and will be used for the December GridView Reports.

DONE.

  • 9 Dec 2008 – I. Bird will check the SRM V2 link in the action list with the SAM team and report back to the MB.

To be done. I.Bird reported that he will check it next week when he is back to CERN.

  • 20 Nov 2008 - VOBoxes SLAs:
    • Experiments should answer to the VOBoxes SLAs at CERN (all 4) and at IN2P3 (CMS).
    • NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

F.Hernandez reported that there is no news from CMS. M.Kasemann will follow up the issue.

NL-T1 not completed yet. NDGF is still modifying the SLA with ALICE.

  • 7 Dec 2008 - Sites comment the proposal by F.Donno for reporting installed capacity

In the agenda today.

L.Dell’Agnello added that CNAF would still like to send some comments.

  • 09 Dec 2008 - Comments to the proposal of new High Level Milestones should be sent to the MB mailing list.

In the agenda today.

  • 16 Dec 2008 – P.Mato and M. Schulz to follow-up on the GCC 4.3 discussion, agree on co-ordination and timescale, and report back to the MB.

Ongoing. For next week.

  • 16 Dec 2008 - Sites requested clarification on the data flows and rates from the Experiments. The best is to have information in the form provided by the Data flows from the Experiments. Dataflow from LHCb

Not done yet.

  • DCache teams should report about the client tools should present estimated time lines and issues in providing the porting to gcc 4.3.

Not done. Will be followed by the Applications Area?

  • Experiments should verify whether the gcc 4.1 binaries work and report on issues and problems. P.Mato will report in a couple of weeks.

REMOVED. This action is replaced by the one where M.Schulz and P.Mato agree on a plan for preparation and testing of the SLC5 platform.

Done for ATLAS. To do for the others Experiments.

  • 09 Dec 2008 - Comments to the proposal for collecting Installed Capacity (F.Donno's document) should be sent to the MB mailing list.

In the meeting today.

Note: OSG asked for one more month, until end of January.

 

 

3.   LCG Operations Weekly Report (Draft pre-CHEP workshop agenda; Minutes; Slides) – J.Shiers

Summary of status and progress of the LCG Operations. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      FTS on SLC3 RIP

Access to the CERN-PROD SLC3-based FTS services (FTST0-EXPORT, FTS-T1-IMPORT, FTS-T2-SERVICE) were STOPPED at 10.00 on Monday 8th December. All production activity should now be on the new SLC4-based FTS services. See https://twiki.cern.ch/twiki/bin/view/FIOgroup/FtsServices for details.

3.2      Week Summary

The number of tickets is not changing and is relatively small. Alarms were rarely needed.

-       The week was smooth until there was a CASTOR ATLAS major issue.

-       The scheduled intervention of the ATLAS DB went largely overtime. 4h instead of 1h. A version upgrade of Oracle requires the system to be stopped and that was not so clear in the announcement. (slide 6)

-       There was a fire in the CC in Taipei. No site updates received until one asked via Di Qing. (Slide 7)

-       ATLAS e-p (CNAF) was in unscheduled downtime Friday evening due to gpfs problems – fixed over w/e (very early Sat morning)

 

Slides 5 and 6 show the details on the CASTOR ATLAS SRM and the ATLAS DB issues.

 

J.Shiers pointed out that the interventions should be announced and executed without extensions. Better to announce a longer intervention than a short one that is not then completed in time. Slide 10 shows the advantages and the guidelines that Managed Services should follow.

3.3      Miscellaneous

-       WLCG Operations Page: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsWeb

-       WLCG Operations mailing list: wlcg-operations@cern.ch

-       WLCG “Service Coordinator on Duty: wlcg-scod@cern.ch

3.4      WLCG Collaboration Workshop pre-CHEP

-       Registration deadline is December 28th 2008.

-       People are already asking about agenda before they reserve for CHEP and book flights / hotels.

 

4.   Reporting Installed Capacity (More Information; Slides) – F.Donno

 

F.Donno summarized the comments received to the document about reporting installed capacity at the WLCG Sites.

4.1      Computing Capacity – changes

Some GLUE attributes were changed:

-       GlueSubClusterLogicalCPUs = Total number of core/hyperthreaded CPUs in the subcluster (including machines in state offline or down)

-       GlueHostBenchMarkSI00 = Average SpecInt2000 rating per LogicalCPU

-       CECapability: Fairshare=<VO>:<share> ; This value is used to express a specific VO share if VO shares are in operation.

-       CECapability: CPUScalingReferenceSI00=<refCPU SI00> the CPU SI00 for which the published GlueCEMaxCPUTimes are valid.

 

Below is the formula to calculate the installed capacity at a site.

 

 

M.Kasemann noted that if the value is zero it means that the VO has no share at a Site.

 

The total installed capacity is given by the formula below.

 

 

Ph.Charpentier noted that the request is the integral of the power provided not an instantaneous value. The situation changes during time during a month or a year.

 

G.Merino asked if these attributes are also used by APEL to normalize the accounting values.

J.Gordon replied positively but he will also verify it with the APEL team, with particular attention to the scaling factors used.

 

J.Templon noted that when there are WLCG Tier-2s that do not implement fair share and therefore their value is 1. And if they provide resources for local users is not then included.

 

L.Dell’Agnello noted that the document asks for have homogeneous sub-clusters recommends having separate CEs and queues for different hardware. CNAF does not agree on the approach. The local configuration at CNAF is to have a queue for each Experiment and not by WN types.

 

F.Donno replied that the sub-clusters are mostly useful to do proper match making of jobs and resources.

4.1      Storage Capacity – changes

No changes on the calculations.

-       GlueSACapability: Installed[Online|Nearline]Capacity = Online or Nearline space part of a Storage Area. This attribute has been introduced for accounting purposes only.

 

Installed Capacity = (Σ GlueSACapability(InstalledCapacity)

 

Now storage areas can overlap and the installed capacity for the overlapping storage the value is zero except for one of the storages (to avoid double counting).

GlueSACapability: Installed[Online|Nearline]Capacity = 0

 

The SRM at RAL has one single SRM with one endpoint per VO and this is not contemplated in the current model. 

 

On request of OSG now if Reversed Size is bigger than zero then TotalSize must be smaller than that value.

 

IF ReservedSize > 0  THEN  TotalSize <= ReservedSize

4.2      Status

The new version of the document (v1.8-1) has been published with all details, operational instructions for site administrators

and explicative examples (OSG missing): https://twiki.cern.ch/twiki/pub/LCG/WLCGCommonComputingReadinessChallenges/WLCG_GlueSchemaUsage-1.8.pdf

 

There is a technical agreement on the essential content of the document, the last phone conf on storage info providers was last Friday (December 5th)

 

OSG requested one more month to review the proposals. OSG informed the about the concept of a “site sponsor” that publishes global (?) fair shares for all VOs (i.e. call the VOs “US-ATLAS” and “US-CMS” instead of ATLAS and CMS and this can cause problems). We need to accommodate that request but more details are needed.

 

Minor changes are still expected with operative instructions and examples from OSG needed

4.3      Plans

Operational plan is in preparation. We will start as soon as the document is officially approved and OSG details received and agreed. It will imply work on several applications:

-       Client tools

-       Gstat sanity checks already available.

-       GridMAP is being adapted. Some sites now appear smaller in GridMAP because the values have been updated and are now correct. The Reliability reports will also use these values.

-       Client tools gfal/lcg-utils will improve resource selection algorithm with next releases.

-       Reviewing SAM tests

-       Nothing yet done for APEL

-       What else? OSG?

 

A coordination with the WN working group is needed:

-       A few bug reports already submitted.

-       Many site admins ask for support. Support group needed.

 

Storage is mostly automatic

-       GRIF and RAL have new info providers for DPM and CASTOR. DCache and StoRM expected in January.

-       DCache requires info about near line storage to be set by hands.

-       Gstat checks should ensure coherence of published information.

 

H.Marten asked that the document does not ask the Sites to use a central table (page 6) for benchmarking. Sites must run the benchmarks themselves, as agreed at the MB.

 

O.Smirnova commented that NDGF’s values are converted by some scripts and it may be difficult to find problems in case of incorrect reporting.

J.Gordon replied that anyway, at least initially, sites ought to check that their data published is correct.

4.4      OSG Comments

R.Pordes reported the main OSG comments also mentioned in the email to the MB.

 

1. Move approval of the document to end January 2009 – The request is from OSG, US ATLAS and US CMS and are asking for an extension until end of January before approving the document. The main reason is to evaluate the impact of the document on the OSG current setup.

 

M.Ernst supported the request for US ATLAS. The implementation plan should start when OSG has analyzed the impact.

M.Kasemann stated that also CMS supports this request. Is important to have OSG and EGEE start implementing when the document is agreed by all.

 

2- Separate Requirement from Implementation(s) – While the requirement are the same for all infrastructures the solutions they choose could be different. Therefore the document should be split in two distinct sections: one with the requirements and another with the solution chosen by the infrastructures. This is a comment valid for all WLCG documents. There should be separate requirements from the recommended implementations.

 

J.Templon noted that the initial section about the goals of the document could be modified to express the requirements.

R.Pordes agreed and asked that becomes a separate document.

 

J.Gordon also asked that after discussions at the MB the documents should be modified and a new version produced.

 

New Action:

6 Jan 2009 - F.Donno, with the help of J.Gordon and R.Pordes, will separate the requirement from the implementations recommended in the document about Reporting Installed Capacity.

 

3- Timeline for the Implementations and Validation – There should always be enough time for implementation and validation of data and of the results published in the reports. It is a human intensive activity. This document will require a lot of work and verifications. The work must therefore be done discussing with all teams all along the implementation. For instance, the use of CPU count has been introduced in the monthly report for the federations’ weighted reliability but is not written in any document.

 

A.Aimar noted that the need of weighted averages was discussed several time and agreed. And is in the High level Milestones. Sites have actually asked for this (including B.Bockleman for OSG) since the first reports appeared. In addition nothing changes for federations that do not provide the CPU count yet: they are still accounted like before, with the simple average of all Sites reliability belonging to the federation.

 

5.   HL Milestones for 2009 (HLM20081125) – A.Aimar

 

The MB reviewed the High Level Milestones for 2009.

 

Note: Changes agreed during the discussion are in RED in the tables below.

5.1      VOBoxes

The VOBoxes related milestones need to be updated but are followed in the Action List already.

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

VOBoxes Support

WLCG-07-04

Apr
2007

VOBoxes SLA Defined
Sites propose and agree with the VO the level of support (upgrade, backup, restore, etc) of VOBoxes

Aug
2008

Aug
2008

 

 

 

 

Aug
2008

 

 

 

 

 

WLCG-07-05

May 2007

VOBoxes SLA Implemented
VOBoxes service implemented at the site according to the SLA

Aug
2008

Aug
2008

 

 

 

Mar 2008

Aug
2008

 

Apr 2008

 

 

 

WLCG-07-05b

Jul 2007

VOBoxes Support Accepted by the Experiments
VOBoxes support  level agreed by the experiments

ALICE

n/a

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

ATLAS

 

 

 

 

 

n/a

n/a

 

 

 

 

n/a

CMS

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

 

LHCb

n/a

 

 

 

 

n/a

 

 

 

n/a

n/a

n/a

 

5.2      SCAS and gLexec

 

gLexec/Pilot Jobs 

WLCG-08-14

May 2008

Pilot Jobs Frameworks studied and accepted by the Review working group
Working group proposal complete and accepted by the Experiments.

ALICE

ATLAS

CMS

LHCb

 

Jan 2009

SCAS Solutions Available for Deployment
Certification successful and SCAS packaged for deployment

 

 Delete

Jan 2009

SCAS Verified by the Experiments
Experiment verify that the SCAS implementation is working (available at CNAF and NL-T1)

ALICE
n/a

ATLAS

CMS
n/a ?

LHCb

 

Avail + 1 month

SCAS + glExec Deployed and Configured at the Tier-1 Sites
SCAS and glExec ready for the Experiments.

 

 

 

 

 

 

 

 

 

 

 

 

Delete 

Feb 2009

SCAS + glExec Deployed and Configured at the Tier-2 Sites
SCAS and glExec ready for the Experiments.

 

 

Milestone: WLCG-08-14 – Working group on Pilot Jobs.

 

LHCb approved long ago. The other VO’s proposals are not officially accepted.

 

Milestone: SCAS Solutions Available for Deployment: Certification successful and SCAS packaged for deployment.

 

End of January was proposed as a milestone by M.Schulz and agreed by J.Templon.

Ph.Charpentier asked why we are discussing the schedule of the SCAS progress. And why the dates for SCAS are constantly moved forward.

A.Aimar replied that the reason is because SCAS was not delivered as planned (summer 2008) and now it will be followed at the MB level. The date will not be moved forward. Actually HLM dates are never changed.

 

Milestone: gLexec Verified by the Experiments - Experiment verify that the SCAS implementation is working (available at CNAF and NL-T1)

 

M.Schulz noted that this must be verified in production on a few Sites and scales.

Ph.Charpentier proposed that this milestone is organized by the Architects Forum.

 

Milestone: SCAS + glExec Deployed and Configured at the Tier-1 Sites - SCAS and glExec ready for the Experiments.

 

Ph.Charpentier proposed that as soon as is available + 1 month should be installed at the Sites.

 

Milestone: SCAS + glExec Deployed and Configured at the Tier-2 Sites - SCAS and glExec ready for the Experiments.

 

M.Schulz proposed that the deployment on the Tier-2 Sites is left to the Experiments. There is no control actual from the MB on the Tier-2.

5.3      VO Specific Tests

SAM VO-Specific Tests

WLCG-08-08

Jun  2008

VO-Specific SAM Tests in Place
With results included every month in the Site Availability Reports.

ALICE

ATLAS

CMS

LHCb

 

J.Gordon asked that also the Sites sign up if the tests are valid for the Sites too.

J.Templon suggested adding a new milestone that the Sites agree on those VO tests.

The issues should be discussed with I.Bird.

5.4      Tier-2 Reliability

Tier-2 Federations Milestones

WLCG-08-09

Jun
2008

Weighted Average Reliability of the Tier-2 Federation above 95% for 80% of Sites
Average of each Tier-2 Federation weighted according to the sites resources

See separated table of Tier-2 Federations.

 

WLCG 08-09 - Weighted Average Reliability of the Tier-2 Federation above 95% for 80% of Sites - Average of each Tier-2 Federation weighted according to the sites resources

For the moment the milestone above is not achieved and will be set as red.

A.Aimar will add the values reached for the last few months since the Tier-2 reliability is collected.

5.5      SL5 Milestones

SLC5 Milestones

 

Dec 2008

SLC5 gcc 4.3 (WN 4.1 binaries)Tested by the Experiments
Experiments should test whether the MW on SL5 support their grid applications

ALICE

ATLAS

CMS

LHCb

 

Jan 2009

SLC5 Deployed by the Sites (64 bits nodes)
Assuming the tests by the Experiments were successful. Otherwise a real gcc 4.3 porting of the WN software is needed.

 

 

 

 

 

 

 

 

 

 

 

 

 

These SL5 milestones were added when the situation seemed clear. Experiments have expressed different timelines since then.

Now the SL5 move will be again discussed by P.Mato and M.Schulz. They will propose a plan (see action list) and these milestones will be set accordingly to that plan.

 

M.Kasemann stated that a milestone on SLC5 + gcc 4.3 installations at the Sites should be present. Even if the migration to SLC5 will be progressive there should be an initial cluster of SLC5 machines. Full migration will take longer but is necessary. SLC5 should be installed with SLC4 compatible libraries for the Experiments to use.

 

Ph.Charpentier noted that would be good that, for testing with all main SEs, there are sites for CASTOR, dCache and DPM with SLC5, gcc 4.3 and Python 2.5

 

 

 

AOB

 

 

F.Hernandez asked about the status on the service during the Christmas period at the Sites.

Experiment had replied already the week before. CERN and the Tier-1 Sites will announce their support level this week.

 

6.    Summary of New Actions

 

 

New Action:

6 Jan 2009 - F.Donno, with the help of J.Gordon and R.Pordes, will separate the requirement from the implementations recommended in the document about Reporting Installed Capacity.

 

New Action:

16 Dec 2008 - Experiments comments the VO-specific Reliability Reports