LCG Management Board

Date/Time

Tuesday 29 April 2008, 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=31115

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 2.5.2008)

Participants  

A.Aimar (notes), I.Bird (chair), K.Bos, T.Cass, Ph.Charpentier, A.Di Girolamo, M.Ernst, I.Fisk, S.Foffano, J.Gordon, F.Hernandez, M.Lamanna, H.Marten, P.Mato, G.Merino, A.Pace, M.Schulz, Y.Schutz, J.Shade, J.Shiers, O.Smirnova, R.Tafirout, J.Templon, J.White

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 6 May 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

-       29 Feb 2008 - A.Aimar will verify with the GridView team the possibility to recalculate the values for BNL.

 

Not done. GridView will produce this recalculation in the next week.

 

-       18 Mar 2008 - Sites should propose new tape efficiency metrics that they can implement, in case they cannot provide the metrics proposed.

 

Not Done. A.Aimar will send a message to the list in order to have the metrics ready by the F2F meeting.

 

-       Experiments should provide the read and write rates that they expect to reach. In terms of clear values (MB/sec, files/sec, etc) including all phases of processing and re-processing.

 

Not Done. J.Templon suggested that the Experiments provide the same information in the format used by LHCb. Here is the example from LHCb: https://twiki.cern.ch/twiki/pub/LCG/GSSDLHCB/Dataflows.pdf

 

-       31 March 2008 - OSG should prepare Site monitoring tests equivalent to those included in the SAM testing suite. J.Templon and D.Collados will verify this equivalence and report to the MB; as it was done for NDGF.

 

Ongoing. Equivalence of the OSG tests not yet officially confirmed.

 

A.Aimar received an email from D.Collados and will distribute it to the MB mailing list.

 

-       18 April 2008 - I.Bird and A.Aimar will propose new milestones to the Management Board.

 

Done. Later in this meeting.

 

-       15 Apr 2008 - Experiments should confirm that the alert/contact mailing lists are open to posts submitted from the sites.

 

  • 30 Apr 2008 – Sites send to H.Renshall plans for the 2008 installations and what will be installed for May and when the full capacity for 2008 will be in place.

 

H.Renshall will be absent several weeks in May, Sites should send it to A.Aimar in CC.

 

3.   CCRC08 Update (Revised Agenda for June Post-Mortem; Slides) - J.Shiers

 

J.Shiers presented a summary of status and progress of the recent CCRC08 activities and what was also discussed at the workshop.

3.1      Critical Services Summary

Summary of key techniques (D.I.D.O.) and status of deployed middleware. The techniques are known and widely used – in particular for Tier0 (“WLCG” & experiment” services). But there are still quite a few well identified weaknesses – an action plan is probably needed.

It is also important to measure what we deliver with clear service availability targets.

 

Testing of the experiment critical services should follow the same guidelines.

There was an update on work on ATLAS services (to be extended to all 4 LHC experiments): For instance the top 2 ATLAS services “ATLONR” & DDM catalogs, there was the request to increase availability of latter during DB sessions. A possible use of DataGuard to another location in the CC.

3.2      Storage Issues

Covered not only baseline versions of storage-ware for May run of CCRC’08, but also timeline for phase-out of SRM v1.1 services, schedule (and some details…) of SRM v2.2 MoU Addendum.

 

Storage Performance / Optimization / Monitoring was discussed at the Workshop: Some (new) agreed metrics for May run, e.g.

-       Sites should split the monitoring of the tape activity by production and non-production users;

-       Sites should measure number of files read/write per mount (this should be much greater than 1);

-       Sites should measure the amount of data transferred during each mount.

 

There is the clear willingness to work to address problems by the Experiments, e.g. ALICE plan files of ~10GB.

3.3      Baseline Services for CCRC08-May

Below is a summary of what should be deployed for May (very minor changes).

 

Implementation

Addresses

CASTOR:

        SRM: v 1.3-21,

        b/e: 2.1.6-12

Possible DB deadlock scenarios;  srmLs return structure now conforms; Various minor DB fixes;  Fix for leaking sockets when srmCopy attempted; correct user mappings in PutDone; log improvements; better bulk deletion(?)

dCache:

        1.8.0-15,
p1, p2, p3
(cumulative)

http://trac.dcache.org/trac.cgi/wiki/manuals/CodeChangeLogs

T1D1 can be set to T1D0 if the file is NOT in a token space; There is a new PinManager available.(improved stability); Space Tokens can be specified in a directory, to become the token used for writing if no other information is provided; dCache provides to the HSM script : Directory/file, Space Token, Space Token Description (new)

DPM:

        1.6.7-4

Issues fixed for 1.6.7-4: Pool free space correctly updated after file system drain and removal ; SRM2.2 srmMkdir will now create directories that are group writable (if the default ACL of the parent gives that permission)

Known issues or points:  No srmCopy is available;  Only round robin selection of file systems within a pool; No transfer stream limit per node

StoRM

        1.3.20

http://storm.forge.cnaf.infn.it/documentation/storm_release_plan

Upgraded information providers, both static and dynamic; Fix on the file size  returned by srmPrepareToGet/srmStatusOfPtG for file > 2GB; Support for ROOT protocol; "default ACLs" on Storage Areas that are automatically set for newly created files; srmGetSpaceMetaData bound with quota information; Improved support for Tape1Disk1 Storage Class with GPFS 3.2 and TSM

 

SRM 1.1 should be phased out.

Experience from May will presumably (re-)confirm that SRM v2.2 is (fully) ready for business. The removal of SRM 1.1 will be decided at June’s post-mortem workshop. A tentative target could be end June?

 

SRM Addendum Timeline

The agreed SRM 2.2 MoU Addendum dates are:

-       19 May – technical agreement with implementers

-       26 May – experiment agreement

-       2 June – manpower estimates / implementation timelines

-       10 June – MB F2F: approval!

 

The target is for the production version is that of CCRC’09; therefore the test versions must be available well beforehand.

It is essential that representatives have authority to sign-off on behalf of their site / experiment.

3.4      CCRC08-Feb and –May

CCRC08-February run has already been extensively reviewed, for May; middleware and storage-ware versions are all defined.

 

Numerous “operational” improvements need to be implemented

-       (Consistent) Tier0 operator alarm mailing list for all experiments, co-owned by IT & experiments
vo-operator-alarm@cern.ch (atlas & alice currently grid-alarm)

-       NL-T1 alarm for Tier1s – sites who cannot implement this should say so asap

New Action:

6 May 2008 - Sites should confirm to the MB that they will define an email digitally signed for their alarm system submission.

 

-       Possible GGUS improvements – on-hold until after May…

-       Better follow-up on “MoU targets” and other issues at ‘daily’ meetings

-       Monitoring improvements – still on-going.

-       Clarification of roles / responsibilities of Tier2 coordinators / contacts.

 

 

Below are the patch levels agreed for the major middleware components.

 

3.1      Requirements on the Tier-2 Sites

Still – even after last week’s workshop – questions regarding what Tier2s are required to do for CCRC’08.

 

The baseline middleware and storage-ware has just been shown. Other information can be found in the presentations from the experiments at the workshop. But (all) sites must stayed tuned to the experiments’ wikis / other pages which are the definitive source of experiment-specific needs.

 

Same is true for Tier1s & Tier0!

-       ALICE: http://twiki.cern.ch/twiki//bin/view/ALICE/CCRC08

-       CMS: T2 workflows & data transfers – in this presentation

LHCb: https://twiki.cern.ch/twiki/bin/view/LHCb/CCRC08

 

I.Bird added that the Collaboration Board of the week before had discussed the (lack of) participation of (some of) the Tier-2 sites.

The agreement of Sites and Experiments was that Tier-1 sites should take the lead in involving their Tier-2 sites.

 

J.Templon asked what exactly the Tier-2 sites need to receive. Some Tier-2 have asked very basic questions (like “what is an FTS channel?”).

J.Shiers replied that some information is available in the slides and each Experiment has prepared information for the Tier-2 sites (e.g. storage classes to implement, specific software to deploy, FTS setups, etc).

 

I.Bird added that the Tier-2 sites are supposed to be proactive and to read the minutes of the MB and participate to the GDB.

For operational support the sites should use actively the EGEE support.

3.2      Future of WLCG Operation

At the Workshop was clear that continuity is essential for WLCG – 2010 data-taking and 2009 re-processing cannot be placed at risk.

 

EGI_DS should engage pro-actively with those who are actively involved in today’s operations (and other aspects of production Grids) to define a smooth and timely transition. This is not the case today. Time for a letter from WLCG PL, GDB chair, EGEE director to EGI_DS?

 

There is the need to be (re-)assured of this by June EGI_DS workshop (at CERN) with the planning well advanced by EGEE ’08.

 

One must also remember that “WLCG” / experiment operations will be needed in any case – it is not “replaced” by EGI operations.

3.3      Future Events

-       CCRC’08 post-mortem workshop 12-13 June at CERN

-       HEP application cluster at EGEE ’08. Around that time – overlapping with data taking – we will have to start thinking about 2009: middleware versions, storage-ware, DB services, resources etc and the plans for testing it!

-       CCRC’09 planning workshop 13-14 November at CERN

-       WLCG Collaboration workshop in Prague, prior to CHEP ’09 – number of parallel sessions? (DM? DB? OPS?)

 

-       Possible workshop prior to EGEE ’09

-       WLCG Collaboration workshop at CERN, Spring 2010

-       Pre-CHEP workshop in Taipei prior to CHEP ’10

-       WLCG Collaboration workshop at CERN, Spring 2011

-       Pre-CHEP workshop prior to CHEP ’12

-       EGI-related events?

 

I.Bird and the MB thanked J.Shiers for the excellent organization and the success of the WLCG Workshop of the week before.

 

4.   LCG-LHCC Referees Meeting (5 May 2008) (10') (Agenda) I.Bird

 

The meeting with the LHCC Referees is for Monday 5 May 2008 at lunch time.

 

The agenda is defined.

 

Monday 05 May 2008

 

12:00 

Status of CCRC'08  

Jamie Shiers (CERN)

Summary of February phase, and preparations for May and later

 

12:40 

Status of SRM deployments  

Flavia Donno (CERN)

Status of SRM v2.2, including experience from CCRC08, and preparations for data taking

 

13:00 

Castor status

Tony Cass (CERN)

Castor performance in Tier 0 during CCRC'08

 

13:25 

Castor development status

Alberto Pace (CERN)

Overall situation with castor preparation for data taking

 

13:40 

Middleware overview

Markus Schulz (CERN)

Status of deployed middleware, updates required for May and data taking

 

14:10 

High Level Milestone summary  

Ian Bird (CERN)

Status of existing milestones, proposed additions.

5.   New Milestones and Future Topics (Future_Milestones_Topics)

One of the information to provide to the LHCC Referees is a first version of the High Level Milestones for the coming months.

5.1      Completed Milestones

The completed (and a few removed) milestones have been moved to page 3 of the HLM Dashboard

 

Completed / Cancelled High Level Milestones

ID

Date

Milestone

ASGC

CC IN2P3

CERN

FZK GridKa

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

WLCG-07-09

Mar
2007

3D Oracle Service in Production
Oracle Service in production, and certified by the Experiments

 

 

 

 

 

 

 

 

 

 

 

squid frontier

WLCG-07-10

May 2007

3D Conditions DB in Production
Conditions DB in operations for ATLAS, CMS, and LHCb. Tested by the Experiments.

 

 

 

 

 

 

 

 

 

 

 

squid frontier

Site Reliability - June 2007

WLCG-07-12

Jun
2007

Site Reliability above 91%
Considering each Tier-0 and Tier-1 site

Apr 88%

 

 

 

 

 

 

 

 

 

 

 

 

May 88%

 

 

 

 

 

 

 

 

 

 

 

 

Jun 91%

 

 

 

 

 

 

 

 

 

 

 

 

Jul 91%

 

 

 

 

 

 

 

 

 

 

 

 

Aug 91%

 

 

 

 

 

 

 

 

 

 

 

 

Sept 91%

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-13

Jun
2007

Average of Best 8 Sites above 93%
Eight sites should reach a reliability above 93%

Averages of the 8 Best sites Apr-Sept 2007
Apr 92%  -  May 94%  -  Jun 87%  -  Jul 93%  -  Aug 94%  -  Sept 93%

Procurement

WLCG-07-16

1 Jul
2007

MoU 2007 Pledges Installed
To fulfill the agreement that all sites procure the 2007 MoU pledged by July 2007

 

 

 

 

 

 

 

 

 

 

 

 

FTS 2.0

WLCG-07-18

Jun
2007

FTS 2.0 Tested and Accepted by the Experiments
In production at CERN and  accepted tested by each Experiment

ALICE

ATLAS

CMS

LHCb

WLCG-07-19

Jun
2007

Multi-VO Tests Executed and Tested by the Experiments
Scheduled at CERN  for last week of June

(will be part of CCRC in February and May 2008)

WLCG-07-20

Sept 2007

FTS 2.0 Deployed in Production
Installed and in production at each Tier-1 Site

 

 

 

 

 

 

 

 

 

 

 

 

BDII

WLCG-07-21

Jun 2007

BDII Guidelines Available
On how to install BDII on a separated node

EGEE - SA1
(not requested)

 

WLCG-07-22

Jun 2007

Top-Level BDII Installed at the Site
For each Tier-1 site

 

 

 

 

 

 

 

 

 

 

 

 

GLExec

WLCG-07-24

Jul 2007

Decision on Usage of gLExec and Guidelines to Follow

GDB

 

MSS Main Storage Systems

WLCG-07-28

Sept 2007

Demonstrated Tier-0 Performance (Storage, DM)
Demonstration that the highest throughput (ATLAS 2008) can be reached.

CERN Tier-0

 

WLCG-07-28b

Sept 2007

Demonstrated Tier-0 Export to Tier-1 Sites
Demonstration that the highest throughput (ATLAS 2008) can be reached.

CERN Tier-0

 

WLCG-07-29

Feb 2008

SRM: CASTOR 2.1.6/dCache in Production at T1
From the SRM Roll-Out Plan (SRM-20 to -21a)

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-30

Dec 2007

SRM Implementations with HEP MoU Features
With features agreed in HEP MoU (srmCopy, etc).

CASTOR

DCache

DPM

 

WN and UI

WLCG-07-31

Jun 2007

WN Installed in Production at the Tier-1 Sites
WN on SL4 installed on each Tier-1 site, with the configuration needed to use SL4 or SL3 nodes

 

 

 

 

 

n/a

 

 

 

n/a

 

 

WLCG-07-32

Jun 2007

UI Certification and Installation on the PPS Systems

EGEE - SA1-PPS
done: Jul 2007

 

WLCG-07-33

Aug 2007

UI Tested and Accepted by the Experiments

ALICE

ATLAS

CMS

LHCb

Xrootd

WLCG-07-41

Jul 2007

xrootd Interfaces Tested and Accepted by ALICE

ALICE

 

SAM Vo-Specific Tests

WLCG-07-39

Sept 2007

VO-Specific SAM Tests in Place
With results included every month in the Site Availability Reports.

POSTPONED TO 2008 AND REPLACED BY A NEW MILESTONE (WLCG-08-08)

Site Reliability - Dec 2007

WLCG-07-14

Dec
2007

Site Reliability above 93%
Considering each Tier-0 and Tier-1 site

Aug 91%

 

 

 

 

 

 

 

 

 

 

 

 

Sept 91%

 

 

 

 

 

 

 

 

 

 

 

 

Oct 91%

 

 

 

 

 

 

 

 

 

 

 

 

Nov 91%

 

 

 

 

 

 

 

 

 

 

 

 

Dec 93%

 

 

 

 

 

 

 

 

 

 

 

 

Jan 93%

 

 

 

 

 

 

 

 

 

 

 

 

Feb 93%

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-15

Dec
2007

Average of Best 8 Sites above 95%
Eight sites should reach an average > 95%

 

Averages of the 8 Best sites Sept 2007 - Jan 2008
Sept 93%  -  Oct 93%  -  Nov 95%  -  Dec 96%  -  Jan 95% - Feb 96%

 

5.2      Future Milestones

Page 1 and 2 contain the milestones still to complete and the new proposed milestones for 2008.

Sites are asked to comment on these milestones before the F2F Meeting in May, the dates will not be officially shown at the Referees until they are approved.

 

29-Apr-08

WLCG High Level Milestones – 2007

 

 

 

Done (green)

 

Late < 1 month (orange)

 

Late > 1 month (red)

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

FZK GridKa

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

24x7 Support

WLCG-07-01

Feb 2007

24x7 Support Definition
Definition of the levels of support and rules to follow, depending on the issue/alarm

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-02

Apr
2007

24x7 Support Tested
Support and operation scenarios tested via realistic alarms and situations

 

 

 

Apr 2008

Apr 2008

 

 

 

 

 

 

 

WLCG-07-03

Jun
2007

24x7 Support in Operations
The sites provides 24x7 support to users as standard operations

 

 

 

Apr 2008

Apr 2008

 

Mar 2008

Mar 2008

Apr 2008

 

 

 

VOBoxes Support

WLCG-07-04

Apr
2007

VOBoxes SLA Defined
Sites propose and agree with the VO the level of support (upgrade, backup, restore, etc) of VOBoxes

Mar 2008

Apr 2008

 

 

 

 

Mar 2008

 

 

 

 

 

WLCG-07-05

May 2007

VOBoxes SLA Implemented
VOBoxes service implemented at the site according to the SLA

Apr 2008

Apr 2008

Mar 2008

 

 

Mar 2008

Mar 2008

 

Apr 2008

 

 

 

WLCG-07-05b

Jul 2007

VOBoxes Support Accepted by the Experiments
VOBoxes support  level agreed by the experiments

ALICE

n/a

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

ATLAS

 

 

 

 

 

n/a

n/a

 

 

 

 

n/a

CMS

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

 

LHCb

n/a

 

 

 

 

n/a

 

 

 

n/a

n/a

n/a

VOMS Job Priorities

VOMS Milestones below suspended until the VOMS Working Group defines new milestones.

WLCG-07-06b

Jun 2007

New VOMS YAIM Release and Documentation
VOMS release and deployment. Documentation on how to configure VOMS for sites not using YAIM

EGEE-SA1

 

WLCG-07-06

Apr
2007

Job Priorities Available at Site
Mapping of the Job priorities on the batch software of the site completed and information published

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-07

Jun
2007

Job Priorities of the VOs Implemented at Site
Configuration and maintenance of the jobs priorities as defined by the VOs. Job Priorities in use by the VOs.

 

 

 

 

 

 

 

 

 

 

 

 

Accounting 

WLCG-07-08

Mar 2007

Accounting Data published in the APEL Repository
The site is publishing the accounting data in APEL. Monthly reports extracted from the APEL Repository.

 

 

 

 

 

 

 

 

 

 

 

 

MSS Main Storage Systems

WLCG-07-25

Jun 2007

CASTOR 2.1.3 in Production at CERN
MSS system supporting SRM 2.2 deployed in production at the site

CERN Tier-0

 

WLCG-07-26

 Nov 2007

SRM: CASTOR 2.1.6 Tested and Accepted by the Experiments at all Sites
From the SRM Roll-Out Plan (SRM-16 to -19)

ALICE
n/a

ATLAS
Nov 2007

CMS
Nov 2007

LHCb
Nov 2007

WLCG-07-27

Nov 2007

SRM: dCache 1.8 Tested and Accepted by the Experiments
From the SRM Roll-Out Plan (SRM-16 to -19)

ALICE
n/a

ATLAS
Nov 2007

CMS
Nov 2007

LHCb
Nov 2007

WLCG-07-30b

May 2008

SRM Missing MoU Features Implemented
With full features agreed in the HEP MoU (SRMCopy, etc).

CASTOR

DCache

DPM

 

gLite CE

The gLite CE will not be deployed on SL4, the porting of the LCG-CE is in progress (21.9.2007)

WLCG-07-35

Sept 2007

gLite CE Development Completed and Component Released

EGEE - JRA1

 

WLCG-07-36

+4 weeks

gLite CE Certification and Installation on the PPS Systems

EGEE - PPS

 

CAF CERN Analysis Facility

WLCG-07-40

Oct 2007

Experiment provide the Test Setup for the CAF
Specification of the requirements and setup needed by each Experiment

ALICE

ATLAS
May 2008

CMS
May 2008

LHCb
May 2008

 

WLCG High Level Milestones - 2008

ID

Date

Milestone

ASGC

CC IN2P3

CERN

FZK GridKa

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

OSG SAM Tests

WLCG-08-01

Mar 2008

OSG RSV Tier-2 Reliability Tests in Place
OSG tests equivalent to those in WLCG SAM

OSG-RSV

 

WLCG-08-02

Jun 2008

OSG Tier-2 Reliability Published
OSG RSV information published in SAM and GOCDB databases. Reliability reports include OSG Tier-2 sites.

OSG-RSV

 

MSS/Tape Metrics

WLCG-08-03

April 2008

Tape Efficiency Metrics Published
Metrics are collected and published weekly

 

 

 

 

 

 

 

 

 

 

 

 

Tier-1 Procurement

WLCG-07-17

1 Apr 2008

MoU 2008 Pledges Installed
To fulfill the agreement that all sites procure they  MoU pledged by April of every year

Apr 2008

Apr 2008

Apr 2008

 

CPU
Apr 08
Disk
May 08

CPU
Apr 08
Disk
Sep 08

CPU
Apr 08
Disk
Jun 08

March 2008

Nov
2008

 

 

 

WLCG-08-04

Sep 2008

Status of the MoU 2009 Pledges 
To fulfill the agreement that all sites procure they  MoU pledged by April of every year

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-08-05

1 Apr 2009

MoU 2009 Pledges Installed
To fulfill the agreement that all sites procure they  MoU pledged by April of every year

 

 

 

 

 

 

 

 

 

 

 

 

gLExec/Pilot Jobs 

WLCG-08-04

May 2008

GLExec and Pilot Jobs Implemented at the Tier-1 Sites
Metrics are collected and published weekly

 

 

 

 

 

 

 

 

 

 

 

 

Tier-1 Sites Reliability - June 2008

WLCG-08-06

Jun
2008

Tier-1 Sites Reliability above 95%
Considering each Tier-0 and Tier-1 site

Jan 93%

 

 

 

 

 

 

 

 

 

 

 

 

Feb 93%

 

 

 

 

 

 

 

 

 

 

 

 

Mar 93%

 

 

 

 

 

 

 

 

 

 

 

 

Apr 93%

 

 

 

 

 

 

 

 

 

 

 

 

May 93%

 

 

 

 

 

 

 

 

 

 

 

 

June 95%

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-08-07

Jun
2008

Average of Best 8 Sites above 97%
Average of eight sites should reach a reliability above 97%

 

SAM VO-Specific Tests

WLCG-08-08

Jun  2008

VO-Specific SAM Tests in Place
With results included every month in the Site Availability Reports.

ALICE

ATLAS

CMS

LHCb

Tier-2 Federations Milestones

WLCG-08-09

Jun
2008

Weighted Average Reliability of the Tier-2 Federation above 95%
Average of each Tier-2 Federation weighted accordingly to the sites pledges

See separated table of Tier-2 Federations.

WLCG-08-10

Jun
2008

Installed Capacity above 2008 Pledges of the Tier-2 Federation
Capacity at each Tier-2 Federation  vs. the Federation's pledges

See separated table of Tier-2 Federations.

Tier-1 Sites Reliability - Dec 2008

WLCG-08-11

Dec
2008

Tier-1 Sites Reliability above 97%
Considering each Tier-0 and Tier-1 site

Jul 95%

 

 

 

 

 

 

 

 

 

 

 

 

Aug 95%

 

 

 

 

 

 

 

 

 

 

 

 

Sept 95%

 

 

 

 

 

 

 

 

 

 

 

 

Oct 95%

 

 

 

 

 

 

 

 

 

 

 

 

Nov 95%

 

 

 

 

 

 

 

 

 

 

 

 

Dec 97%

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-08-12

Dec
2008

Average of ALL Tier-1 Sites above 97%
The average across ALL Tier-1 sites should reach a reliability above 97%

 

 

 

New Action:

9 May 2008 - Sites should send comments about the New HL Milestones.

5.3      Topics to Discuss in the Future

There are issues to clarify at the MB in the next few weeks. They may result in future 2008 milestones:

 

-       VOMS-Accounting (user, groups, roles, etc)

-       Job Priorities

These topics above should be clarified by ATLAS, and whether they are still priorities.

-       HEP Benchmarking and move to new Accounting Unit

After HEPiX there will be a decision, in May.

-       Site-Oriented Dashboards

Julia and J.Casey could report on them.

-       Cream CE

M.Schulz and F.Giacomini could report on this topic.

-       Job Reliability and Efficiency

M.Schulz and J.Casey  

-       File Transfers Efficiency

IT/DM

-       SAM Tests vs. MoU Commitments

SAM Team and some site

-       Accounting of Installed Capacity

Tier-2 should be asked to do so.

 

J.Templon asked to add a topic on communication.

-       Define and track communication across Sites and Experiments.

 

6.   LCAS/LCMAPS configuration options for pilot jobs (Slides) J.White

 

“Pilot jobs” need to perform identity switch on the WN. Therefore the mapping decision to be determined by LCMAPS.

6.1      Execution of LCAS/LCMAPS

In order to run gLEexec on the WN (short time scale), there are two choices. :

 

1. Run LCAS/LCMAPS at a site-central location.

-       Single point of failure for site.

-       Scalability not guaranteed.

-       Not certified for NFS.

-       MUCH testing needed.

-       Only positive point: Consistent mapping decisions.

 

2. Run LCAS/LCMAPS on each WN;

-       More durable.

-       No single point of failure

-       Possible clash of FQAN mappings over WNs.

-       Synchronization of mapping configurations.

 

Eventually use a Site-Central Authorization Service.

-       Technically best solution.

-       Consistent, synchronized mapping decisions.

-       Authorizes user credentials. Returns mapped uid/gid.

 

J.Templon noted that for solution 1, NFS is certified for thousands of WNs.

J.White clarified that the intention is to mount the runtime library on NFS can cause problems.

6.2      Centralized Configuration

The Solution 2 can also have a centralized configuration:

Based on experience (NIKHEF) should scale to O (1000) nodes.

Run LCAS/LCMAPS on each WN with centralized configuration. Have a shared home directory system (NFS?) and provide the following files:

-       /etc/grid-security/gridmapdir

-       /etc/grid-security/grid-mapfile

-       /etc/grid-security/groupmapfile

 

Or Run LCAS/LCMAPS on each WN with local configurations. (preferred)

-       Use WN-local, VO-agnostic, generic pool accounts. (e.g. pool0000 to pool0032).

-       WN-local home directory. Quick recycling.

-       No need for shared gridmapdir etc.

-       Job cannot “escape” from your WN, pilot job contained, traceable.

-       Need to synchronize the LCAS/LCMAPS configurations (with YAIM or Quattor?)

 

Ph.Charpentier noted that the pools accounts cannot be VO-agnostic but must be VO-related.

Otherwise the accounting will not be correct and also the VO-specific software will not be selected properly.

6.3      Logging

GLExec will log the uid switch based on FQAN. • Ideally would use the (remote) syslog facility.(gLExec >= 0.5.25)

 

In solution 1:

-       Will log FQAN/uid/gid switches from central configuration.

-       Consistent logging if syslogging centrally.

 

In solution 2:

-       Will log FQAN/uid/gid switches within WN user-space.

-       Some WN-specific info also needed if syslogging is done centrally.

-       Otherwise a clash of uids from various WNs is possible.
Syslog collector includes the machine name + timestamp. Alleviates the problem.
Accounting on pilot FQAN/uid: VOs are responsible, in their framework, for tracking the individual users.

6.4      SCAS Status

The SCAS client code has been completed and is being developer-tested and being tested by OSG.

 

The SCAS server code is written. It is important site-central code. Still being developer tested. The interoperability library now stable (beta version).

 

The SCAS delivery dates. As of April 29th 2008.

-       Between second week of May to end of June 2008.

-       Must be certified at the level of a WMS/CE.

 

I.Bird asked for a firm date for the delivery, the range above is of about 7-8 weeks.

J.Templon replied that the package is a crucial component and if a bug is found it can take a long amount of time.

Developers want to test it themselves more. The probable date is the end of June for starting certification.

I.Bird added that the Experiments could try the system even while is not fully tested.

 

For short-term tests of the gLExec-on-WN. The suggestion is to run gLExec/LCAS/LCMAPS on the WN. With:

-       Centrally-maintained LCAS/LCMAPS configuration.

-       LCAS/LCMAPS configuration on WN.

-       Maintained through central service.

 

I.Bird asked whether some sites can test the solution with LCAS. The SCAS solution can take several months.

J.Templon replied that SARA will do it once they have completed the installation of the 2007 pledges.

F.Hernandez added that IN2P3 is testing the possibility of running gLExec in “logging only” mode.

 

M.Schulz asked whether sites, once they have the LCAS solution, maybe they will not want to move to the SCAS future solution.

 

New Actions:

9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

 

 

7.   AOB

7.1      Reporting Tier-2 Changes

J.Gordon reported that he and S.Foffano asked to the CB to remind the Tier-2 Federations’ responsibility to keep up to date the MoU (changing site resources, renaming sites, etc).

He also asked the Tier-1 to report this information to the Tier-2 sites.

 

8.   Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.

-        

New Action:

6 May 2008 - Sites should confirm to the MB that they will define an email digitally signed for their alarm system submission.

 

New Action:

9 May 2008 - Sites should send comments about the New HL Milestones.

 

New Actions:

9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).