LCG Management Board

Date/Time:

Tuesday 11 March 16:00-17:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=27473  

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 12.3.2008)

Participants:

A.Aimar (notes), I.Bird (chair), Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, M.Litmaath, U.Marconi, H.Marten, P.Mato, G.Merino, A.Pace, B.Panzer, R.Pordes, Di Qing, H.Renshall, M.Schulz, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 18 March 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

 

Missing in the previous MB minutes:

A clean-up and update to the SAM tests was proposed by Schulz and should be done in the next few weeks.

1.2      Tape Efficiency Metrics

Not all sites can produce the metrics proposed. Therefore they should provide alternative metrics suitable to measure and report about the performance of their MSS.

 

M.Ernst commented that from HPSS is not easy to see the metrics because they have one single HPSS cache and cannot see the metrics for multiple writes. It is not obtainable from HPSS but this should not be an issue for BNL.

 

I.Bird suggested that the sites using HPSS could skip the metrics that are not relevant and propose their alternatives.

 

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

-       26 Feb 2008 - The Sites and Experiments should confirm to A.Aimar that they have updated the list of their contacts (correct emails, grid operators’ phones, etc). Here is the current contact information: https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails

 

Information confirmed only by:
Sites:                           ASGC, CERN, CNAF, FZK, NDGF, PIC. SARA
Experiments:               ALICE

 

-       29 Feb 2008 - A.Aimar will verify with the GridView team the possibility to recalculate the values for BNL.

Not done. Asked to GridView but it still needs to be implemented.

-       29 Feb 2008 - A.Aimar will verify why the reliability values for the Tier-2 sites seems incorrect (being lower than availability).

 

On the way. Being verified.             

 

-       18 Mar 2008 - Sites should propose new tape efficiency metrics that they can implement, in case they cannot provide the metrics proposed.

 

Will be verified next week.

 

3.   CCRC08 Update (Slides) – H.Renshall

 

H.Renshall presented the weekly update on CCRC08.

 

CCR08 is now in the phase 1.5 of CCRC'08 (i.e. between the February phase 1 run and phase 2 in May). There are no formally coordinated activities or metrics in this phase as yet. Currently it involves individual Experiments doing functionality, throughput and stress testing of their computing model components and sites.

 

ALICE continue to exercise their data export reaching up to 300 MB/sec to their 5 major Tier1 sites, well above the required 60 MB/sec for p-p running. They plan to add RAL in the near future (this or next week).

 

ATLAS performed their M6 detector cosmics run including data storage at Tier 0. 40TB of data was stored on tape between Friday and Monday and calibration streams were sent to the planned 4 Tier 2 sites of Naples, Rome, Munich and Michigan. This week they are functionality testing Tier-1 to Tier-1 data movement to PIC. Will then distribute 20 datasets of 100 files each to other Tier-1 for multiple Tier-1 to PIC tests next week. Revolve this round all Tier-1 over next 2 months. Intensive MC production for CCRC’08 phase 2 continues.

 

G.Merino warned that next week PIC will have a long electrical scheduled shutdown. The activities for ATLAS will not be possible and they should be aware of this.

 

ATLAS has also proposed a draft plan for March and April. See slide 5.

 

CMS continue preparations for the May run and a prior CMS global run in March. Reprocessing is going well with some site issues (dCache slow at FNAL and IN2P3 had disk pools problem). In Tier-1 to Tier-1 commissioning only the RAL to ASGC pair is missing.

Their Tier 0 operations suffered from a one-week long instability in LSF which had hit a performance-related bug/feature in synchronising to the failover server. Automatic failover is currently disabled while this is investigated but CMS are suggesting a separate LSF instance for their Tier 0 operations.

 

M.Kasemann added that the agreement is that for next 4 weeks they will try not to set up a new gateway for local submission. They try to use the single common gateway. If after 4 weeks this solution shows to be inadequate it will then be changed for May’s CCRC.

 

LHCb are preparing the workflow for their stripping jobs to be part of the 4 weeks steady running at nominal rate in May. They are setting up to evaluate and act on CPU time remaining for their grid jobs and setting up SRMv2 endpoints to go into their SAM test suite.

 

Future meetings coming up are the following:

 

-       Next CCRC’08 Face-to-Face Tuesday 1st April:
http://indico.cern.ch/conferenceDisplay.py?confId=30246

Site focused session in the morning then Experiment and Service focused session in the afternoon.

-       21st – 25th April WLCG Collaboration Workshop (Tier0/1/2) in CERN main auditorium:
http://indico.cern.ch/conferenceDisplay.py?confId=6552

Possible themes:

WLCG Service Reliability: focus on Tier2s and progress since November 2007 workshop (1 day?)

CCRC'08 & Full Dress Rehearsals - status and plans (2 days?)

Operations track (2 days, parallel)

Analysis track (2 days, parallel)

-       12th13th June CCRC’08 Post Mortem:
http://indico.cern.ch/conferenceDisplay.py?confId=26921 is in preparation.

 

J.Templon commented that the Site-focused session needs the presence of the Experiments. The session is focused on presenting how the sites work and how to use their resources at best. Therefore to be a useful session the Experiments’ presence is really necessary.

 

J.Templon noted that all the data that ALIDE is sending to SARA is all made of “0” data, which is either a mistake or data that is not real ALICE raw data but just testing the transfer

 

1.   Update of the HL Milestones (HLM 11.03.2008)

 

The MB verified all due milestones in the High Level Milestones dashboard HLM 11.03.2008.

 

Here is the dashboard Updated Dashboard  HLM 15.3.2008, after the discussion below:

 

 

WLCG-07-01

Feb 2007

24x7 Support Definition
Definition of the levels of support and rules to follow, depending on the issue/alarm

WLCG-07-02

Apr
2007

24x7 Support Tested
Support and operation scenarios tested via realistic alarms and situations

WLCG-07-03

Jun
2007

24x7 Support in Operations
The sites provides 24x7 support to users as standard operations

 

WLCG-07-01

-       FZK: H.Marten confirmed that will be ready for end of March 2008.

 

WLCG-07-02

-       ASGC: Done.

-       CNAF: 24x7 support is provided for all critical services but not for the non-critical ones. The timescale will be by and of April.

-       PIC: Done.

-       RAL: Not represented at the meeting.

 

WLCG-07-03

-       ASGC: Done.

-       NDGF: Done

-       SARA: In the process of changing to a single SARA/NIKHEF help desk. It will be ready by end of April.

 

WLCG-07-04

Apr
2007

VOBoxes SLA Defined
Sites propose and agree with the VO the level of support (upgrade, backup, restore, etc) of VOBoxes

WLCG-07-05

May 2007

VOBoxes SLA Implemented
VOBoxes service implemented at the site according to the SLA

WLCG-07-05b

Jul 2007

VOBoxes Support Accepted by the Experiments
VOBoxes support  level agreed by the experiments

ALICE

ATLAS

CMS

LHCb

 

WLCG -07-04/05/06

-       ASGC: No SLA defined yet. Will be done by March.

-       IN2P3: No defined yet. After the proposal the steps should be quick because the procedures are actually already in place.

-       CERN: The VOBox SLA is being discussed actively with the experiments so we are on the way to turning this green.

-       FZK: Only the agreement from CMS is missing.
M.Kasemann commented that seems not a problem but the CMS responsible will reply officially.

-       NDGF: For ATLAS there are no VO Boxes to run at the sites. For ALICE they have 7 VO Boxes and the SLA is being prepared.

-       PIC: For LHCb is done. ATLAS no VOBoxes. For CMS an SLA is being proposed, similar to the one of LHCb, should be done by end March.

-       SARA: Document is ready but not implemented. Will be checked in the next two weeks.

 

WLCG-07-08

Mar 2007

Accounting Data published in the APEL Repository
The site is publishing the accounting data in APEL. Monthly reports extracted from the APEL Repository.

 

WLCG -07-08

-       CERN: the uploading of accounting data is now probably OK, but we are seeing discrepancies of around 10% between the APEL and local accounting, most probably due to the single normalisation factor used by APEL.

 

WLCG-07-16

1 Jul
2007

MoU 2007 Pledges Installed
To fulfill the agreement that all sites procure the 2007 MoU pledged by July 2007

WLCG-07-17

1 Apr 2008

MoU 2008 Pledges Installed
To fulfill the agreement that all sites procure they  MoU pledged by April of every year

 

WLCG -07-17

-       IN2P3: The supplied CPU material was not adequate and had to be sent back to the supplier. Hopefully will ready for May.
Purchases of more that 1M EUR will take one more month than before.

-       CERN: Also have delivery problems and a supplier had to be replaced. Will be ready for end of April.

-       CNAF: Will have the CPUs by mid-May, Storage beginning of May. Due to administrative issues.

-       NDGF: CPU will be there by April, And Storage by September.

-       PIC:  Will have CPU by and of April and Storage by June 2008.

 

WLCG-07-19

Jun
2007

Multi-VO Tests Executed and Tested by the Experiments
Scheduled at CERN  for last week of June

 

WLCG -07-19

-       The tests are being done in CCC08. And some have been done recently.

 

WLCG-07-26

 Nov 2007

SRM: CASTOR 2.1.3 Tested and Accepted by the Experiments
From the SRM Roll-Out Plan (SRM-16 to -19)

WLCG-07-27

Nov 2007

SRM: dCache 1.8 Tested and Accepted by the Experiments
From the SRM Roll-Out Plan (SRM-16 to -19)

 

Now CASTOR is 2.1.6 instead. But was tested only at CERN by ATLAS and CMS.

 

WLCG-07-28b

Sept 2007

Demonstrated Tier-0 Export to Tier-1 Sites
Demonstration that the highest throughput (ATLAS 2008) can be reached.

 

WLCG-07-28b: Done

 

WLCG-07-39

Sept 2007

VO-Specific SAM Tests in Place
With results included every month in the Site Availability Reports.

 

WLCG-07-39: This is not complete yet and need to be reviewed in the next few weeks. 

 

WLCG-07-40

Oct 2007

Experiment provide the Test Setup for the CAF
Specification of the requirements and setup needed by each Experiment

 

WLCG-07-40

-       CMS: They ramped-up the resources but will do stress tests in May.

-       LHCb: Means running analysis at CERN and will be done in May 2008

 

WLCG-08-01

Mar 2008

OSG RSV Reliability Tests in Place
OSG tests equivalent to those in WLCG SAM and results available via GridView

 

WLCG-08-01

-       OSG: There will be a VDT release with the SE tests. The Availability and Reliability calculations are progressing with the SAM team.

 

J.Templon added that it seems that the issues are minor and will be verified in April when D.Collados is back.

 

Action:

Verify whether there are issues with NDGF SAM tests. Some comments from J.Templon and D.Collados were not replied by M.Eller.

 

2.   Multi-Users Pilot Jobs Working Group (Slides) - M.Litmaath

 

 

The Pilot Jobs Frameworks working group, launched by the GDB, was mandated by WLCG MB on Jan. 22, 2008.

 

Its mission is to:

-       Review security issues in the pilot job framework of each experiment.
Pilot jobs are taken as multi-user in this context

-       Define a minimum set of security requirements

-       Advise on improvements

-       Use of a common library or tool set for proxy management, but seems unlikely.

-       Report to GDB and MB in a time frame is a few months

 

The members of the working group are:

-       ALICE: Predrag Buncic

-       ATLAS: Torre Wenaus

-       CMS: Igor Sfiligoi

-       LHCb: Andrei Tsaregorodtsev

-       EGEE: David Groep

-       FNAL: Eileen Berman

-       OSG: Mine Altunay

-       WLCG: Maarten Litmaath (chair)

 

There were 3 phone conferences held, and a 4th call at the end of March (Friday March 28).

 

The discussion progresses mostly via the mailing list.

Each experiment is to provide a document about their system

-       LHCb were the first and the next version will incorporate feedback from discussions so far. They had it already before the meeting and have set the tone about the quality and content of the document.

-       CMS provided a first version last week

-       ALICE and ATLAS needed more time and have not provided any document yet.

 

A security questionnaire is being discussed

-       Currently at v0.4

-       Agreement on the relevance/scope of a question is not always evident

-       Each document should provide the Experiment’s answers in an annexe.

 

Some experiments do not agree on some questions.

-       E.g. How user tokens are used by the proxies, from submission until the job is started on the WN. What happens if the job crashes and how the clean-up is done?

-       The Experiments replied these requirements are not asked to the general gLite components (e.g. WMS has a lot of proxies).

 

M.Schulz noted that the TCG has launched the security verification of LFC and other components. WMS will be reviewed as soon as the next version is out.

M.Litmaath added that the fact that some gLite components have not been already verified is not a good reason not to verify the security of the VO’s frameworks.

 

I.Bird proposed that VOs that provide the document and the questionnaire and pass the security check are allowed to use their framework. While those not passing or not providing the information should be on hold until they do so.

M.Litmaath replied that it is possible to configure gLEexec to allow only some groups or users but this will require configuration at each single site. In practice is very difficult to achieve.

 

I.Bird replied that the PJF working group should report on each VO separately and then the GDB and MB could decide what to do.

 

J.Gordon added that in the future other applications (Bio-Med, etc) should go through the same verifications (but this is not a WLCG matter).

 

Ph.Charpentier noted that the sites should have already gLEexec installed so that when the solutions are approved the sites are ready.

J.Templon added that gLEexec can be configured to the sites only if they define all the details and the temporary space can be created in different way at the sites (e.g. job’s subdirectory, permissions to protect the proxy files, etc).

Ph.Charpentier replied that guidance is needed from the gLEexec experts. 

 

M.Litmaath added that all these issues are covered by the discussions that are taking place in the working group.

 

I.Bird proposed that a general recommendations document on security should be provided even before the frameworks are all certified.

 

Ph.Charpentier proposed that one site could provide an example installation so that all VOs can test the environment and the configuration in depth before its deployed elsewhere.

 

 

 

 

No AOB.

 

3.   Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.