LCG Management Board

Date/Time:

Tuesday 24 July 2007 16:00-17:00 - Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=17193

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 19.7.2007)

Participants:

A.Aimar (notes), O.Barring, I.Bird, N.Brook, L.Dell’Agnello, T.Doyle, M.Ernst, F.Hernandez, I.Fisk, J.Gordon (chair), M.Kasemann, P.Mato, G.Merino, P.NycZyk, R.Pordes, H.Renshall, R.Tafirout

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 7 August 2007 16:00-17:00 - Phone Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

N.Brook asked to change the date of the current LHCb requirements from “until 2009” to “2010” in section 1.2 of the minutes of last week.

Note: Change done.

 

Minutes of the previous meeting approved.

1.2      SRM Roll-out Plan Update (SRM Roll-out Plan; Status Update) - H.Renshall

H.Renshall updated the status of the milestones of the SRM2 Roll-out plan.

 

All due milestones are completed.

The sites have done their SRM installations; some issues founds have been reported to the developers and fixed.

 

See the details in the text below.

 

-------------------------- 
Here is the status for the MB heads-up today, 24 July 2007:
 
Due 11/07 and completed by MB of 17 July:
 
SRM-01: done (FZK)
SRM-02: done (IN2P3)
SRM-05: done (CERN)
SRM-09: done (Edinburgh)
 
Due 11/07 and not completed by MB of 17 July:
 
SRM-08: done except storage tokens still to be configured (LAL)
SRM-03: done. Exposed known dCache bug keeping data as 'precious'
        after migration to hpss. Fix is already available (BNL)
 
Due 18/07:
 
SRM-04 done (SARA)
SRM-06 done (NDGF)
SRM-07 done. Both CASTOR and STORM endpoints published (CNAF)
 
Summary: all sites meeting the schedule.
 
Due 31/07:
 
SRM-10 testing experiment scenarios with experiment certificates
       - action on LHCb to assign lhcb/lcgprod role to Lana and Flavia
       (for Lana will be done today, 24 July)
       - Lana will send requested information today to LHCb so that they
can prepare for their testing
SRM-13 definition of tests, incl. SRM V1
       - See ATLAS and LHCb plans on GSSD page:
https://twiki.cern.ch/twiki/bin/view/LCG/GSSD
 
Replies and more details in the web archive:
 
https://mmm.cern.ch/public/archive-list/s/storage-classes-wg/
-------------------------- 

 

1.3      VO-specific SAM Results (June and July 2007) (VO SAM July 2007; VO SAM June 2007) - A.Aimar

A.Aimar showed the VO-specific SAM results for June and July 2007. Not all VOs are providing SAM tests and therefore some of the values are currently not relevant. More details are in the talk by P.NycZyk. 

 

The goal is to start to publish the general and VO-specific monthly reliability data and then also search for correlations with the job reliability data.

 

J.Gordon asked for the comparison of VO tests and general SAM tests side-by-side for each site.

1.4      QR Report for 2007 Q2 (April-July 2007) (QR 2007Q2 to complete) - A.Aimar

A.Aimar reminded that the QR should be completed and sent by Monday 30 July 2007 in order to review them during the month of August 2007.

1.5      Computing Capacity Requirements

L.Robertson had distributed the status about the upgrade of the computing capacity required by the Experiments.

 

----------------------------- 
Following the action placed on me at the MB meeting of 17 July I sent an email to the computing coordinators 
asking about the availability of their revised computing capacity requirements and the period that would be covered.
 
I have not received formal replies in all cases, but my understanding is as follows:
 
 
ALICE - as stated by FC in last week's meeting - expect revised requirements to be available this week, up to 2012
CMS - working on requests to 2012; expect to complete by end of month or early August.
ATLAS - no contact, but last October's numbers are already up to 2012.
LHCb - no date, but only until 2010.
 
Chris Eck confirmed that several agencies refuse to provide capacity planning data if there are no formal requirements.
----------------------------- 
 

N.Brook added that LHCb will provide the estimations to 2012 before the October CRRB.

The requirement for 2008 and 2009 in the Megatable are up to date for LHCb.

 

M.Kasemann added that also CMS had not changed their estimated for 2008 and 2009.

 

1.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

  • 10-July 2007 - INFN will send the MB a summary of their findings about HEP applications benchmarking.

Not done. L.Dell’Agnello will distribute a summary by end of July.

  • 12 July 2007 - L.Robertson will appoint assessors to review the equivalence of the OSG tests and the WLCG test set agreed in April 2006.

Not done. Waiting for a candidate reviewer to return from holiday.

  • 12 July 2007 - Progress of the Job Priorities working group and the progress of the working solution for ATLAS. D.Liko has been asked to provide a summary, on status and plans of the JP Working Group, which will be distributed to the MB by I.Bird.

Not done.

2.    SAM Sites Reports (Site Reports; Slides) - P.Nyczyk

 

 

In the previous MB meeting some sites had reported problems with the SAM results and expressed doubts on how downtimes were taken into account in the SAM availability calculations. For this reason P.NycZyk participated to the meeting in order to explain the SAM issues in June 2007 and summarise the status of the VO-specific SAM tests.

2.1      SAM -related Issues

Submitter DN Problem - Affected sites: ASGC, PIC, SARA and TRIMUF (1-5 June) 

At the beginning of June SAM had to switch certificate used for the tests submission (old certificate was expiring).

Should have been transparent” but at the same time we forced usage of SGM role (more powerful).

After this operation a “directory permissions” problem on DPM occurred, because of wrong ACL settings.

 

Scheduled downtime anomalies - Affected site: IN2P3 (11, 19/20 June)

Bug: the current availability calculation algorithm sometimes misses (or wrongly interprets) scheduled maintenance periods.

Solution: fixed in new GridView release, not yet in production (serious algorithm change, not yet approved?).

 

FNAL: OPS vs. CMS results - Affected site: FNAL (several days during June 2007)

FNAL remarked several times in their site report: “USCMS was fully operational, test defect”.

Which is not a useful explanation for investigating the issue: Which type of service was affected? Which test?

In fact: also the CMS-specific tests results shows problems in June (Section 1.3).

For example the job submission problem (CMS) on 24th June. The test is not critical for CMS and fails for OPS where is considered critical.

In some cases the CMS tests are executed with the OPS VO and this should be reviewed.

The suggestion is that the reports should be generated on a per-VO basis and the critical test selection for CMS should be maybe reviewed.

 

Minor Issues - Affected sites: CERN and PIC.

Temporary Tomcat downtimes reported by CERN-PROD (20-21 June)

Possible glitches of BDII monitoring should be due to Gstat reported by PIC (14 June),

2.2      Status of VO specific SAM monitoring

 

ATLAS

-       Submits default set of tests for all services

-       Custom critical test selection

-       VO specific tests development is in progress

 

ALICE

-       Custom VOBOX monitoring suite publishing to SAM

-       SAM Programmatic Interface used to feed Monalisa.

-       The availability numbers for ALICE are invalid for now because the real SAM tests are not executed.

 

CMS

-       Submits default set of tests extended by few VO specific tests for CE and SRM

-       Custom critical test selection, but relies on OPS tests and doesn’t use VO specific ones

-       There are few VO-specific tests for the time being.

-       The values for CMS are well representative of the CMS availability.

 

LHCb

-       Submits custom set of VO specific tests (CE, SRM) and are critical tests.

-       Custom critical test selection

-       The values are well representative of the LHCb availability.

 

J.Gordon noted that on Slide 5 CMS is failing both on the OPS and the CMS VO execution.

I.Fisk replied that in this case the behaviour is the same; but as the CMS tests are executed less frequently the results can be different. And also a test is not critical for CMS but is critical in the general SAM tests.

 

Issue to Follow:

The list of critical tests in SAM should be re-evaluated.

 

J.Gordon noted that a site should pass all the tests because the general SAM tests are testing general conditions that will influence also the CMS availability. The SAM tests should be applied to all sites in the same way, or we end up in a situation where sites pick the tests they want and this is not anymore a common standard testing.

 

I.Fisk added that if some services are not relevant to CMS they should not be accounted. The issues should be clarified looking into the results comparison in the next months.

 

N.Brook reported that LHCb is working with the SAM team in order to solve the issues about retrieving log files of installations.

P.Nyczyk added that SAM is not done for verifying the log of installations, but a work around was found.

 

F.Hernandez asked where the VO-specific tests are visible to the sites and where the tests are described.

P.Nyczyk replied that the information on each test and about the tests selected by each VO (via the FCR tool) is available and will be distributed to the MB list. The test results can be seen using GridView.

 

R.Tafirout asked why the SAM tests do not have a retry. Because usually the errors are sporadic and the retry will succeed and it would be a good idea to introduce it.

I.Bird noted that the “retry” could be introduced in the tests only if the applications have such mechanism implemented, otherwise the retry would artificially increase the reliability of the sites. The goal of the tests is to realistically see the reliability for users' jobs not to artificially get to the target.

 

O.Barring noted that when at CERN the SE is stopped for interventions the CE continues to try the replica management test that will all fail even if the rest of the CE is working and being used.

P.Nyczyk replied that the test checks the “access from CE to SE” and it is a legitimate CE test. Real jobs are doing and need such access; therefore the only solution when the CERN SE is unavailable is to move the default SE to another SE; so that the CE test will still work.

 

Action:

A.Aimar will distribute the information on how to find information about the SAM and VO-specific tests and FCR.

 

3.    Status of Automatic Accounting (Paper) J.Gordon

 

 

 

Summary of Tier1 Accounting Status 2007

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AGSC

BNL

CERN

CNAF

FNAL

FZK

Lyon

NDGF

NIKHEF

PIC

RAL

Triumf

March

 

 

 

 

 

 

 

 

 

 

 

 

April

 

 

 

 

 

 

 

 

 

 

 

 

May

 

 

 

 

 

 

 

 

 

 

 

 

June

 

 

 

 

 

 

 

 

 

 

 

 

 

J.Gordon showed the summary of the Accounting Status since March 2007.

 

In June there are six sites publishing correctly, in May there were eight.

The issues with CERN and CNAF should be solved. For NIKHEF is not clear why there is no value in the table.

 

For the moment we will continue with the manual verification of the values, in parallel with the automatic accounting procedures until all sites report correctly.

 

F.Hernandez noted that Lyon is not green because they inadvertently reported Lyon Tier2 data with the Tier1. The Tier1 data was published correctly.

 

 

4.    AOB

 

 

Meetings in August

The MB agreed that in August the MB meeting will be every 2 weeks (i.e. 7 and 21 August).

 

5.    Summary of New Actions

 

 

 

Action:

31 Jul 2007 - A.Aimar will distribute the information on how to find information about the SAM and VO-specific tests and FCR.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.