LCG Management Board

Date/Time

Tuesday 6 May 2008, 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=31116

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 9.5.2008)

Participants  

A.Aimar (notes), D.Barberis, O.Barring, L.Betev, I.Bird (chair), D.Britton, Ph.Charpentier, L.Dell’Agnello, A.Di Girolamo, J.Gordon, M.Lamanna, H.Marten, P.McBride, G.Merino, A.Pace, R.Pordes, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 13 May 2008 16:00-17:00 – F2F Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

 

During the week G.Merino mailed to the MB asking whether the previous meeting had discussed about a standard name for the “alert mailing list” for the sites.

 

The MB had only agreed standard names for the Experiments. The standard is vo-alarm@cern.ch but there was not standard defined for the sites.

To have a standard email address format at all sites seems difficult because there are different conventions at each site and different domain names not always matching the name of the WLCG Tier-1 site. Therefore to be sure one should in any case check it in the contact page.

 

J.Shiers added that after CCRC08 the situation will be reassessed, including possible communication problems. For now the usage of mailing lists is the only possible solution.

H.Marten reported that a solution to automatically email will be in GGUS in Release7. He also proposed that a solution using web services would allow external tools to manage and follow up the tickets integrating them in the site’s dashboards.

 

J.Templon proposed that the alarm lists are limited to a fixed group of (four) users (DNs) in each Experiment. The MB agreed.

 

Decision:

The MB agrees that defining a standard naming schema for site alarms mailing lists is not possible. Sites should instead keep up to date the usual contact page with the correct email address.

 

New Action:

16 May 2008 - Each Experiment proposes 4 users who can raise alarms at the sites and are allowed to mail to the sites alarm mailing list.

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

-       29 Feb 2008 - A.Aimar will verify with the GridView team the possibility to recalculate the values for BNL.

 

Done. The BNL values are recalculates and will be included in the report of the Tier-1 reliability report.

 

-       18 Mar 2008 - Sites should propose new tape efficiency metrics that they can implement, in case they cannot provide the metrics proposed.

 

Not Done. A.Aimar will send a message to the list in order to have the metrics ready by the F2F meeting.

 

A.Aimar noted that some sites fill the metrics wiki every week, some are updating it irregularly and others have never provided any metric information. Here is the link to the wiki page: https://cern.ch/twiki/bin/view/LCG/MssEfficiency

 

I.Bird reported that the LHCC Referees are asking for these MSS metrics in order to see the efficiency of the Experiments at the different Sites. He asked the sites that are not reporting regularly to comment on the reasons.

 

Some of the sites commented about their (lack of) progress about tape metrics:

-       FZK: H.Marten explained that those metrics at FZK still need to be implemented and will not be ready for CCRC08-May. They could publish other metrics in the wiki page but he would first make sure that the data is correct.

-       ASGC: Have received the scripts from CERN in order to extract the metrics from their CASTOR installation. They still have to modify them in order to make them work with their specific setup.

-       PIC: Did not update the wiki page since February. The collection and processing of the logs is a lengthy manual procedure. They will try to updated more frequently.

-       SARA: Like for PIC, the manual procedure in not done since a few weeks.

-       IN2P3 was not present but F.Hernandez sent an email after the meeting:

FR-CCIN2P3 is one of the sites not yet publishing this information. For your information, we are currently collecting the data for extracting the proposed metrics as closely as possible. The work on the tools to automate the extraction of the relevant information and generation of the metrics was more time-consuming than what we initially anticipated. We expect to have this initial phase finished by the end of this week and start publishing the available data in the wiki by the end of next week.

 

-       Experiments should provide the read and write rates that they expect to reach. In terms of clear values (MB/sec, files/sec, etc) including all phases of processing and re-processing.

 

Not Done. J.Templon suggested that the Experiments provide the same information in the format used by LHCb. Here is the example from LHCb: https://twiki.cern.ch/twiki/pub/LCG/GSSDLHCB/Dataflows.pdf

 

LHCb: Done:

ATLAS: D.Barberis reported that ATLAS considers to have replied to this action in their Jamboree meeting.

CMS: P.McBride reported that an email to explain the rates and custodial of data will be sent to the sites.  

 

-       31 March 2008 - OSG should prepare Site monitoring tests equivalent to those included in the SAM testing suite.

-       J.Templon and D.Collados will verify this equivalence and report to the MB; as it was done for NDGF.

 

Ongoing. D.Collados distributed an email with the link to the wiki page where the information is available. He and R.Quick will present it at the MB on the 20 May.

 

The OSG tests are described here: http://rsv.grid.iu.edu/documentation/help/.

 

The proposed new list of critical tests is available here: https://twiki.cern.ch/twiki/bin/view/LCG/OSGCriticalProbes#Proposed_Critical_Probes_for_OSG

 

  • 30 Apr 2008 – Sites send to H.Renshall plans for the 2008 installations and what will be installed for May and when the full capacity for 2008 will be in place.

 

H.Renshall will be absent several weeks in May, Sites should send it to S.Foffano in addition to H.Renshall.

 

3.   CCRC08 Update (Slides) - J.Shiers

 

J.Shiers presented a summary of status and progress of the recent CCRC08 activities and what was also discussed at the last WLCG workshop.

 

February has been extensively discussed, CCRC08-May is started and this is the focus now. Last week’s daily (15:00) meeting minutes are available here: https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08DailyMeetingsWeek080428

 

3.1      Baseline Versions

Here are the baseline versions that have been discussed and agreed.

 

Storage-ware – CCRC’08 Versions by Implementation

CASTOR: SRM: v 1.3-21,  b/e: 2.1.6-12

dCache:  1.8.0-15, p1, p2, p3 (cumulative)

DPM: (see below)

StoRM  1.3.20

 

M/W component

Patch #

Status

LCG CE

Patch #1752

Released gLite 3.1 Update 20

FTS (T0)

FTS (T1)

Patch #1740

Patch #1671

Released gLite 3.0 Update 42

Released gLite 3.0 Update 41

gFAL/lcg_utils

Patch #1738

Released gLite 3.1 Update 20

DPM 1.6.7-4

Patch #1706

Released gLite 3.1 Update 18

3.2      Interventions / Status

Date

Intervention

Reason / Affecting

5 May

Intervention on VOMS services

(lcg-voms.cern.ch and voms.cern.ch)

1.    transparent intervention to voms-core (Proxy Generation) and voms-admin (GridMap File Generation) at 13:00 UTC for 2 hours.

2.    short complete stop of vomrs at 14:00 UTC for ~30’ (registration processing stopped).

 

Status of CASTOR2 at CERN:

-       C2ATLAS & C2CMS have been upgraded to 2.1.7

-       C2ALICE & C2LHCB remain at 2.1.6 unless they request an upgrade, as do external sites

 

The DB version at Tier-0: is not Oracle 10.2.0.3 with January CPU*

-       Increase of usage of FTS agents and webservices since yesterday morning (links in slide notes);

-       On experiments’ databases no load changes (yet) seen.

 

Slide 5 shows the summary of the interventions reported at the Operation meeting. There are many scheduled and unscheduled interventions always being performed.

 

Slides 6 to 11 were not commented at the meeting and they provide the report and plans of the LHC Experiments.

 

Slide 12 shows the GridView plots show the export rate from the Tier-0, reaching 900 MB/s.

 

 

 

This report will be prepared on Tuesday mornings and mailed to wlcg-ccrc08@cern.ch for verification.

 

J.Gordon asked who is registered in that mailing list.

J.Shiers replied that the list is open but included all relevant Sites and Experiments contacts.

 

4.   Job Priorities for ATLAS (Text) - D.Barberis

                                                                                             

The MB had asked that ATLAS clarifies whether Job Priority was still an important issue for the Experiment.

 

Text from D.Barberis

 

ATLAS stated at the MB on 1st April that the DPM certification, and in general anything related to data management) is for ATLAS the first priority (the updated info. is that today the DPM certification is well advanced).

 

Still ATLAS is definitely interested in a priority (or share) system that can be published in the information system (the VOviews). This system does not need do be dynamic (as the priorities will not be changed frequently) and needs only few roles, production and analysis for a start.

 

Such system is needed also because the pilot jobs are used for production but not yet for the analysis jobs submitted with Ganga, and the two activities have to be prioritized. Moreover for the analysis user it is important to know how many jobs are running in a site in the analysis share available to her/him. Thus the publication in the information system is needed.

 

 

D.Barberis added that as the work is almost done it should be completed, and is really needed by ATLAS.

 

I.Bird concluded that the milestones about Job Priorities should remain in the HLM dashboard.

 

5.   ATLAS SAM Tests (Slides; VO_SAM_200803; VO_SAM_200804) - A.Di Girolamo

A.Di Girolamo summarized the status of the ATLAS-specific SAM tests and commented the results of the test for April 2008.

5.1      SAM Critical Tests - Current Status

SAM is running ATLAS specific tests together with standard tests. But all ATLAS tests are using ATLAS credentials.

 

The sites and endpoints definitions are done as the intersection between GOCDB and TiersOfATLAS, an ATLAS specific sites configuration file.

 

Different services and endpoints need to be tested using different VOMS credentials:

-       The ATLAS endpoints and paths must be explicitly tested

-       The LFC of the Cloud (residing in the T1) is used. Only Clouds using LFC are tested with ATLAS specific tests

 

For the moment the FCR is used but not enforced: i.e. no banning even if sites are failing Critical Tests

 

SE & SRM (centrally from SAM UI)

-       SE-ATLAS-lcg-cr: copy and register (with the cloud LFC) a file from the SAM UI to the endpoint. For the Tier1s both Disk and Tape areas are tested.

-       SE-ATLAS-lcg-cp: copy back the file from the SE to the UI. Verification of the integrity of the file copied.

-       SE-ATLAS-lcg-del: delete the files from the storage and from the LFC.

 

CE (job submitted on the CE):

On all the ATLAS CE in production and certified (from the BDII)

-       Running part of the OPS suite under ATLAS credentials:

– Job Submission

– Certification Authority version

– VO software directory

-       ATLAS specific test: ATLAS-vo-lcgTag: Check VO tag management (lcg-tags).

 

Only for ATLAS Tier1 and Tier2 (from the ToA):

-       Ganga Robot: Compile and execute a real analysis job based on a sample dataset.

 

G.Merino asked whether the Ganga Robot is working or not. On three identical CEs at PIC two always fail and one always succeeds. But is not clear at PIC whether it works: the log files are very difficult to understand.

A.Di Girolamo replied that the GR is launched by the author of that test and SAM only collects the results. In case of problems the sites should contact him (A.Di Girolamo) and he will inform the author of the tests. The goal of the publication, even if there are still many failures, is to improve and fix the problems in the test and at the sites.

 

LFC:

-       lfc ls: list entries in /grid/atlas

-       lfc wf: create an entry in the LFC

 

FTS:

List FTS channels: glite-transfer-channel-list, Information System configuration and publication

 

Other SAM Tests

ATLAS-lcg-versions:

-       Check the version of lcg-utils running on the WNs

ATLAS-swdirspace:

-       Check the dimension of the ATLAS sw installation area

5.2      ATLAS April 2008 SAM Results

The ATLAS tests had submission problems:

-       Migration of ATLAS tests into the SAM prod machines (DONE)

-       lock files blocked (now monitored with lemon sensor)

-       machines SL4 upgrade (DONE)

-       big load on the machines (used also for other VOs, under investigation)

 

And for each site in particular:

-       BNL & NDGF: will be included soon (with their local LFC)

-       CERN: VOTag errors on CEs (SOLVED)

-       IN2P3: SE disappeared from the SAMDB (SOLVED)

-       RAL: GangaRobot errors (under investigation)

-       SARA: SE tests problems (SAM side, old endpoint tested, needs to be fixed)

 

Below is the availability at the end of April and beginning of May: the problems have been solved at ASGC, CERN, FZK, IN2P3, INFN and PIC. Not solved yet at BNL, NDGF, SARA and RAL.

 

5.3      Work in Progress

The SAM tests are being added to the ATLAS Dashboard so that failures are detected immediately.

 

The Tier0/Tier1/Tier2 sites have many intrinsic differences and should not run all the tests. One should increase site granularity in the SAM DB not to mix results. More flexibility is needed to set critical tests. This was requested to the SAM developers.

 

J.Templon added that in SAM the fact that SARA and NIKHEF are two separate sites makes some tests to fail.

J.Gordon added that SAM knows only about individual nodes and not the concept of multi-node sites. This should be discussed at the GDB more extensively with the SAM developers.

M.Lamanna added that the SAM developers are already discussing with the Experiments about these issues.

 

SE: SRM2 tests for each space token of each endpoint: Tests already developed, to be integrated in the framework

CE: (1) Increase the GangaRobot granularity and (2) retrieve Panda production system information

 

A new GUI for SAM test for ATLAS is being developed: Is a general solution for VO-specific SAM display, i.e. immediately usable for the other VO (extending/improving early prototype for CMS)

 

 

R.Pordes asked whether there are discussions on how to include the T2 OSG sites in the VO-specific tests.

A.Di Girolamo replied that there is not contact with OSG for the moment.

 

6.   LCG-LHCC Referees Meeting – I.Bird
Tape Efficiency Metrics (Wiki)

I.Bird reported on the meeting with the LHCC Referees on the day before.

 

The reviewers asked for more plots and metrics in order to more easily understand the status and performance of the Sites and Experiments.

 

At the forthcoming reviews, the first is on the 30 June 2008, they expect the results of CCRC08-May, a follow up on the storage efficiency and on the tape metrics and data rates reached.

 

7.   Reliability/Availability April 2008 (T1_200804; T2_200804; VO_SAM_Tests) - A.Aimar

 

The Reliability ad Availability for Tier-1 and Tier-2 is available. As well as the VO-specific tests results.

 

They will be distributed to the MB for further comments.

 

CMS agreed to comment their April tests at next F2F Meeting.

 

 

8.   AOB
 

 

No AOB.

 

9.   Summary of New Actions

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.

 

New Action:

16 May 2008 - Each Experiment proposes 4 users who can raise alarms at the sites and are allowed to mail to the sites alarm mailing list.