LCG Management Board

Date/Time:

Tuesday 15 May 2007 16:00-17:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=13795

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 16.5.2007)

Participants:

A.Aimar (notes), D.Barberis, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, X.Espinal, I.Fisk, S.Foffano, D.Foster, J.Gordon, F.Hernandez, J.Knobloch, H.Marten, R.Pordes, Di Qing, L.Robertson (chair), J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 22 May 2007 16:00-17:00

1.      Minutes and Matters arising (minutes) 

 

1.1         Minutes of Previous Meeting

Minutes of the 24 April 2007 approved.

The minutes of the 8 May 2007 will be distributed after the MB meeting.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 15 May 2007 - J.Gordon agreed to make a proposal on how to also store non-grid usage in the APEL repository.

Not done. J.Gordon said that he will send to the MB a proposal in a couple of weeks.

 

3.      Site Reliability Reports (Reliability Data; Site Reports; Slides) - A.Aimar

 

All sites commented the Reliability Data document distributed at the beginning of May (pages 3 to 5 refer to April 2007). Here are their Site Reports.

In the attached Slides there are the April reliability daily values (slide 2 and 3) that are summarized and compared to the values since January in slide 4 and in the table below (sites ordered as in the Reliability Data document):

Reliability >= 88%   (>= Target)

Reliability >= 79%   (>= 90% of Target)

Reliability < 79%   (< 90% Target)

Site

Jan 07

Feb 07

Mar 07

Apr 07

CERN

99

91

97

96

GridKa/FZK

85

90

75

79

IN2P3

96

74

58

95

INFN/CNAF

75

93

76

93

RAL

80

82

80

87

SARA-NIKHEF

93

83

47

92

TRIUMF

79

88

70

73

ASGC

96

97

95

92

FNAL

84

67

90

85

PIC

86

86

96

95

BNL

90

57

6

89

NDGF

n/a

n/a

n/a

n/a

 

The target of 88% for the best 8 sites not reached but 10 sites were within 90% of the target:

-          7 Sites >  88% (target)

-          7+3 Sites >  79% (90% of target)

 

The table below summarizes the issues and solutions sent by the sites.

 

Description

SITE: Problem, Solution

SRM/MSS

INFN: CASTOR lack of D1T0 resources

FZK: dCache GridFTP doors

RAL: CASTOR tape stager crashed

BNL: dCache GridFTP doors filled up and write pools disabled, killed processes, restarted and reduced the TCP buffer size.

TRIUMF: dCache SRM server overload, cleaned the postgres database from historical log/errors

SARA: SE downtime unknown reason for now

BDII overload/timeout

FZK: BDII timeouts with CERN (sam-bdii.cern.ch)

TRIUMF: BDII timeouts with CERN (sam-bdii.cern.ch)

IN2P3: Local Top-level BDII timeouts, re-indexed the database

CE issues

PIC: CE blocked, restarted Maui

RAL: More unspecified GridManager errors and CE de-queued jobs, upgraded torque and move to TCP for job communication

INFN: sporadic alarms on the CE, no evidence of problems

SARA: 4 clusters available, not all down so SAM CE test should pass

Operational Issues

PIC: Network problems, regional network provider caused it

CERN: Incorrect enabling of batch work, human error

SAM

ASGC: discussing with SAM team on some the errors

IN2P3: SRM unavailability in SAM, not understood

sam-bdii.cern.ch timeouts/problems?

SARA special configuration/domains/clusters

 

One can note that:

-          dCache problems affecting several sites are still related to gridftp doors issues

-          sam-bdii.cern.ch timeouts can be the cause of several issues and need to be investigated

-          the rest are mostly sporadic alarms of operational issues

 

J.Gordon added that the sam-bdii problems cause the tests to be in an unknown state and this should not be considered as a failure for the site. A monitoring session will be take place at next GDB where these issues will be discussed.

 

In conclusion one can see that: the average values (excluding NDGF) are above the 88% target:

-          Average 8 best sites: 92%

-          Average all sites: 88%

 

Down-time is lower and therefore there are fewer issues such as dCache (GridFTP doors), CASTOR (pools resources), BDII timeouts, CE (unclear) and sam-bdii.cern.ch.

 

We also started collecting reliability data at the Operations meeting every week. For now the reports from the sites are not sufficiently clear (to say the least). They will be sent back to the authors. The MB members are invited to pass the message of filling more carefully the weekly reports to their representatives at the Operations meeting. If the weekly reports will not improve we will still do the monthly reliability reports for the month of May 2007 and in successive months until the weekly reports become adequate.

 

H.Marten noted that CMS is running their own site availability tests with different values although they are apparently using SAM for the tests.

I.Fisk explained that SAM is a framework and CMS has added five VO-specific tests and remove some general SAM tests not relevant for CMS.

 

Action:

 I.Fisk will circulate to the MB the description of the CMS tests.

 

F.Hernandez noted that (as written in the IN2P3 and BNL reports) there are differences between the Reliability Data distributed and the same data as seen in GridView. This issue will be followed up verifying how the reliability data is generated by GridView. 

 

Issue to follow:

An update on the situation on the SAM tests should be followed and a presentation at the F2F MB organized. Including the situation about VO-specific tests and the comparison with GridView data.

 

1.      Job Reliability Reports (document) - L.Robertson

 

A first proposal was discussed at the MB in Prague, since then the initial table describing the job attempts has been removed.

 

The Job Reliability reports (document) now include:

-          CMS: CRAB user analysis jobs

-          ALICE: job agents

-          LHCb: pilot jobs

 

The ARDA team is also working with ATLAS to start providing similar data.

 

The reports will be generated monthly from now on. The MB is asked to check the values and send feedback and comments to L.Robertson.

 

L.Dell’Agnello asked whether it is possible to investigate further  the reasons of the failures shown on the graphs like one can now do with the site reliability reports.

L.Robertson replied that there are operations interfaces to enable the errors to be investigated, and he will ask M.Lamanna to contact him to explain how to use these features.

 

Decision:

The MB agreed to start distributing the Job Reliability reports every month and review the situation after a few months.

 

GDB Summary (document) - J.Gordon

 

J.Gordon presented a summary of the May GDB meeting; details are in the attached document and in the GDB agenda.

1.1         Organizational Issues

The August 1st meeting is cancelled.

 

Proposal to move the October meeting from 3rd, where it clashes with EGEE Conference, to the 10th.

 

The LHC Experiments agreed to move the October GDB to the 10th.

 

Countries are encouraged to keep their GDB membership details up to date. Countries with a Tier1 should nominate a Tier2 representative too if appropriate.

1.2         Middleware Issues

 

See the document for more details.

 

SL4 - WN has been through one round of testing in 32 bit mode on PPS, bugs fixed, and due for release. UI and WN in 32bit are top priority. Now in PPS and ready for release and experiments should test it urgently.

 

WMS - At the March MB Ian proposed some evaluation criteria for WMS and now he gave a status report. gLite WMS 3.1 has achieved 15,000 jobs/day for 7 days (criteria 10k/day for 5 days). 0.3% failures with no restart. And all jobs completed when restarted. The Logging and Bookkeeping service is capable of much higher rates and is not a bottleneck. This is encouraging.

 

CE - Current status of gLite CE. Close to 100% success of job submission – after resolving a number of timing issues with Condor. Submissions of 6,000 jobs to a CE (max ~3000 at any time). Several Condor issues were found – not yet clear on a timescale for resolving them. Not proven yet.

 

The fallback proposal could be:

-          Keep the LCG-CE “as-is” - there is no effort to port to SL4 (which implies GT4 and potentially many issues)

-          Deploy either on SL3 nodes (or SL4 with Xen/SL3). Contrary to previous reports

SL3 support will not stop in October 07 (SLC3 could also be continued for exceptional cases – like the CE). RHEL3 security patches will continue (until 2010) so it is feasible to continue with LCG-CE on SL3.

-          Set up a CREAM instance in parallel and subject it to the same testing procedure because JRA1 effort is focusing on CREAM and not on improving the gLite CE.

 

 

J.Templon noted that a CREAM-only solution will not have a Globus interface and some sites need it because it is used by some of the VOs hosted at the sites.

1.3         Top 5 Issues

Technical summaries of groupings of issues were presented to GDB.

 

Castor - Tier0 Issues being addressed by a special task force. A new LSF Plug-in should address many of these issues. For Tier1s D1 storage classes are the highest priority. There is no firm plan for the remaining issues yet but it is being reviewed. Castor SRMv2.2 implementation is proceeding on track from a slow start. The Xrootd development is done by SLAC so negotiations are required.

 

Integration & testing of data and storage management components - The main outstanding issue that has not yet been addressed is multi-VO testing of Tier0-Tier1 transfers to demonstrate the nominal rates. This should be come feasible soon when ATLAS restart bulk transfers. CMS are repeatedly transferring.

 

SRMv2.2 - The main implementations are tested for the functionalities requested. SRM v2.2 is available for the experiments to test. It is very important to have the experiments on the pre-production test-bed testing the environment as soon as possible in order to understand if SRM v2.2 is ready for production

 

Job Management - WMS 3.1 is making good progress. It addresses many of the issues. An outstanding issues (for future GDB?) is the deployment of glexec, firstly on the CE and then on the Worker Nodes.

 

J.Templon added that nothing should prevent the installation of glexec at the sites (in two possible configurations, one limited to log-only mode) but this should be discussed in detail and explained to the sites at the GDB.  Also the scalability should be tested.

 

FTS and Data Storage Management - FTS version 2.0 is certified – pilot service used by experiments for testing. It has interfaces to both SRM v1.1 and SRM v2.2 and includes VOMS-aware proxy renewal (ALICE) and delegation (avoids need to send passwords).

 

ALICE made a statement that FTS will not be used for T1 to/from T2 transfers. This has to be contrasted with WLCG’s view that FTS is the only supported transfer mechanism.

 

N.Brook added that, like ALICE, for T1 to/from T2 transfers LHCb will not use FTS.

 

F.Hernandez added that sites need control of the dataflow in and out of the site in order to tune it and limit it in case one VO blocks all others or when the site needs to reduce the incoming flow.

If there is no flow control, as is provided in FTS, the only solution will be to close the VO’s dataflow channels to the site.

 

J.Templon added that if transfers are executed using VO-specific tool and configurations the problems will be difficult to investigate.

 

N.Brook noted that LHCb needs communication from 80 T2 sites to all T1 sites, a complete FTS solution would require defining 80 FTS channels at each Tier-1 site.

 

L.Robertson stated that non-standard solutions will be much more complicated to solve and be lower-priority for the sites compared to standard FTS transfers.

 

Top5 Summary - Experiments expressed content with the technical status reported but would await progress in coming months. There will be future reports to GDB but the Management Board should consider and present a management plan (see next section)

1.4         Other Topics

Security - Agreed the final version of Grid Site Operations Policy. Agreed good draft version of Grid Security Policy top-level document. To send the MB for approval.

Grid Policy on Handling Logged Personal Information which is relevant to user level accounting privacy issues. This has not yet discussed by JSPG It is foreseen that OSG and EGEE have top-level documents.

 

Does WLCG need a user level privacy document for the sites neither in EGEE nor OSG?

L.Robertson replied that once the common EGEE and OSG document is available D.Kelsey should come to the MB with a proposal for the LCG, after having it discussed it with EGEE and OSG.

 

File systems Working Group - M.Jouvin reported from the workshop held during HEPiX in April in DESY. A work plan has been agreed and evaluation started. Most Tier1s and many other sites are involved. The target is a final report for HEPiX in spring 2008.

 

 

2.      Follow-up to Top 5 Issues from the Experiments (Slides) - L.Robertson

 

L.Robertson went through the main items and presented how the MB will follow them.

 

CASTOR

Task Force established to understand problems, prepare plan for addressing them

Report to MB once per month.

Specific issues:

-          Tier-0/CAF: first step LSF plug-in (delivered); next step is load balancing

-          SRM 2.2: Being monitored as part of the SRM 2.2 activity. Basic and use case tests passed; stress tests starting.

-          Support for Disk 1 (abort request when disk full)

-          Improved Tier-1 deployment model

 

Longer term developments

-          Review in September

-          Access control/VOMS – current expectation 2Q08

-          Quotas – current expectation 2009. Need to agree specification

 

dCache

Review at end of each quarter.

-          Reliability issues

-          Name experiment contacts needed

-          Define problems; agree priorities with the experiments contacts

SRM 2.2: Being monitored as part of the SRM 2.2 activity. Basic and use case tests passed; stress tests under way.

 

As for Castor there are longer term issues: Access control/VOMS; quotas, etc.

 

SRM 2.2

Monthly review by MB already established. Available for experiment testing on PPS. Test and deployment plan being agreed with experiments, sites

 

Additional functionality

-          Access control/VOMS (request by CMS). Supported by DPM, in development for dCache. Need to agree that these are consistent and satisfy use cases; then deployment/development plans from dCache, Castor

-          Quotas (request by ATLAS). First agree on requirements/feasibility

-          File pinning (LHCb): This is not part of the agreed SRM 2.2 functionality

 

Organise GSSD/pre-GDB meeting(s) to agree use cases, functionality; then formal agreement in GDB

 

L.Dell’Agnello added that INFN will use StoRM and should support VOMS like DPM.

 

N.Brook stated that “file pinning” is part of SRM 2.2 but that CASTOR had agreed to implement it by allowing life-time extension on files.

T.Cass confirmed that file “life-time extension” is the approach chosen by CASTOR.

 

BDII reliability

Sites should ensure that they are following deployment guidelines. I.Bird and M.Schulz should prepare proposals for a longer term solution. A new version using indices to provide significant speed-up expected soon. Separation of static and dynamic data should be considered

 

FTS 2.0

Reviewed monthly by MB. Agreed functionality in final stages of deployment

 

WMS

In final test prior to distribution in gLite release. In production at CERN. Functionality, performance as already agreed

 

glexec

Policy to be agreed at the GDB. Deployment of logging-only system may be acceptable to all sites

Full user switch is unlikely to be agreed.

 

Storage Accounting

Pilot project under way will be presented to the GDB for review, and agreement if this satisfies requirements in order to proceed with the deployment plan.

 

LFC Bulk operations

Agreed bulk operations have been deployed. Additional functionality would have to be agreed prior to establishing development plan

 

File Management tools (ATLAS)

Basic tools are already in SRM 2.2.

Disk 1 management tools: Consistency of SE and experiment catalog is an application responsibility.

A utility to extract list of all files in SE is missing for all three MSS implementations

 

xrootd (ALICE)

-          CASTOR prototype/evaluation implementation: Developed by SLAC. Progress slow – long delays in testing, few SLAC resources to react to problems. Being tested by ALICE.

-          DPM prototype: Available for testing by ALICE – specific build as it includes xrootd code and dependencies. Proposal for eliminating xrootd dependencies has been designed and is being implemented by ALICE.

-          dCache prototype: Independent of SLAC code and in test by ALICE.

 

ALICE should test the prototypes thoroughly before they are deployed.

 

Need to establish SLAC commitment to support before including it in the WLCG planning.

Investigate possibility of US-ALICE providing resources

 

Action:

A.Aimar and L.Robertson circulate follow-up milestones on the VOs Top 5 Issues to the MB.

 

3.      AOB 

 

 

L.Dell’Agnello asked for news about the HEPiX workgroup on benchmarking. INFN needs to launch some tenders and hardware providers do not publish SpecInt2k benchmarks anymore. What should sites do?

 

Action:

L.Robertson will distribute a proposal that had been prepared after the MB presentation on benchmarking in March.

 

A presentation on benchmarking will be scheduled for next F2F MB meeting.

 

4.      Summary of New Actions

 

 

22 May 2007 - I.Fisk will circulate to the MB the description of the CMS SAM tests.

 

22 May 2007 - L.Robertson will distribute a proposal that had been prepared after the MB presentation on benchmarking in March.

 

29 May 2007 - A.Aimar and L.Robertson circulate follow-up milestones on the VOs Top 5 Issues to the MB.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.