LCG Management Board

Date/Time:

Tuesday 20 June 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a061506

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 22.06.2006)

Participants:

A.Aimar (notes), D.Barberis, L.Bauerdick, L.Betev, I.Bird, S.Belforte, T.Cass, L.Dell’Agnello, T.Doyle, D.Foster, B.Gibbard, C.Grandi, I.Fisk, F.Hernandez, J.Knobloch, M.Lamanna, H.Marten, G.Merino, Di Quing, L.Robertson (chair), J.Shiers, O.Smirnova, R.Tafirout, J.Templon, J.White

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 27 June 2006 at 16:00

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of the Previous Meeting

Minutes approved.

1.2         SRM Planning

An SRM Management Overview Group has been formed. The members are T.Cass (CERN), J.Gordon (RAL), V.Guelzow (DESY), and R.Kennedy (FNAL). Will have a phone call with M.Litmaath once a month and escalate any issue in the SRM implementations.

1.3         Implications of SRM V2 Proposal on the Sites

A note should be prepared by K.Bos and distributed to the MB.

1.4         3D Project installations

The ORACLE installation in FZK is progressing; the streams to CERN and firewall settings have to be configured. Someone from FZK will contact D.Duellmann for the set up.

 

Action:

27 Jun 2006 - FZK sets up the ORACLE streams to CERN for the LCG 3D project.

1.5         Support for mysql in FTS

E.Laure distributed a message clarifying the situation (mail message)

 

I.Bird mentioned that a Tier-2 site (Glasgow) is working on the support of mysql in FTS. T.Doyle (Glasgow) said that they would not be able to ensure support for the high-performance requirements of Tier-1s (e.g. FNAL). Therefore high performance sites should not rely on the mysql version of FTS.

 

 

2.      Action List Review (list of actions)

 

Note: Actions in RED are due.

 

 

  • 23 May 06 – Tier-1 sites should confirm via email to J.Shiers that they have set-up and tested their FTS channels configuration for transfers from all Tier-1 and to/from Tier-2 sites. Is it not sufficient to set up the channel but the action requires confirmation via email that transfers from all Tier-1 and to/from the "known" Tier-2 has been tested.

Not done: ASGC, BNL, FNAL, INFN, IN2P3, NDGF, SARA-NIKHEF, and TRIUMF.
Done: RAL, FZK and PIC.

  • 30 May 06 - ALICE and LHCb will send to J.Shiers the list of the Tier-2 sites to monitor in SC4.

Received the CMS list of Tier-2 sites participating to SC4: https://twiki.cern.ch/twiki/bin/view/CMS/SWIntSC4SiteStatus

The list for ALICE is a subset of the sites in the list Tier-1 and Tier-2 associations. Alice agreed to send list of Tier-2 sites that will be able to participate to SC4. 

  • 31 May 06 – K.Bos should start a discussion forum to share experience and tools for monitoring rates and capacities and to provide information as needed by the VOs. The goal is then to make possible a central repository to store effective tape throughput monitoring information.

Not done.

  • 13 Jun 06 – D.Liko to distribute the Job Priority WG report to the MB.

Not done.

  • 31 May 06 – C.Grandi presents to the MB the EGEE middleware priorities and development of the features needed by the LHC (in Flavia's list).

Done.

  • 13 Jun 2006 - I.Bird to add the discussion on the SAM tests and results to the Operations Workshop agenda.

Done.

  • 15 June 06 – I.Bird reports to the TCG the needs of APIs to verify the working status of the middleware services.

Done.

  • 21 May 06 – J.Shiers and M.Schulz: Flavia’s list should be updated, maintained and used to control changes and releases. A fixed URL link should be provided to that list.

Done. The TCG list is the reference list and should be monitored carefully.

  • 30 May 06 - J.Shiers and N.Brook will add a note in the document to remind of the need of announcing in advance draining of jobs at the sites.

Done. Final version to be approved by next MB.

  • 30 May 06 - J.Shiers and N.Brook will add a note in the document to remind of the need of announcing in advance draining of jobs at the sites.

Done. Final version to be distributed.

  • 10 Jun 06 - CMS will send to J.Shiers some defined milestones for the CMS SC4 exercises.

Done. The CMS milestones are here: https://twiki.cern.ch/twiki/bin/view/CMS/SWIntSC4Mile

  • 10 Jun 06 – J.Shiers will add an Action List to the SC4 Plans wiki page.

Done. But there are not many clear actions.

  • 16 Jun 06 – CERN+Tier-1 sites - The sites should send the accounting data for May to lcg.office@cern.ch

Done.

  • 20 Jun 06 - E.Laure prepares a clear statement on the status of support for MySQL in FTS

Done. Bug fixing will be done but no tests for high-performance usage of a mysql-based solution.

 

1.      gLite 3: JRA1 Middleware development plans (more information, transparencies) - C.Grandi

 

1.1         Common activities for all components

Slide 2. JRA1 will provide:

-          Bug fixing and support for the production infrastructure (2nd line support for GGUS)

-          Participation to task forces with applications and sites

-          Support for new platforms (SLC4, X86-64 and IA64) that will imply migration to VDT 1.3 and to Globus Toolkit 4 (GT4).

-          The EU is also requesting that funded projects support the IPv6 protocol.  

 

The requirements in the TCG list are referenced in the presentation, via their ID value (see example in slide 3, the left-most column).

 

Note: The dates below are the development dates from the developers, tested by the developers but without certification.

1.2         Security

The work on the VOMS and VOMS-related utilities is going to be:

-          VOMS validator – ongoing

-          VOMSAdmin memory leak in util-java – ongoing

-          VOMS server cert in Attribute Certificate in Java API – ongoing (done on server and C/C++ API)

-          VOMSes directory structure VO name restriction support in Java API – ongoing (done on server and C/C++ API)

 

On glexec:

-          Test glexec+LCAS+LCMAPS on a CE head node – end-July (313). On a WN with the same code – during the summer

-          Call-out to remote authZ/credential mapping services – end August

-          Distinguish error reported by glexec and by the command – end August

-          Support for file copying and ownership change – end-September

-          Fine grained error codes – end-September (313)

-          Use configuration files at hard coded locations – autumn

 

On proxy renewal:

-          VOMS-aware renewal within the service – library ready to be certified (103a)

-          Establishment of trust between service and myproxy – ongoing (103b)

 

Trustmanager

-          Namespace enforcement – mid-August

-          Hostname checking on client-side in handshake – end-August

 

LCAS/LCMAPS

-          Refine proxy lifetime checking (end-July)

-          Implement Globus C authZ call-out interface to be able to plug LCAS and LCMAPS into GT3,GT4 services – end-August (313)

-          Fine grained error codes – end-September

-          Service front-end which allows for a centralized management – not planned yet, 1 month work (313; needed if glexec runs on worker nodes)

-          Wildcard matching in grid/group map files – end-November

-          Call-out to remote authZ and credential mapping services – not planned yet (313)

 

Job Repository

-          Integration in gLite – end-July

-          Tests with glexec – 1 week work

1.3         Information Systems

BDII not reported because the component is directly supported by the LCG.

 

The work on RGMA will consist of:

-          Support for AuthZ – end-2006 (101)

-          Architectural changes to improve usability – end-2006

-          Support for namespaces (multiple virtual-DB) – end-2006

-          Queries across namespaces – long term

-          Support for registry and schema replication – long term

-          Support for ORACLE and other RDBMS – long term

 

Service discovery

-          Query to multiple back ends in parallel – end-2006

-          Bootstrapping (avoiding the usage of a config file) – end-2006

-          Caching – need to understand issues, probably better to let the client implement their caching mechanisms (111)

 

J.Templon noted that the middleware components should not expect the information to be accurate in real time, caching is needed in several components and should always be taken into account.

1.4         Data Management

The Data Management Group in EGEE JRA1 counts 3 FTEs, other are developers from SA1 and SA3.The CERN DM group works as a unique group regardless the activity to which the members belong. This summary includes all DM activities in EGEE and LCG.

 

Fireman/gLite IO

-          To be replaced by adding functionalities to LFC and GFAL.

-          Limited support (to support biomed).

SRM

-          SRMv2 compatibility tests – ongoing (122)

-          Add lookup in the information system to determine SRM v1/v2 for a given SE – end-June (123)

DPM

-          Migration to gLite build system

-          Common rfio library with Castor – ongoing

-          xrootd plugin and modifications in DPM to support xrootd – mid-July

-          Migration to gridftp v.2 – end-July

-          DB backup – start in July

GFAL

-          SRMv2 support. Complete support of POSIX calls not supported by SRMv1 – ongoing (123)

-          Support for Encrypted Data Storage – end-June (580,581)

-          Revision of GFAL information system access to provide cache – start in July

LFC

-          Migration to gLite build system

-          Database replication tests – ongoing (263)

-          Provide POOL file catalog plugin that supports the new style pool file catalog interface – ongoing (244)

-          DB backup – start in July

-          Revision of GFAL information system access to provide cache – start after July

FTS

-          Revision of agents to remove need for database locking – ongoing (216)

-          Testing retry logic for srm-cp – ongoing

-          Adding delegation – testing over by end-July (216)

-          Remove build dependencies on Fireman from FTS & Hydra – end-June

-          Proxy renewal agent – testing over by end-July (216)

-          Re-factoring of components – end September (123)

lcg-utils

-          Retries with different replica(s) in case of failure – end-June

Hydra

-          Key splitting over multiple servers – start in July (582)

AMGA

-          supported by NA4

 

I.Bird noted that lcg-utils should be merged with the gLite tools.

 

Check-sum verification

The control of the check-sum is not going to be added to the FTS layer because it performs third-party transfers and also because is can be a considerable overhead. There is also a proposal to control the check-sums in parallel (“out of band”), in order not to slow down the transfers. But check-sum calculation is still an open issue.

1.5         Computing Resources

CE (gLite)

-          Log files for accounting – in certification

-          Information pass through, to transport application data – to be certified

-          Migration to Condor 6.7.19 – ready to start as soon as requested (304).
It was tested with gLite by F.Prelz and there are no known issues.

-          Access to job on restricted shell – needs more specification (310)
- 3 months work if outbound connectivity on WN and inbound on UI
- more work from security if not from restricted shell

CE (CREAM)

-          bulk submission from WMS to CE – Available for preview (304,311)

GPBOX

-          Available for preview (101/309)

APEL

-          Supported directly by LCG

DGAS

-          To be certified (331)

-          Work plan:
- certify and activate local sensors and site HLR in parallel with APEL
- replace APEL sensors with DGAS (DGAS2APEL)
- certify and activate central HLR. Test scalability to the PS scale

CEMon

-          Available on PS (311,312)

1.6         Workload Management and LB

WMS/UI

-          Support for GLUE 1.2 and VOViews – in certification (101/309)

-          Support for GPBOX removed may be added on request (101/309)

-          Migration to Condor 6.7.19 – ready to start as soon as requested

-          Job prologue/epilogue, order to execute some code before (e.g. check and preparation before execution) and after execution (e.g. check exit status by the application) – to be certified (523)

-          Support for Short Deadline Jobs – to be certified (512)

-          Bulk match-making (end-October) (304)

-          Hot standby – 3 months work (end-October) (303)

-          High availability – not planned yet (302)

-          Improve input file download (http cache) – end October (306)

-          Improve output file upload – not planned yet (306)

-          Improve round-robin on UI - 1 month work (301)

 

LB and JP

-          Hot standby – 2 months work (sync. with WMS: end-October) (303)

-          Remove performance bottlenecks (logging events via logd and registering large collections) – end-October (304)

-          Allow queries form authorized users and not only from owner – end July (324)

-          Produce job statistics on LB (e.g. for JRA2) – end October

-          Job Provenance - available for preview

 

1.7         Conclusions and Issues

 

Incremental Releases

The ideal scenario is that components with new features are deployed in incremental upgrades, as they are tested and become certified.

 

SCL4 introduction

A scenario with mixed SCL3 and SCL4 nodes has to be planned and certified (as happened for RH73 and SLC3). There is a recipe from Grif (Paris) but this is not a supported solution and will require proper certification.

 

Decision on glexec on WNs

If glexec will be used on the WN, resources will be devoted to it by JRA1. Some sites will not accept the “identity switch” allowed by glexec; therefore the functioning of the WNs should not depend on the existence or usage of glexec.

 

Migration to GT4

When gLite will be migrated to use GT4 it will not be backward compatible.

 

Preview System

There is the need of a “preview system” in order to show several new components (CREAM, GPBOX, Job Provenance, etc) without installing them in any existing PPS or PS service. A preview system is an installation accessible to the developers and to early users where new components and ideas can be tested. The preview system is updated by the developers and is open to the users so that they can see if the service is useful and which improvements are needed. It will be maintained by the JRA1 developers and it is not related to the certification or PPS services.

 

Support for old Encrypted Data Storage

The old encrypted data storage will be maintained and supported until the new one is ready. In particular this a requirement of the Biomed community.

 

2.      AOB

 

2.1         LHCC Referees Meeting

Meeting with the LHCC Referees with presentation of the status of the Tier-2 sites.

See agenda: http://agenda.cern.ch/fullAgenda.php?ida=a057190

2.2         Meeting Timing

Maintain the current start time - BUT

-          assume that everyone connects to the meeting BEFORE 16:00

-          the role call starts at 16:00 precisely

-          meeting cut-off at 17:05

2.3         Other AOB

Other AOB items will be distributed by L.Robertson via email.

 

3.      Summary of New Actions

 

 

Action:

27 Jun 2006 - FZK sets up the ORACLE streams to CERN for the LCG 3D project.

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.