LCG Management Board

Date/Time:

Tuesday 3 July 2007 16:00-18:00 - F2F Meeting at CERN

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=13802

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 5.7.2007)

Participants:

A.Aimar (notes), D.Barberis, I.Bird, N.Brook, F.Carminati, T.Cass, L.Dell’Agnello, F.Donno, M.Ernst, I.Fisk, D.Foster, F.Hernandez, J.Gordon, C.Grandi, M.Kasemann, J.Knobloch, E.Laure, H.Marten, G.Merino, P.McBride, R.Pordes, Di Quing, L.Robertson (chair), J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 10 July 2007 16:00-17:00 - Phone Meeting

1.      Minutes and Matters arising (Minutes)

 

1.1         Minutes of Previous Meeting

No comments received. Minutes of the previous meeting approved.

1.2         Sites Availability Reports for June 2007 (Site Availability Reports) - A.Aimar

Several sites availability reports are not adequately completed at the Operations meeting. Until this is done better the we have to continue to ask, every month, to the MB members to complete them.

 

The deadline is Friday 6 June 2007. A.Aimar will send a reminder after the meeting.

 

New Action

6 July 2007 - Tier-0 and Tier-1 sites complete the Site Availability Reports for June 2007.

 

2.      Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

  • 3 July 2007 - I.Bird and C.Grandi will report on the structure and leadership reformed Job Priorities Working Group.

Done.

I.Bird reported that the Working Group has been reformed. For the moment the group is discussing a short-term solution for ATLAS, not the long-term more general solutions. It is led by D.Liko and already met on a couple of occasions.

 

C.Grandi added that all focus is on the short-term solution for ATLAS. There was a presentation on the “gpbox” solution, but this will not be discussed until the working solution for ATLAS is implemented and available on the PPS.

 

The solution for ATLAS is expected in a couple of months from now (September GDB). Some initial testing already started this week.

 

Issue to follow:

12 July 2007 - Progress of the Job Priorities working group and the progress of the working solution for ATLAS. D.Liko has been asked to provide a summary, on status and plans of the JP Working Group, which will be distributed to the MB by I.Bird.

 

3.      VO box SLAs (Slides) H.Marten

 

H.Marten presented his experience and the issues encountered while preparing the SLA for the support of VO Boxes at GridKa.

 

More details are on the Slides that he presented.

3.1         Introduction

The “GridKa SLA for the VO Boxes” is one generic SLA for all VOs, not specific to the HEP VOs.

 

It is based on two additional documents

-          “LCG VO Box Recommendations and Questionnaire” (S.Traylen) [1]

-          “VO Box Security Recommendations and Questionnaire” (D.Groep) [2]

 

The SLA contains not only a description of service levels but also a detailed workflow (e.g. due to the split responsibilities, two persons, one from the site and another from the VO must work together to set up the services after a hardware failure), with clear responsibilities for the site and for the VO.

 

The current SLA proposal has been sent to the GridKa Technical Advisory Board (TAB) for comments.

3.2         People/Roles Involved (Too Many?)

According to the documents referenced above ([1], [2]) several persons, roles and groups are involved in the VO Box support and maintenance.

 

Site admins

This role and function is clear at the site.

 

VO Maintainer

The function in [1] is vague: “By submitting this questionnaire or by using any VO box service, the VO maintainer agree that their personal information will be shared among all sites …”

 

GridKa (currently) defines and requires that the VO maintainer is exactly one named VO (contact) person who summarizes and writes down the VO box requirements in [1] and is the owner of and responsible for any activities on the local account <vo>sgm.

 

Anonymous (sgm-) accounts are prohibited at FZK. The sgm account is local and can be used by several persons, but a clearly defined individual is responsible. CMS (Germany) suggested using the already existing “VO manager” role, instead of VO maintainer for this.

 

It was pointed out that the “VO Manager” is the person appointed by the Computing Coordinator to manage the contents of the VOMS database, in particular overseeing the registration process and the assignment of people to groups and roles. After some discussion it was agreed that the “VO Maintainer” is appointed by the VO Manager.

 

VO Security Responsible

Their functions are better defined in [2]: “… both (the VO maintainer and the security responsible) take responsibility for all services running on the VO box under the VO’s systems credentials, and for all actions, events and incidents resulting directly or indirectly running on the VO box under the VO’s user and group system identity”

 

Do VOs expect to be informed about security incidents through their security responsible? – That is not in line with EGEE procedures where security incidents propagate via the sites. The VO security responsible could / should register to sec. response team to receive all security issues.

 

GridKa (currently) defined that: the VO maintainer is exactly one named VO (contact) person and security incidents will be handled according to EGEE incident handling procedures by the local Site Security Officer. According to these procedures, the Grid Security Officer decides about confidentiality of security incidents (together with the incident handling response team).

 

J.Templon noted that a problem with a VOBox involves the VO only not all other VOs in general. Therefore if the problem is limited to one VO it is not needed to propagate the issue via EGEE, which follows the security of the middleware.

I.Bird replied that to create a second channel for security issues is going to be a problem. And a security incident of a VOBox may interest all others too and the sites as well as. Therefore the channel should be the same as all security incidents. While the responsibility stays for the incidents is still the VO that is supporting that VOBox.

 

It was agreed that the “VO Security Responsible” as defined in [2] is the person who has a responsibility for ensuring that the usage of the VO Box conforms to all applicable security policies. This is not the person that intervenes in the event of an incident. It was also agreed that this role should be combined with that of the “VO Maintainer” defined in [1].

 

T.Cass noted that one contact only is a problem because nobody is reachable or can act on 24/7 basis. A replacement must be available. {See next point.}

 

VO service intervention contact

Defined in [1] depending on the VO, this is a person, group of persons or several groups of persons to be informed in case of service interventions

 

Why this? – why to implement several workflow additionally to EGEE broadcast (“scheduled / unscheduled downtime”)

Suggestion: these people or groups should simply register for receiving the service broadcasts.

 

It was agreed that it is the responsibility of the VO to ensure that their “Service Intervention Contact” subscribes to the broadcast service used by the site (the EGEE broadcast service in the case of EGEE sites). The Service Intervention Contact is responsible for initiating the recovery process for the VO Box.

 

VO Administrators

GridKa separated this role into:

-          VO Software Managers

-          VO Admins

 

VO Software managers

Sometimes also referred to as “sgms” = “software grid managers”.

Are responsible for installing VO specific software and services on the VO box through remote grid procedures in user space. They are (currently) all mapped to one local account <vo>sgm (which - according to GridKa local policies – must be owned by a single person (currently the VO maintainer; see above)

 

-          Depending on the VO, these are currently 7-30 people that can modify whatever they find on account <vo>sgm.

-           Is this what the experiments want and what the VO maintainer agrees to be responsible for?

-           For sites, their availability and communication paths this is horrible – if allowed at all.

 

GridKa accepted this for the time being, but it urgently calls for single accounts and clear responsibilities through groups & roles.

 

VO Admins

These are people working either locally or through remote access to help debugging experiment specific problems in close collaboration with site admins these people are defined by the GridKa TAB they must be personally known to GridKa admins at all time.

 

R.Pordes suggested that the VOBox roles and agreements should be addresses by “VOBoxes maintainers, VOBoxes admins, etc” and all roles be limited to “VO Boxes” not to general “VO” roles and services.

Here we are only talking of specific VOBoxes services; therefore the names above should not be “VO” but VO Boxes” related.

 

The MB agreed that the roles mentioned before are to be considered “VO Box Admins, Maintainers, etc”.

3.3         Example of Work Flow (in slide 8)

A site operator must

-           announce an unscheduled downtime through EGEE broadcast

-           inform the VO service intervention contact

-           appoint a site admin for the reparation process

The VO service intervention contact (remember it’s a group of people!) must

-           appoint a VO software admin to the above appointed site admin

The site admin(s) must (could be more than one – for hardware, OS, mw, backup, etc)

-           install new hardware and recover the OS

-           send a “go ahead” to the VO software admin

The VO software admin must

-           recover the VO services

-           test and iterate with the site admin

-           agree with the site admin about readiness of the new system

A site operator must

-           announce the end of the unscheduled downtime through EGEE broadcast

-           inform the VO service intervention contact about the success

 

H.Marten concluded highlighting that the procedures for operating VO boxes are extremely complicated and communication intensive. Due to split responsibilities between sites and VOs many (groups of) people involved complicated, error prone workflow writing of sophisticated SLAs.

 

L.Robertson asked why the site does not simply:

-          prepares the node (installing hardware and OS) for the VO and

-          then it is the VO that installs the VOBox software without any further involvement of the site (process initiated by the VO Service Intervention Contact).

This would be a simpler model to follow for handling VOBoxes support.

 

T.Cass confirmed that CERN for instance knows who the contacts are for each VO when an intervention is needed. The VO then follows the case interacting with CERN whenever needed.

 

L.Dell’Agnello commented that at CNAF there is a contact link for each VO and it is the same person following the VO boxes problems. If the person is not available there is some other alternative contact (in some cases remote). There is not formal description of the SLA at CNAF for the moment.

 

G.Merino explained that at PIC they are trying to keep the SLA as simple as possible, quickly installing the node in a given time and hand it over to the VO for the VOBox installation.

 

J.Templon added that the site does not know exactly whom to contact (admin, maintainer, security responsible, etc) therefore a single entry point for “all VO-related issues” is better (a mailing list). Then is responsibility of the VO to follow the issue.

3.4         Future of VO Boxes Services

H.Marten concluded reminding the MB that the WLCG OB concluded in its meeting on March 20, 2006 and asked the status of the work on general services to replace VOBoxes?

 

The OB endorses the GDB proposals regarding VO boxes as follows:

-          The experiments must not enhance their usage of VO-specific services until the VO box working group has drawn its conclusions.

-          After the final report of this group, decisions will have to be made and a timetable established for the implementation of these services in the general middleware.

-          Until that is done, the Tier-1 centres are requested to allow the deployment of VO-specific services so that the experiments can fully participate in SC4.

-          It is left for the individual Tier-2 centres to decide if they can provide these services or not.

-          Ultimately the OB wishes to see all deployments of VO-specific services replaced by generic middleware.

The overall message is that the OB strongly supports the line of doing things in common and does not support the line of experiment-specific implementations.

 

L.Robertson commented that the future of VOBoxes (and their possible removal) should be discussed when the existing solution is implemented and working.

 

4.      Draft Policy on UK Sites Stopping Stalled Jobs (document) J.Gordon (for T.Doyle)

 

J.Gordon distributed the document (version 0.5) prepared by T.Doyle about efficient usage of site resources.

 

This issue is important because sites are also going to be measured by their efficiency. In the UK the funding agencies ask for efficient usage of resources in order to provide more funds. There are cases in which 20% of the jobs have only 2% efficiency.

 

Sites do not have the ability to easily inform the user but they must have a mean to avoid such waste of resources. By stopping inefficient jobs the site frees more resources for other jobs of the VO.

 

The actions to take are negotiable but some action needs to be taken when some jobs occupy CPU slots but do not use the CPU. The VO will always be informed of the actions taken; if a user email is retrievable the user also will be informed.

 

And if a VO knows that the efficiency is low it should inform the site so that the site runs several jobs per core

 

L.Robertson asked whether this policy is already applied in the UK.

J.Gordon replied this is being agreed and was started because the UK sites have asked for standard policies to apply to stalled jobs that are supported by the GridPP policies.

 

D.Barberis asked for the breakdown by site of the data regarding ATLAS in the document and, if possible, to know the behaviour of the jobs that perform inefficiently. Is it a time-out issue, an application problem, or waiting on some resources? ATLAS is interested in debugging these issues, efficiency is very important for the VO.

 

J.Templon asked how the user could be retrieved. The procedure can be manual and lengthy from the DN to the email of the user.

 

I.Fisk added that the sites are providing resources as defined in the MoU and some efficiency of the jobs is something that should be assumed by the sites. Otherwise it looks as if the sites are failing and provide less than the resources agreed.

J.Gordon replied that if such inefficiency was known it should have been written in the MoU. The sites procurements are based on efficient usage of the resources.

 

F.Carminati added that there should be a standard escalation procedure with the VO before any drastic action is taken. A VO contact person should be informed, in addition to GGUS tickets submitted.

 

M.Kasemann added that the VO is interested in understanding the reasons of the stalled jobs in order to take actions at the VO level. Therefore any information about stalled jobs is useful to the VO investigations.

 

F.Carminati stressed the fact that if a job is killed by a site the VO must be informed. Otherwise, not knowing that it was stopped by the site, the VO will not debug the job failure.

 

L.Robertson concluded suggesting that it is good to gain experience for the time being. Sites should interact with experiments when inefficiency is noted and later standard policies will be defined together by sites and VOs.

 

 

5.      High Level Milestones Update (Feedback Received; Milestones) - A.Aimar

 

 

Presentation cancelled. A.Aimar will ask the sites to comment via email the status of the milestones and will present the summary at one of next MB meetings.

 

6.      April and May Accounting Report (document) J.Gordon

 

 

Presentation postponed to next MB meeting.

 

7.      OSG Sites Validation and SAM Interface (Slides) R.Pordes

 

 

R.Pordes presented the status and plans for the OSG sites validation and the interface to the SAM sites testing system.

7.1         Scope/Goals

OSG had decided to develop an OSG infrastructure for site validation tests reporting:

-          locally to the Site Administrators,

-          in a central OSG repository,

-          and with filtered movement of data as needed to WLCG SAM

 

With commitment to a common record structure with the WLCG joint monitoring group. The discussions about such structure are still going on.

7.2         Status

The 1st round of OSG probes are in the VDT version that will be released next week.

The OSG system has a scheduling structure using condor and the collection of the results is done in a “Gratia” repository.

 

The transfer of information to SAM is ready - and waiting for joint effort to make this happen and to validate the information sent to SAM. It will be deployed in OSG over next few months. September is the goal as first official reporting month.

7.3         Once data is in SAM

Still a lot of effort is needed to check information, compare it to existing SAM information and see how this information feeds into the high level availability summaries reported to the oversight bodies.

 

I.Bird asked whether September is a realistic date for having the OSG data in the SAM repository.

R.Pordes replied that OSG is in principle ready for publishing the data and only has to agree with the SAM team about the scripts that will send the data to SAM.

 

I.Bird also noted that OSG had agreed to present the OSG tests in order to make sure they are equivalent to the ones used by the standard SAM system.

R.Pordes agreed that OSG will talk to the SAM team in order to evaluate the current OSG tests.

 

New actions:

12 July 2007 - L.Robertson will appoint assessors to review the equivalence of the OSG tests and the WLCG test set agreed in April 2006.

R.Pordes agreed to report to the MB about the comparison of the OSG tests wrt. to the SAM tests.

 

8.      SRM 2.2 Update (Slides) F.Donno

 

 

F.Donno presented an update on status and plans of the SRM 2.2 implementation and its deployment at the WLCG sites.

8.1         Status of the Tests

All the SRM 2.2 implementations pass the functionality tests, the details are available here: Wiki SRM Tests Page

 

The Use-case test family was enhanced with more tests. The remaining issues are:

-          CASTOR:
Disk1 is implemented by switching off the garbage collector (not gracefully handled by CASTOR).
Proper Disk1 support will be in the next Castor version (2.1.4) to be finished in July and deployed the next month(s).

-          DCache/StoRM:
Very different ways to “reserve” space for the experiments and provide space token descriptions.

 

The Stress tests started on all development endpoints using 9 client machines.
Small server instances are preferred in order to reach easily the limits.

In parallel stress-testing activities are also on-going by the EIS team with GSSD input.

 

The initial goals are:

-          Understand the limits of the instance under test

-          Make sure it does not crash or hang under heavy load

-          Make sure that the response time does not degrade to an “unreasonable” level

 

And later the goals will be:

-          Make sure there are no hidden race-conditions for the SRM calls that are the most used

-          Understand server tuning

-          Learn from stress testing

8.2         First Results of Stress Test Activities

For each implementation the initial results are the following.

 

CASTOR:

-          Race conditions found. Working with developers to address problems.

-          Good handling of heavy-load: requests are dropped if server busy (the client can retry)

-          Response time for the requests being processed is good.

 

dCache:

-          Authorization module crash

-          Server very slow or unresponsive (restart cures the problem)

-          Working with developers to address problems

 

DPM:

-          No failures

-          Good handling of heavy-load: requests are dropped if server busy (the client can retry)

-          Response time for the requests being processed is good.

 

StoRM

-          Response time degrades with load. However, it recovers after the crisis.

-          Working with developers to address problems

 

BeStMan

-          Server unresponsive under heavy load. It does not resume operations when load decreases.

-          Working with the developers to address problems

8.3         Outcome of the Storage Workshop

Participants agreed that there is the need to prove that SRM v2 is “better” than SRM v1.

The metrics for this purpose are:

-          New functionalities (functionality tests) which are available.

-          Stability
Will be proven stress testing endpoints for a considerable amount of time (one week) and measuring the server response

 

Note: At the workshop there was the major commitment from sites and developers to roll-out SRM v2.2 in production by the end of 2007.

 

It was noted that sites do not have special resources for PPS services and the tests. Those resources should be taken out from MoU agreed resources for a VO.

Until now the experiments have not agreed to give the PPS resources for the tests out of their resources for production.

8.4         SRM 2.2 Roll-out Plan

In addition a clear roll-out plan has been agreed.

 

Plan for Tier-1s:

The focus is on the following sites: BNL, FZK, IN2P3, NDGF, SARA, RAL, CNAF, and CERN. Starting with the dCache sites.

 

The plans are to:

-          Differentiate between a testing phase (till October 15th,2007) and a deployment phase (from 15 Oct. 2007 to January 2008)

-          Have dCache test instances properly configured for ATLAS and LHCb for middle July 2007.
All confirmed availability, but BNL. Probably also SARA is in question, maybe by mid-July.

-          Have DESY (running as Tier-1 with tape) and Edinburgh (as disk-only Tier-2) properly configured with dCache by middle of next week.

 

M.Ernst stated that BNL will provide enough hardware to accommodate the tests.

 

Tests will cover:

-          Sustained stress tests on these 2 instances, using several certificates at the same time and performing a mixture of SRM v2.2 requests. Prove stability over a period of at least a week under heavy load. Test to be performed over the summer.

-           SRMv1-v2 interoperability for all possible implementations with high level tools/API using entries in production catalogues with multiple certs. Checking as well accessibility/manageability of SRM v2 data through SRM v1 endpoints. Test to be performed over the summer.

 

-          Push experiment active testing by next week, July 9th 2007. Start with whatever site is available. (ATLAS with FZK?).

-          Continue stress tests on development instances (not only DESY and Edinburgh) and experiment testing till middle October 2007.

 

L.Robertson asked how the experiment testing will be monitored and executed at the sites. Are the experiments committed to work with so many sites?
F.Donno replied that GSSD will verify which experiments tests have been executed and the experiments have provided the list of sites.

 

J.Shiers added that CMS will execute their tests and plans to have their metrics in order to verify performance.

N.Brook added that LHCb is interested in trying SRM 2 on the production configuration during the stress tests for transfers and do not want to repeat the SRM tests separately.

-          Roll-out patches as they come out (a strategy for rolling out patches is being established by the developers). GSSD will coordinate installation of patches at key sites.

-          Start deployment of SRM v2.2 at FZK, IN2P3, and DESY October 15th, 2007 (in agreement with the experiments). The other key Tier-1 dCache sites will follow.

 

-          CASTOR at CERN will make available a test endpoint for LHCb testing. This can be properly configured and ready by middle of July 2007.

-          CASTOR will be deployed at RAL for ATLAS by middle of July 2007. RAL is planning to support LHCb by August 2007. They can also setup SRM v2 endpoints to production instances.

-          CASTOR at CNAF will also make available a test instance for LHCb/ATLAS by middle of July 2007.

-          Stress and experiment tests as for dCache. S2 stress only on development instances.

-          Acquire experience and start wide deployment of CASTOR in production by October 2007.

 

Plan for Tier-2s:

-          Tier-2s using DPM can migrate to SRM v2 as of now. The configuration can be coordinated centrally by GSSD with the input from the experiments.

-          Tier-2s using dCache can upgrade as soon as a confidence level for configuration, management, and operations has been reached at dCache key Tier-1 sites.

-          Tier-2s using StoRM can migrate to SRM v2 as soon as a space token description can be associated to Storage Area (middle July 2007).

-          Roll-out patches as they come out with the coordination of GSSD.

-          Deployment of SRM 2.2 should be completed by end of January 2008.

 

Tools and Support

-          By the 15th of October new releases of higher level tools (GFAL, lcg-utils) will offer the possibility of configuring the version of SRM to use by default. FTS 2.0 offers already the possibility to configure the version of SRM to use per channel.

 

-          Building experience at Tier-1 sites to offer support to Tier-2s

-          Exercise GGUS channels already during the testing phase

-          The developers will be the last level of the escalation process during normal operations but not during the testing and start-up phase.

-          Organize SRM specific tutorials/workshops covering configuration and management for all Storage implementations. This will happen after October 2007.

 

Action:

L.Robertson asked that, for next week, a detailed SRM roll-out plan is prepared with all the activities that have to be executed and deadlines for sites, developers and experiments.

 

9.      AOB

 

 

 

 

 

10. Summary of New Actions

 

New Action

6 July 2007 - Tier-0 and Tier-1 sites complete the Site Availability Reports for June 2007.

 

New action:

10 July 2007 - L.Robertson asked that, for next week, a detailed SRM roll-out plan is prepared with all the activities that have to be executed and deadlines for sites, developers and experiments.

 

New actions:

12 July 2007 - L.Robertson will appoint assessors to review the equivalence of the OSG tests and the WLCG test set agreed in April 2006.

R.Pordes agreed to report to the MB about the comparison of the OSG tests wrt. to the SAM tests.

 

Issue to follow:

12 July 2007 - Progress of the Job Priorities working group and the progress of the working solution for ATLAS. D.Liko has been asked to provide a summary, on status and plans of the JP Working Group, which will be distributed to the MB by I.Bird.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.