LCG Management Board

Date/Time:

Tuesday 29 January 16:00-17:00 – Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=27467

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 31.1.2008)

Participants:

A.Aimar (notes), D.Barberis, I.Bird (chair), T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, S.Foffano, J.Gordon, C.Grandi, F.Hernandez, T.Hesselroth, E.Laure, M.Livny, H.Marten, G.Merino, A.Pace, B.Panzer, R.Pordes, Di Qing, R.Quick, H.Renshall, M.Schulz, Y.Schutz, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 5 February 2008 16:00-18:00 – F2F Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

1.2      High Level Milestones Update

The High Level Milestones have been updated.

 

R.Pordes asked to add the milestone for OSG RSV tests in the High Level Milestones dashboard.

 

Update: Here is the High Level Milestones dashboard (PDF file) updated after the meeting.

1.3      Quarterly Reports Preparation (Previous QR Report)

A.Aimar reminded the MB that the QR report for the quarter (Nov 07- Jan 08) is being prepared and he will ask for contributions about:

-       LCG Services (J.Shiers) and GDB Activities (J.Gordon).

-       Sites will comment their due milestones and the result below target (i.e. value in “red” in the dashboard and in the reliability reports).

-       ARDA, DB, Applications Area for their milestones plans.

-       Experiments will present a short summary at the MB that A.Aimar will report in the QR.

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

  • 21 Jan 2008 - The LCG Office should define where (a web area, wiki, share point?) the Sites can upload their statistics about their tape storage performance and efficiency.

Not done.

Open issue to clarify:

J.Templon asked for some clarification about the NDGF tests but there was no representative.

 

New Action:

31 March 2008 - OSG should prepare site monitoring tests equivalent to those included in the SAM testing suite.

J.Templon and D.Collados will verify this equivalence and report to the MB; as it was done for NDGF.

 

1.    CCRC08 Update (CCRC'08 Service Issues; CCRC'08 Site Concerns/Issues; CRC'08 Wiki; Draft F2F agenda; Paper) – H.Renshall

 

 

Here is the paper attached that was read by H.Renshall.

 

Papers concerning current issues of concern, discussion and ongoing
work are attached to the agenda. They cover service issues and site
concerns. We also attach a link to the ccrc08 Wiki which has links
to the agenda and minutes of the daily and weekly meetings and a draft
agenda of the next face-to-face ccrc08 planning meeting, the pre-GDB
meeting of 5 Feb when phase 1 should have started.

The ccrc08 Wiki now includes a tabular calendar (thanks to P.Mendez)
of major activities where we have invited experiments to send us their
additions such as other significant activities, software release dates
etc. This calendar will be extended to cover the whole year. It is more
lightweight than the correlated experiment and sites view in the SC4
experiment plans Wiki and is probably more appropriate for 2008.

There is a new issue of Tier 2 sites migrating to srm2 where there was
said to be a correlation with an OSG rollout in the US. Atlas would like as many
Tier 2 on srm2 as possible. The FTS team think about 100 are but we
will check. Also WLCG is proposing that regional or cloud Tier 2
coordinators be appointed to ensure bi-directional information flow.
A mailing list for them, and also for database service coordinators,
has been set up. This will be followed up at the Feb 5 pre-GDB.

DM group have now proposed missing metrics for the conditions data bases
(attached to the Jan 28 weekly ccrc08 meeting) and experiments and Tier 1
are invited to comment.

In the service issues the SRM second point of choosing the correct pool
for a bring online operation in dCache is seen as not being resolved for
the February run but also not likely to have a big impact as Tier 1
reprocessing is not expected to be strongly exercised (lack of cpu
resources and lower priority) but it must be resolved for May. There is
an associated service interventions Wiki where we will be adding storage
systems intervention plans.

The Feb 5 pre-GDB meeting will review the remaining service, site and
experiment concerns issues.

 

Ph.Charpentier reminded the MB that the LHCb processing also takes place at the Tier-1 sites; therefore the problem of choosing the correct dCache pool for “bring online” is crucial also for the LHCb Tier-1’s operations.

LHCb are currently having problem with this processing and for LHCb is a show stopper. If these functions, needed for analysis tasks, are not fixed only the transfers to Teir-0 and to Tier-1 sites will be tested.

 

J.Templon added that it seems one cannot protect, from read access, the T1D0 pools. And if this cannot be limited such operations could block the site using the tapes equipment. H.Renshall replied that this issue should be reported and verified.

 

H.Renshall noted that this was supposed to be a week of “software stability” before CCRC but updates are still being applied (e.g. CASTOR for Atlas). J.Gordon added that this week also an FTS update took place.

 

2.    LHCC Referees Meeting (Agenda) - I.Bird

 

 

The “Experiments status and progress” still need a speaker. If nobody volunteers this week, I.Bird will propose a speaker.

 

New Action:

5. Feb.2008 - I.Bird will find a speaker for the Experiments Status at the LHCC Referees Meeting.

 

Tuesday 19 February 2008

 

12:00->12:55    Status of CCRC08

12:00 

Status and progress of sites & services (20')

Jamie Shiers (CERN)

12:20 

Experiments' status and progress (20')

12:40 

Summary of SRM v2.2 deployment (15')

 

12:55->13:55    Castor2 and storage metrics

12:55 

Proposed metrics for Castor performance and reliability (15')

13:10 

Metrics for T0 + T1 MSS performance (15')

How we monitor drive and system performance - small files, etc.

 

13:55->14:55    Overview of Tier 1 status for all Tier 1s
Description:

One talk with comparison tables; no need to cover T1s already examined in comprehensive review

 

 

3.    OSG Update: RSV, Site Availability and Storage Services (RSV OSG SAM; Storage) –R.Quick and R.Pordes

 

3.1      OSG RSV and SAM Availability Plots Update – R.Quick

In the last couple of weeks OSG started:

-       Uploading RSV records to SAM that we receive from US Tier-2 production resources listed in the WLCG MOUs with ATLAS and CMS.

-       Reporting 7 sites so far using an automated transport.

-       The logs, on the SAM side, have revealed that our uploads are steady and getting there fine.

 

The SAM web interface still does not present the expected results. Joint debugging is actively taking place (see slides 4 and 5 for details).

 

From testing and checking the transmission logs it seems that there are no issues with the transport layer.

OSG and SAM/GridView are working together to solve the issue.

 

The RSV Storage Probes will be available the first full week of February and should then be tested and interfaced to SAM in order to publish the results by SAM.

 

I.Bird asked about the tests of the Information Services (BDII or the IS used).

R.Quick replied that only a BDII server is used and a BDII test will be available, by March.

 

M.Livny asked whether this test should be available sooner. They can change the priorities for the RSV tests.

I.Bird replied that the tests are needed as soon as possible.

 

J.Templon added that the Information System test checks whether the Information System is correctly identified and whether replies correctly.

 

M.Livny asked whether the definitions of availability and reliability had been discussed and reviewed at the MB.

J.Gordon replied that the reliability algorithms were approved and had been updated a few months ago and are available in the minutes of the MB.

 

J.Templon agreed to check the equivalence of the tests and to discuss whether all current tests are needed in the OSG.

 

I.Bird added that the VO specific tests will become important but are not ready for the moment. But OSG should plan to be able to support VO-specific tests.

 

J.Templon added that VO-specific tests failures cannot be considered a metric for the reliability of the site only, but also of the VO software. In some cases the failure of the VO-specific test is not caused by the site and should be clear and not affect the reliability of the site.

 

I.Bird concluded that J.Templon and D.Collados should do the same review as they did for the NDGF tests: check about the equivalence of the tests and report to the MB for approval.

 

R.Pordes asked that a milestone should be added to the High Level Milestone dashboard on the completion of the RSV tests into SAM.

3.2      OSG Storage Services for WLCG and CCRC

R.Pordes summarized the services that OSG provides to the WLCG and to the Experiments.

 

OSG and ATLAS and CMS

US ATLAS and US CMS Tier-1s (BNL, Fermilab) install and support their own SEs.

-       The Facility managers (Ian/Jon, Michael) have upgraded to dCache 1.8 and work within the GSSD for testing and CCRC for the upcoming computing challenges.

-       Both US ATLAS and CMS have recently added effort to these activities.

 

US ATLAS and US CMS Tier-2s receive installation and support from the OSG VDT for dCache.

-       Significant testing of dCache 1.8 has been done within the VDT in good communication with the GSSD and developers - with many bugs and issues found and resolved.

-       CMS sites rely on the replica manager.

-       ATLAS Michigan and Indiana/Chicago sites install dCache. Others are testing with xrootd as well.

 

OSG sites offer opportunistic storage. For all OSG VOs they will support space reservation as the first support of opportunistic storage.

 

Release Candidates

OSG has published release candidates for dCache 1.8 from VDT caches. Another RC is due this weekend.

The milestone was to release a VDT/OSG version with dCache 1.8 2/15/2008. They are currently approximately 1 week late to deliver that.

They plan to release the needed information services in steps. Static information needed by the RB will be released in Feb.

 

The client packages work against BestMan and dCache. They are tested against STORM and DPM but we have no SEs with these implementations on OSG sites.

 

Information srm/dCache 1.8

OSG has been part of the information attribute discussions within the GSSD and developments groups.

They are currently integrating the changes (for static and dynamic information) into the OSG GIPs and it will be included in the February release.

 

SE Implementations in OSG 0.8.X and OSG 1.0

OSG 1.0 is delayed past the original February milestone due to the volume of VDT support and the number of features/updates we need to get into this major release (e.g. new platforms, dependence on system ssl, voms/voms-admin upgrade).

 

They will release OSG 0.8.2 by the end of February to include dCache 1.8.

 

Some Tier-2s are installing dCache 1.8 from the developers’ web site with space reservation disabled (meeting WLCG MOU). The problems and issues they are finding are well coordinated with the VDT work. The VDT Storage group supports and works with those sites that install dCache 1.8 early.

 

4.    Interim User Accounting Policy (Policy; Slides) - J.Gordon

 

J.Gordon presented the status of user level accounting and the current plans.

4.1      Background Information

The Experiments expressed their need for user level accounting. But before many legal issues, concerning handling of personal data had to be discussed and analyzed.

 

The GDB proposed encryption of user data with access restricted to a ‘VO Resource Manager (VRM). This role is distinct from the VO Manager which is a mainly administrative role controlling who joins the VO.

 

A policy for access to user data was to be developed and has to be signed by the VO Resource Manager.

 

APEL already collects UserDN information from jobs and stores it locally at the site. A site can switch on publishing of this data to the central repository. The transfer is encrypted as is the subsequent storage.

 

The Accounting Portal has developed views to allow the VO Resource Manager to view UserDN data for their VO. But we do not have a policy in place for them to sign (a draft document is available upon request).

4.2      Proposed Interim Solution

The proposal is to have an interim solution approved by the MB. ATLAS would like be granted access to their user data. ATLAS also use VOMS Roles/Groups (related issue)

 

The interim policy (see Policy Document below) that select ATLAS people should agree to before being given access to ATLAS data. We inform sites that if they publish UserDN data then the ATLAS VRM will have access to the ATLAS data.

 

This can be extended to other VOs if really required. This should be a short-term solution so we ask the policy people to expedite the agreement of a formal policy.

 

The MB is asked to

-       AGREE the interim policy document (see agenda)

-       INFORM WLCG sites of the access ATLAS will have

-       RECOMMEND that they publish UserDN data subject to these restrictions

 

For the moment the access via the web portal is given to individuals (via certificates of specific users). There is not an interface between certificates and VOMS roles.

 

D.Barberis noted that ATLAS needs this data and the choice should not be “optional” to the sites.

J.Gordon replied that for the time being a few sites are sufficient, in order to test the collection of the data.

 

J.Gordon then read the Policy Document for comments from the MB.

 

D.Barberis asked that the MB is given a week to review the document and the approval should be discussed at the F2F MB meeting the next week.

 

WLCG Interim Policy Document on User Level Accounting

John Gordon, STFC, 29 January 2008

This Policy governs access to data in the APEL Accounting Portal http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.html

This portal holds cpu data for jobs run on EGEE sites for EGEE VOs and for WLCG VOs from the NorduGrid Tier1 and the Open Science Grid (OSG). EGEE sites also publish information on the VOMS Roles and Groups under which jobs run. They also have the ability to publish the identity of the user who ran the job.

This document lays out the access control to the data held in the portal.

1. Aggregated data at the VO level (summarised per VO, per site, per month) is public information and requires no access control.

2. Access to VO group/role aggregated data should be restricted to members of that VO.

3. The aggregated data of an individual user must be properly protected. All user data in the database is anonymous in the sense that the user data must not easily be connected to a user name.

4. Access to a portal that allows the decoding of the anonymised name into a person’s DN is restricted to individuals in the VO appointed to be "VO Resource Managers".

5. It is strictly forbidden for VO Resource Managers to expose any user’s name to any un-authorised person. The personal information must be handled with care. The user-level accounting is only to be used to allow the VO resource manager(s) to understand and control how many and which individuals within the VO, group or role are using resources.

6. A user should have access to summaries of their own jobs

7.  Site System Administrators have access to summary data about jobs run at their site. This includes the UserDN and VOMS information. The System Administrators should not make information obtained through the portal available to any un-authorised person.

Notes:

1.      Access control listed here is implemented according to the certificate loaded into the browser when viewing the portal.

2.      The identities of members of a VO are obtained from the relevant VOMS Server.

3.      VO Resource Managers and Site Administrators will only be given appropriate access to the portal if they affirm that they have read this document and will abide by the relevant restrictions

4.    Site Admins already have access to the raw accounting and identity information at their sites. The portal is only giving them an easier way of viewing it. One hopes that ethical policies are in place at sites to prevent them making inappropriate use of user data obtained directly at the site.

 

 

Ph.Charpentier asked how an “authorized” person is defined.

I.Bird replied that the Experiments will define their authorized people but they be should strictly protect the user information.

 

Ph.Charpentier asked who is accounted when the jobs are run using gLEexec.

J.Templon replied that they will be identified on the person running the pilot job. The Experiments had agreed to do their own accounting in that case.

 

Ph.Charpentier asked if this policy applies to VO-level user accounting. Should the Experiments for their accounting using encrypted UserDN?

J.Gordon replied yes, the accounting by the VOs also must follow this policy in order to protect user privacy.

 

R.Pordes added that OSG needs the list of security requirements and on data retention periods. They are outsourcing the development of the OSG accounting to an external company and this information needs to be formal.

J.Gordon replied that the list of requirements is being defined; it will be completed for MB approval and then provided to the OSG.

 

M.Livny added that OSG needs to know the privacy requirements, in order to understand whether they want to be responsible of storing and retaining the data.

I.Bird replied that, if OSG is ready to provide the data, it should not be a problem to keep the information in the EGEE repository.

 

J.Gordon agreed to include a data retention policy in the policy for user level accounting.

 

I.Bird proposed that:

-       the next MB should approve the interim solution next week and

-       A future GDB should discuss the long-term policy. Possibly the GDB next week.

 

5.    AOB

 

 

No AOB.

 

6.    Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.

 

5. Feb.2008 - I.Bird will find a speaker for the Experiments Status at the LHCC Referees Meeting.

 

31 March 2008 - OSG should prepare site monitoring tests equivalent to those included in the SAM testing suite.

J.Templon and D.Collados will verify this equivalence and report to the MB; as it was done for NDGF.