LCG Management Board

Date/Time:

Tuesday 27 November 2007 16:00-17:00 – Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=22188

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 2.12.2007)

Participants:

A.Aimar (notes), D.Barberis, S.Campana, T.Cass, L.Dell’Agnello, T.Doyle, S.Foffano, J.Gordon, J.Knobloch, M.Lamanna, P.Mato, G.Merino, R.Pordes, Di Qing, L.Robertson (chair), J.Shiers, Y.Schutz, O.Smirnova, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 4 December 2007 16:00-18:00 – F2F Meeting at CERN

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comment received on the minutes of the previous meeting. Minutes approved.

1.2      Matters Arising – L.Robertson

L.Robertson reminded that is it important that the Experiments participate to the CCRC meetings.

 

ALICE and CMS still have to clarify to CERN how they want SRM 2.2 setup for their space token information.

 

The issues raised by LHCb about the configuration of CASTOR at CERN have been solved.

Ph.Charpentier confirmed that the system now works correctly for LHCb.

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

  • 21 October 2007 - Sites should send to H.Renshall their resources acquisition plans for CPU, disks and tapes until April 2008

 

Not done. Half of the sites have sent their acquisition plans (TW-ASGC, US-T1-BNL, ES-PIC, DE-KIT and FR-CCIN2P3). The missing sites should send theirs to H.Renshall.

 

An update should be provided by H.Renshall at next meeting.

 

L.Robertson asked SARA, NDGF and ASGC to clarify their estimates for the installation of the 2007 and 2008 capacities:

-       SARA: the tender for the 2007 capacity will be completed in March 2008 and the installations by May 2008. No deadline specified for the 2008 pledges.

-       NDGF: the 2007 capacity should be installed by January 2008, the 2008 capacity by May 2008.

-       ASGC: The 2007 capacity will be installed by January 2008.

 

  • 30 November 2007 - The Tier-1 sites should send to A.Aimar the name of the person responsible for the operations of the OPN at their site.

Received from ASGC and INFN.

ASGC: Min Tsai (mtsai at twgrid dot org), Aries Hung (aries at twgrid dot org)
INFN: Stefano Zani (stefano dot zani at infn dot it))

  • L.Robertson will prepare a note reporting and explaining the current definitions of “availability” and “reliability” wrt. the definitions in the MoU.

Done.

 

3.    CCRC Update (Slides) – J.Shiers

J.Shiers described the status of the actions at the CCRC working group (slide 2).

 

Actions:

-       Conclude on scaling factors – done

-       Conclusions on scope / scale of February challenge (resource limited)

§  Feedback from the sites received and can now define it: (partial) reprocessing at sites with sufficient resources; read pass over data files including calibration DB lookup for those without

 

-       Conclude on SRM v2.2 storage setup details – to complete

§  Experiments to come with details on storage classes by Tier at December F2F

§  This will define what is available for pre-challenge testing in January and the February challenge itself – iteration possible prior to May – practical experience is certainly needed

 

-       ‘Walk-throughs’ by experiments of ‘blocks’ of activities, emphasizing “Critical Services” involved and appropriate scaling factors

§  Essential for sites to understand which are the services involved and the necessary parameters for sizing / configuring the services appropriately

§  Also needed to understand how well we do wrt the 2008 requirements

-       Monitoring, logging and reporting the (progress of the) challenge

§  First discussion during this week’s WLCG Service Reliability workshop, conclude during F2F and GDB (with a possible iteration in January?)

 

-       CDR challenge in December – splitting out ‘temporary’ (challenge) and permanent data

 

-       Other tests that can be done prior to February?

 

 

-       Important issue to discuss: De-scoping? How do we negotiate this if required?

This was also raised by the reviewers at the LHCC Review. It should be discussed at the Overview Board for instance. Examples: Should be reducing the rate the solution? In equal ratio for all Experiments?

 

In summary the planning is on track but is important that the F2F meeting in December clarifies all open issues about Experiments’ requirements at the sites, temporary vs. permanent data and de-scoping decisions.

 

L.Robertson added that the CCRC Planning group should propose a solution about a possible de-scoping.

Then present this solution to the MB that will then propose it to the Overview Board and to CERN. If there is agreement at the CCRC meeting it should be proposed as the agreed solution avoiding further discussions.

 

4.    SRM Update (Slides) – J.Shiers

 

The SRM v2.2 production deployment is proceeding on schedule without hiccoughs

-       Typically 1 to 1.5 days per (dCache) site, including other housekeeping operations

      NDGF, FZK, SARA done; IN2P3, FNAL (this week), RAL, TRIUMF, BNL, PIC.

-       A log is kept of each upgrade and the issues found are reported.

 

Bugs in the client tools are being tracked by the Engineering Management Taskforce (EMT) with high priority

-       Patch certified; contents already in App. Area – see notes for details

 

Note: The recent schedule has disrupted the weekly SRM management con-calls: there is the real need to restart them already from next week

 

Issue: Space for files recalled does not seem to be well defined (i.e. knowledge about the files is “lost”)

-       Should we agree on a workaround for now, while agreeing on a long-term solution?

-       Standard behaviour is essential across all implementations.
If each implementation chooses a different method will be a major problem difficult to change at this last stage before the challenges in 2008.

 

Ph.Charpentier reminded that it is important that the Experiments have a clear view of how their disk space is used and on what is recalled from tape and is using their disk pools. This issue was discussed several times in the past with the developers of the different implementations but maybe needs again a push for a common clarification and single solution.

 

L.Robertson agreed with the statement and asked where this discussion will take place.

J.Shiers replied that will be in the SRM management meetings that a common behaviour should be agreed and then implemented. This may involve a short term solution for February while the longer-term solution is agreed and implemented.

 

J.Templon asked that this is also discussed or reported to the GDB meeting. J.Gordon added that there is usually a GSSD summary at every GDB.

 

5.    Job Priorities Update (Slides) - S.Campana

S.Campana presented an update on status and progress of the work of the Job Priority WG.

 

Status

The Job Priorities mechanism has been tested on SA3 Certification test beds:

-       Installed with a “branched” version of YAIM

-       No remaining issues to report

 

Next Steps

The next steps are:

-       Configuration of JP in 3 PPS sites. At least 1 with Torque+Maui and 1 with LSF

-       Testing with 2 VOs and 2 Roles/Groups per VO. With larger scale of tests over a longer periods of time

 

The PPS sites have been installed. Some configuration issues have been spotted and cured promptly. They were not related with JP.

 

The next activities are planned for this week, in collaboration with Andrea Sciabá, in order to make a combined test with CMS and present the result at the GDB next week.

 

6.    GDB Summary (Paper; More Information) - J.Gordon

J.Gordon presented a summary of the issues discussed at the GDB and an update on the Pilot Jobs progress.

6.1      GDB Summary

 

Here is the document distributed by J.Gordon and that he read at the meeting:

 

Summary from GDB November 2007

 

John Gordon 27/11/2007

 

The main issue at the November GDB was the strategy for multi-user pilot jobs. A separate paper on this will be discussed at the MB.

 

Other issues presented were:-

 

SL4 Oliver Keble reported on the status of the SL4 and 64-bit releases of the various middleware services. Components are split roughly equally between released; PPS; Certification; Configuration; and Integration. Only CREAM is still at the build stage.  Nothing seems critical to make it into productions for the CCRC in February

 

The strategy for 64 bit (x86_64) is prioritised in order : WN; Torque client (distributed with middleware)‏; DPM_disk; UI. Other services depending on the advantage to be gained by 64 bit. The 64 bit WN is undergoing runtime testing.

 

VOM(R)S Maria Dimou reported from the recent workshop at CERN. A new minor release of VMORS and a major release of VOMS will both go into production in November. Apart from problems with hardware upgrades at CERN her main concern was the loss of key staff in the next few months and how testing and support would suffer.

 

Accounting Dave Kant reported on recent progress. UserDN accounting has been in place for some time but by default the data does not leave the site. Sites can change this manually but the default will be restored the next time YAIM is run. Changes are required to YAIM to change this. 33 sites are currently publishing and the portal has a user view but we are still waiting for the policy to be agreed before VO Resource managers can be allowed to see their VO’s UserDN data.

 

The VOMS FQAN accounting (Roles and Groups) is included in gLite 3.0.35. This release also contains a checksum of published records both centrally and at the site so a SAM test can highlight discrepancies. http://goc02.grid-support.ac.uk/rss/GRIF_Sync.html

 

The APEL Portal now publishes a report that mirrors Sue Foffano’s Tier2 report. There are still some residual difficulties with sites that belong to one T2 for one VO and a different T2 for another VO. We can’t just report pledged VOs for each T2 as most T2s accept all LHC VOs.

 

Job Priorities Simone Campana reported on progress. The ‘short-term solution’ is being tested on the PPS. The longer term solution will wait for the report by Christophe Witzig on VOMS Authorisation, due early next year.

 

CCRC & GSSD both reported on their meetings at the pre-GDB day but since they also report direct to MB I won’t repeat.

 

December Pre-GDB The December pre-GDB will be a whole day devoted to CCRC. The GDB will have a Monitoring session and a report on this week’s Service and Reliability Workshop.

 

 

6.2      Pilot Jobs Policy Version 2 (PDF)

The MB Policy and its pre-requisites were discussed at the GDB on 7th November and below is the update considering those comments.

 

John Gordon

Les Robertson

Version 2

26 November 2007

 

WLCG policy on pilot jobs submitting work on behalf of third parties

 

The topic of pilot jobs has been discussed several times in the GDB, and in particular at the last two meetings. At the October meeting it was agreed to make a proposal to the MB to adopt a policy requiring that sites support pilot jobs submitting work on behalf of third parties.

 

A summary note was prepared by J.Gordon (17/10/07) and presented to the MB on 23 October. This identified a number of issues and made recommendations for a pilot job policy. After discussion the following policy statement proposed and endorsed by the MB meeting on 6 November. Minor changes to the text were made following the discussion in the GDB on 7 November.

 

WLCG sites must allow job submission by the LHC VOs using pilot jobs that submit work on behalf of other users. It is mandatory that the experiments' pilot job frameworks use the approved identity-changing mechanism in order to change the job identity to that of the real user, and it is mandatory that the site actually causes the identity to be changed whenever the mechanism is invoked in a valid way. These two requirements are made in order to avoid the security exposure of a job submitted by one user running under the credentials of the pilot job user.

 

Implementation of this policy is subject to the following pre-requisites:

 

1. The identity change and sub-job management must be executed by a commonly agreed mechanism that has been reviewed by a recognized group of security experts. At present the only candidate is glexec, and a positive review by the security teams of each of the grid infrastructures (OSG, EGEE) would be sufficient.

 

2. All experiments wishing to use this service must publish a description of their pilot job frameworks. A positive recommendation to the MB on the security aspects of the framework by a team of experts with representatives of OSG and EGEE is required. The frameworks should be compatible with the draft JSPG Grid Multi-User Pilot Jobs Policy document.

 

3. gLEexec testing: gLEexec must be integrated and successfully tested with the commonly used batch systems (BQS, PBS, PBS pro, Condor, LSF, SGE).

 

4. LCAS/LCMAPS: the server version of LCAS/LCMAPS must be completed, certified and deployed.

 

The policy will come into effect when the MB agrees that all of the above pre-requisites have been met.

 

In addition J.Gordon summarized the Multi-User Pilot Jobs discussion at the GDB in November.

The points he mentioned at the MB were:

 

-       One error was spotted in the item on review of the experiment frameworks.  Reviewers should consider the entire framework, not just the distributed parts.

-       The JSPG had discussed Multi-User Pilot Jobs. The view of the participants in the meeting room at CERN was that there are significant security risks in not switching identity. We should therefore require identity switching.

-       JSPG decided they should concentrate on the requirements for traceability and logging. These are general requirements which apply not only to multi-user pilot jobs, but also to all other forms of job submission including, for example, Grid portals.

-       John White of EGEE reported that two security experts (Andrei Kruger and Alexander Yu) had reviewed gLEexec. They were JRA1 security developers in EGEE.  They raised a number of issues which have been passed to the developer. But no showstoppers.

-       Volunteers were sought to test gLEexec with different batch systems and the following identified. CC-IN2P3 (BQS), CERN (LSF), PBS (NIKHEF, CERN), SGE (CESGA), PBSpro (??), Condor (??).

-       The service version of LCAS/LCMAPS will be required for scalability before general deployment but this should not hold up testing with the shared file system version.

-       The frameworks of the 4 LHC experiments need to be reviewed by a small panel. A small group of Ian Bird, Don Petravic, Dave Kelsey and John Gordon were actioned to choose a panel to review the frameworks. The first step should be for the experiments to present documentation of their architectures. The panel will then review this and then interview the relevant experts, perhaps with a questionnaire first. Having all 4 experiments on the panel might make it large but would share experiences.

6.3      Status of the Actions Related to Pilot Jobs

Current status of the of the pre-requisites from the WLCG  policy

 

      GLEexec must be reviewed by a recognized group of security experts. Status Done

      Document pilot job frameworks. Status Not All Done (only LHCb done it)

      Frameworks to be reviewed STATUS Team still to be formed

       The frameworks should be compatible with the draft JSPG Grid Multi-User Pilot Jobs Policy document. STATUS not tested

      GLEexec tested with the commonly used batch systems (BQS, PBS, PBS pro, Condor, LSF, and SGE). STATUS not tested

      LCAS/LCMAPS: the server version of LCAS/LCMAPS must be completed, certified and deployed. STATUS Planned

 

Ph.Charpentier asked by which date the panel for reviewing the Experiments’ frameworks will be formed.

J.Gordon replied that the goal is to have the panel defined by next GDB on the 5 December.

 

7.    AOB

 

 

Storage Accounting - L.Robertson

Postponed to the F2F meeting next week.

 

8.    Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.