LCG Management Board

Date/Time:

Tuesday 14 November 2006 at 16:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=a0632704  

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 16.11.2006)

Participants:

A.Aimar (notes), D.Barberis, S.Belforte, I.Bird, K.Bos, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, Di Quing, C.Eck, B.Gibbard, J.Gordon, F.Hernandez, M.Kasemann, J.Knobloch, H.Marten, G.Merino, B.Panzer, H.Renshall, L.Robertson (chair), M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 21 November from 16:00 to 17:00, CERN time

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

No comments. Minutes approved.

1.2         Priorities in the FTS development (slides) - M.Schulz

Follow-up of the SRM 2.2 discussion in the MB of 7 November, priorities of the FTS development (support for SRM 2.2, FTS dashboard, etc).

1.2.1          Priorities

The top priorities for FTS are:

-          Keep the FTS service running, because is used in the Service Challenge and by the VOs.
This includes monitoring the FTS instance at CERN, troubleshooting the whole “data moving” service”.

-          Develop some monitoring tools for the service.

-          FTS Version 2, which include the support for SRM 2.2.

1.2.2          Status

Currently two people are working on the FTS (P.Badino, G.McCance) and they provide:

-          3rd level support

-          training for 2nd level support

-          procedures for 1st level support

-          bug fixing and analysis of problems in the LCG data moving service

-          Developing FTS monitoring tools

-          Developing FTS Version 2, only for 10-20% of their time.

 

Additional part-time help and user support is also provided by a few others (see slide 3).

 

The next steps for the FTS team are:

-          Concentrate the support and service work on G.McCance

-          P.Badino will focus on FTS 2.0 and the development of the SRM 2.2 support in FTS

-          Monitoring will be done by the Indian team and P.Millar

-          Train new team member arriving next month

 

Currently the status is:

 

-          The SRM client libraries are coded and will go to unit testing this week.

 

-          FTS 2.0 will be ready in production by end of January 2007 (tested, certified and deployable).

 

-          A prototype of FTS 2.0 will be available to the experiments on the 2nd week of December.

1.2.3          SRM 2.2 Tests

In view of the delays in development of the SRM 2.2 support by FTS, until mid-December user access to the SRM 2.2 endpoints is limited to gfal and lcg-utils.

 

General tests of SRM 2.2 are constantly run (by F.Donno) using the S2 test suite developed at RAL. Here are the result of the tests (executed every 30 minutes):

http://gdrb02.cern.ch:25000/srms2test/scripts/protos/srm/2.2/basic/history

 

Tests by the VOs of SRM 2.2 can start now. There are SRM-2.2 compatible RPM files for lcg-utils and gfal.  ATLAS (S.Burke) and LHCb (G.Cowan and A.Osorio) have started to do some testing of SRM 2.2.

 

Slide 6 shows the status of the SRM endpoints. No endpoint is completely working (green) and this indicates that there is still a lot of work to get functioning SRM 2.2 endpoints at the sites.

 

R.Tafirout: How do we deploy the new FTS version while the current servers are used in production?

M.Schulz replied that a second service should be installed, draining the requests on the old FTS and then switching to the new FTS.

 

L.Robertson proposed that, in order to monitor the progress, every 2-3 weeks a short update on the FTS is presented to the MB.

 

J.Gordon asked why the FTS work was not scheduled before. It was known and we could have avoided being late now.

M.Schulz replied that the original schedule was slowed down by the involvement of the FTS developers in supporting and maintaining the existing service.

J.Gordon commented that nevertheless the MB should have been warned of these delays.

 

Ph.Charpentier asked what happened to the changes agreed to the Mumbai Workshop. Are they included in FTS 2.0?

M.Schulz replied that the changes agreed in Mumbai are included in the FTS 2.0 release.

1.3         Disk cache discussion

The note distributed by B.Panzer last week (see note) was discussed at length, with several emails on the MB mailing list.

 

Therefore F.Carminati proposes the launch of a working group on  caching and data access”, following the pattern of the Baseline Services working group. This should be a group of experts who will provide (1) an assessment of the situation, (2) recommendations on what is available and (3) what can be done now and in the future.

 

K.Bos noted that there maybe an overlap with the Storage Classes working group.

F.Carminati replied that there is the need to look at what is available now, and define the next practical steps. The Storage Classes working group started from a more general approach.

 

N.Brook noted that experiments should be invited to the working groups and informed of the meetings where their input is needed (he was referring to the SRM and Storage Classes pre-GDB meetings).

 

J.Templon stated that sites, experiments and developers should all be represented by the same number of people. Past groups were sometimes too unbalanced and lacking representatives of one of the parties involved.

 

S.Belforte asked that the MB should prepare a clear set of questions that this WG should answer.

 

Action:

21 Nov 06 - F.Carminati agreed to prepare and propose (mail to the MB) the mandate, participation and goals of the working group on “Caching and Data Access”.

1.4         Status of the Site Monitoring Tools working groups – I.Bird

Names of candidates for coordination and participation in the three groups (Site Management, Monitoring and System Analysis).

 

The names proposed for the coordination of the working groups are:

-          Monitoring Tools and Dashboard: J.Casey and I.Neilson

-          System Analysis: J.Andreeva and it would be good to have a co-chair
Several people have offered assistance: IN2P3, P.Millar, Logging and bookkeeping EGEE group, the RB team working in Imperial College, and the GridIce team.  It seems that there are enough people to form this group

-          Site Management: Hepix proposed IT/GD for leading such a group. I.Bird will go back to Hepix reporting that the coordination of this WG is still unassigned, and asking for suggestions and volunteers from Hepix.

 

Action:

21 Nov 06 - I.Bird will prepare the mandate, participation and goals of the working groups on “Monitoring Tools” and “System Analysis”.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 13 Oct 2006 - Experiments should send to H.Renshall their resource requirements and work plans at all Tier-1 sites (cpu, disk, tape, network in and out, type of work) covering at least 2007Q1 and 2007Q2.

 

Ongoing. H.Renshall presents the situation at this MB.

 

  • 27 Oct 2006 - The MB members should send to I.Bird names of candidates for coordination and participation to the three groups (Site Management, Monitoring and System Analysis).

 

Done.

 

3.      Resource Requirements from the Experiments (slides, document) - H.Renshall

-          Summary of the requirements received for 2007.

-          Proposal of a standard table for representing the requirements.

 

 

 

H.Renshall provided a summary of the experiments requests for 2007Q1 and Q2.

 

Only FZK, NIKHEF and PIC have updated their resource and procurement plans. The other sites should send their information urgently.

 

The Tier 1 percentage shares do not add up to 100% of experiment requirements but the values are converging towards C.Eck’s MegaTable totals for 2007.

 

H.Renshall provided also an example of how experiments should report their requirements (using the report from LHCb)

3.1         ALICE

The data was sent by F.Carminati on the 20th October. The total Disk Space is half the total mentioned in the values given to C.Eck. This needs to be understood.

 

ALICE mentioned also their US Tier-1 site requirements, but the site and commitments have not yet been decided.

3.2         ATLAS

The data is as received from G.Poulard on the 13th November. ATLAS plans 40M events in 2007Q1 and 80M in Q2, with 40% of them to be generated at the Tier-1 sites. The disk and tape requirements for this activity were added to the 2006Q3 base.

 

The Tier-0 data export (Jan 2007) and the full-scale rehearsal (Jun/Jul 2007) are not included, because they use scratch resources that are going to be reused.

3.3         CMS

The data was received from M.Ernst on the 10th November. CMS will generate 30M event per month, at a steady rate, 20% of which will be at Tier-1 sites. They also noted that their requirements are inside the original pledges so should not be a problem for the sites. The events will be copied to tape at the Tier-1 where they are generated. Therefore disk staging must be available for disk-tape activities. H.Renshall left CPU and Disk values unchanged and added tape requirements for 75 TB/month. CMS should confirm this estimate.

3.4         LHCb

The data was received from N.Brook at the end of September but was unchanged from previous data.They show both the incremental and total requirements for each resource. They have a clear estimate only until April 2007; therefore information for the months after April is needed.

 

The format used by LHCb was proposed as the standard for all experiments. (see spreadsheet)

 

3.5         CONCLUSIONS

For 2007Q1 the global CPU and tapes match the requests and are sufficient, but the resources are very unevenly distributed over the sites. The increments from Q1 and Q2 are small. Mostly for ATLAS.

 

The major request in 2006Q4 is the request for tape space by ALICE.

 

The available disk is about 80% too low (!!) therefore the site procurement plans for 2007 are needed urgently from all the grid sites.

 

Note: the data will be used to update the ExperimentPlans wiki pages.

 

J.Templon asked that the experiments specify total “cumulative” amounts. Incremental values are confusing because one must know all of the history of the request updates in order to understand how much to purchase.

 

J.Gordon asked H.Renshall to update the values from RAL.

 

H.Marten suggested using the accounting summaries provided by the sites every month.

 

J.Templon noted that experiments should specify what is needed and when they are going to use.

 

N.Brook noted that also 2007Q3 and Q4 should be soon estimated by the experiments and sites. Because more resources will be needed also in Q3 and Q4 next year.

 

Action:

25 Nov 06 - Sites should send to H.Renshall their procurement plans by end of next week.

 

4.      GDB Wrap-up and Next Steps - K.Bos

 

 

The GDB took place on the 8th November 2006. See also the meeting agenda and the Summary by K.Bos.

4.1         GLite CE

The CE strategy was discussed at the GDB, starting from the strategy agreed by the EGEE TCG. The GDB agreed with the TCG that the work should focus on getting the gLite CE into production as soon as possible.

 

The LCG CE will not be ported to SL4 or further developed, but it will be maintained until July 2007.

 

J.Templon said that until July 2007 the new components will be compatible with the LCG CE. So that they can be introduced smoothly.

 

Ph.Charpentier said that SL3 software runs on SL4. M.Schulz replied that the dependencies on the external packages will require some configuration shortcuts of the libraries, and this probably will not make execution on SL4 reliable enough.

 

M.Schulz noted that the “GT4-pre-WebService-GRAM job manager” will have to be phased-out over the next 6-8 months.

 

J.Gordon asked what happens to the “GT4-WebService-Gram” solution (not “pre”) that are used in the UK and are based on Globus. So UK (and OSG) will have to maintain resources to support those.

I.Bird replied that this point was discussed at the TCG. The new gLite CE will still enable workload managers designed to use Globus, but the recommended solution (by EGEE JRA1) is to use CondorC as the interface. Cream will follow the OGF interface for job submission, so, with Cream, the GT4 compatibility will be removed and only the ICE interface will be supported.

4.2         ALICE Security Model

The discussion focused on the authorization scheme proposed by ALICE. Concern was raised on whether this supports “incontrovertible identification” of the user by the site. A.Peters said, at the GDB, that he understood the concern of the sites and he will propose a suitable technical solution by email.

 

I.Bird noted that this is a second security model to support, which will be a continuing operational load.

F.Carminati said that the only issue open was access to and ownership of data.

 

The solution that will be proposed by A.Peters will provide the real user’s certificate to FTS, which will solve the problem. The detailed proposal is awaited.

 

F.Carminati said that no other changes will be requested to support this new security model. ALICE will use only xrootd to access the data. [The xrootd implementations that are being developed for DPM, dCache and Castor support only the ALICE security model, and so will not be usable by other experiments.]

4.3         Glexec

At the GDB J.Templon explained how Glexec works, basically allowing pilot jobs to change the identity of the user. The “glexec” solution has been asked for by LHCb and ALICE but many sites are not in favour of such solution. It will still require at least 6 months to complete the prototyping and testing phase, and it is clear that it will take time to agree on a strategy for deploying it as a general facility.

 

L.Robertson mentioned that there should be a clear list of the relative priorities of the new features that are being requested, in order to understand how to plan resources.

 

J.Templon said that this work is developed by EGEE (at NIKHEF) and also for other purposes (OSG, etc). Therefore the priorities and issues should be discussed (Spring 2007) also in the context of the EGEE TCG.

 

N.Brook asked for a clear survey from the sites. It must be known whether it is worth waiting for a solution involving the use of glexec or not. M.Schulz replied that a survey was made by Alessandra.Forti and is available in the TCG minutes. The response of the sites was rather negative about the glexec solution. Link to the TCG survey.

 

N.Brook said that LHCb needs to have the Tier-1s supporting glexec.

 

J.Templon noted that glexec can also be run without changing the job ID but doing the user bookkeeping.

N.Brook said that in this case LHCb would do the user level accounting themselves.

F.Carminati asked that if glexec is used at a site, then glexec should be interfaced to the local site accounting. It is believed that this is not supported by most batch systems, and the MB did not support any developments that would be required for this.

 

J.Gordon questioned whether, in view of the discussion, general user level accounting is still urgent for the experiments. 

 

The MB recommended that the GDB in January should come back on the issue of accounting and glexec.

 

4.4         Storage Classes

The Storage Classes group wants to continue the work. A number of sites have presented their models in the November pre-GDB meeting, with several differences noticed. More sites should present their models and experiment should be present in the discussions - only G.Poulard of ATLAS was present in the pre-GDB meeting.

 

N.Brook asked that, when the input from an experiment is needed by a working group, the experiments should be invited for the specific meeting.

4.5         Megatable

When all of the experiments have provided data the table will be distributed to the GDB, following which sites and experiments should analyse the relationships and numbers. At the December pre-GDB meeting the experiments will be asked to explain how they arrive at the numbers, but sites are encouraged to discuss issues with experiments before this meeting.

 

C.Eck noted that some values need to be updated urgently, in particular ATLAS, and the ALICE numbers need to be checked (F.Carminati).

 

Action:

K.Bos will review the text of the GDB Summary with I.Bird and M.Schulz before distributing it.

 

 

5.      AOB

 

5.1         "cpu time escaping jobs" Issue - J.Templon

 

Postponed to next MB meeting.

 

 

6.      Summary of New Actions 

 

 

Action:

21 Nov 06 - F.Carminati agreed to prepare and propose (mail to the MB) the mandate, participation and goals of the working group on “Caching and Data Access”.

 

Action:

21 Nov 06 - I.Bird will prepare the mandate, participation and goals of the working groups on “Monitoring Tools” and “System Analysis”.

 

Action:

25 Nov 06 - Sites should send to H.Renshall their procurement plans by end of next week.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.