LCG Management Board

Date/Time:

Tuesday 6 February 2007 - 16:00-18:00 – Face-to-face Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=11624

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 10.2.2007)

Participants:

A.Aimar (notes), D.Barberis, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello, D.Duellmann, F.Donno, M.Ernst, I.Fisk, J.Gordon, C.Grandi, F.Hernandez, M.Lamanna, E.Laure, S.Lin, M.Litmaath, M.Kasemann, H.Marten, P.Mato, G.Merino, R.Pordes, L.Robertson (chair), J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 13 February 2007 - 16:00

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

No comments. Minutes approved.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 19 December - J.Shiers and H.Renshall will report on the progress on the definition of targets and milestones for 2007 at the LCG ECM meeting.

 

Discussed further in this meeting. Action closed.

 

 

  • 31 Jan 2007 - All Tier-1 Sites should send to H.Renshall their procurement plans for 2007.

 

To be completed.

 

J.Templon noted that the plans from the experiments do not go beyond June 2007 (as requested). He said that NIKHEF does not want to install equipment before there is a clear requirement from the experiments and so the needs during the second half of the year should be specified in the table. He added that he would like to have a single place to submit the site plans – at present they are submitted to Harry but also to Fabienne as part of the monthly accounting report and again in the quarterly progress report..

 

The discussion that followed concluded that the values of reference are those in H. Renshall’s pages. Both sites and experiments should submit the values there and keep them up to date. The accounting and quarterly reports will simply use the latest numbers from there.

 

Discussion:

L.Robertson noted that the resources required for the startup in November must be available some time before to ensure that any installation problems can be resolved and experiments can test in a stable environment. We had agreed that the target date was 1 July this year. While there is still a possibility that the accelerator may not have colliding beams at the beginning of November, we should not be gambling on delays and risk not being ready if the beam comes on schedule. There is of course a cost benefit by delaying purchasing but this should be viewed in the context of the overall cost of the preparations for LHC.

 

M.Kasemann noted that in order to have the equipment in production in April 2008, one would need to have the resources installed by January 2008. H.Marten said that the agreement was that equipment should be ready by April 2008, not by January 2008. The sites already committed to have the equipment installed at the beginning of 2008 in order to have it available in production for 1 April 2008. L.Robertson noted that this fits our current knowledge of the accelerator schedule for 2008, but we should be prepared to reconsider the date for availability in subsequent years when there is better knowledge of the accelerator timetable.

 

The agreement is that:

-          resources for Novembers 2007 should be ready for use by experiments by July 2007

-          resources for the 2008 run should be ready by April 2008

-          dates and resources for 2009 will have to be discussed once the accelerators schedule for 2009 is clearer

 

  • 5 Feb 2007 - L.Robertson will send a starter list of software involved to the MB.

 

Done. Discussed further in this meeting.

 

  • 13 Feb 2007 – All Tier-1 Sites should send to the MB their plans for the implementation of job priorities.

 

Done. An update is done today with the roundtable in this MB meeting.

 

  • 9 Feb 2007 – All Tier-1 Sites should send to the MB their Site Reliability Report for December 06 and January 07.

 

Not done. Several sites still have to do it.

 

L.Robertson said that the original distribution of the reliability report for January, using the data from SAM, had again difference with GridView, which had the correct values. The values from GridView will from now on be used. The report produced by Fabienne will be replaced shortly by a similar report produced by GridView. The calculation used at present for reliability simply considers scheduled down time as available. It is proposed to change this to the definition in the MoU. L.Robertson will circulate a note on this along with a recalculation of the September-January numbers so that members can see the impact of this, prior to making a decision at next week’s meeting.

 

In addition it was that the “dteam” account will no longer be used (until now the best result each day of the OPS and dteam VOs was taken). Only the “ops” account will be used for the SAM tests from now on.

 

  • 31 Mar 2007 – Experiments to provide very long term (2010-2020) estimates of their computing needs at CERN

 

3.      Future Maintenance and Support of External Software (Document) – L.Robertson

 

L.Robertson raised the question of whether it is necessary to have more formal support agreements for the external software on which we depend. He had distributed an initial list of some of these components.

 

Various external projects and organizations provide software for the LCG. In some cases there are agreements with the developers, but in most of the cases there is no formal support agreement. Should the LCG have agreements with these sources of software?

 

Popular open source products (apache, squid, etc) are not included because they are widely used outside the LCG community and are not considered a major risk.

 

J.Templon asked why CERN products are included. And he also added that some project-specific products will have not support once the project ends? Ex: EGEE-funded products. L.Robertson said that CERN products were included for completeness, and that the concern over long-term support for tools developed by projects with limited lifetime was one of the reasons for raising this issue,

 

D.Barberis said that a written agreement can at least help to make sure the provider keeps the LCG informed with a reasonable advanced warning if support is to be stopped.

 

F.Carminati added that even if a product is not going to be supported one could adopt it anyway. But the survey is good because the situation should be known, so that the LCG can take corrective action if/when support for a given product will stop.

 

Ph.Charpentier said that porting to future LCG platforms” should be included in a support agreement. Sometimes there is limited platform support, and if the porting is needed for the LCG only it may be an issue that will require negotiation with the provider and could also influence the compiler and platform choice of the applications area. This could become a nightmare!.

 

The Application Area products are not included in L.Robertson’s list but P.Mato will consider whether there are products that should be included. P.Mato added that the selection process of external products took into consideration the extent of their adoption in the open source community and future support. But further investigation will be done on the Applications Area external software. .

 

Action:

The MB Members should send feedback to L.Robertson on the list of products and on the kind of agreement to reach with the external software providers.

 

A questionnaire/survey should then be prepared and distributed (also talking with the major providers directly) in order to asses the current situation. A risk analysis will then be performed once more information is available.

 

4.      Update of the Requirements of the Experiments (6 months period) – L.Robertson

 

 

The experiments are progressively improving the calculation of their requirements, but the sites needs to have stable values for their procurement. Last update was October 2006, since then some values were changed, but sites cannot take them into account quickly.

 

 

The Overview Board recommended that the experiments’ requirements are changed only two times a year.

 

Experiments will check once more that the current requirements published on the LCG Web in October are the latest estimates (see http://lcg.web.cern.ch/LCG/MB/revised_resources/). In future the estimates will be verified, and changed if needed, every June (in good time for the October C-RRB) and December (at the end of the annual data taking run).

 

J.Templon and O.Smirnova noted that typically sites have already fixed budgets until 2010; therefore the sites have not much flexibility and only the distribution of procurement over these years can be now changed.

 

5.      Update/Issues on the 2007 Targets and Megatable Rates (Slides) – J.Shiers

 

 

J.Shiers presented an update on the work being done to define the Targets and Milestones for 2007.

 

The discussion moved immediately to Slide 7 where the data and rates required by the experiments are compared with those mentioned in the TDRs of the experiments. Some changes from ATLAS (ESD size) and CMS (trigger rate) are highlighted.

 

Slides 8 to 12 show the Megatable values for Tier-0-Tier-1 data rates. Not all experiments applied the same criteria [although this is not documented in the explanations sheet in the workbook]. Some include catch-up rates, others average or peak rates, others machine efficiency, etc. The values collected seemed to be inconsistent. The following clarifications were made:

         ALICE – numbers are long term averages, taking account of detector efficiency and with no allowance for catch-up after a problem. The “peak” can be considered to be twice this value.

         ATLAS – assumes 100% efficiency of the accelerator – and so the numbers can be considered as peak numbers.

         CMS – takes account of accelerator efficiency but includes an allowance for catch-up – so this is the peak rate.

         LHCb – give the long term averages, and so the peak can be assumed to be twice the numbers given in the MegaTable.

 

L.Robertson noted that the sites want to know whether these values are “peak” or “average”. The site must know the peak rate that they have to achieve. The average rate is implied by the amount of data that must be accepted

 

Decision and Action:

The conclusion of the MB is that the experiments should define the peak data rate for Tier-0-Tier-1 transfers.

 

N.Brook noted that this standardization should apply to all transfers, including Tier-1 to Tier-2 transfers.

He also added that the peaks could occur all at the same time, and this should be considered by the sites. [It should be noted that the MegaTable already includes average and peak values for Tier-1-Tier-2 transfers, allowing Tier-1 sites to make an estimate of the likely peaks rates].

 

6.      Proposal for Job Reliability Statistics (Slides) – M.Lamanna

 

M.Lamanna presented a proposal to report the sites’ performance from the users’ point of view, using the real behaviour of the system with real user jobs. There are two aspects to consider:

-          Quality of Service (QoS):
The focus is on the “job”. “Is the job successfully executed?” is the crucial question. Maybe the job was re-submitted several times, but this is not relevant for the user.

-          Grid Reliability:
Here it is important to know if the execution required several attempts or was immediately successful. This verifies the good usage of resources (no wasted attempts).

 

One could have a high QoS and low GR. This simply would mean that the success rate is high but only after many execution attempts, with satisfied users but bad utilisation of resources.

 

J.Templon noted that low GR will use up the user quota of resources for executing jobs.

 

Slide 3 shows an example of a job that is only executed completely after 4 attempts with recovery handled by the system, but the user sees this as a successful execution.

 

J.Templon noted that the sites are interested in the state transitions that result in the failed attempts and asked M.Lamanna to provide such information.

 

Slide 4 shows an example of users’ jobs executed at DESY.

Slide 5 shows the Summary by VO and site, with efficiency and with the possibility to “zoom in” and see the reason of each failure.

 

There is a large amount of additional data accessible via the web pages producing these reports. M.Lamanna proposed to sit with people from the sites to explain to them how to obtain this information, and give feedback on how to improve the system.

 

For the regular reporting M.Lamanna proposed to start with two job sets: CMS analysis jobs (CRAB) and LHCb pilot jobs.

 

F.Hernandez asked whether an API is available, so that a site can automate the extraction and verification of its job success rates. M.Lamanna replied that a command-line interface is available and could be used. But this needs to be verified in practice.

 

J.Templon and G.Merino agreed to discuss with M.Lamanna on which information should be available in the “job reliability GUI” and distributed in the “job reliability monthly reports”.

It was agreed to begin the regular reporting of the two job sets proposed, covering CERN and the Tier-1 sites initially.

 

 

7.      OSG Accounting (Slides) – R.Pordes

 

The plan is that OSG will follow the MoU agreements of US ATLAS and US CMS, and will report to the WLCG both grid and local usage of the OSG resources.

 

Currently the report is done manually (via email) but now the goal it to automate it, using Gratia, by March 2007.

The contact for this automation is Ph.Canal.

 

J.Gordon replied that Ph.Canal probably has not yet started to do the work. Intermediate milestones should allow the verification of the effective progress of this work.

 

OSG Accounting

OSG is currently running (or testing in some case) accounting plug ins for Condor, PBS, LSF, SGE, glexec and dCache.

 

14 OSG Sites are currently reporting - including CMS Tier-1 & all Tier-2s, OU (Oklahoma Univ.) Atlas Tier-2, in test at UTA (Univ. Texas Arlington).

 

The information collected at batch level - includes locally (not through grid interfaces) submitted work. Grid-interface submitted jobs capture submission DN.

 

VO-level reporting from ATLAS Panda is also available.

 

J.Gordon added that some work on storage accounting was done in the UK, by G.Cowan at Edinburgh. J.Gordon will send additional information to R.Pordes.

 

Loading of APEL

The options available are:

-          write direct SQL statement to upload to LCG, which requires development only on OSG side

-          upload OGF UR xml files. Requires some development on OSG and requires development on LCG side (done anyway for the Resource Usage Service, RUS)

 

After the meeting J.Gordon clarified that the work on the RUS will not be ready before next six months. Therefore the first option (SQL statement) is the only available.

 

The user privacy issues have to be sorted out before any user-level OSG data is published, but the data should be collected already with restricted access.

 

8.      SRM 2.2 Update (Slides) – F.Donno

 

 

Note: At every face-to-face meeting during the coming months F.Donno or M.Litmaath will present an update of the SRM 2.2 development and deployment status.

 

A report was presented at the last MB meeting; therefore most of the considerations still apply.

 

Slide 3 shows that the Basics tests in the MoU are mostly working. Some problems still remain with CASTOR.

 

Slide 4 shows the status of the SRM implementations:

-          DPM All MoU methods pass Basic tests. Three use cases fixed in 1.6.2.
Major problems this week since the machine was accidentally wiped out by an operational error.

-          DRM and StoRM: No news. StoRM still missing implementation of ExtendFileLifeTime for SURLs.
Copy in PULL mode not available in StoRM. Implementations rather stable.
No more communication issues with DRM.

-          dCache: Still missing ExtendFileLifeTime.
Problems with use cases and interoperability tests reported but not yet fixed.

-          CASTOR: Still some problems with basic tests. Reported to S.De Witt. This week some SRM methods were hanging. Not yet at use case analysis.

 

The GSSD working group was discussed at the pre-GDB and has started work. M.Litmaath presented an example on configuring LHCb at FZK.

 

M.Kasemann noted that it would be useful to have the old test results tables together with the latest result. This will make it easier to evaluate progress and spot longstanding problems.

 

Before every face-to-face meeting F.Donno will distribute the SRM test results, and compare them with the results of the previous months.

 

 

9.      Sites Plans on Group/Role and Job Priority – Round table

 

 

L.Robertson proposed a roundtable of all Tier-1 sites on their implementation of job priority based on group and roles.

 

CERN: Looking at the issue but needs to discuss more with J.Templon on how to implement the mapping of VOMS roles and groups onto the LSF system.

 

RAL: They believe that is done.

 

AGSC: Will ask more information and inform via email. [After the meeting Jason Shih reported that ASGC has applied the job priority settings already before the start of csa06 and atlas sc4 for production users. Fair share scheduling has also been implemented for two vos, such that each vo can benefit from the free resources that the other VO is not using.]

 

IN2P3: Not done yet. Development is ongoing in order to (1) modify the information system and (2) later to implement the prioritization inside the BQS scheduler. For now it is externally done, changing the BQS priority of the job.

 

PIC: Installed on the PPS a few weeks ago, but was not tested by users from the experiments. It will be introduced into production in the next month, even if it is not fully tested, and then fixed as problems show up.

 

CNAF: The plugin for LSF is needed and is being developed; it will be tested in a few weeks.

 

T.Cass noted that the dynamic info provider is a problem for CNAF and CERN, the contact at CERN is U.Schwickerath.

 

SARA: Done

 

NDGF: Done in a way that 80% of sites publish their information. But it needs a cleverer implementation (no further explanation was provided at the meeting).

 

TRIUMF: Implemented priorities at the scheduler level. Not in the information system. It is configured in order to provide 80% of the resources to ATLAS production and 20% to general users.

 

US CMS: They have not updated the batch system but they provide 50% for production and 50% for general users. They can tune it easily. They also implemented the mapping of priorities on the storage element using dCache features.

 

US ATLAS and FZK: Left the meeting.

 

J.Templon noted that in spite of the apparent lack of problems mentioned by some sites nobody, except SARA, is publishing data in the information system. To publish data in the information system the sites need to install (and use) the RPM that is available and described in J.Templon’s email

 

10. AOB

 

 

No other business.

 

11. Summary of New Actions

 

 

20 Feb 2006 - The MB Members should send feedback to L.Robertson on the list of products and on the kind of agreement to reach with the external software providers.

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.