LCG Management Board

Date/Time:

Tuesday 9 January 2007 at 16:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=9140

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 9.1.2007)

Participants:

A.Aimar (notes), D.Barberis, L.Betev, I.Bird, N.Brook, K.Bos, T.Cass, Ph.Charpentier, L.Dell’Agnello, I.Fisk, B.Gibbard, J.Gordon, C.Grandi, M.Lamanna, E.Laure, M.Kasemann, J.Knobloch, H.Marten, P.Mato, M.Mazzucato, G.Merino, B.Panzer, , Di Qing , L.Robertson (chair), J.Shiers, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 16 January 2007 - 16:00-17:00 - Phone Meeting

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

No comments. Minutes approved.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 25 Nov 06 - Sites should send to H.Renshall their procurement plans.

 

New deadline end of January agreed by the MB.

 

  • 19 December - J.Shiers and H.Renshall will report on the progress on the definition of targets and milestones for 2007 at the LCG ECM meeting.

 

On the way. Discussed during this meeting.

 

  • 19 Dec 2006 - The proposal to ALICE is to consider, as in the TDR, a value of 10**6 for the ALICE ion runs. L.Betev agreed that ALICE should confirm it within a week.

 

Done.

 

 

3.      Update on the Search of the GDB Chair - G.Merino

 

 

The Search Committee is preparing the list of candidates and will send it to the GDB mailing list in next 10 days.

 

The election will take place during the February’s GDB meeting.

 

The GDB Meeting in March will be run by K.Bos and the new chairperson. After that meeting the new chairperson will be in charge.

 

4.      Decisions on Accounting (Slides, Updated Proposal 8.1.2007)

Additional Material: Document MB 19.12.06;; Slides MB 19.12.06; Document: DGAS Status)

 

 

L.Robertson presented the slides summarizing the decisions proposed concerning accounting (Updated Proposal distributed via email).

 

The goal is to agree and conclude on the points discussed during J.Gordon’s presentation on Accounting at the MB of the 19 Dec 2006.

4.1         Automated reporting of CPU (usage and wall clock)

 

Do we still agree that non-grid usage should be accounted (and reported to the C-RRB)?

Yes. The MB agreed that this should continue to be accounted because at some sites it is considerable.

 

K.Bos reminded the MB that in the GDB in Rome it was agreed that “by Dec 2006 all processing should be submitted via the grid”.

 

Do all sites report both grid jobs and non-grid jobs to the APEL repository at the GOC? If not when will this be done?

L.Robertson noted that the values in the APEL repository do not match the ones reported every month manually by the sites and asked the status of grid vs. non-grid accounting at each site.

 

In order to be able to produce APEL reports and statistics on the non-grid usage the values must be inserted in the APEL repository. The sites have to write the scripts to publish non-grid usage in the APEL repository, following the documented interface to APEL.

 

User-level data for grid jobs can be accounted automatically by the APEL sensors. In order to have details on the users and DNs for non-grid jobs the sites must insert that information themselves..

 

PIC: Using APEL since a long time and it is running automatically. They checked it manually one year ago and are assuming the tool works correctly. They only have grid-submitted jobs.

 

FZK: Using APEL for grid-submitted jobs, values have been verified manually.

Non-grid usage is reported only manually. No plans for automated reporting at present.

 

INFN: Using DGAS on all sites. Completing the scripts that automatically publish data on the APEL repository. DGAS will be used also for the non-grid jobs in the near future. The DGAS status is described in detail in the mail by M.Mazzucato (Document: DGAS Status)

 

NIKHEF/SARA: Publish in the APEL repository the grid jobs. The non-grid usage is very small and limited to ALICE. SARA will try to have only grid jobs at their site and not develop non-grid accounting.

 

CERN: Reports grid usage in the APEL repository and the intention is to report non-grid jobs by 1st March 2007. Normalization is applied.

 

ASGC: Publishes in the APEL repository the grid usage. Update via email after the meeting:

As checked in last MB meeting, this is to confirm that ASGC can publish all accounting data into APEL repository for LHC VOs. Long time ago there were non-grid users for LHC VOs, but now all local LHC users only submit grid jobs to ASGC, thus ASGC can publish all accounting data into GOC GB by APEL.”

 

BNL: Scripts publish local accounting (grid and non-grid) into the APEL system. The same scripts are now also generating the manual monthly accounting data, therefore there is no difference.

 

FNAL: Use the Gratia accounting system for grid and non-grid usage. This is also used to generate the manual data reported. No reporting in the APEL repository yet, but could be done by the 1st March 2007.

 

RAL: All use of the RAL resources will be via the grid for the LHC VOs. Therefore no non-grid jobs to account.

 

IN2P3, NDGF: Not present at the MB.

 

J.Gordon noted that the experiments may probably want to know the details (user, resources, etc) also of the non-grid jobs. And therefore that information should be published as well.

D.Barberis noted that it is important to record ANY usage of resources, independently of the way jobs are submitted. Also how the job was submitted should be recorded (grid or non-grid).

 

L.Robertson noted that in the long term it may be difficult to maintain all the site-specific scripts for non-grid accounting. And that in Rome experiments and sites agreed to move towards grid jobs instead of local non-grid queues.

 

Is the normalisation factor published in the information system correct for each site?

The “granularity” of the normalization varies depending on the site: some sites have a value for EACH WN, on some other sites the normalization is done by pool of resources; sometimes not homogeneous resources have a single normalization factor.

 

L.Betev noted that except NIKHEF and IN2P3, where each WN has an individual normalization factor, on most sites the normalised resources could be 20-30% incorrectly normalized. This is a huge amount of resources accounted on the VO.

 

J.Templon stated that SARA is ready to provide to other sites their system that normalizes “per WN” and reports into APEL.

 

A.Cass noted that at CERN the LSF based accounting normalizes individually for each WN.

 

J.Gordon added that a SAM test will statistically check that the local log values and the ones in the central APEL repository will be compared to detect wrong publishing into APEL repository.

 

When do we move to all automatic accounting?

L.Robertson proposed the following timetable:

-          1 March 07 - All Tier-1 sites reporting both grid and non-grid CPU usage to the APEL repository..
For the time being without user-level accounting information.

-          31 May 07 - All Tier-1 sites to input historic data from January 07 to the APEL repository.

-          31 May 07 - No further possibility of providing manual CPU input for the LCG accounting report.
Manual and automatic will overlap for 3 months (March-May) and then manual reporting will be stopped.

-          October 07 C-RRB - C-RRB Reporting - only automatically reported data used for the

 

From March the automatic system will be used to publish the CPU accounting data for CERN and all Tier-1 sites, unless a site specifically asks to submit the data manually. After May only the automatic data will be used, including for the reports to the OB and the C-RRB.

 

J.Templon asked that well before May the “SAM reliability test checking the published values in the APEL repository” should be in production because, sometimes, values sent to the APEL repository are lost without generating any error.

 

C.Grandi asked that the automatic accounting could be tested until May 2006, with successive improvements and tests of the components.

 

G.Merino asked how the “installed capacity” will now be reported: the GOC DB does not have fields for such data. The MB agreed that sites will report the installed capacity every quarter in the QR reports.

4.2         Reporting CPU usage for Tier-2s

 

When will we be able to identify sites as Tier-2s in the GOC?

Only data reported to the APEL repository will be included in LCG Tier-2 reports. J.Gordon noted that there is a field in GOC DB defining the “Tier Level” and that a mechanism, perhaps manual, will be implemented to define appropriately all of the Tier-2 sites that have been formally declared to Chris Eck.

 

J.Gordon noted that the federations of Tier-2 sites provide aggregated accounting in the UK but not everywhere. The Tier-2s that are not federated should report individually.

 

When will DGAS2APEL be supported for INFN Tier-2s?

Discussed earlier in the “INFN status” section. This will be done by the end of February.

 

When will OSG accounting be linked to the APEL repository?

R.Pordes said (at the 19 December MB) that it would take 3 months to implement the OSG accounting tools, but she was not present at this MB to answer the question.

 

Proposed Next Steps on Tier-2 Accounting

-          1 April 07 - LCG should start reporting the Tier-2 data that is in the APEL repository

-          1 September 07 – All Tier-2 sites are required to report CPU usage to the APEL repository
This proposed date should be presented at the Tier-2 Workshop in order to discuss it with the Tier-2 representatives.

 

The MB agreed on the proposed dates.

4.3         User Accounting

The document on the confidentiality and security policies for collecting and mainbtaining user data are expected to be approved at February’s GDB.

 

When will EGEE sites commit to deploying and activating user accounting?

The MB agreed on June 2007 for “user accounting” of the grid jobs.
The sites with non-grid usage will have to provide this information soon after June writing their own scripts.

 

Initially this will be done only for jobs at Tier-1 sites. Tier-2 user-level accounting will be done once the Tier-1 sites reporting are verified and reliable.

When will OSG sites (BNL and FNAL) and NDGF provide user accounting?

The plans for OSG user-level accounting needs to be checked with R.Pordes, and for NDGF with Michael Grnager.

4.4         Storage Accounting

GridPP has been collecting storage accounting data for a few months, and since the last GDB the data from all EGEE sites has been harvested.

Sites should check the validity of their Storage Accounting data (SEs monitored, accounting data values, units used, etc) in the APEL repository.

The decision on when to move to automated storage accounting will be made when there is more experience of the automated collection system.

4.5         LCG Reports

A small group in EGEE will define the format of the reports and the aspect of the portal.

Representatives for sites and VOs should define the kind of information needed by the different parties involved (sites, experiments, operations).

J.Gordon and I.Bird volunteered to participate. D.Barberis agreed to find some representative for ATLAS.

 

5.      Update on VOMS Roles and Groups + Scheduling Priorities (Slides) – J.Templon

 

 

The information and all the parts (software, documentation, etc) needed are (almost) available:

-          The information provider is (almost) certified and soon could be in production.

-          Torque also should be ready but had to be repackaged using Yaim and is being certified.

 

There is a document explaining how to specify groups and roles at a given site (Unix groups matching roles). It has been tried at SARA (on their batch system only) and sites should proceed to their implementation.

 

The status for other batch systems (LSF, BQS, etc) was not clear. Each site needs to implement appropriate code by the themselves.

-          With LSF the system was not working due to small issues that could be easily solved. For instance the INFN issues were easily solved during EGEE 07 together with M.Serra and C.Alfimei, and they could now help other LSF sites.

-          With BQS the status is unknown and the site (IN2P3) should come back with a status report.

-          With Condor, PBS, Sun Grid Engine, etc there is no feedback from the sites.

 

Sites with routing queues also had problems. Sites using routing queues should contact SARA where the issue was solved.

 

When the certification of the information provider and Torque are released the sites should be informed at the Operations meeting. Sites should implement the Job Priorities and contact J.Templon in case of any problem.

 

D.Barberis noted that this is very important for the experiments, and at least for the Tier-1 sites it is urgent.

I.Bird said that he will report next week on the schedule of the release and whether Tier-1 sites should wait for the release or proceed before so that experiments can test their job priority strategies.

It was agreed that all Tier-1 sites will provide a plan to implement the support as soon as a matter of priority.

 

6.      Update on Targets for 2007 - (Paper, Slides) J.Shiers

 

6.1         Introduction

J.Shiers proposed some draft high-level milestones complementary to the experiment-specific milestones.

The intention is to have more detailed goals and objectives listed in the WLCG Draft Service Plan (Paper).

The milestones are going to be:

-          Similar to that prepared and maintained in previous years

-          Regularly reviewed and updated through LCG ECM

-          Regular reports on status and updates to WLCG MB / GDB

Focus is on real production scenarios and (moving rapidly to) end to end testing. The time for component testing is over and time before data taking is very short.

 

For all milestones proposed:

-          All data rates refer to the Megatable and to pp running

-          Any ‘factor’, such as accelerator and/or service efficiency, is mentioned explicitly (‘catch-up’ is a proven feature of the end-end FTS service).

6.2         Milestones for Q1 2007

This quarter the tests will be with all VOs and all activities simultaneously therefore the values are a bit lower than in 2006. And using the SRM 1, as installed at the sites.

 

FTS 65% T0/T1

Demonstrate Tier0-Tier1 data export at 65% of full nominal rates per site using experiment-driven transfers

-          Mixture of disk / tape endpoints as defined by experiment computing models, i.e. 40% tape for ATLAS

-          Period of at least one week; daily VO-averages may vary (~normal)

 

FTS 50% T0/T1 + T1/T1 + T1/T2

Demonstrate Tier0-Tier1 data export at 50% of full nominal rates (as above) in conjunction with T1-T1 / T1-T2 transfers1

-          Details in the future

 

FTS 35% T0/T1 + T1/T1 + T1/T2 + Reprocessing and Analysis at T1

Demonstrate Tier0-Tier1 data export at 35% of full nominal rates (as above) in conjunction with T1-T1 / T1-T2 transfers and reprocessing / analysis stripping at Tier1s

-          Details in the future

 

SRM V2.2 Tests

Provide SRM v2.2 endpoint(s) that implement(s) all methods defined in SRM v2.2 MoU, at least all critical methods for the experiments

-          Details in the future

6.3         Milestones for Q2 2007

Still to be clarified, in general the focus should be:

 

FTS Tests

Same as Q1, but using SRM v2.2 services at Tier0 and Tier1, gLite 3.x-based services and SL(C)4 as appropriate

 

Services

Provide services required for Q3 dress rehearsals

 

Note: Slide 5 shows that in the Megatable the values for ALICE seem incorrect.

 

L.Betev noted that these are mostly FTS tests. J.Shiers replied that this will be the first time with real scenarios with all VOs running.

 

Note: It is important that the experiments attend the ECM meeting so that Q1 scenarios are defined and can then be presented at January’s Workshop.

 

L.Robertson asked whether some Job Success rates will also be tested. J.Shiers replied that some targets will also be set for the sites, and some dashboard-like tools should be put in place.

 

The experiments will have to see that a sufficient number of jobs/day is completed.

I.Fisk added that CMS has some simple tests that are “sure to complete”. They are used to check that the site is not having major problems and the performance of the site.

 

J.Shiers concluded that the Q1 milestones must be clear before the workshop. Update next week.

 

7.      Long Term Planning for the CERN Computing Centre (Slides, Document) – T.Cass, L.Robertson

 

7.1         Introduction

L.Robertson presented the strategy for planninf the long-term evolution of the CERN Computing Centre.

 

In order to ensure that CERN is able to fulfil its responsibilities for LHC computing, some long term planning is under way. One of the issues that emerges is the apparently (at least at present) inexorable growth of energy requirements as computing capacity increases.

 

CERN’s estimates are that the long-term energy requirements cannot be met by simply adding power and cooling to the current computer centre. For some time now there has been an ongoing discussion on this, trying to identify a suitable building, etc.

 

Now that this has been written formally. Further justification on the needs has been requested from higher CERN management.

 

Note: The details of the explanation in the attached document.

7.2         Uncertainty of the parameters

Estimating the evolution of both the requirements of the experiments and the characteristics of the technology beyond 2010 necessarily involves a considerable element of guesswork.

 

The requirements of the experiments in the early years of the next decade will depend to a large degree on:

-          the efficiency of the accelerator operation

-          the speed with which the design energy and luminosity are reached

-          the timing of the programme to upgrade the machine parameters, and

-          the way in which the computing model of each of the experiments is adapted to the new opportunities that will be offered by the evolution of computing and networking technology.

 

Similarly, while there are already many approaches at the R&D stage to evolve high throughput computing technology over the next five to ten years, the reality will be strongly dependent on market factors.

7.3         Requirements Estimates

The major constraint is to respond to the needs of the experiments within the anticipated long term budget.

 

The initial approach is simply extrapolating the current requirements by assuming the same percentage increase in computational and storage capacity as in 2009-2010 for each subsequent year until 2020.

 

The result is considerably lower than that which would have been obtained by using Moore’s Law extrapolation, which has been used effectively in recent years to predict the capacity that can be purchased within a fixed budget (see slide 4). Therefore in principle there is budget for the infrastructure (energy, etc) but may imply investing in new building, etc.

 

Because of this it is expected that this could be achieved (purchase plus energy costs) within the fixed budget constraint, while keeping the total energy requirements within reason (~20% of the accelerator)

 

The CERN management asked more details on these estimates.

The experiments are required also to provide their estimates (to 2020) and their scenarios. Deadline is end of March.

 

The experiments should provide single input and estimates by discussing among themselves first.

 

T.Cass noted that the estimates must go to 2020 because probably the current infrastructure could be expanded until 2010, but not further.

T.Cass is organizing a group to study all implications on the different services and equipment needed in the CC (space, power, cooling, etc).

 

8.      AOB

 

 

B.Panzer distributed a note (document) reminding the Board of the agreed policy statement for the CERN CC:

 

Extract:

 

The following three policy statements for the CERN computer center are not new and have been mentioned already several times, but I want to make sure that they are officially recognized, thus this re-iteration.

 

1. The revised resource planning for the next years in the CERN computer center has confirmed a problem with the power and cooling needs in 2009-2010. As a first measure the computer center will not accept any request from outside institutes for the installation of extra computing equipment.

 

2. Requests from the experiments for extra services and/or enlargements (VO boxes, build-systems, larger data bases, PROOF, etc.) are taken from the requested resources for CPU and disk in Lxbatch and CASTOR and thus will reduce the available production capacity.

 

3. All new resources for Lxbatch in 2007 will only be available under SLC4. This is based on the resource numbers given by the experiments in October and presented to the c-RRB.

 

------

 

K.Bos asked that the Storage Classes working group should be presented (and appointed) by the MB at next meeting.

 

9.      Summary of New Actions

 

 

Additional information to be provided on accounting:

         IN2P3,– status of automated reporting of non-grid CPU

         NDGF – status of automated reporting of CPU usage (grid and non-grid); plans for user level accounting

         OSG – plan for the OSG accounting system to report CPU accounting data automatically to the APEL repository; plans for user level accounting

         Ian Bird – to report on the deployment status of the scheduling priority scheme information providers

         All Tier-1 sites – to plan as a matter of priority the deployment of the VOMS groups and roles priorities scheme; sites not using Torque to organise integration with their batch system

 

  • 31 Mar 2007 – experiments to provide very long term (2010-2020) estimates of their computing needs at CERN

 

  • 31 Jan 2007 - Sites should send to H.Renshall their procurement plans.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.