LCG Management Board

Date/Time:

Tuesday 9 October 2007 16:00-18:00 – F2F Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=18005

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 12.10.2007)

Participants:

A.Aimar (notes), D.Barberis, I.Bird, T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, I.Fisk, D.Foster, J.Gordon, F.Hernandez, M.Kasemann, D.Kelsey, J.Knobloch, U.Marconi, H.Meinhard, M.Michelotto, H.Marten, P.Mato, G.Merino, Di Qing, L.Robertson, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 16 October 2007 16:00-17:00 – Phone Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

A minor change sent by H.Marten was added in Section 4 (text in blue).

The minutes of the previous meeting were approved.

1.2      LHCC Comprehensive Review Preparation (Agenda) - L.Robertson

The agenda of the Comprehensive Review is now available as an Indico Agenda.

Some speakers need to be defined; in particular for the Middleware and Deployment sessions.

 

Tier-1 Sites – One Tier-1 site is still missing; the proposed sites were FZK or CNAF.

H.Marten agreed that FZK will be one of the Tier-1 sites.

 

Tier-2 Sites - For the Tier-2 J.Coles will present the UK sites, R.Pordes the OSG sites and a summary from the Asia-Pacific Workshop by someone from an Asian site.

 

Experiments Presentations - For the Experiments the reviewers were in favour of live demos. The complexity is that it difficult to prepare and run a realistic job submission with meaningful results in 20 minutes. This should be discussed with the Reviewers again.

 

Action:

D.Barberis agreed to clarify with the Reviewers the kind of presentations and demos that they are expecting from the Experiments at the Comprehensive Review.

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

  • 10-July 2007 - INFN will send the MB a summary of their findings about HEP applications benchmarking.

Done at the F2F Meeting 9 October 2007.

  • 18 Sept 2007 - Next week D.Liko will report a short update about the start of the tests in the JP working group.

Not done. A.Aimar will ask for updated information.

  • 21 Sept 2007 - D.Liko sends to the MB mailing list an updated version of the JP document, including the latest feedback.

Not done. D.Liko will also provide a status summary at the GDB the following day.

I.Bird added that the installation is now available on the certification testbed and available for testing.

 

I.Bird agreed to report to the Management Board about the progress of the Job Priority working group.

 

 

D.Liko sent this information:

 

The document has been updated by the comments from Jeff. It is available from the

Working Group Wiki.

 

http://egee-intranet.web.cern.ch/egee-intranet/NA1/TCG/wgs/Job%20Priorities%20Implementation%20Plan.doc

 

The feedback was implemented. The new installation scripts are available only with

the new version of these tools and require a reinstallation of the certification testbed

(at least that is my understanding). I assume that this reinstallation can take place,

when the responsible persons have returned from the EGEE conference next week.

 

Next actions will be decided in function of the results of this step. I would expect that

we have this information end of next week.

 

2.1      Distribution of the 24x7 and VO Boxes SLAs to the MB Members

J.Templon asked the MB members to share their 24x7 and VO Boxes SLA agreements in order to learn from other sites’ experience ad to possibly define similar support levels at different sites.

F.Hernandez and J.Gordon supported the request but the documents should be stored in an area accessible only to the MB members because they contain confidential (host names, etc) or private information (home phone numbers, etc).

New Actions:

Sites should send the pointers to their documents about 24x7 and VO Boxes to A.Aimar.

A.Aimar will prepare a protected web area for confidential documents of the LCG Management Board.

 

3.    SRM Update (Agenda of Edinburgh workshop; Slides) – J.Shiers

J.Shiers provided a short update of the progress with SRM 2.2 deployment.

 

ATLAS Tests – SRM 2 as a replacement for SRM 1 was not a problem at most sites for dCache and STORM. The issues with other implementations are being solved.

 

Tier-1 Sites Deployment – Slide 2 shows the deployment schedule at the Tier-1 sites: starting with NDGF on the 29th October, one site per week, until PIC on the 16th December 2007.

 

CASTOR Deployment - CERN: by end of October 2007, RAL: by end of November 2007, CNAF: once the current problems are solved, should be during November. ASGC: not clear yet.

 

Tier-2 Sites Deployment – The DPM sites can already upgrade now. The dCache sites can upgrade once Version 1.8 is released.

 

4.    CCRC Planning Preparation (Slides) – J.Shiers

                                                                 

 

CCRC’08 is by definition a combined effort between the Experiments (online and offline) and the sites {Tier0, Tier1, Tier2} of the WLCG. Its goal is to verify and measure the readiness for pp data taking from 7+7 TeV collisions in the LHC (July 2008).

 

The goal of the challenge is to identify problems early and allow time for these to be fixed. It is complementary to the on-going challenges / Dress Rehearsals of the experiments and builds on these activities.

4.1      Planning Meetings

The proposal is to use pre-GDB time slots, on Tuesday morning, at least until end of year, probably until end of the CCRC programme. It also needs to have more frequent weekly checkpoints (every Monday 17:00 Geneva time – agenda driven). It will be essential to have active participation from all parties – as well as a common schedule (e.g. no independent planning for CASTOR, SL4 releases etc).

4.2      Possible CCRC’08 Targets

February ‘pre-challenge’: primary goal is to get all 4 experiments online – out to sites with as much of the chain as possible

-       Data rate is less important – the priority is to see all components working successfully together

-       Not all resources will be in place at this time – target April 1st 2008 for the resources to be installed

 

May ‘challenge’: goal is the successful operation of all components at the expected data rates for 2008

-       All 2008 resources should be fully available for production use at this time

-       Background rate: cosmics (100Hz for ATLAS); peak rate: pp collisions

-       LHC target: 10 hours / 24 ‘would be good. Averaging a 10h fill per day would be good (40% efficiency for physics)”

-       Suggest no need for additional recovery – pp data has priority over cosmics

-       Experiment (ATLAS, LHCb) assumptions: 50% efficiency

4.3      Participants (so far)

Would be important that every representative has a deputy in order to assure participation at every meeting.

 

Experiments:

-       ATLAS: Kors Bos

-       CMS: Matthias Kasemann

-       ALICE: Latchezar Betev

-       LHCb: Nick Brook

 

Sites:

-       ASGC: Di Qing

-       BNL: Michael Ernst (Gabriele Carcassi)

-       CNAF: Luca dell’Agnello

-       FNAL: Ian Fisk

-       FZK: Andreas Heiss

-       IN2P3: Fabio Hernandez (Lionel Schwarz)

-       NDGF: Matthias Wadenstein

-       NIKHEF/SARA: Mark van de Sanden

-       PIC: Gonzalo Merino

-       RAL: Andrew Sansum (Derek Ross)

-       TRIUMF: Reda Tafirout (Rod Walker)

-       CERN/Tier0: Miguel Coelho

 

WLCG Services:

-       WLCG service coordination: Harry Renshall, Jamie Shiers

-       WLCG service / LHC operations link: Maria Girone

 

L.Robertson proposed that the Tier-2 involvement is managed by the Experiments Tier-2 coordinators and when possible by the countries that have Tier-2 coordinators.

F.Hernandez added that he will discuss in order to involve the France Tier-2 coordinator in the CCRC Planning.

4.4      Key Dates

-       Nov 6/7: (pre-) GDB

-       Dec 4/5: (pre-) GDB

-       Jan 8/9: (pre-) GDB

-       Mid- Jan: ATLAS M6 complete

-       February: first combined challenge

-       March: cosmics runs; analysis of Feb run

-       April: machine closed; preparation for May run

-       May: second combined challenge; first beam

-       June: any residual problems? De-scoping?

-       July: first collisions scheduled

 

M.Kasemann agreed that any contingency can only be achieved by de-scoping the requirements if needed. But the timeline should be respected.

4.5      Proposed Schedule

Phase 1 - February 2008:

Possible scenario: blocks of functional tests,

 

Try to reach 2008 scale for tests at:

-       CERN: data recording, processing, CAF, data export

-       Tier-1’s: data handling (import, mass-storage, export), processing, analysis

-       Tier-2’s: Data Analysis, Monte Carlo, data import and export

 

Phase 2: May 2008 :

Duration of challenge: 1 week setup, 4 weeks challenge

 

Ideas:

-       Use February (pre-)GDB to review metric, tools to drive tests and monitoring tools

-       Use March GDB to analyse CCRC phase 1

-       Launch the challenge at the WLCG workshop (April 21-25, 2008)

-       Schedule a mini-workshop after the challenge to summarize and extract lessons learned

-       Document performance and lessons learned within 4 weeks.

-       Basic Scaling Items to Check in CSA08

4.6      Explicit Requirements

-       ATLAS, CMS, LHCb: SRM v2.2

-       ALICE: xrootd interface; gLite3.1 VO box & WMS

-       LHCb: generic agents, R/O LFC at Tier1s

-       ATLAS, LHCb: conditions DB

-       CMS: will used only the commissioned links

4.7      Summary

CCRC’08 is the final readiness check after many years of preparation and testing. The current technical scope and timeline is already very ambitious and CCRC’08 is also testing / measuring service and operations readiness, usable capacity at sites etc.

 

Recent experience (e.g. SRM v2.2) tells us we must be realistic about availability and schedules. All partied must be very open and clear about resources that will really be available – and at what level – to work on this much needed activity More details will be discussed at tomorrow’s GDB.

 

5.    Approval of Grid Security Policy and Site Operations Policy documents (Grid Security Policy; Site Operations Policy) – D.Kelsey

 

D.Kelsey briefly summarized the security documents about Site Operations and Grid Security Policies.

5.1      Site Operations Policy

The document has been approved by EGEE, by GDB and is up for approval by OSG.

 

There will be new versions in the future but the MB is asked to approve the current version.

 

H.Marten asked explanations about point 6:

6.  Your participation in the Grid  as a Site shall not create any intellectual property rights in
software, information and data provided to your Site or in data generated by your Site in the
processing of jobs.

 

D.Kelsey replied that this is not removing any other property or copyright. Simply it explicitly says that running an application at a site does not entitle that site to any intellectual property related to that application.

 

Decision:

The WLCG Management Board approved the Site Operations Policy document.

5.2      Grid Security Policy

This document replaces the previous version that is now 4 years old. Was approved by EGEE while OSG is still discussing it and may request some, hopefully small, changes.

 

D.Kelsey asked the MB to approve the current version so that that further improvements will be included in future versions but do not stop the approval process.

L.Robertson supported this request and noted that it is a major improvement on the current version of the document.

 

Decision:

The WLCG Management Board approved the Grid Security Policy document.

 

6.    Report on Benchmarking at CNAF (Slides) M.Michelotto

 

M.Michelotto summarized the benchmarking activities that were undertaken at CNAF, during summer 2007, in order to better estimate CPU resources.

6.1      Background Information and Motivation

SPECInt 2000 is the benchmark used up to now to measure the computing power of all sites for the HEP experiments:

-       Computing power requested by experiment

-       Computing power offered by a Tier-0/Tier1/Tier2 site

 

SI2K is the SPEC CPU Int 2000 benchmark. It was declared obsolete by SPEC in 2006 and replaced by the new benchmark from SPEC is CPU Int 2006.

 

It is now impossible to find SPEC Int  2000 results for new processors (e.g. even for the not-so-new Clovertwon 4 cores) and to find SPEC Int 2006 for old processor (built before 2006). In addition SI2K does not match the behaviour of the HEP applications like it did in the past.

6.2      Using SI2K + 50%

The SI2K results for the last generation processor are incorrect. Trying to correct this problem CERN (and FZK) started to use a new benchmark measured with “gcc”, the gnu C compiler and using two flavour of optimization:

-       Real SI High tuning:  gcc –O3 –funroll-loops–march=$ARCH

-       Real SI Low tuning:  gcc –O2 –fPIC –pthread

 

The CERN proposal is to: use as site rating the “Real SI Low” increased by 50% for a short period of time and for the last generation of processor.

 

For example a worker node (32 bits) with two Intel Woodcrest dual core 5160 at 3.06 GHz:

-       SI2K nominal: 2929 – 3089 (min – max)

-       SI2K on 4 cores: 11716 – 12536

-       SI2K gcc-low: 5523

-       SI2K gcc-high: 7034

-       SI2K gcc-low + 50%: 8284

 

On a worker node (64 bits) with two Intel Woodcrest dual core 5160 at 3.06 GHz:

-       SI2K nominal: 2929 – 3089 (min – max)

-       SI2K on 4 cores: 11716 12536

-       SI2K gcc-low: 6021

-       SI2K gcc-high: 6409

-       SI2K gcc-low + 50%: 9031

 

In addition the ration depends on CPU manufacturer (Intel vs. AMD) and processor generation (old Xeon vs. new “core” Xeon), as shown in slide 9 and 10.

 

Slide 11 shows the CERN-proposed solution (SI2K-cern) where there are still differences between Intel and AMD for the same clock speeds and between 32 and 64 bits.

In Slide 12 the SI2006 show a smaller difference between Intel and AMD benchmarks.

6.3      Benchmarking Real HEP applications

At CNAF they have benchmarked real HEP applications from Babar, CMS MC and Pythia.

 

Slide 13 shows the comparison using BABAR as a normalized benchmark:

-       SI2006 matches this pattern (pubblished and gcc ratio constant)

-       SI2K clearly does not work

-       SI2K-cern is better than SI2K nominal

 

J.Templon asked how the normalization was performed.

M.Michelotto explained that was done by dividing by the number of core and by the GHz of the processors. And one processor (Opteron 275) in the Babar tests was taken as being 100%.

 

Slide 14 show the comparison using CMS MC and Pythia as benchmarks:

-       Both Specint 2006 pubblished and Specint 2006 with gcc show the same behaviour

-       SI2K is clearly wrong.

-       SI2K-cern is better but not as good as SI2006

6.4      Proposal

The SI2K should be stopped because are incorrect. The CERN-proposed solution is better but uses obsolete and not maintained benchmarks.

 

The proposal is to use Spec INT 2006 RATE measured with GCC:

-       Spec INT 2006 because it is new, one can find pubblished results, will have bug fixes, and will be portable to new platform compiler

-       RATE because one can take in account of scalability issues with the whole machine architecture

-       Measured with GCC: to keep the environment as close as possible as to the HEP experiments.

 

The problem is that all the official requirements and pledges are still expressed in terms of SI2K (nominal).

Therefore funding agencies are still bounded to use SI2K for all formal agreements and purchasing. A general agreements on the next benchmark to use is still missing.

CNAF will add more comparison by ATLAS and BABAR in the future.

 

H.Marten asked whether the SI2K can be rescaled to the SI2006 with a clear scaling factor.

M.Michelotto replied that, as shown, this is not possible because the scaling is different for Intel vs. AMD and 32 vs. 64 CPUs.

 

D.Barberis said that in April it was agreed at the MB, that the sites would provide “benchmark installations” for the experiments in order to re-evaluate their resource requirements. What the Experiments should do now?

M.Michelotto replied that the proposal is to use the Si2006 with gcc and with the flags of the Experiments.

L.Robertson added that some machines should be configured and provided as reference machines for the benchmarking.

 

T.Doyle asked about the high and low tuning, which would be to use in the SI2006 benchmarks?

M.Michelotto replied that with SI2006 the difference are smaller and one should choose the flags closest to the one used by the Experiments. In addition many applications that are IO bound all benchmarks will behave differently between VOs.

 

Should one continue to use SPEC Int or floating point benchmarks?

M.Michelotto replied that the FP implementation is very different by manufacturer and by specific CPU and boards. Benchmarks of the OpenLab show that HEP code is still closer to SPEC Int than FP benchmark applications (+- 10%).

 

H.Meinhard reminded that more benchmarking should be completed evaluating different flags and platforms. These benchmarks are being evaluated in the HEPIX group for different type of HEP applications (simulation, reconstruction, analysis, etc).

T.Doyle noted that HEP should anyway converge on the SI 2006 as soon as possible because this is what will be available. Therefore the evaluation by the Experiments is urgent.

J.Templon agreed that the SI2K-cern is better than the SI2K but the SI2006 are even better and therefore one should move quickly. The SI2006 are what is available on the market only the scaling factors need to be better understood.

 

M.Michelotto asked that other sites (CERN, FZK) make other measurements and confirm or deny the conclusions at CNAF.

H.Meinhard said that for the moment CERN have not done such tests on SI2006.

H.Marten added that FZK has done tests (M.Alef) but not on HEP applications yet.

 

I.Bird noted that future tenders will require the suppliers executing the HEP benchmark applications.

L.Robertson agreed that this is done currently and will be still needed in the future.

 

L.Robertson proposed that a set of machines with the different configurations (Intel. AMD, 32, 64 bits, etc) and with the compiler flags agreed by the Architects The goal is to find the ratio between HEP applications and SI2006 benchmarks. When Experiments will run their application on those same platforms one can calculate the scaling factors to apply.

 

H.Meinhard added that at HEPiX there is a session on benchmarking and the AF should provide the flags to be used before that date.

 

7.    CPU Usage Efficiency (Slides) T.Cass

 

T.Cass presented a set of measurements about “wall-clock time vs. CPU time” usage of computing resources by the Experiments.

 

The applications seem to have a lower CPU efficiency in the last few months (slide 2) in general the efficiency seems to be heading to 50% by December (black line) or at best stabilizing around 65% which is still a low efficiency value.

 

Slide 3 to 6 shows that, for instance, at RAL, FZK, IN2P3, CERN several VOs have an efficiency very close to 100% (PhinoGrid, MINOS, etc) probably because they have CPU bound applications but there are also some HEP VOs like D0. Slide 6 and 7 show CPU efficiency at CERN during the last year. Comparing jobs submitted via the Grid vs. submitted locally.

 

Patterns are not clearly visible; the VOS should investigate the main cases of low efficiency.

 

Slide 8 shows that most jobs have very low efficiency (CMS at RAL, but just as an example).

The jobs that have zero CPU efficiency for hours or days could be stopped because they are occupying CPU slots at the site without any real usage.

In many cases those could be jobs waiting for the data to be moved to the site where they are running. Also tape pre-staging and data caching should be verified.

 

The P.Mato and Ph.Charpentier suggested that the slide 8 shows the wall-clock time instead of CPU time so that one could see how many jobs are 0% efficient.

 

J.Knobloch asked whether in addition to the CPU time one can see other parameters (page faults, memory, etc)

T.Cass replied that probably this information is not collected. At the moment.

 

D.Barberis noted that grid jobs are not worse than local jobs therefore does not seem to be caused by the submission mechanism.

 

Ph.Charpentier noted that if users submit jobs to lxbatch directly there is not any pre-staging (done by DIRAC for LHCb) and therefore the job efficiency drops.

 

G.Merino added that efficiency can be improved if the VOs copy with FTS the data before executing their jobs.

Ph.Charpentier replied that the VOs use the SRM and the tools have to provide efficient move of data.

T.Cass added that the situation should be first assessed and be re-discussed.

 

The MB discussed how to progress in monitoring the CPU low efficiency. J.Templon proposed:

-       The CPU efficiency should be monitored via sensors into the standard Nagios information/warnings to the sites

-       Experiments should be warned in case of very low values

 

Ph.Charpentier asked that important issues should be sent to the email address that the Experiments will choose. And the Experiments will then decide how to act depending on the priority of the issue.

 

J.Gordon proposed to use the GGUS ticketing system.

I.Bird explained that the work of interfacing ticketing systems to the Experiments mailing lists is being discussed (by M.Dimou). One should for now proceed with emails to the Experiments.

 

L. Robertson proposed that for the moment the most serious problems are followed by sending (non-automatic) emails for the main inefficiencies. With time reasonable thresholds for automatic alarms will become clear but one should first understand the extreme cases.

 

8.    AOB

 

 

Presentations postponed to next week:

-       Report from the EGI Workshop (Slides) - J.Knobloch

-       VO-specific SAM Tests (VO-specific SAM Results) - Experiments

 

 

9.    Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.