WLCG Management Board

Date/Time

Tuesday 7 April 2009 –F2F Meeting - 16:00-18:00

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=55733

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 2 – 14.4.2009)

Participants

A.Aimar (notes), D.Barberis, O.Barring, I.Bird (chair), K.Bos, F.Carminati, Ph.Charpentier, L.Dell’Agnello, D.Duellmann, Qin Gang, J.Gordon, A.Heiss, M.Ernst, I.Fisk, S.Foffano, M.Kasemann, M.Lamanna, S.Lin, H.Marten, P.Mato, P.McBride, G.Merino, B.Panzer, H.Renshall, M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Invited

F.Gianotti, P.Kuijer, J.Virdee

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 14 April 2009 16:00-17:00 – Phone Meeting

1.   Experiments Requirements for 2009-10 (Slides; documents)

 

The documents attached are the latest distributed by the Experiments. I.Bird summarized what will be reported to the RRB in the slides attached.

1.1      Resource Requirements and Data Taking

For planning purposes we assume 2 resource periods (although no break between them)

-       “2009” Oct’09 à March’10

-       “2010” April’10 à March’11 (as before)

 

For data taking the periods are:

-       Apr’09 – Sep’09: no LHC (simulation and cosmics)

-       Oct’09 – Mar’10: 1.7x10^6 sec of physics

-       Apr’10 – Oct’10: 4.3x10^6 sec of physics

-       Nov’10 – Mar’11: LHC shutdown (simulation, reprocessing, etc)

 

Energy is limited to 5+5 TeV and there will be a heavy-ion run at the end of 2010.

1.2      General Comments

Overall there is less LHC running time anticipated in this period than was planned for in 2009+2010. However, we must ensure that the computing is not a limiting factor when data comes. See LHCC conclusions of WLCG mini-review in February

 

A significant effort is going into detector understanding now using cosmic ray data. Early in 2009 the WLCG relaxed the requirement to have the 2009 resources in place by April, although many of the (Tier 1) resources are actually in place now. In some cases this has allowed delayed procurement for better equipment.

 

Now the situation is different: Sites will need to install new resources while data taking. The intention is to eventually provide a profile of ramp-up of resources (quarterly?) – But for this discussion present only the total needs for the 2 resource periods.

 

J.Virdee and F.Gianotti noted that the amount o data will not be reduced compared to before and there will not be any down time for a long period. If installations are done before is better.

1.3      Comparisons for Each Experiment 

The comparison is not easy. For each experiment the following tables are available and present the updated requirement for 2009 and 2010 compared with existing 2009 pledge and old 2010 requirement (since we do not have the split between experiments for pledges after 2009).

 

The new requirements have not been reviewed by neither the C-RSG nor the LHCC. But maybe the RSG is meeting before the RRB.

 

The pledges do not take into account the change in INFN planning, nor delay at NL-T1 and others where 2008 pledges are not fully installed.

 

In pink below is the case in which the new requirement are higher that the old ones.

1.4      ATLAS

In parenthesis is the request asked last year and seen by the RSG group.

For instance ATLAS had already doubled their request to 53.6 for CPU at CERN.

 

ATLAS

2009 req

2009 pledge

2010 req

Old 2010 req

CERN CPU

57

26.5
(53.6)

67

43

CERN disk

3.7

2.075
(3.95)

5.1

3.67

CERN tape

7.8

6.21
(9.69)

9.9

13

T1 CPU

90

120.9

227

198.3

T1 disk

24

19.86

36.7

40.35

T1 tape

11.3

14.72

14.8

29.9

T2 CPU

108

114

240

206

T2 disk

13.3

11.2

24.8

22.32

 

The justifications are:

-       Cosmic ray data in Q309 will produce 1.2PB (same as Aug-Nov 08)

-       In 6x10^6 sec will collect 1.2x10^9 events è 2PB raw

-       Raw stored on disk at T1s for a few weeks

-       Plan for 990M full sim events and 2200M fast sim events

-       CERN request was updated last Aug and was seen by RSG

 

D. Barberis stated that ATLAS had slightly different numbers for 2010 (and beyond) that were presented as part of their updated request to the C-RRB in August 2008, and re-stated in their document of 23 October 2008, submitted to the C-RRB Scrutiny Group. The last column of the table is therefore wrong and the correct numbers should be used; these are the same as in the ATLAS document with new requests. If the correct numbers for 2010 are used, the whole right-hand part of the table turns green.

 

D. Barberis also stated that it is incorrect to compare old requests and pledges for Q2-2009 with updated requests for Q1-2010. It is exactly to avoid such mix-ups that ATLAS provided a quarterly breakdown of requests for 2009 and 2010. Looking at the table in the ATLAS document, one can notice that there is a global shift of 6 to 9 months in the new resource requests, which leads to considerable savings in terms of costs.

 

J.Virdee reminded that one should check the numbers presented last time and make sure that the ones presented now are consistent.

1.5      CMS

 

CMS

2009 req

2009 pledge

2010 req

Old 2010 req

CERN CPU

48.1

54.8

112.9

115.2

CERN disk

1.9

2.5

4.6

3.8

CERN tape

9.5

9.3

15.3

14.3

T1 CPU

53.5

63.7

119

139

T1 disk

6.5

8.4

14.1

15.4

T1 tape

10.5

16

21.6

23.2

T2 CPU

54.1

116

209.6

306

T2 disk

5

8.4

11.3

7.6

 

Requirements are smaller than before except for tapes and disks at the Tier-2.

 

There are new parameters to take into account for CMS:

-       300Hz data taking rate

-       There are now 3 re-reconstr in each ’09, ‘10

-       CPU times assume higher lumi in ’10: recoCPU: 100à200 HSO6.s, simCPU: 360à540 HSO6.s

-       40% overlap in PD datasets

 

CMS also added storage needs for ‘09 cosmics

 

At the Tier-0 the changes are:

-       Added 1 re-reco in each year

-       Capacity for express stream

-       Reco to finish in 2x runtime

-       Monitoring + commissioning is now 25% of total (was 10%)

 

T1:

-       Finish re-reco in 1 month (was spread over full year)

T2:

-       Require 1.5 more MC events than raw: sw changes and bug fixes

-       MC events produced in 8 months (can only start after Aug’09)

1.6      ALICE

 

ALICE

2009 req

2009 pledge

2010 req

Old 2010 req

CERN CPU

42.8

46.4

46.8

49.4

CERN disk

2.4

4.5

4.5

4.7

CERN tape

3.7

7.3

6.7

11.6

T1 CPU

42.8

40.9

102.4

94

T1 disk

4.3

3.9

9.9

12

T1 tape

5.9

6.2

11.6

19.7

T2 CPU

36

39.9

80.8

100

T2 disk

4.4

2.82

12.4

4.3

 

The only main change is that ALICE will collect p-p data at ~maximum rate: 1.5x10^9 events at 300 Hz. The initial running will give luminosity required without special machine tuning – cleaner data for many physics topics. The first pp run energy is important in interpolating results to full PbPb energy.

 

They plan to collect large statistics pp in 2009-10 and they assume 1 month Pb-Pb at end of 2010.

 

Main changes:

The 2009 requests are within the pledges except for T2 Disks. For 2010 – don’t know actual pledge for ALICE, but generally pledges are significantly lower than requirement (so final column should be mostly pink for T1+T2).

 

J.Gordon noted that the T1 Disk for 2009 should not be green. I.Bird agreed.

1.7      LHCb

LHCb

2009 req

2009 pledge

2010 req

Old 2010 req

CERN CPU

11.4

4.2

19.2

6.12

CERN disk

0.78

0.99

1.47

1.28

CERN tape

1.2

2.27

2.3

4.2

T1 CPU

16

20.2

34

27.36

T1 disk

2.8

2.7

4.4

3.25

T1 tape

1.3

3.2

2.9

5.86

T2 CPU

21.9

35.4

31.5

45.5

T2 disk

0.02

0.37

0.02

0.02

 

The main change for 2009 is an increase of CPU at CERN.

 

There is uncertainty in running mode (pile up) è add contingency on event sizes and simulation time.

-       2009 Simulation with assumed running conditions

-       Early data with loose trigger cuts and many reprocessing passes – alignment/calib+early physics

-       2010 – there will be several reprocessing passes and many stripping passes

-       Simulation over full period

 

Main changes:

-       CERN increase due to need for fast feedback to detector of alignment/calibration + anticipation of local analysis use

-       T1 CPU increase in 2010 due to more reprocessing

-       T2 requirements decrease as less overall simulation needed

 

I.Bird noted that LHCb reports a ramp-up with the needs at beginning and end of the periods.

Ph.Charpentier replied that the table reports the integrated amount of CPU even if the need is unequal during the period. LHCb will not use the resources all the time. Sites and funding agencies should be aware of this.

I.Bird noted that then the Sites have to procure for the maximum value requested, even if is not uses equally during the period. For example LHCb CPU at CERN request is 17 instead of 11.4 (comparing slides 7 and 8).

1.8      Global Summary

The summary of all Experiments is below. Maybe is not very relevant but shows the overall discrepancies between new requirements and the pledges. Not with the 2010 previous requirements like in the other tables.

 

 

Except for CERN there are not major differences in 2009. Most changes are in 2010.

 

J.Virdee noted that the question “why the request are the same of higher if the running time is significantly shorter?” must be addressed clearly during the presentation. For instance explanations from the Experiments are that:

-       CPU must be adequate with the data taking in 2010 and therefore must be available as it was originally planned.

-       The rate is the same as was planned therefore CPU and disk should be available in adequate amount.

-       There must be disk available for immediate analysis not to lose any data collected.

-       There must be disks sufficient for data taking and data analysis/debugging in parallel.

 

F.Gianotti added that in 2009 cosmics data has been collected and was very useful for calibration and alignment. Data, on disks, should be kept (2Pb).

 

J.Virdee added that usually the last quarter of the years is always the most demanding both for data taking and for analysis. In April the resources should be all available. Installing later will be difficult.

 

D.Barberis noted that the new requests are the same as before just sifted of 9 months.

 

I.Bird noted that ALICE, ATLAS and LHCb provided a profile and a ramp-up and the Sites could accommodate the installations in 2 steps. Originally was August and April.

M.Kasemann noted that the two-steps approach was agreed when one did not know that the run would be continuous as in the new schedule.

F.Gianotti added that probably funding agencies will check the behaviour in 2009 before buying for 2010.

 

J.Templon noted that funding agencies and Sites want to provide what is needed but waiting allows them to choose more cost effective and power efficient hardware.

 

J.Virdee noted that for each category one must have an argument:

-       CPU needed because there is need to process all data coming instantaneously

-       Long term storage and immediate need for analysis and debugging

-       The experiments took 20 years to be prepared and data should not be deleted. Even the cosmics.

-       The costs are irrelevant compared to the cost of the Experiment. This is the history of the detector in time since the beginning: it must not be lost deleting the stored data.

 

F.Gianotti added that cosmics is the only data available when there is no beam and Experiments have used such data and will be useful also in the future.

 

M.Kasemann agreed that this the first year of data taking and one should have resources available for anything unforeseen. Would be a disaster not to have resources for some Physics needs, while data taking, after so many years of preparation.

 

F.Carminati added that the scenarios for analysis and data taking have been tried and show that data must be as much as possible on disk. Retrieving from tape is very complex. F.Gianotti supported this argument.

 

J.Templon, supported by I.Fisk, added that the users will increase and this will require more resources. Even if the data is smaller the users are the same as planned originally. The users’ activity is independent on the amount of data.

 

I.Bird added that working with cosmics and simulated data has caused the change of the requirements as they are specified now. Old requirements are now better understood and updated.

F.Gianotti added that actually this is what happened in ATLAS when they have seen that more capacity is needed at CERN, staging is not easy and disks are much more effective for the users work.

 

J.Virdee added that when the change is within 10% it should not really be justified as it is quite normal. The first run is very important and the resources should be there and analysis is done on disks. Tapes are just there for archiving.

 

J.Templon noted that the Sites are following their pledges since 2005 and would like to see that data is actually coming before procuring.

J.Gordon and I.Bird replied that procurement takes one year and therefore waiting for the data will be too risky and late. Disks must be tried well before. Firmware problems have shown to be a major constant issue. All Sites had problems with hardware last year. Sites must check their hardware well in time to have it fixed.

 

J.Virdee added that the final conclusion is that it would a shame to lose data after 20 years of work.

The same slides and material will be shown at the RSG the day after, support from the RSG group would be very important.

 

Action:

I.Bird will send an updated version of the slides attached.

 

2.   Minutes and Matters Arising (Minutes)

 

2.1      Minutes of Previous Meeting

The minutes of the previous meeting were approved without comments.

2.2      QR Reports 2009Q1

A.Aimar will collect the QR contribution from LCG Operations, GDB and Applications Area.

The agreed dates for the Experiments’ presentations are:

-       7 April 2009: ATLAS

-       14 April 2009: ALICE, LHCb, CMS

2.3      New proposal for Availability Reports (Sample_Tier1_Report.pdf)

A.Aimar distributed the proposal that now account availability without taking into account the “unknown” time.

 

3.   Action List Review (List of actions)

 

 

Not discussed this week.

 

4.   LCG Operations Weekly Report (ALICE Alarm Test summary; CMS GGUS Alarm Test results; GGUS summary; LHCb GGUS Alarm Test results; Slides) - J.Shiers

 

Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

 

J.Shiers included a few remarks about the quarter.

 

Below the SIR produced during the quarter. Not all sites had adequate reports, and not in the right lists and channels.

 

Site

When

Issue

CERN

08/01

Many user jobs killed on lxbatch due to memory problems

CERN

17/01

FTS transfer problems for ATLAS

CERN

23/01

FTS / SRM / CASTOR transfer problems for ATLAS

CERN

26/01

Backwards incompatible change on SRM affected ATLAS / LHCb

CERN

27/02

Accidental deletion of RAID volumes in C2PUBLIC

CERN

04/03

General CASTOR outage for 3 hours

CERN

14/03

CASTOR ATLAS outage for 12 hours

CNAF

21/02

Network outage to Tier2s and some Tier1s

FZK

24/01

FTS & LFC down for 3 days

ASGC

25/02

Fire affecting all site – services temporarily relocated

RAL

24/03

Site down after power glitches. Knock-on effects for several days

 

4.1      QR Conclusions

Not all sites are yet reporting problems consistently – some appear ‘only’ in broadcast messages which makes it very hard to track and impossible to learn. E.g. from a single joint operations meeting (Indico 55792) - here

SARA: OUTAGE: From 02:00 4 April to 02:00 5 April. Service: dCache SE.
SARA: OUTAGE: From 09:30 30 March to 21:00 30 March. Service: srm.grid.sara.nl.
SARA: OUTAGE: From 15:13 27 March to 02:00 31 March. Service: celisa.grid.sara.nl. Fileserver malfunction.
CERN: At Risk: From 11:00 31 March to 12:00 31 March. Service: VOMS (lcg-voms.cern.ch).
FZK: OUTAGE: From 14:21 30 March to 20:00 30 March. Service: fts-fzk.gridka.de
INFN-CNAF: OUTAGE: From 02:00 28 March to 19:00 3 April. Service: ENTIRE SITE.
INFN-T1: OUTAGE: From 16:00 27 March to 17:00 3 April. Service: ENTIRE SITE.
NDGF-T1: At risk: From 12:31 27 March to 16:31 30 March. Service: srm.ndgf.org (ATLAS).
NDGF-T1: At risk: From 12:31 27 March to 13:27 31 March. Service: ce01.titan.uio.no.

 

As per previous estimates, one site outage per month (Tier0+Tiers1) due to power and cooling is to be expected. It is very important to find some track of these through the daily operations meetings and weekly summaries.

 

We must improve on this in the current (STEP’09) quarter – all significant service / site problems need to be reported and some minimal analysis – as discussed at the WLCG Collaboration workshop – provided spontaneously. There should be some SERVICE METRICS – as well as experiment metrics – for STEP’09 which should reflect the above.

 

CASTOR and other data management related issues at CERN are still too high – we need to monitor this closely and (IMHO again) pay particular attention to CASTOR-related interventions. DM PK called about once per week; mainly CASTOR; sometimes CASTOR DB*; more statistics needed but frequency is painfully high. And we are not even taking data yet.

 

ASGC fire – what are the lessons and implications for other sites? Do we need a more explicit (and tested) disaster recovery strategy for other sites and CERN in particular?

 

Otherwise the service is running smoothly and improvements can clearly be seen on timescales of months / quarters. The report from this quarter is significantly shorter than that for previous quarters – this does not mean that all of the issues mentioned previously have gone away!

 

A HEPiX Questionnaire is being proposed and the suggestion is that the Tier-0 and Tier-1 Sitres complete this questionnaire.

4.2      GGUS Summary

 

VO concerned

USER

TEAM

ALARM

TOTAL

ALICE

2

2

7

11

ATLAS

20

14

11

45

CMS

43

0

8

11

LHCb

10

1

9

20

Totals

35

17

35

87

 

Alarm testing was done last week: the goal was that alarms were issued and analysis was complete well in advance of these meetings.

From Daniele’s summary of the CMS results: “In general, overall results are very satisfactory, and in [the] 2nd round the reaction times and the appropriateness of the replies were even more prompt than in 1st round”. Slides 8 and 9 show some examples.

 

Links to detailed reports are available on the agenda page – or use GGUS ticket search for gory details.

 

Discussions during daily meetings revealed mismatch: DM PK & expectations: there is no piquet coverage for services other than CASTOR, SRM, FTS and LFC. For all other services, the alarm tickets will be treated as best effort. 

 

J.Templon commented that the tests were too realistic and the risk is that the Sites will take as tests some real alarms in the future. All VOs should not test all the Tier-1 Sites simultaneously.

J.Shiers replied that the tests will only be done once for every quarter. He is ready to receive all comments from the Sites about the tests alarms.

 

Slide 10 shows the progress of the SAM tests at the Sites. The days masked are those were the downtime is announced or problems understood. There are some ATLAS tests at NDGF failing but there is no messages or email explaining it.

 

“The priority to improve the site readiness is there NOW and not only in May/June when ATLAS and the other VOs are actively scale-testing at the same time”

 

5.   ATLAS QR Report 2009Q1 (Slides) – D.Barberis

 

D.Barberis presented the ATLAS QR for 2009Q1.

5.1      Tier-0 and Data-taking Activities

It was a quiet period during the 1st quarter 2009. Partial read-out tests ("slice weeks") started in March and will continue for another month. These are mostly DAQ tests, turning into more complete tests (including Trigger) later this month.

 

Global cosmic data-taking runs will restart during May 2009. Initially with partial read-out, later on with the complete detector.

 

ATLAS plans the STEP'09 exercise from end-May to mid-June. Late June onwards they see almost continuous cosmic data-taking and will be ready for collisions in Autumn 2009.

5.2      Data Reprocessing

ATLAS ran during the Christmas and New Year period a reprocessing campaign for single beam and cosmic data taken in August-November 2008.

Partially forced by circumstances: the software and calibrations were finally ready only mid-December. Most sites ran on “best effort” during the holiday period.

 

Anyway, 500 TB of raw data were processed in 8 Tier-1 sites and CERN.

-       FZK failed the validation in December: test jobs produced different results from all other sites (under investigation, the local Linux build is suspected).

-       ASGC had storage troubles in December when we launched the jobs

-       “Slow motion” in PIC and NIKHEF is under investigation

 

Reconstructed data were merged and distributed to other Tier-1/2s. A second reprocessing campaign just started. This time reading back raw data from tape (where possible). Database access issues are believed to be sorted out — being checked

5.3      Data Functional Tests

ATLAS continues running data export functional tests at low level to keep checking the health of the whole system. Site contacts are promptly notified of problems and we follow up all troubles together.

 

The picture below shows the Tier-1 to Tier-1 transfers for the first week of April.

 

 

One ca see that ASGC and CNAF were off for the all week.

CNAF was working as receiver because some Tier-2 Sites were working because the LFC was moved to Rome.

5.4      Simulation Production

The MC activities continue constantly, as it is shown below. And is only limited by Physics requests and availability of disk space for output data.

 

5.5      Plans

There are upcoming software releases:

-       Release 15.0.0: March 2009. Includes feedback from 2008 cosmic running. Base release for 2009 operations.

-       Releases 15.X.0. Once/month (or 6 weeks). Incremental code improvements.

 

The Cosmic runs are in progress with partial read-out and will restart with the complete detector in April-May 2009

 

ATLAS will be ready for STEP'09 in June (see GDB presentation) and for collision data in the Autumn.

 

6.   User Analysis Support - WLCG Wrap Up (Slides) – M.Lamanna

 

M.Lamanna reported on the progress in Support for User Analysis in IT/GS, and as a follow up of the analysis support session at the WLCG Workshop.

6.1      Background Information

Similar activities are ongoing in all Experiments. In IT/GS they started an “inventory” of useful tools/strategies. Lots of similarities and complementary approaches. Example of tools

-       Shift system (e.g. ATLAS)

-       Job robots (e.g. CMS)

-       Site commissioning and monitoring

-       Stress generators (e.g. ATLAS)

 

Detailed studies of site and code performances are also performed. Analysis support essential for 2009/2010 data taking in order to learn from each other (experiments and IT/GS) and with clear metrics to evaluate the user support.

6.2      Further Steps

In the framework of STEP09 IT/GS will start fostering further exchange across experiment and IT/GS in the area of Analysis Users Support.

They will complete the inventory of useful tools/strategies and explore the possibility to extend the usage of tools (like robots and stress tests) and devise sensible metrics to evaluate the user support

 

J.Templon noted, from the WLCG Workshop, that the users do not care of Site monitoring and the tools must provide this feature. For instance, black lists should be maintained by the VO for the users, in a transparent way.

M.Lamanna agreed but added that some power users will want total control instead. Both scenarios should be allowed and supported

 

M.Schulz added that an analysis in OSG showed that the error rate of end users applications is largely above 10-20% and support will be complex for so many failures.

M.Lamanna added that can even go to 50% if new data sets are analysed and often users have problems/bugs with their applications. To help users CMS will provide some “reference Sites and data sets” where users can test their applications before doing real analysis.

 

M.Kasemann added that users need to have different levels of support and examples. Several levels of support for simple and advanced questions must be provided. Experience and tools could be shared across VOs and IT Services.

M.Lamanna added that this is exactly the purpose of this activity.

 

I.Bird proposed that the MB receives an update in the next few weeks.

M.Lamanna agreed on end of April as a good time for a summary.

 

 

7.   AOB

 

7.1      Proposal for a technical discussion group (Slides)

I.Bird presented a proposal of a forum where technical discussions, of interest of the WLCG, could take place.

 

WLCG does not have a forum where all of the technical people discuss critical and strategic technical issues. The AF is for App. Area Architects and some offline people. Pre-GDB and GDB tends to be mainly sites and a few “grid-specific” experiment people. In the baseline services one had a better representation of experiment offline people

 

For instance the discussions on SL5 recently – online vs. offline vs. experiment-grid vs. middleware vs. site managers. No forum exists where all these can discuss a given issue even though there are implications for all parties. WLCG need a place where strategic technical issues, priorities and roadmaps can be discussed with all relevant stakeholders. And then requirements can be provided to grid projects/developers/ etc.

 

The proposal could be to better use the Pre-GDB or create a new group. In both cases must be led by someone driving the discussion.

 

The questions are:

-       Is such a forum/group needed? 

-       Is there time for it?

-       What format? (Pre-GDB or a new group – or are they the same?)

 

M.Kasemann added that also an AF for Middleware would be useful. Where new features or requests are agreed and discussed with MW developers and Sites.

I.Bird agreed and asked for comment for next week.

 

8.   Summary of New Actions

 

 

New Action:

14 Apr 2009 - Sites and Experiments should comment on the need and  functions of a WLCG technical group.