WLCG Management Board

Date/Time

Tuesday 31 March 2009 –Phone Meeting - 16:00-17:00

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=49395

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 5.4.2009)

Participants

A.Aimar (notes), L.Betev, I.Bird (chair), K.Bos, T.Cass, Ph.Charpentier, L.Dell’Agnello, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Ernst, I.Fisk, S.Foffano, M.Kasemann, U.Marconi, P.Mato, P.McBride, G.Merino,  A.Pace, B.Panzer, R.Pordes, H.Renshall, M.Schulz, J.Shiers, R.Tafirout, J.Templon

Invited

N.Brook, J.Casey, M.Livny,

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 7 April 2009 16:00-18:00 – F2F Meeting

1.   Minutes and Matters Arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous meeting were approved without comments.

1.2      QR Reports 2009Q1

A.Aimar will collect the QR contribution from LCG Operations, GDB and Applications Area.

The agreed dates for the Experiments’ presentations are:

-       7 April 2009: ATLAS

-       14 April 2009: ALICE, LHCb, CMS

1.3      High Level Milestones (updated) (HLM_20090323.pdf)

A.Aimar distributed the HLM updated after the previous MB Meeting.

1.4      Availability Reports (reporting "unknown" time) (Example; Slides) - J.Casey

The number of high “unknowns” in the availability graphs causes the availability data to be low even if the Sites are not responsible for the “unknown” results of the VO SAM tests. See page 2 of Example.

 

At the MB there were complains about the availability shown in the reports when a lot of UNKNOWN is present (e.g. February LHCB VO tests for NL-T1).  The new proposal assumes that “the uptime during the "unknown" time has the same behavior as during the known time" and applies the same logic in the reports and the GridView portal.

 

In parallel one should start monitoring the “unknown” and reduce that at low levels like for the OPS SAM tests.

 

Current Availability:  Avail = Uptime/(total time)

Proposed Availability: Avail = Uptime/(Uptime+Downtime+ScheduledDown)

 

or (equivalently)

 

Avail = Uptime / (1 - unknown)

Down = downtime/(1-unknown)

Scheduled_down = Scheduled_downtime/(1-unknown)

 

Slide 4 shows an example of the new calculations.

 

OLD

NEW

Availability

12/24

0.500

12/19

0.632

Down

3/24

0.125

3/19

0.158

Scheduled Down

4/24

0.167

4/19

0.210

Unknown

5/24

0.208

Total

1.000

1.000

 

The known period is only 19 hours and not 24 hours. Availability is 63% over the measured period (19 hours) The measured period is 79% of the reporting period (24 hours)

 

There is a possibility now of having an undefined availability because the denominator in the above equation could be zero.

 

Example 1:

If the status is 'Unknown' for a whole hour, then 'unknown‘ will be 1. So availability will be 0/0 which is “undefined”. Earlier it used to be zero.

 

Example 2:

If the site is scheduled down for the entire known period, the availability for that period is 0. No change from earlier algorithm

 

If we have no results during a scheduled downtime, period is marked as SCHEDULED_DOWN we knew what it should be even if we got no metrics

 

Known Interval (Measured Period - the quantity 1 - UNKNOWN) should be quoted with Availability, Reliability numbers to indicate how good the numbers are “99% available with results known for 80% of the period.”

 

Note: The Reliability numbers remain unchanged.

 

New Action:

An updated SAM report will be distributed to the MB mailing list.

 

Ph.Charpentier asked whether this will be implemented both in the dashboards and GridView.

J.Casey replied that it is going to be done on both. Initially in GridView, later in the dashboards too.

 

A.Pace proposed to use error bars on the availability bars in order to show the impact of unknown on the value.

 

2.   Action List Review (List of actions)

 

 

Not discussed this week.

 

3.   LCG Operations Weekly Report (Slides; Weekly minutes) - J.Shiers

 

Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Main Issues

 

Site

Date

Issue

RAL

24/3

2 power glitches resulted in major site outage. CASTOR up at 14:30 on 25/3, other services soon after. ATLAS replication had to be restarted due to DB corruption. Some other knock-on effects…

Move to new machine room & STEP’09?

CNAF

27/3 – 03/04

Scheduled downtime – (power supply and air conditioning) due to the interconnection of the existing services to the new infrastructure system. LFC has been replicated in Roma (CHEP)

PIC

06/04 12h – 08/04 15h

Annual power supply maintenance – scheduled downtime

[ Impact on ATLAS ES(?) cloud? ]

ASGC

27/3 (?)

Have now hired a full-time DBA (who?)

Still in process of relocating services to IDC.

Communication is still an issue here – (for example) many days delay in response to problems with Oracle DB & streaming for ATLAS – this is not compatible with the response times we have discussed here (MB) nor with a reliable service.

Commissioning for STEP’09 needs to be understood!

 

Slide 3 contains more information on ASGC.

3.2      GGS Tickets

No alarm during the period (2 weeks).

 

VO concerned

USER

TEAM

ALARM

TOTAL

ALICE

1

0

0

1

ATLAS

14

15

0

29

CMS

2

0

0

2

LHCb

11

2

0

13

Totals

28

17

0

45

 

VO concerned

USER

TEAM

ALARM

TOTAL

ALICE

2

0

0

2

ATLAS

12

11

0

23

CMS

4

0

0

4

LHCb

8

2

0

10

Totals

26

13

0

39

 

Slide 5 shows the reasons on the downtimes on each of the graphs.

 

Slide 6 shows that most downtimes are understood.

 

J.Templon noted that some CE failures at SARA where due to a user overloading their SRM.

3.3      Summary

“Transitory” problems continue – possibly at a higher rate than in recent weeks and are unexplained (As in “here today and gone tomorrow”).

 

Masking out scheduled or understood issues leaves a relatively good service view for this period – is this representative of the situation?

 

Scale Testing for the Experiment Program – STEP ’09: is possible in May/June, to finish by end June.

 

 

4.   ALICE Requirements for 2009-10 (Slides) – L.Betev

 

L.Betev described the document attached. Here is the summary he presented.

 

Page 2 contains the scenario for 2009-10.

 

 

ALICE would like to store as much as possible data, as early as possible.

 

Page 3 contains the split for CPU, disks and tape. For new and old requirements and the variation for T0, T1, T2 and CAF.

 

 

The requirements for disk have increased at CERN.

 

 

 

 

I.Bird noted that ALICE should prepare the strong arguments for these increased requirements. Some funding agencies will challenge all increases they see.

L.Betev replied that the reasons are explained in Page 1. The document is approved by the ALICE MB and can therefore be quoted.

Ph.Charpentier noted that the changes are within the error margin of these estimations.

 

5.   ATLAS Requirements for 2009-10 (Slides) – K.Bos

 

The data presented is still preliminary and ATLAS might still change them.

 

The scope of the requirements goes from April 2009 to April 2011.

Below the Combined test of cosmic rays’ and other types of data are shown.

 

Note: The width of the arrows below should the priority and resources of each activity.

 

 

Usually Q1 and Q4 are considered less heavy because are towards the end of the year.

 

As for Tier-0, Tier-1, Tier-2 and CAF requirements they are shown in the slides from 3 to 6.

 

In conclusion K.Bos reminded the MB that:

-       The numbers presented had not gone through all relevant ATLAS management bodies. More details and updated number during this and next week.

-       The requests for 2009/2010 are needed later but marginally less than before

 

 

-       The rate determines resources just as much as livetime

-       More cosmic ray running and more simulation in Q3+4 2009

-       Compared to 2008 requests: tape requirements for Tier0 down by 30%

-       Disk and tape requirements for Tier1’s down by ~20%

-       Requirements for Tier2’s are unchanged

-       Simulation requirements have not changed: 900M new G4 and 2.2B ATLFAST2 events

-       Analysis ramping up independent of beams

 

 

I.Bird noted that the data from LHC is now 1/3 compared to before. What arguments can be used to defend these requests?

K.Bos replied that reprocessing and Cosmic data will require resources. He has a more detailed explanation available.

 

I.Bird asked that for next week there are strong justifications prepared for the RRB. 

 

J.Templon noted that Cosmics runs would probably not be a justification well received.

 

 

6.   CMS Requirements for 2009-10 (Slides) – M.Kasemann

 

M.Kasemann also clarified that these requirement are still under discussion inside the Experiment.

It will be presented for final approval next week.

6.1      Planning Values

CMS assumes:

-       LHC running time: 6 × 10**6 sec

-       Oct09 - Mar10: 1.7 × 10**6 sec, duty cycle 0.2

-       Apr10 - Sep10: 4.3 × 10**6 sec, duty cycle 0.5

 

-       Data taking rate: 300 Hz

-       Re-reconstructions: 3 in 2009, 3 in 2010

-       Event size: constant for 2009 & 2010

 

CPU times: assume higher lumi in 2010

-       RecoCPU increases from 100 to 200 HS06.s in 2010

-       SimuCPU increases from 360 to 540 HS06.s in 2010

 

It assumes 40% overlap in PD-sets because the MC studies have uncertainties. The final target is actually 20%.

 

Added storage for COSMICS runs in 2009.

 

For this planning exercise:

-       2009 run: ends March 2010,

-       2010 run: starts April 2010

 

The conversion factor to HEP-SPEC: 4 HEP-SPEC = 1kSi2k (see Jan 27th WLCG-MB).

6.2      Changes for 2009

T0 CPU:

added 1 re-reco for 2009/2010 each

add capacity for instantaneous reconstruction of express stream

reco to finish in 2 × runtime (as if duty cycle is 0.5)

first data taking: increase CAF based monitoring & commissioning activity to 25%/20%of total (nominal was 10%)

 

T0 2009 tape:

show only increment on top of what we store now

 

T1 2009 CPU:

finish each re-reco in 1 month, previously was: re-recos spread over full year

 

T2 2009 CPU:

Require 1,5 more MC events than RAW, because of frequent software changes and bug fixes

Produce MC events in 8 months: resources and CMSSW are available August

 

6.3      CMS Requirements

Here is the result of the planning exercice.

 

The following slides show the same information as graphs.

 

6.4      Comments on the Requirements

At CERN the resources match well the requirements:

-       Need 88/100% (in 2009/2010 ) of CPU requested for “old” running scenario

-       We need 75/115% of disk requested for “old” running scenario

-       Tape: only the increment for 2009/2010 shown, we need to review which of current data can be deleted

In summary: CERN resources match the pledges reasonable well, some surplus in CPU resources, increased disk needs in 2010.

 

T1:

-       CPU pledges are reasonably justified by our request (note: our 2009 request goes up because we want re-reco to finish in 1 month)

-       Disk pledges are reasonably justified by our request

-       Tape pledges are reasonably justified by our request (we need to review which of current data can be deleted)

 

T2:

-       2009: Disk & CPU have to build up for 2010,
2010 MC will be started in 2009,

-       2010 Disk & CPU request matches pledges reasonably well

T2 resource pledge still incomplete

 

Important Comments

Pledges are official for 2009 as agreed in the C-RRB, 2010 pledges will be agreed in the next C-RRB.

For now: educated guesses, based on direct information from some sites and scaling to CMS part from total upgrade plans.

 

Pledges are not corrected for Italian (non-) upgrade plans

-       Not confirmed increase for 2009: 3.2 MHEPSPEC06, 640TB disk, 750 TB tape

 

F.Hernandez note that if CMS wants to concentrate the CPU usage at the Sites over a shorter period this will have an impact on the network needed at the Sites.

M.Kasemann replied that the global impact should still be well below the 2010 running conditions. 

 

J.Gordon noted that the requested for disk at the Tier-1 Sites has increased of 25% which is above the error tolerance.

M.Kasemann agreed to verify and correct the values in the CMS write-up for next week.

 

7.   LHCb Requirements for 2009-10 – N.Brook

 

 

As slide 2 shows there will be a slight increase of event size.

 

 

MC 2009: Assume in 2009 generate ~600M b-type events and ~2000M non-b type events (min bias). Generation at takes place all centres

 

Reconstruction and Stripping 2009:

It assumes:

-       In 2009 ~5x10**5s of DAQ time

-       1.0x10**9 events

-       4 “full” passes in reconstruction, 1 of which will be in 2010

-       4 stripping passes. No event reduction but assume microDST o/p for some physics WG (factor 10 reduction in event size). 1 of which will be in 2010

 

Assume recons of TED data etc before 1st “physics” beam. Many recons passes.

 

Analysis 2009

Assume in 2009 ~40% of analysis outside of CERN

Two periods in year

-       MC analysis

-       Analysis of first data

 

Grid analysis at Tier-2 sites. About 50% of ToyMC type activity

 

Storage 2009

-       7 copies of the “stripping” o/p on disk (CERN+6 Tier-1 centres). Also archived at CERN + 1 other T1 centre

-       10% of RAW & rDST on disk at CERN

-       ~300TB for analysis storage.  ~10% of total requirements

-       2 copies of RAW & 1 copy of rDST of each reconstruction pass on tape

 

Reconstruction and Stripping 2010

Assume in 2010 5.6x106s of DAQ time

-       1.1x1010 events

-       3 “full” passes in reconstruction, 1 of which is outside of beam time

-       4 stripping passes, 1 of which is with recons outside of beamtime

-       Unable to make use LHCb online farm, but will continue to investigate this option

 

MC 2010

-       Assume in 2010 generate.  ~1000M b-type events. ~2000M non-b type events (min bias)

-       Generation at Tier-2 centres only

 

Analysis 2010

-       Assume in 2010 50% of analysis outside of CERN. Typical user analysis (“job(s)”) processing ~2.5 106 events.  ~1k “jobs”/week

-       Grid analysis at Tier-2 site. Still 50% of ToyMC activity

 

Storage - 2010

-       7 copies of the “stripping” o/p on disk (CERN + 6 Tier-1 centres). Latest version + next-to-latest version everywhere. All stripping passes archived at CERN + 1 other T1 centre

-       10% of RAW & rDST on disk at CERN

-       ~600TB for analysis storage.  ~10% of total disk needs

-       2 copies of RAW & 1 copy of rDST of each reconstruction pass on tape

 

Slides 10 to 17 show the requests for 2009 and 2010 and the variation vs. the previous requirements.

 

CPU 2009: 30.7 k HEPSPEC06.years

CPU 2010: 79.5 k HEPSPEC06.years

 

Disk 2009: 2.38 PB

Disk 2010: 5.19 PB

 

Tape 2009: 0.83 PB

Tape 2010: 4.3 PB

 

7.1      Summary

 

LHCb updated planning of resource usage for 2009 & 2010:

-       Reduced 2009 needs in light of reduced accelerator activity CERN CPU activity increased as activities moved from externalsites

-       2010 needs increased slightly. Due to greater understanding of analysis patterns. Increased output of stripping. Play-off physics & “calibration” channels

 

Several aspects of the computing model still under investigation:

-       MC selection at Boole stage

-       Reduction factors at stripping stage

-       Fraction of ToyMC analysis

-       Realistic LHC beam operation needs to be folded in

 

Below are the integrated resources needs.

 

 

 

 

8.   General Comments

 

I.Bird summarized the requirements with a few general comments:

-       All Experiments should assume the same running periods, with two periods of 6 months

-       There will not be a shutdown in January and February.

 

-       LHC running time: 6 × 10**6 sec

 

-       Oct09 - Mar10: 1.7 × 10**6 sec, duty cycle 0.2

-       Apr10 - Sep10: 4.3 × 10**6 sec, duty cycle 0.5

 

Ph.Charpentier noted that there is no clear schedule and the Experiments have to prepare some common assumptions even if are not agreed by the LHC experts.

I.Bird replied that one should make all the same assumptions and if the LHC schedule will change the estimation will be adjusted adequately. If the assumptions are different the distribution of resources will be incorrect.

 

The discussion on the same topics continued for a while and the decision taken are here below:

 

The MB agreed that all estimations should assume the same parameters mentioned above.

 

I.Bird will send an email summarizing what needs to be done for next week.

 

9.   AOB

 

 

M.Kasemann asked that future agendas topics are:

-       Analysis Working Group status

-       Situation about avoiding MW related discussion at the MB (Tech. Forum?)

-       Changing MW and other ideas from the Workshop. Should be discussed at the MB or at least clarified.

 

10.   Summary of New Actions