WLCG Management Board

Date/Time

Tuesday 14 April 2009 –MB Meeting - 16:00-17:00

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=55734

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 19.4.2009)

Participants

A.Aimar (notes), D.Barberis, O.Barring, I.Bird (chair), D.Britton, L.Dell’Agnello, M.Ernst, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Ernst, I.Fisk, M.Kasemann, P.McBride, U.Marconi, G.Merino, A.Pace, H.Renshall, M.Schulz, R.Tafirout, J.Templon

Invited

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 21 April 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters Arising (Minutes)

 

1.1      Minutes of Previous Meeting

D.Barberis sent some modifications to the minutes. A.Aimar will update them. Changes will be highlighted in blue.

Unless there are other change in the next few days, the minutes will be considered approved by the MB.

1.2      Preparation for the RRB Meeting

I.Bird clarified with D.Barberis their differences on the ATLAS resource requirements. I.Bird will send updated versions of the presentation and of the document to the MB list after this meeting.

1.3      WLCG Technical Forum 

The only comment received after the proposal of a WLCG Technical Forum, at last week’s MB meeting, is from M.Kasemann.

 

From: Matthias Kasemann

Sent: 14 April 2009 15:17

To: Ian Bird

Cc: Alberto Aimar

Subject: [Fwd: Re: WLCG forum to discuss MW related topics]

 

Hi Ian,

 

here is feedback on the MW technical discussion forum from one of our developer (Simon Metson/Bristol). I share his concerns:

 

- unless there is a very good chair can end up wasting a lot of time of many people

 

- there is not much room for change, but this change needs good coordination

 

The idea of Hypernews + 'virtual meeting' is worthwhile considering, still it needs a good chair and moderator, without that it will decay soon.

 

Regards  Matthias

 

 

I.Bird confirmed that he will look for the chairperson and propose him/her to the MB.

 

M.Kasemann noted that the forum could use hypernews and other virtual meeting facilities and reduce the needs of meeting physically.

 

J.Templon asked whether the function of this forum is not already in the GDB mandate.

I.Bird replied that the GDB has more a mandate to drive the discussions and progress related to deployment at the Sites. The GDB does not have technical decisions and proposals.

J.Templon asked that if a group is formed WLCG Sites must be well represented. The forum needs to have also the Sites’ input on what they can they deploy or is acceptable for the Sites, before it is decided in the forum.

 

 

New Action:

I.Bird will look for the chairperson and also distribute a proposal for the mandate of the WLCG Technical Forum and reminding the GDB’s mandate.

1.4      New Updated High Level Milestones (HLM_20090406.pdf)

A.Aimar distributed an update version of the HLM: they will be discussed next week.

 

2.   Action List Review (List of actions)

 

  • VOBoxes SLAs:
    • Experiments should answer to the VOBoxes SLAs.
    • NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

CMS: Several SLAs still to approve.                                                                                     
ALICE: Still to approve the SLA with NDGF. Comments sent to NL-T1.

J.Templon added that NL-T1 had sent some alternative solutions to ALICE and are waiting for ALICE’s feedback.

  • 16 Dec 2008 - Sites requested clarification on the data flows and rates from the Experiments. The best is to have information in the form provided by the Data flows from the Experiments. Dataflow from LHCb

The dataflow and rates were discussed at the WLCG Workshop before CHEP.  STEP09 will test the situation There is a table ready to be filled and the LHCB template should be used.

J.Templon noted that the information is not really available to the Sites.

J.Gordon noted that CMS is using the LHCb format and M.Kasemann agreed that that is the format they want to follow.

D.Barberis noted that for ATLAS there are no Tier-2 Sites as in the other Experiments (the ATLAS cloud).

M.Kasemann proposed that the Experiments present their dataflow and rates to next GDB meeting.

The Experiments agreed to present their dataflow and rates at May’s GDB.

  • 4 March 2009 - M.Schulz to present the list of priorities for the Analysis working group.

The list of priorities was presented at the WLCG Workshop and then discussed. But the priorities are not fully agreed.
R.Pordes volunteered to co-chair the group and will re-open the discussion on the priorities.

M.Kasemann asked for a clear mandate for this User Analysis working group. J.Gordon agreed.

M.Schulz noted that some Experiments have done measurements of the needs of their user analysis.
I.Bird proposed that the action is closed and M.Schulz will summarize the situation as it is now.

ACTION REMOVED

F.Hernandez asked that the models from the Experiments are clarified.
M/Schulz replied that the TDRs are describing them, what is missing is the workload on the Sites which is still unclear by a factor 2.

New Action:
M.Schulz will summarize the situation of the User Analysis WG in an email to the WLCG MB.

  • 14 Apr 2009 - CNAF reports on how they plan to handle the security incidents report and periodic tests

Not done yet. Will be discussed next Monday.

  • 14 Apr 2009 - Sites and Experiments should comment on the need and functions of a WLCG technical group.

Done.

  • 17 Mar 2009 - Sites should report whether GFAL and lcg-utils can start using by default SRM V2 and will not impact VOs outside WLCG.

Done.

  • 17 Mar 2009 - Experiments should confirm whether the schedule for the changes regarding “Busy” Storage Services is acceptable.

Done. Only CMS has replied.

On User Accounting:

  • Tier-1 Site should start publishing the UserDN information.
  • Countries should comment on the policy document on user information accounting
  • The CESGA production Portal should be verified. J.Gordon sent information to the MB which portal to use (prod or pre-prod).
  • J.Gordon and R.Pordes will write a requirements document on user information accounting.

J.Gordon and R.Pordes will distribute the document.

  • J.Gordon will send to the GDB the message about the need of 32 compatibility libraries on SL5 64 bits installations.

Not followed by the MB.

  • Sites should report whether (or when) FTS is deployed on SL4.

Done at all Sites.

  • A.Aimar will verify with M.Litmaath the situation for the pilot hobs frameworks of ALICE, ATLAS and CMS.

LHCb’s is approved, other have to make modifications. 
ACTION REMOVED. Already followed monthly at the GDB.

  • Actions for moving to the new CPU unit
    • Convert the current requirements to the new unit.
    • Sites Tier-1 Sites and main Tier-2 Sites buy the license for the benchmark
    • A web site at CERN should be set up to store the values from WLCG Sites.
    • A group to prepare the plan of the migration regarding the CPU power published by sites through the Information (J.Gordon replied that will be discussed at the MB next week).
    • Pledges and Requirements need to be updated.

 

3.   LCG Operations Weekly Report (Slides) – H.Renshall

 

Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

 

GGUS Tickets

Very quiet week. Usually there are about 100 tickets, this week only about 15.

 

VO concerned

USER

TEAM

ALARM

TOTAL

ALICE

1

0

0

1

ATLAS

4

7

0

11

CMS

0

0

0

0

LHCb

3

0

0

3

Totals

8

7

0

15

 

Slide 2 shows the update of RAL to the reply to the test alarm of 2 weeks before.

RAL agreed that the response should be within 2 hours.

 

Tier-1 VO Availability

Slide 3 shows the Tier-1 availability, by VO, as detected by the VO SAM tests:

-       ATLAS problems with NDGF SRM, but no reports about it.

-       LHCb had problems at CNAF

 

ASGC

After relocating the facilities from ASGC DC to IDC, it took another week to resume the power on trial before entering the IDC. Also, the complex local management policies have delayed the whole progress for another week. Now, all T1 services should have been restored at ASGC.

ATLAS decided to restart ASGC in a clean situation – remove all SE files and catalogue entries.

 

SIRs

One report received from IN2P3. Batch unavailable for any job depending on the robotic storage system. Local backup service interrupted during the outage. Estimated 25% loss of running jobs during outage (jobs locked in queue).

 

Expected SIRs on:

-       Power cooling problem at TRIUMF, reduction in the batch capacity.

-       Incidents at CERN after a glibc update the LHCb online DB was not working properly.
Also intermittent degradation on srm-lhcb.cern.ch and srm-cms.cern.ch for around 1 hour.

 

4.    GDB Summary - April 2009 (Paper) – J.Gordon

 

J.Gordon summarized the April’s GDB Meeting. The details are in the paper attached.

 

During the MB meeting he highlighted the following issues (in yellow in the document):

-       (Test) alarms at CNAF, was already discussed and is in the action list

-       Only 8-12 June seem possible dates for STEP09. Experiments and Sites should prepare for that date

-       GDB supported the proposal about identity management. With a real name included in the DN.

-       The WN on SL5 is released since March and Tier-1 Sites should confirm the deployment. Even if running SL4 binaries.

-       Experiments should comment on the acceptability of the SRM MoU 2.2 Extensions.

-       Tier-1 will be asked how they will provide support for their users, are they clarifying the issue with the NGI?

-       Next GDB meeting Wednesday 12th May at CERN

 

5.   CMS QR Report 2009Q1 (Slides) – M.Kasemann

 

M.Kasemann presented the CMS QR for 2009Q1. He summarized the activities during the quarter.

5.1      Tier-1 and Tier-1 Availability

The CMS Tier-1 Sites have improved their average availability in the last 30 days. Now 5 Sites are above 90%.

In Slide 2 one can see the comparison with the same Sites at the end of March.

 

 

The same for the Tier-2 SAM result availability. And in slide 4 the improvements vs. end of March.

 

 

A.Heiss noted that some Sites are using dedicated services to answer to the SAM requests. This could be faking some results.

M.Kasemann asked for more details because this is not a correct behaviour and misleads the tests results.

5.2      Tier-1 Activities

At the Tier-1 Sites re-reconstruction with new conditions/software was performed. Both for Cosmics and Monte Carlo data.

 

CMS is stress testing its Tier-1 Sites continuously. It maintain processing loads at the each Site with low priority jobs (“backfill jobs”) that fill the queues when the site is not fully used by normal jobs. This activity classifies and reports any errors observed. Below is the amount of backfill jobs completed at every CMS Tier-1 Site.

 

 

The planning for STEP09 is ongoing and the priority is to test:

-       tape handling together with ATLAS. I/O, staging

-       important Network transfers tests T0-T1, T1-T1

5.3      Data Production in 2009Q1

Most production was from the re-reconstructions of CruZet & CRAFT Cosmics data (~700 TB of RAW, RECO, Skims):

-       Second re-reconstruction of CRAFT completed in February

-       Second re-reconstruction of CruZet just completed

 

Large Monte Carlo production completed, but new requests ongoing: Summer08 lasted until February 2009 as shown in the graph below. With 360 different samples/bins.

 

 

The production rate is quite good (~500M FullSim / 5 months + 350M FastSim) rate and is not limited by resources.

 

The validation samples were made available quickly, but is a manpower intensive operation

5.4      Plans for 2009

Slide 6 shows the overall CMS plans for 2009:

-       the right side shows the software releases and computing plans

-       on the left the production and analysis activities

 

5.5      CMS STEP09

In CMS STEP09 is coordinated by Daniele Bonacorsi and Oliver Gutsche

STEP09 is a set of CMS functionality and scale tests, aligned with tests of other experiments

-       T0: tape writing, transfers

-       T1: tape write/read/staging, transfers

-       Transfers: T0-T1, T1-T1, T1-T2

-       Analysis at T2’s

\

As shown in slide 7 the week for STEP09 is the week of the 8 June.

 

The draft plan in place and for the status of CMS planning see: https://twiki.cern.ch/twiki/bin/view/CMS/PADASiteCommissioning?topic=step09

5.6      Outlook and Summary

Data processing of CMS Global and Cosmics Runs is working well. Data were re-reconstructed twice with latest software and calibrations.

 

Monte Carlo production at Tier2 sites is well established (~500M FullSim / 5 months + 350M FastSim, several CMSSW versions, not resource limited)

 

The availability of the Tier-1/Tier2 infrastructure is monitored closely. Stress testing Tier-1 sites has started

 

STEP09 and Analysis-End-to-End-test is being planed

 

Resource requirements for 2009/10 assessed based on LHC schedule.  Answering questions to the C-RSG

 

CMS Computing & Offline workshop next week @ San Diego will finalize planning until start of data taking.

 

M.Schulz asked how CMS makes sure that at the Tier-2 Sites the stage-out is realistic.

M.Kasemann replied that one will evaluate the situation when users in May-June will use heavily the results of the MC production.

 

D.Barberis confirmed that the best week for ATLAS is the week of the 8th June and all should be over by the 15th June.

 

 

6.   LHCb QR Report (Slides) – U.Marconi 

 

U.Marconi presented the LHCb QR for 2009Q1. Slides from Ph.Charpentier.

6.1      Production activities

The LHCb production activities started in July with simulation, reconstruction, stripping; including file distribution strategy, failover mechanism

File access using local access protocol (rootd, rfio, (gsi)dcap, xrootd) and commissioned alternative method: copy to local disk.

 

The failover solution is using VOBoxes with file transfers (delegated to FTS), LFC registration and internal DIRAC operations (bookkeeping, job monitoring).

 

Analysis started in September and Ganga is available for DIRAC3 in November. DIRAC2 was de-commissioned on January 12th.

 

The graph below shows the sites used for production (about 100).

 

chep09-15kjobs2009.png

 

And by type of job, where one can see that MC simulation and user analysis are the main type of jobs.

 

 

6.2      Activity in 2009Q1

The focus of the quarter was on improving Data Management (testing pinning, debugging with each site), running large simulation productions for certifying the largest possible number of sites. The certification concerned the software repository availability, settings of batch queues, memory limitations and many GGUS tickets were sent to small sites for fixing configuration.

 

The final commissioning of the new bookkeeping system was completed. It has a new schema, new user interface and is used also for processing production requests.

 

The simulation activity focused on the certification of applications using the new version of Geant4 (bugs found, fixed and reported) and the new reconstruction software (performance tests).

 

An important activity was the preparation for a new major release of Gaudi In collaboration with ATLAS, to be ready in May.

6.3      Issues and Successes in 2009Q1

 

As usual data access, storage stability were the main issues. Several tickets open at Tier1s for storage issues. Configuration of sites, hardware setup and storage ware versions still unstable.

 

Still SRM v2.2 is not uniformly implemented and LHCb fully relies on SRM.

 

Workload management issues. Several outstanding issues with gLite WMS in 2008. They were due to be fixed by the “mega patch” but several were not. But there is a minor impact on LHCb, thanks to the usage of pilot jobs.

 

During Q1, LHCb started to use generic pilots wherever possible and it is working extremely well. LHCb expects that the VOMS role=pilot will be implemented everywhere. GlExec was tested on test sites and was working adequately.

6.4      SAM Jobs and Reports

The SAM tests need to report on the LHCb’s view of usability. The tests must reproduce standard use cases and should run as normal jobs, i.e. not on special clean environment.

There are some changes that LHCb would like:

 

-       Reserve lcg-admin for software installation and needs dedicated mapping for permissions to repository

-       Use normal accounts for running tests running as “Ultimate Priority” DIRAC jobs

-       Matched by the first pilot job that starts. It scans the WN domain and often sees WN-dependent problems (bad WN configuration).

-       SAM should allow for longer periods without report. Queues may be full, which is actually good sign, but then no new SAM job can start.

6.5      Plans for 2009

Simulation and its analysis will start in May. The main goal is to test and tune stripping and HLT for 2010 (MC09).

Will simulate the 2009-10 scenarios:

-       4/5 TeV, 50 ns (no spillover), 1032 cm-1s-1

-       Benchmark channels for first physics studies (~10 Mevts).

-       Large minimum bias samples (~ 30 mn of LHC running, 109 events). Estimate 2 to 3 months for simulation

-       Stripping performance required: ~ 50 Hz for benchmark channels

-       Tune HLT: efficiency vs. retention, optimisation

 

The Physics studies (MC09-2) will focus on signal and background samples (~500 Mevts) and will be used for CP-violation performance studies.

 

Long term physics studies (DC09) with the nominal LHC settings (7 TeV, 25 ns, 2 1032 cm-2s-1)

 

Preparation for very first physics (if required)

-       450 GeV to 2 TeV, low luminosity

-       Minimum bias sample (108 events)

 

Commissioning for 2009-10 data taking (FEST’09), using simulated data as LHCb cannot take cosmic rays.

 

7.   AOB

 

 

J.Templon asked that have the test alarm of the VOs every 3 weeks over the three months not all on the same week.

J.Gordon and I.Bird suggested asking this change to the meeting about alarms.

 

 

8.   Summary of New Actions

 

New Action:

I.Bird will look for the chairperson and also distribute a proposal for the mandate of the WLCG Technical Forum and reminding the GDB’s mandate.

 

New Action:
M.Schulz will summarize the situation of the User Analysis WG in an email to the WLCG MB.