LCG Management Board

Date/Time

Tuesday 13 January 2009 - 16:00-18:00 – F2F Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=45201

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 17.1.2009)

Participants

A.Aimar (notes), D.Barberis, I.Bird (chair), K.Bos, D.Boutigny, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, S.Lin, U.Marconi, H.Marten, P.McBride, H.Meinhard, G.Merino, A.Pace, B.Panzer, P.Mato, Di Qing, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon, M.Vetterli

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 20 January 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments. The minutes of the previous MB meeting were approved. 

 

2.   Action List Review (List of actions) 
 

  • SCAS Testing and Certification

 

Testing is progressing but problems were found and this required the intervention of the developers.

 

J.Templon reported that there was a memory leak found at CERN. Several other issues were found and solved at NIKHEF and fixed.

M.Schulz added that the SCAS testing expert is only back next week.

  • 20 Nov 2008 - VOBoxes SLAs:
    • Experiments should answer to the VOBoxes SLAs at CERN (all 4) and at IN2P3 (CMS).
    • NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

J.Templon reported that the SLA for NL-T1 is now completed and he will send it to the Experiments.

 

M.Kasemann reported that D.Bonacorsi has been nominated contact person for all the SLAs with the Sites. And he is completing the process with all CMS Sites.

 

O.Smirnova reported that NDGF has sent their SLA proposal to ALICE and is waiting for a reply.

 

  • 16 Dec 2008 – P.Mato and M. Schulz to follow-up on the GCC 4.3 discussion, agree on co-ordination and timescale, and report back to the MB.

To remove. Is now followed by the AF and they will keep the MB informed

  • 16 Dec 2008 - Sites requested clarification on the data flows and rates from the Experiments. The best is to have information in the form provided by the Data flows from the Experiments. Dataflow from LHCb

Not done yet.

I.Bird proposed that someone is appointed in order to complete this action.

  • DCache teams should report about the client tools should present estimated time lines and issues in providing the porting to gcc 4.3.

Not done. Will be followed by the Applications Area? Or ask F.Donno?

·         13 Jan 2009 - G. Merino to present a new CPU unit conversion factor proposal to the MB.

Will be presented next week.

·         13 Jan 2009 – Sites present their MSS metrics to the F2F MB.

Will be discussed at the GDB. The metrics originally proposed are not suitable for all Sites and therefore they should propose other equivalent metrics.

 

3.   LCG Operations Weekly Report (Slides) – J.Shiers

Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Overview

The last service report was given on December 16th 2008 – one month ago. Since that time the service has run reasonably – even commendably – well

 

Experiments are preparing “post-mortem” analyses of their experiences over Christmas and there is a slot on this at the GDB tomorrow.

There are WLCG operations pages (wiki) for the weeks over Christmas and New Year and people were encouraged to add comments / incident reports directly. But none did. So it is assumed there were no relevant incidents.

3.2      GGUS Summary

The GGUS tickets submitted during the last 3 week are very few. Two of them were alarm (ALICE and LHCb).

 

VO

USER

TEAM

ALARM

TOTAL

ALICE

1

0

0

1

ATLAS

26

5

0

31

CMS

3

0

0

3

LHCb

2

0

1

3

 

Alice ticket.

This is a ticket from a Site (NL-T1) to ALICE.

 

Detailed description: https://gus.fzk.de/ws/ticket_info.php?ticket=45061 

The experiment-specific SAM tests for ALICE are failing as such:

Event: Abort
- Arrived = Tue Jan 6 19:27:04 2009 CET
- Host = wms103.cern.ch
- Reason = X509 proxy not found or I/O error
- Source = WorkloadManager
- Src instance = 24407
- Timestamp = Tue Jan 6 19:27:02 2009 CET

However this is reflected in a "failure" of the SAM test and is erroneously attributed to the site.

Note as well that according to SAM, this test has run only four times in the past five days.

Used command line: looked at our sites Nagios setup
Received error message: see above

Problem affects the whole VO: ALICE

 

J.Templon reported that the error was reported by Nagios because the test simply failed to run. This particular test should not report to the Site admin when it cannot execute.

 

LHCb Alarm Ticket (45112)

There were problems to register file in LFC. The target time was respected.

At the end it seems that the cause was in the LHCb application.

 

santinel

2009-01-08

16:20

assigned (ROC_CERN)
Problem affects the whole site: CH-CERN
Problem affects the whole VO: LHCb
Sent ALARM mail to mail address lhcb-operator-alarm@cern.ch

aretico

2009-01-08

16:31

in progress (ROC_CERN)

Computer Operations

2009-01-08

16:47

Added attachment mailbody.Thu._8_Jan_2009_17.45.47_.0100.txt

Computer Operations

2009-01-08

16:47

Public Diary:
For your information,
The Data Operations Piquet has been called. Please standby.
Regards,Pascal Sicault operator on duty

Uli

2009-01-08

16:53

Checked with LFC expert who already started to investigate this. Copy&paste from offline conversation:

Slemaitr

2009-01-09

16:05

Solved (ROC_CERN)

Problem solved. The issue was coming from DIRAC not creating unique GUIDs. Closing ticket.

santinel

16:11

Verified

 

J.Templon noted that some of the ATLAS tickets were due to files name spelling mistakes. These are not tickets that should be accounted nor sent to the Site admins

3.3      Service Incidents to Follow

Some issues need still to be followed-up:

-       ATLAS conditions streaming issue – pending from 2008: Online – offline was resynchronized before Christmas but it was not possible to also re-sync to ATLAS Tier1 sites then. This will be done this week (tomorrow – Wednesday 14th)

-       CASTOR-related issues at ASGC (Dec 16 – 17) and more DB-related issues early Jan: CASTOR, SRM + LFC/FTS

-       CASTOR-related issues at CERN: Name server problems last night resulted in an ATLAS team ticket. A weak spot in the software has been reported to the developers.

-       Network router problems at CERN. See http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/ScheduledInterventionsArchive/090112-LCG-Router.htm

3.4      Other Issues

Some issues have shown up recently but cannot be solved by the Operations:

-       Issues related to shared s/w area have come up a couple of times recently. This need to be taken up and addressed at an appropriate forum, e.g. HEPiX – techniques for running large scale shared file systems

-       s/w installation and other tools – avoid use of hard-coded absolute paths

-       AF – compiler version(s) required for SL5

 

I.Bird asked why some Sites have problems with a central s/w area while others Sites do not.

J.Gordon replied that it may depend on hardware on how the files systems are organized. Sharing experience should be done at the Operations and, more extensively, at GDB meetings.

3.5      Outlook

The amount of (valuable) information reported in the daily operations meetings continues at a high level – the calls can last up to 30’ these days.

How will this “scale” to LHC running? It should not become a much longer meeting.

J.Templon noted that some Experiments present too detailed reports on the data and transfers done. This information is not really relevant for Sites’ Operations but makes the meetings last much longer than the half hour originally decided. .

 

It is important that we continue to follow-up on significant service degradations and incidents. Services that are significantly degraded and / or for a long period of time. There is increasing evidence that more “best practices” knowledge sharing would help many sites and hence the overall service. More cooperation with HEPiX on specific issues?

 

How do we ensure adequate cross-reporting between site internal meetings and “the Grid”? AFAIK I only have access to those from CERN & GridPP.

M.Schulz added that SA3 is improving on the change management. They are looking on how to make releases les destructive.

 

4.   Resources Procurement in 2009 (Slides) – I.Bird

 

Before Christmas in a meeting with new CERN Director for Physics & Computing (S.Bertolucci) more new estimates for first collisions for physics were for about September 2009. Assuming that the present schedule for re-installation, cool down, testing etc will progress without problems. Additional information at that meeting was that it is unlikely that the accelerator would run past mid-November.

Based on that information it seemed unreasonable and irresponsible to keep to the presently agreed resource deployment schedule. That why there was an email about it before end of 2008.

 

Subsequently in January: additional information from CERN management (and mentioned in DG speech to staff and users) was made available. The Chamonix workshop (Feb 2-6) will discuss 2009 and 2010 running schedules in detail, based on expected capabilities and experiment desires. It could happen that there is no winter shutdown this year (but a longer shutdown later) – EDF contract to be retendered; important to do as much physics as early as possible. 

4.1      Next Steps

The known step in the near future are:

-       Feb 6: Await results of Chamonix workshop to understand better the likely running schedule for 2009 and 2010

-       Prepare an updated plan for resource procurement /installation/ commissioning taking into account new schedule and site constraints

-       Feb 16: Discuss this plan with the LHCC mini-review and following days

-       April: Present this to the RRB

4.2      Outlook

For 2009 one should relax the requirement to have all equipment installed by April – this is clearly not essential now. It could open the possibility to get the next generation of equipment for the same cost. But all Sites must be ready in good time for September 2009 (i.e. installed in July/August).

 

What else is required really depends on what comes from the Chamonix discussion:

-       If there is shutdown in mid-Nov and restart in April we must question the resources that are needed for 2009 and 2010

-       If there is likely to be no (or a much reduced) shutdown then we probably need to keep to the original planning (but with an adjusted or staged deployment)

 

Hopefully it will also to be clarified in Chamonix:

-       Energy and Heavy-ion running

-       Experiments desire to take data (even if they cannot yet analyse it)

4.3      Open Issues

There are questions to be answered in the next few weeks:

-       What are you already committed to in terms of what was planned for April 2009?

-       What could be delayed?

-       What are the deadlines for restarting the procurements so that the full set of commissioned resources is in place by April 2010?

 

D.Barberis added that data will be there in Spring 2009, with Cosmic Rays.

I.Bird replied that the current resources are likely sufficient for that task.

 

M.Kasemann, added that CMS is going to re-asses their needs every time new information becomes available.

Ph.Charpentier confirmed that also LHCb is reviewing their requests. The first collisions are needed for a good estimate. Until then they are executing more simulation jobs and this will require considerable resources. LHCb needs information before the RRB in April.

 

A.Heiss reported that DE-KIT has already sent their orders for 2009. Same for RAL and IN2P3.

H.Meinhard clarified that CERN has postponed their orders. The orders for October were done, while the orders on December were postponed.

J.Gordon noted that the information was distributed too late and RAL did their tenders.

D.Britton noted that in the future this should not happen again and information should be distributed more promptly to the Sites.

 

I.Bird replied in 2008 the information was not available until CERN’s stop for Christmas. Sites can continue their orders but it does not need to be all installed for April.

 

J.Templon added that the delay is of about one year compared to the original schedule.

I.Bird agreed but reminded that NL-T1 still has to provide the resources pledged for 2008.

 

L.Dell’Agnello added that INFN stopped their tenders in October unilaterally and did not purchase the resources for 2009.

 

F.Hernandez asked that any future change is communicated immediately.

G.Merino stated that CPU tender is progressing, while the tender for disk is delayed vs. the original date.

 

M.Ernst both I.Fisk reported that the CPU is ordered and the disk will be delayed until September in order to wait for 2 TB drives.

O.Smirnova and Di Qing reported a similar situation for NGDF and ASGC respectively.

 

I.Bird summarized that one should wait for the Workshop in Chamonix. If there are only 6 weeks run in 2009 there are sufficient resources. But if in 2010 there is a longer run the resources will be needed earlier.

 

Action:

I Bird will send an email summarizing this proposal.

 

After the MB meeting, I.Bird sent a proposal to the MB mailing list and was commented by the MB Members.

 

Here is the latest version of the 16 Jan 2009:

 

[…]

Following the discussion in the Management Board on Jan 13th, the following is what we agreed would be the process now for updating the resource procurement planning for 2009 and 2010.

1)      It now appears unlikely that we will see collisions for physics much before September this year, assuming that according to the presently understood schedule the accelerator complex restarts in the summer once the repairs are complete.  However, the experiments continue to run with cosmic data, as well as simulation productions in preparation for data taking and analysis.  Although significant portions of the 2009 resources have already been procured and installed at many sites,  we nevertheless feel that for this year it is reasonable to relax the requirement to have the full 2009 resources commissioned by April, and to push this back to July for CPU and the first part of disk commitment, and later in the year for the remainder of the disk.

a.       This raises the opportunity to get next generation equipment in some cases where the procurement process allows it;

b.      In many cases, changing the process at this late date is not possible as commitments have already been made (as sites were requested to do given the information previously available);

2)      The LHC workshop to be held in Chamonix on Feb 2-6 will discuss the details of the accelerator schedule for 2009 and 2010, taking into account updated input from the experiments.  It is expected to also provide information on likely running scenarios including energy, and timescales for heavy ion running. 

3)      Once there is a better understanding of the expectations for the accelerator and experiments for 2009 and 2010 coming from the Chamonix workshop we can then update the experiment requirements and thus the resource procurement plans for 2009 and 2010 (we should now treat these together).  The timescale on which this planning should be updated is the following:

a.       Feb 6: input from Chamonix workshop,

b.      Feb 16: discuss the resource planning with the LHCC referees during the mini review of WLCG.  Since this is only 10 days after the workshop we probably will not have the final numbers available;

c.       First week of April: meet with experiment management and jointly plan the presentation to the RRB of an updated resource schedule for 2009/10; inform C-RSG of the result of the discussions;

d.      April 28: presentation to the Computing RRB for agreement of the updated plan

4)      All WLCG sites will be informed of changes in the anticipated schedule through the Management Board.

[...]

 

 

5.   Preparation of the Mini Review (Slides) – I.Bird

 

I.Bird discussed with the reviewers the goals and the agenda for the next Mini Review.

5.1      Mini-Review Goals

The main goal of the review this time would be to

-       1.1  Very short summary of 2008 activities

-       1.2  Plans for 2009 in view of the schedule discussed in Chamonix

-       1.3  Clear Picture of the current Performance compared to TDR anticipated one 

-       1.4  List of pending issues for 2009 running

 

Concerning 1.3, during the last LHCC there were few points we should address during this meeting

-       How the detector now perform compared to TRD 

-       How  the computing model for the experiments have been evolved

-       Efforts carried out by experiments to optimize the performance of their software 

5.2      Proposed Agenda

In RED the proposed speakers.

 

16 February 2009

 

09:00 à 11:20 - Part 1 Convener: Mario Martinez-Perez (IFAE)

09:00 Project status (40') 2009 planning and resources (I.Bird)

09:40 Summary of 2008 LCG operation (20') - Performance and experience  (J.Shiers)

10:00 Status of Application Area (30') (P.Mato)

 

10:30  break (15')

 

10:45 Middleware (EGEE , OSG, NDGF) (30') (F.Giacomini, someone from OSG and NDGF) )

11:15 Networking (10') (J.M.Juanigot)

11:25 à14:10- Part 2 Convener: Chris Hawkes (Birmingham)

11:25 Status of Tier0 (30') - Resources, performance of mass storage, castor, etc.  (T.Cass)

11:55 CAFs Status at CERN (20') (covered in the Experiments talks)

 

12:15 Lunch

 

13:10 Status of Tier 1s (30') – here in addition to the general status, it would be interesting to focus on couple of examples. - here we also expect report on performance

13:40 Status of Tier 2s (30')           (M.Jouvin or M.Vetterli)

 

14:10 à15:50 - Part 3 Convener: Jean-Francois Grivaz (LAL)

-       Review of Experiments (20') – 2008 summary and 2009 plans - Current status/performance vs. Computing TDR - It is particularly relevant to get a clear picture on how close the experiments are now to their original TDR operational model/anticipated performance

-       ATLAS (25')

-       CMS (25')

-       LHCb (25')

-       ALICE (25')

 

 

I.Bird noted that the reviewers should receive the clear requests of resources from the Experiments. The scientific case should be presented in order to receive their support.

 

M.Kasemann suggested that the first presentations could be by the Experiments. The CAF usage and other talks depend on the models presented by the Experiments. But on the other hand, as it is a WLCG review, maybe the Experiments should not be first then.

 

6.   ATLAS Quarterly Report (Slides) – D.Barberis

D.Barberis presented the ATLAS quarterly report for 2008Q4.

6.1      Tier-0 Activities

ATLAS took continuously cosmic ray data for several months and until 3rd November. With only short breaks for detector work (and LHC data). An additional week with the Inner detector only at the end of November.

 

The Tier-0 coped well with nominal data rates and processing tasks. A few Castor glitches were usually sorted out with the Castor team within a very reasonable time.

 

In November hardware detector commissioning work restarted. Detector work will continue throughout the Winter and Spring 2009. The global cosmic data-taking runs will restart during April-May. Initially with partial read-out, later on with the complete detector.

6.2      Data Reprocessing

ATLAS launched a reprocessing campaign for single-beam and cosmic data taken in August-November 2008. It is an ambitious plan: run in partially attended mode during the New Year period. Partially forced by circumstances: the software and calibrations were finally ready only mid-December.

 

Most sites ran on “best effort” during the holiday period. Anyway, 500 TB of raw data were processed in 8 Tier-1 sites and CERN: There were two outstanding issues:

-       FZK failed the validation in December: test jobs produced different results from all other sites (under investigation, the local Linux build is suspected).
A.Heiss added that the issue is probably solved and it was due to an incorrect compilation flag.

 

-       ASGC had storage troubles in December when we launched the jobs.

-       “Slow motion” in PIC and NIKHEF is under investigation

G.Merino clarified that the slow CPU farm is due to jobs online accessing sqlite files via NFS.

 

Reconstructed data were merged and distributed to other Tier-1/2s. A few remaining jobs to be run here and there

6.3      Data Export Functional Tests

ATLAS continues running data export functional tests at low level to keep checking the health of the whole system. Site contacts are promptly notified of problems and we follow up all troubles together.

 

Below are the data transfers in the last 30 days.

 

 

And here are the statistics over 24 hours.

 

6.4      Simulation Production

Simulation production continues in the background all the time. It is only limited by physics requests and the availability of disk space for the output files.

 

Below is the summary of the whole 2008 year.

 

6.5      Plans

The ATLAS upcoming software releases are:

-       Release 15.0.0 - February 2009. Include feedback from 2008 cosmic running. It is the base release for 2009 operations.

-       Releases 15.X.0 - Once/month. Incremental code improvements.

 

The Cosmic Runs will be the following:

-       Complete detector: Restarting April-May 2009

-       Partial read-out: Restarting late Winter 2009

 

Collision data planning is waiting for news from the forthcoming Chamonix workshop. Resource needs will also be re-evaluated after the accelerator schedule is published.

7.   AOB

 

7.1      Xrootd Writing over WAN

ALICE asked for xrootd access over WAN to NL-T1. J.Templon asked whether the other ALICE Sites are opening xrootd to write access over WAN.

 

O.Smirnova clarified that NDGF are still discussing the possible option, but it is not yet enabled.

F.Hernandez added that IN2P3 could allow this on a separate installation of xrootd but the access to dCache data via xrootd is read-only. He expressed the worry that write access over WAN is a feature that IN2P3 would like to avoid (even for SRM access). In addition local jobs can be tuned according to the limit and performance of the MSS.

A.Heiss clarified that they DE-KIT also have an independent xrootd cluster, and the access to the MSS is limited to the IP ranges of the Tier-2 Sites.

 

I.Bird added that this is not a requirement originally agreed but a request from ALICE to the Sites. Sites should agree with ALICE whether is possible and what level of access they can safely provide.

7.2      Move to 64-bits WNs

J.Templon noted that if the CPU resources are installed in 64-bits mode the memory consumption could double for some applications. This will be causing problem on the Worker Nodes: a memory cap should be introduced.

 

I.Bird replied that this issue will be discussed at the GDB on the following day.

7.3      GDB and MB Times and Topics

I.Bird asked to the MB whether there should be change of time for the GDB (and for the F2F MB) and whether there should be a separation of which topics are treated in the two meetings. Maybe the GDB agendas should be agreed at the MB a week or two in advance so that the subjects are not discussed twice? Should also the times be changed?

 

J.Templon proposed to have the GDB before the MB. IN this way issues are first at the GDB discussed and then approved at the MB.

F.Hernandez noted that the MB cannot decide immediately after a GDB discussion. It needs a week before for checking the proposal of the GDB with the experts.

J.Gordon reminded that also the pre-GDB is useful and then usually reports to the GDB.

 

M.Kasemann proposed that the GDB could run until 4 PM on Tuesdays, followed by the MB.  

 

Note: At the GDB on the following day was agreed that could be possible to have GDB and F2F MB both on Tuesday but not for the next meetings because one needs to check room availability and additional feedback.

7.4      SE at the Beijing Tier-2

At IHEP ATLAS would like to have DPM as SE: currently there is dCache available. DCache suits CMS well and they do not want to change SE. Who should decide whether to change SE at a Site?

 

K.Bos clarified that ATLAS’ favourite SE is DPM. But they can also work with other SE implementations.

M.Kasemann noted that also CMS could provide their SE requests but they did not do it because is the Site that can choose the SE implementation that wants to provide. Change SE now would be a pity given that an SE is already installed and working properly at IHEP.

 

The conclusion was that Sites should decide what SE to install. If IHEP prefers to stay with dCache because they are more familiar with it and dCache is already installed then they should not change to DPM.

 

 

8.    Summary of New Actions

 

 

No new actions.