LCG Management Board
Tuesday 13 January 2009 - 16:00-18:00 – F2F Meeting
(Version 1 – 17.1.2009)
A.Aimar (notes), D.Barberis, I.Bird (chair), K.Bos, D.Boutigny, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, S.Lin, U.Marconi, H.Marten, P.McBride, H.Meinhard, G.Merino, A.Pace, B.Panzer, P.Mato, Di Qing, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon, M.Vetterli
Mailing List Archive
Tuesday 20 January 2009 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
No comments. The minutes of the previous MB meeting were approved.
Action List Review (List of actions)
Testing is progressing but problems were found and this required the intervention of the developers.
J.Templon reported that there was a memory leak found at CERN. Several other issues were found and solved at NIKHEF and fixed.
M.Schulz added that the SCAS testing expert is only back next week.
J.Templon reported that the SLA for NL-T1 is now completed and he will send it to the Experiments.
M.Kasemann reported that D.Bonacorsi has been nominated contact person for all the SLAs with the Sites. And he is completing the process with all CMS Sites.
O.Smirnova reported that NDGF has sent their SLA proposal to ALICE and is waiting for a reply.
To remove. Is now followed by the AF and they will keep the MB informed
Not done yet.
I.Bird proposed that someone is appointed in order to complete this action.
Not done. Will be followed by the Applications Area? Or ask F.Donno?
· 13 Jan 2009 - G. Merino to present a new CPU unit conversion factor proposal to the MB.
Will be presented next week.
· 13 Jan 2009 – Sites present their MSS metrics to the F2F MB.
Will be discussed at the GDB. The metrics originally proposed are not suitable for all Sites and therefore they should propose other equivalent metrics.
Operations Weekly Report (Slides)
Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
The last service report was given on December 16th 2008 – one month ago. Since that time the service has run reasonably – even commendably – well
Experiments are preparing “post-mortem” analyses of their experiences over Christmas and there is a slot on this at the GDB tomorrow.
There are WLCG operations pages (wiki) for the weeks over Christmas and New Year and people were encouraged to add comments / incident reports directly. But none did. So it is assumed there were no relevant incidents.
3.2 GGUS Summary
The GGUS tickets submitted during the last 3 week are very few. Two of them were alarm (ALICE and LHCb).
This is a ticket from a Site (NL-T1) to ALICE.
J.Templon reported that the error was reported by Nagios because the test simply failed to run. This particular test should not report to the Site admin when it cannot execute.
LHCb Alarm Ticket (45112)
There were problems to register file in LFC. The target time was respected.
At the end it seems that the cause was in the LHCb application.
J.Templon noted that some of the ATLAS tickets were due to files name spelling mistakes. These are not tickets that should be accounted nor sent to the Site admins
3.3 Service Incidents to Follow
Some issues need still to be followed-up:
- ATLAS conditions streaming issue – pending from 2008: Online – offline was resynchronized before Christmas but it was not possible to also re-sync to ATLAS Tier1 sites then. This will be done this week (tomorrow – Wednesday 14th)
- CASTOR-related issues at ASGC (Dec 16 – 17) and more DB-related issues early Jan: CASTOR, SRM + LFC/FTS
- CASTOR-related issues at CERN: Name server problems last night resulted in an ATLAS team ticket. A weak spot in the software has been reported to the developers.
- Network router problems at CERN. See http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/ScheduledInterventionsArchive/090112-LCG-Router.htm
3.4 Other Issues
Some issues have shown up recently but cannot be solved by the Operations:
- Issues related to shared s/w area have come up a couple of times recently. This need to be taken up and addressed at an appropriate forum, e.g. HEPiX – techniques for running large scale shared file systems
- s/w installation and other tools – avoid use of hard-coded absolute paths
- AF – compiler version(s) required for SL5
I.Bird asked why some Sites have problems with a central s/w area while others Sites do not.
J.Gordon replied that it may depend on hardware on how the files systems are organized. Sharing experience should be done at the Operations and, more extensively, at GDB meetings.
The amount of (valuable) information reported in the daily operations meetings continues at a high level – the calls can last up to 30’ these days.
How will this “scale” to LHC running? It should not become a much longer meeting.
J.Templon noted that some Experiments present too detailed reports on the data and transfers done. This information is not really relevant for Sites’ Operations but makes the meetings last much longer than the half hour originally decided. .
It is important that we continue to follow-up on significant service degradations and incidents. Services that are significantly degraded and / or for a long period of time. There is increasing evidence that more “best practices” knowledge sharing would help many sites and hence the overall service. More cooperation with HEPiX on specific issues?
How do we ensure adequate cross-reporting between site internal meetings and “the Grid”? AFAIK I only have access to those from CERN & GridPP.
M.Schulz added that SA3 is improving on the change management. They are looking on how to make releases les destructive.
4. Resources Procurement in 2009 (Slides) – I.Bird
Before Christmas in a meeting with new CERN Director for Physics & Computing (S.Bertolucci) more new estimates for first collisions for physics were for about September 2009. Assuming that the present schedule for re-installation, cool down, testing etc will progress without problems. Additional information at that meeting was that it is unlikely that the accelerator would run past mid-November.
Based on that information it seemed unreasonable and irresponsible to keep to the presently agreed resource deployment schedule. That why there was an email about it before end of 2008.
Subsequently in January: additional information from CERN management (and mentioned in DG speech to staff and users) was made available. The Chamonix workshop (Feb 2-6) will discuss 2009 and 2010 running schedules in detail, based on expected capabilities and experiment desires. It could happen that there is no winter shutdown this year (but a longer shutdown later) – EDF contract to be retendered; important to do as much physics as early as possible.
4.1 Next Steps
The known step in the near future are:
- Feb 6: Await results of Chamonix workshop to understand better the likely running schedule for 2009 and 2010
- Prepare an updated plan for resource procurement /installation/ commissioning taking into account new schedule and site constraints
- Feb 16: Discuss this plan with the LHCC mini-review and following days
- April: Present this to the RRB
For 2009 one should relax the requirement to have all equipment installed by April – this is clearly not essential now. It could open the possibility to get the next generation of equipment for the same cost. But all Sites must be ready in good time for September 2009 (i.e. installed in July/August).
What else is required really depends on what comes from the Chamonix discussion:
- If there is shutdown in mid-Nov and restart in April we must question the resources that are needed for 2009 and 2010
- If there is likely to be no (or a much reduced) shutdown then we probably need to keep to the original planning (but with an adjusted or staged deployment)
Hopefully it will also to be clarified in Chamonix:
- Energy and Heavy-ion running
- Experiments desire to take data (even if they cannot yet analyse it)
4.3 Open Issues
There are questions to be answered in the next few weeks:
- What are you already committed to in terms of what was planned for April 2009?
- What could be delayed?
- What are the deadlines for restarting the procurements so that the full set of commissioned resources is in place by April 2010?
D.Barberis added that data will be there in Spring 2009, with Cosmic Rays.
I.Bird replied that the current resources are likely sufficient for that task.
M.Kasemann, added that CMS is going to re-asses their needs every time new information becomes available.
Ph.Charpentier confirmed that also LHCb is reviewing their requests. The first collisions are needed for a good estimate. Until then they are executing more simulation jobs and this will require considerable resources. LHCb needs information before the RRB in April.
A.Heiss reported that DE-KIT has already sent their orders for 2009. Same for RAL and IN2P3.
H.Meinhard clarified that CERN has postponed their orders. The orders for October were done, while the orders on December were postponed.
J.Gordon noted that the information was distributed too late and RAL did their tenders.
D.Britton noted that in the future this should not happen again and information should be distributed more promptly to the Sites.
I.Bird replied in 2008 the information was not available until CERN’s stop for Christmas. Sites can continue their orders but it does not need to be all installed for April.
J.Templon added that the delay is of about one year compared to the original schedule.
I.Bird agreed but reminded that NL-T1 still has to provide the resources pledged for 2008.
L.Dell’Agnello added that INFN stopped their tenders in October unilaterally and did not purchase the resources for 2009.
F.Hernandez asked that any future change is communicated immediately.
G.Merino stated that CPU tender is progressing, while the tender for disk is delayed vs. the original date.
M.Ernst both I.Fisk reported that the CPU is ordered and the disk will be delayed until September in order to wait for 2 TB drives.
O.Smirnova and Di Qing reported a similar situation for NGDF and ASGC respectively.
I.Bird summarized that one should wait for the Workshop in Chamonix. If there are only 6 weeks run in 2009 there are sufficient resources. But if in 2010 there is a longer run the resources will be needed earlier.
I Bird will send an email summarizing this proposal.
After the MB meeting, I.Bird sent a proposal to the MB mailing list and was commented by the MB Members.
Here is the latest version of the 16 Jan 2009:
5. Preparation of the Mini Review (Slides) – I.Bird
I.Bird discussed with the reviewers the goals and the agenda for the next Mini Review.
5.1 Mini-Review Goals
The main goal of the review this time would be to
- 1.1 Very short summary of 2008 activities
- 1.2 Plans for 2009 in view of the schedule discussed in Chamonix
- 1.3 Clear Picture of the current Performance compared to TDR anticipated one
- 1.4 List of pending issues for 2009 running
Concerning 1.3, during the last LHCC there were few points we should address during this meeting
- How the detector now perform compared to TRD
- How the computing model for the experiments have been evolved
- Efforts carried out by experiments to optimize the performance of their software
5.2 Proposed Agenda
In RED the proposed speakers.
I.Bird noted that the reviewers should receive the clear requests of resources from the Experiments. The scientific case should be presented in order to receive their support.
M.Kasemann suggested that the first presentations could be by the Experiments. The CAF usage and other talks depend on the models presented by the Experiments. But on the other hand, as it is a WLCG review, maybe the Experiments should not be first then.
6. ATLAS Quarterly Report (Slides) – D.Barberis
D.Barberis presented the ATLAS quarterly report for 2008Q4.
6.1 Tier-0 Activities
ATLAS took continuously cosmic ray data for several months and until 3rd November. With only short breaks for detector work (and LHC data). An additional week with the Inner detector only at the end of November.
The Tier-0 coped well with nominal data rates and processing tasks. A few Castor glitches were usually sorted out with the Castor team within a very reasonable time.
In November hardware detector commissioning work restarted. Detector work will continue throughout the Winter and Spring 2009. The global cosmic data-taking runs will restart during April-May. Initially with partial read-out, later on with the complete detector.
6.2 Data Reprocessing
ATLAS launched a reprocessing campaign for single-beam and cosmic data taken in August-November 2008. It is an ambitious plan: run in partially attended mode during the New Year period. Partially forced by circumstances: the software and calibrations were finally ready only mid-December.
Most sites ran on “best effort” during the holiday period. Anyway, 500 TB of raw data were processed in 8 Tier-1 sites and CERN: There were two outstanding issues:
FZK failed the validation in
December: test jobs produced different results from all other sites (under
investigation, the local Linux build is suspected).
- ASGC had storage troubles in December when we launched the jobs.
- “Slow motion” in PIC and NIKHEF is under investigation
G.Merino clarified that the slow CPU farm is due to jobs online accessing sqlite files via NFS.
Reconstructed data were merged and distributed to other Tier-1/2s. A few remaining jobs to be run here and there
6.3 Data Export Functional Tests
ATLAS continues running data export functional tests at low level to keep checking the health of the whole system. Site contacts are promptly notified of problems and we follow up all troubles together.
Below are the data transfers in the last 30 days.
And here are the statistics over 24 hours.
6.4 Simulation Production
Simulation production continues in the background all the time. It is only limited by physics requests and the availability of disk space for the output files.
Below is the summary of the whole 2008 year.
The ATLAS upcoming software releases are:
- Release 15.0.0 - February 2009. Include feedback from 2008 cosmic running. It is the base release for 2009 operations.
- Releases 15.X.0 - Once/month. Incremental code improvements.
The Cosmic Runs will be the following:
- Complete detector: Restarting April-May 2009
- Partial read-out: Restarting late Winter 2009
Collision data planning is waiting for news from the forthcoming Chamonix workshop. Resource needs will also be re-evaluated after the accelerator schedule is published.
7.1 Xrootd Writing over WAN
ALICE asked for xrootd access over WAN to NL-T1. J.Templon asked whether the other ALICE Sites are opening xrootd to write access over WAN.
O.Smirnova clarified that NDGF are still discussing the possible option, but it is not yet enabled.
F.Hernandez added that IN2P3 could allow this on a separate installation of xrootd but the access to dCache data via xrootd is read-only. He expressed the worry that write access over WAN is a feature that IN2P3 would like to avoid (even for SRM access). In addition local jobs can be tuned according to the limit and performance of the MSS.
A.Heiss clarified that they DE-KIT also have an independent xrootd cluster, and the access to the MSS is limited to the IP ranges of the Tier-2 Sites.
I.Bird added that this is not a requirement originally agreed but a request from ALICE to the Sites. Sites should agree with ALICE whether is possible and what level of access they can safely provide.
7.2 Move to 64-bits WNs
J.Templon noted that if the CPU resources are installed in 64-bits mode the memory consumption could double for some applications. This will be causing problem on the Worker Nodes: a memory cap should be introduced.
I.Bird replied that this issue will be discussed at the GDB on the following day.
7.3 GDB and MB Times and Topics
I.Bird asked to the MB whether there should be change of time for the GDB (and for the F2F MB) and whether there should be a separation of which topics are treated in the two meetings. Maybe the GDB agendas should be agreed at the MB a week or two in advance so that the subjects are not discussed twice? Should also the times be changed?
J.Templon proposed to have the GDB before the MB. IN this way issues are first at the GDB discussed and then approved at the MB.
F.Hernandez noted that the MB cannot decide immediately after a GDB discussion. It needs a week before for checking the proposal of the GDB with the experts.
J.Gordon reminded that also the pre-GDB is useful and then usually reports to the GDB.
M.Kasemann proposed that the GDB could run until 4 PM on Tuesdays, followed by the MB.
Note: At the GDB on the following day was agreed that could be possible to have GDB and F2F MB both on Tuesday but not for the next meetings because one needs to check room availability and additional feedback.
7.4 SE at the Beijing Tier-2
At IHEP ATLAS would like to have DPM as SE: currently there is dCache available. DCache suits CMS well and they do not want to change SE. Who should decide whether to change SE at a Site?
K.Bos clarified that ATLAS’ favourite SE is DPM. But they can also work with other SE implementations.
M.Kasemann noted that also CMS could provide their SE requests but they did not do it because is the Site that can choose the SE implementation that wants to provide. Change SE now would be a pity given that an SE is already installed and working properly at IHEP.
The conclusion was that Sites should decide what SE to install. If IHEP prefers to stay with dCache because they are more familiar with it and dCache is already installed then they should not change to DPM.
8. Summary of New Actions
No new actions.