LCG Management Board
Tuesday 22 July 2008, 16:00-17:00 - Phone Meeting
(Version 3 - 1.8.2008)
A.Aimar (notes), D.Barberis, T. Cass, L.Dell’Agnello, M.Ernst, I. Fisk, S.Foffano, J.Gordon (chair), F.Hernandez, M.Kasemann, P.Mato, A.Pace, H.Renshall, M.Schulz, Y.Schutz, R.Tafirout
Mailing List Archive
Tuesday 5 August 2008 16:00-17:00 – Phone Meeting
Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous MB meeting were approved.
1.2 Tier-1 and Tier-2 Reliability and Availability - June 2008 (Tier-1_SR_200806; Tier-2_SR_200806)
A.Aimar attached the final Tier-1 and Tier-2 Reliability and Availability Reports for June 2008.
The value for the OSG sites were remove and will be published from next month.
Action List Review (List of actions)
On going. It is installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it should announce it.
The only information still missing is: CMS list of 4 users DNs that can post alarms to the sites’ email address.
Services Weekly Report (Slides)
H.Renshall presented a summary of status and progress of the LCG Services and of other LCG Boards. This report covers the last two weeks.
2.1 Resources Scrutiny Group
All RSG reviewers have had one or more meetings with their Experiments and are filling in a common, but adaptable, template leading to 2009 resource requirements.
A short summary per Experiment:
- ALICE: Have had email exchanges and a teleconference and ALICE have completed their template but follow up is needed.
- ATLAS: Only one reviewer was available. First iteration of template done but only partially completed. The second reviewer is now active.
- CMS: Template fully completed and matches immediately the CMS computing model. HI running is not being reviewed at this time because is separately funded outside of CERN.
- LHCb: Full information given to enable template to be adapted/completed.
The RSG group notes they will have to renormalize the resulting experiment numbers to a common set of assumptions on the LHC running conditions. The planning is to report on the scrutiny of the validity of the 2009 resource requests in August.
The CSO has agreed that these can already be made public though more detail may be added for the November C-RRB. In future years there may be a C-RRB in the summer to review the Scrutiny Group reports for the following year given the need to start hardware procurements well in advance of need.
The group also had a report on the results of the CCRC08 Common Computing Readiness challenge at its fourth meeting. The group will meet in August to finalise the 2009 reports then decide the date for one or more Autumn meetings when they see how the LHC is performing bearing in mind that they have to finally report to the C-RRB meeting of the 11 November.
J.Gordon asked whether the pledges for 2009 have to be expressed by August for the RSG, while the original deadline for the C-RRB was end of September.
S.Foffano replied that the specifications will have to be done in parallel and will have to be consolidated into final pledges for the C-RRB meeting in November.
J.Gordon asked whether the planning of the pledges with a 3-years advance, instead of 5 as now, is going to be proposed.
S.Foffano replied that this is going to be a request to the next C-RRB but until it is agreed, the pledges should be 5 years in advance until 2013. The C-RRB will hopefully agree at the November’s meeting to change to a 3-years advance period for the sites pledges.
2.2 Overview Board
The OB heard an LCG project status report from I.Bird, a CCRC’08 post-mortem report from H.Renshall including a SWOT analysis and a report on procedural progress of the C-RSG from myself.
The weaknesses seen are:
- Some of the services, including but not limited to storage / data management, are still not sufficiently robust.
- Communication is still an issue / concern. This requires work / attention from everybody; it is not a one-way flow.
- Not all activities (e.g. reprocessing, chaotic end-user analysis) were fully demonstrated even in May, nor was there sufficient overlap between all experiments and all activities.
The main threat perceived by the WLCG management is that of falling back from reliable service mode into “fire-fighting” at the first sign of serious problems.
However, a consistent message is being given that Experiments, Sites and WLCG Services are ‘more or less’ ready for the expected 2008 data taking although constant attention will be needed at all levels.
2.3 Site Reports
- 10 July submitted post-mortem on recent power and network switch problems. Full services reported running by 19 July.
D.Barberis noted that there were also problems on the following week end. And LFC is restarted automatically every few minutes.
- 7 July primary link to TRIUMF failed due to outage in Seattle area and failover to secondary via CERN OPN did not come up. Workaround by turning off primary interface at BNL or TRIUMF but proper solution still being worked on.
- 9 July storage server network connection failure took some time to solve changing various components. Left some ATLAS files inaccessible.
- 14 July inaccessible file problem understood and put down to a problem introduced by dCache patch level 8. Files which for some reason failed to transfer out of BNL were pinned by dCache. SRM transfer first tries to pin files and gives up when it cannot. Other access methods work. Workaround is to periodically look for such pinned files and unpin them. No long term solution yet. Sites alerted but probably now being seen in IN2P3 after P8 upgrade.
- 19 July at about 19:20 a major network router failed. Almost all services were affected. Some services were up again on Sunday but some are still degraded or unavailable (as of 13:00 Monday). In particular, some dCache pool nodes are not yet available. We are working on it and a post-mortem analysis will follow.
- 17 July GGUS conducted first service verification of the Tier 1 site operator alarm ticket procedure. Failures of the procedure at NDGF and CERN are understood and being fixed.
2.4 Experiments Reports
- Production hit by MyProxy problems – see PM at https://twiki.cern.ch/twiki/bin/view/FIOgroup/ScLCGPxOperations
- Working on integration of CREAM-CE with ALIEN.
- CERN CASTORATLAS upgraded to 2.1.7-10 on 14 July to avoid a fatal data size overflow problem.
- ATLAS taking cosmics with test triggers resulted in some very large datasets (16 TB in files of a few GB) being successfully distributed to BNL. It was not a scheduled test but was successful.
- ATLAS now running cosmics at weekends and on 20 July ATLAS CERN site services stuck but resulting T0 to T1 catch-up when services restarted Monday morning reached an impressive 2.5 GB/sec. Clearly more process monitoring alarms are needed.
- ATLAS workflow management bookkeeping needs process level access to their elog instance (via an elog call) and this is about to be made available after a security analysis.
- CRUZET 3 cosmics run from 7 to 14 July. Quite good experience, more mature in terms of data handling in general. Reconstruction submissions to all Tier 1 ongoing. Preparing for next global cosmic exercise for the second half of August but expect cosmics data tests weekly Wed and Thu.
finalized on the P5->CERN transfer system, a repacker replay is now
running (since July 17th), namely redoing the repack for CRUZET-3 data.
- CMS would expect a centrally-triggered big transfer load of many CSA07 MC datasets to CMS T2's, as a needed step in order to complete the migration of the user analysis to T2 sites. Each T2 should expect to be asked to host a fraction of ~30 TB of those datasets.
- CMS have a CASTOR directory of 2.3 * 10**6 files of 160KB which are webcam dumps and have gone to tape. They are looking at deleting them and stopping fresh ones.
- DC06 simulation is running smoothly under Dirac3, reconstruction and stripping tests are still ongoing so there is no official date yet for moving fully production to DIRAC3. There are still some activities using DIRAC2, mainly Analysis.
J.Gordon asked whether there is any coordination in the ATLAS and CMS cosmic runs.
Both Experiments replied that for the moment they run their tests without coordination. And it is not necessary at the moment.
3. GDB Summary (Paper) - J.Gordon
J.Gordon presented his report summarizing the July’s GDB meeting.
The main general points at the GDB were:
- Asked for feedback on dates for 2009.
- Should one stay with the second Wednesday of the month or revert to the first?
- Where to hold the GDB meeting when (March 2009) there is the car show in Geneva?
3.2 CPU Benchmarking
H.Meinhard repeated the proposal for benchmarking that he gave to the MB but with more detail. This has now been agreed by MB so we need to work on details. A small group should work on this. No volunteers but J.Templon suggested that all experiments should be involved with a matching number of sites and some of the original HEPiX group. This seems a good suggestion and I will progress it.
The group should at least look at:
- Agreeing the exact conditions under which the chosen benchmark will be run.
- Producing an understandable paper for wider circulation;
- Producing a proposal for evaluating existing capacity
- Producing a proposal for translation of requirements
- Agreement is required in time for input to the 11 November C-RRB.
3.3 Reporting on Installed Capacity
F.Donno expanded on the technical details of her proposal to gather information on CPU and storage installed at sites for comparison with MoU pledges. The storage accounting portal is undergoing an overhaul and needs to work with OSG on publishing their data.
Issues outstanding include:
- There seems to be no standard advice on how sites with clusters shared with non-LHC work should publish their capacity. Some publish everything on the grounds that it is all theoretically available so their installed capacity can greatly exceed their pledge. Others publish a VO’s share so usage can exceed installed capacity.
- The MoU definition of ‘installed capacity’ includes disk that is in the machine room but has not been deployed/configured for any experiment. It may be allocated to one but may not be powered on. Such disk is not currently known to the Information Service. One proposal was to publish this capacity in a virtual storage space so that it could be included in any capacity measurement. I saw this as being difficult to maintain manually. Another proposal was to publish the ‘total installed capacity’ in a virtual storage space. This would only be used for the collection of management information. OSG had problems with this concept.
Was proposed that the working group prepares a proposal and present it to MB for approval.
3.4 Tier-1 Readiness
J.Gordon reported that generally Tier-1 sites feel confident due to experiences in CCRC08 in May. There were problems but in general the service was recovered quickly. Still concerns over communications and middleware limitations.
K.Bos noted that short breaks can have major impact on the users.
- Sites did not recognize the effects of even very short breaks in service on the experiments. It took a lot of manual work by the experiment to recover.
- ATLAS recognized that they had only addressed half of the problem. When users run jobs in large numbers a different set of problems will be seen.
This could also be a limitation of the experiment frameworks. Grids are inherently unreliable and frameworks must take this into account. Sites should try to minimize the number of breaks as well as maximizing reliability.
One part of the improvements in reliability has been through better monitoring and call-out. The next step needs to be in resilience to reduce the number of breaks in service.
3.5 Tier-2 Readiness
M.Vetterli identified a problem with communications, specifically between Tier-2 sites in a different country from their associated Tier-1 site. A concern was how user support would be done for end-user analysis.
3.6 Machine Readiness
J.Shiers reported that the date for first circulating beam was August 8th. The Experiments expect to start taking data for calibration with single beams so everything should be in place before then. Tier-1 sites would like a period of stability before beam but we are almost too late for this already.
J.Gordon would like the MB to track outstanding issues and apply pressure where necessary. In particular:
- Storage Tokens to be defined
- Storage required for each token
- Complete list of minimum middleware releases required.
- Anything else?
D.Barberis and M.Kasemann replied that for their experiments the storage tokens are defined and the sites know how many resources need to allocate.
3.7 Middleware Status
M.Schulz reported on forthcoming developments:-
- SCAS is still with the NIKHEF developers for ‘deep certification’. This is the holding item on Multiuser Pilot Jobs for LHCb.
- The WMS/LB for gLite3.1/SL4 was released. This is recommended as being much better.
- Job Priorities were released but no-one seems to have used them yet. ATLAS are organizing further tests with some of their Tier2s.
- Within six months we should see a new version of FTS addressing: split of SRM; negotiations and gridFTP: improved throughput and logging: full VOMS support
- U.Swickerath reported on his tests of the Cream CE. Its basic functionality was working but it had a number of limitations and it failed to meet the target reliability.
WLCG Grid Deployment: all software required at Tier-1 and Tier-2 sites should be described in a wiki page.
M.Schulz reported that gLExec testing is in progress but there is not enough information, from the developers, to configure gLExec in logging-only mode.
5. Summary of New Actions
WLCG Grid Deployment: all software required at Tier-1 and Tier-2 sites should be described in a wiki page.