LCG Management Board
Tuesday 04 November 2008 16:00-17:00 – Phone Meeting
(Version 1 – 07.11.2008)
A.Aimar (notes), I.Bird(chair), T.Cass, M.Cattaneo, Ph.Charpentier, L.Dell’Agnello, S.Foffano, A.Heiss, F.Hernandez, J.Gordon, M.Lamanna, G.Merino, A.Pace, B.Panzer, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, R.Tafirout, J.Templon
Mailing List Archive
Tuesday 11 November 2008 16:00-17:00 – F2F Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
A clarification, from G.Merino, on the goal of the document by on installation accounting (see action list): He asked that it should also include rules on how to represent, in terms of CPU values, inhomogeneous sub clusters.
The change was done right away to the minutes. The minutes of the previous MB meeting were then approved.
Ph.Charpentier noted that it would be easier if first the Experiments converge on a common technical position at the Architects Forum and then present it at the MB for endorsement.
I.Bird agreed that the AF should come, promptly, with a proposal and a plan.
T.Cass highlighted that the MB should anyway discuss whether Sites actually can migrate a fraction of their WN to SL5 and what timescale they can follow. .
M.Cattaneo reported that the Experiments have discussed SL5 porting but have also other issues in this moment. ATLAS found problems with ROOT SL4 binaries on SL5. Until now nobody has managed to run any major application on SL5.
Y.Schutz reported that the tests from ALICE have just started on SL5. But ALICE had already ported their software to SL4 with gcc 4.3 and therefore they do not expect any problem with Sl5 and gcc4.3.
Y.Schutz asked that the migration to SL5 is done as quickly as possible, by the Sites, once it is agreed at the MB.
T.Cass replied that Sites will migrate according to their schedules and to other VOs they support. All Sites cannot migrate at the same pace.
I.Bird added that are the HEP VOs that should push for the Sites to upgrade. CERN and the LCG MB do not have a key role in this kind of deployment issues at Sites. CERN will prepare and certify all middleware needed but the deployment is a Site’s decision.
32 and 64 bits Libraries and Oss
Ph.Charpentier added that actually Sites had agreed to support the two platforms with backward compatibility libraries and also 32 and 64 bits compatibility libraries. And this never happened for the migration to SL4: there was only support for SL4 32 bits.
Ph.Charpentier also noted that all/most Sites install only 32 bits operating systems and not the 64 bits version. The applications running on 64 bits operating systems are 20% faster; thus this is a considerable waste of CPU resources.
F.Hernandez noted that perhaps some Sites do not advertise correctly their hosts. IN2P3 is running a fraction of their hosts with 64 bits OS.
Ph.Charpentier also reported that the AF has decided to use gcc 4.3 and not the SL5 native gcc 4.1 compiler. This implies that the middleware and all the storage-ware (CASTOR, dCache, DPM, StoRM) should be ported to SL5 running gcc 4.3.
M.Schulz replied that EGEE must support the native compiler gcc 4.1 for all other EGEE VOs. And if HEP VOs choose another compiler it means that a second porting to be done and distributions to be packaged. Same if the Python version chosen is different.
I.Bird added that he will discuss the issues with the new EGEE Technical Director (Steven Newhouse) and invite him to participate to the MB as his predecessor.
11 Nov 2008 - P.Mato should report to the MB the progress on the SL5 testing by the Experiments.
Action List Review (List of actions)
- DONE. A document describing the shares wanted by ATLAS
- DONE. Selected sites should deploy it and someone should follow it up.
- ONGOING. Someone from the Operations team must be nominated follow these deployments end-to-end
Being discussed in ATLAS.
M.Lamanna reported that today’s system uses Panda to submit production jobs. For Analysis there is progress using gLexec and the participation to the WG using pilot has been very useful for ATLAS. Analysis will probably be done using WMS submissions. The importance of the mechanism for JP is decreasing. Will still be necessary that sites distinguish queues and shares for production and analysis and the switching is done by checking VOMS roles. But full JP system maybe not necessary.
This progress ought to be confirmed after the ATLAS Software Week, next week.
Operations Weekly Report (Slides)
Summary of status and progress of the LCG Operations. It actually covers last two weeks.
The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
3.1 Service Incident Reports
The number of SIRs has increased in the last few weeks. Is it because before not everything was reported or because the issues have increased by actually using the services more thoroughly?
Just as examples:
- NDGF has finished a downtime started on the 28 October that is well in excess of what is sustainable.
- The CASTOR problems at ASGC have not been fixed yet.
3.2 General issues
There are a set of general issues that seem to have introduced this delays:
- Even though not necessarily visible through standard monitoring, events which make one (or more) sites effectively unusable are not that uncommon.
- At least a fraction (naturally) of problems occurs shortly before or during weekends – and are often not fully resolved until the following week.
- The chance of two or more Tier1 sites being “down” at the same time (for example during a weekend) means that this would be extremely painful during data taking and / or reprocessing
- We need to address this issue with priority – there are a number of chronic issues – some “trivial” that are not being addressed.
3.3 Example of a Shift Report from ATLAS
This is an extract from the ATLAS shift log
There are so many issues at once and they it lasted for all week end. And ASGC is down since 10 days still without ANY information.
3.4 Suggested Actions
Things are currently very calm with respect to what we must expect when the machine is running. If we don’t get this under control in the next months “sustainable operations” will be very difficult.
Sites should spontaneously provide information and this is not always happening.
- No news e.g. from ASGC about the prolonged downtime of the CASTOR services;
- There have been several incidents at SARA in the past days affecting the storage services
- Quite a few sites rarely or never attend the operations meetings
Are sites deploying services using the techniques that have been repeatedly described and / or with adequate resources? Is there a sufficient documentation on how to install the services and their operation?
Some recent problems were already solved in the past at some sites but others have not taken this into account. Is this caused by the Sites lack of participation or a lack of distribution of information and communication?
I.Bird asked what could be actually done in the short term.
J.Shiers replied that some training could be defined if Sites are ready to follow it. But is absolutely important that Sites always come to the Operations Meeting and participate, ask and listen about problems and solutions even when they have nothing important to report.
I.Bird added that, for instance, the lack of response from Sites should become visible and reported.
He will follow-up the specific issue with ASGC but is a general lack of participation from some Sites. .
F.Hernandez reported that IN2P3 had several Java VM on the dCache pools that were inexplicably blocked all at the same time in the week end. The previous week end the PNFS server for dCache was blocked and only the Tier-0 channels could be kept open. But IN2P3 has not control of the requests the site receives. The site can only stop some PNFS channels.
J.Gordon added that not all Sites can apply the same techniques presented by CERN. There are other constraints and systems in place at Sites (batch systems, alarms, hardware, etc) and other VOs to support.
Ph.Charpentier added that, for instance, when files are unavailable (or even lost) at a Site this is not spotted by the reliability tests and therefore not reported in the reliability reports.
J.Shiers noted that a probe in some cases could check that disks and pools are responding adequately.
Are there changes that can be made that expose experiments less to the inevitable problems? The situation will probably take at least months before any significant improvement can be seen – i.e. we may have to live with this for 2009!
3.5 Weekly Update: Experiments
Just like for Sites, it is not uncommon for an Experiment to run into a problem previously seen, analyzed, resolved by another
More sharing of “solutions” would mean less difficulties and a better use of the available effort. Just as example, it seems that CMS saw the PNFS overloads already and modified their applications to limit directory entries below 1000. But the PNFS problem was never reported and solved.
- On-going work on WMS migration – essentially out of RB now; Continuing with the implementation of the WMS into ALICE s/w.
- A pilot version of the submission module of ALICE has been implemented in Torino and at CERN to ensure load balancing among different WMS per site. Presented at ALICE TF meeting;
- Problems with file registration continued and then was solved – on ATLAS side (not LFC)
- When solved the backlog drained away rapidly.
- The Conditions DB access & related stress tests – on-going discussions within ATLAS and with service providers on way forward, access to conditions from Tier2s most likely requires revision to current production model.
- ATLAS is looking at FroNTier / Squid, which might have (minimal?) service impact.
- Still some cosmic data being collected but not automatically distributed – sites have to ask
- ATLASMCDISK space in CASTOR at the RAL Tier1 lost several files.
- Run "CRAFT" on-going - magnet still on since yesterday afternoon. Everything basically fine except backlog of queued transfers to CASTOR tapes over w/e every time a new run started. From CASTOR point of view "nothing wrong" - CMS trying a different way(s) of patterns of copying out data to CASTOR.
- CAF: problem with low free disk space. CASTOR team gave +150TB to CAF. CMS will still make an effort to delete as much data as possible.
- Problem with CASTOR at ASGC still not solved, in contact with Oracle global support.
- The global cosmics run continues until 11 November.
3.6 Service Incidents Reports
3.7 Useful Validations and Needed Tests
What could be done to validate the Services at the Sites?
- Service validation any time a software is changed/upgraded
- Specific tests (e.g. throughput) to ensure that no problems have been introduced
- Tests of functions not yet tested (e.g. Reprocessing/data recall at Tier 1s)
In particular we should do the following tests:
- A simulated downtime of 1-3 Tier1s for up to – or exceeding – 5 days to understand how system handles export including recall from tape.
- Extensive concurrent batch load – do shares match expectations?
- Extensive overlapping “functional blocks” – concurrent production & analysis activities (inter & intra – VO)
- Reprocessing and analysis use cases (Tier1 & Tier2) and conditions "DB" load - validation of current deployment(s)
J.Shiers concluded that making the Services more reliable should have a priority on the upgrades and enhancements that Sites may have planned. The real issues seem to be about Data Management and DBs Services.
J.Gordon noted that the FS probes only detect data corruptions but not disk failures. And the solutions to the disk failures problems still need to be investigated.
M.Schulz noted that there are also a high number of power failures at the Sites.
J.Shiers added that about power failures Sites should learn how to recover quickly from power failures and making sure it does not take several hours or days.
I.Bird added that the metrics and milestones should reflect this need for improving the sites reliability for the VOs.
J.Templon added that also Experiments should build their software not assuming that everything is working 100% all the time.
4. Preparation of the LHCC Referees Meeting (Agenda) – I.Bird
The LHCC Referees asked in particular for have a report from each Experiment and experience with cosmics.
On the open sessions there will also be a report by I.Bird.
J.Templon reported, that in order to understand if their machines can reboot without problems, NL-T1 plans to reboot random machines for one hour each week. Should they put the site as “at risk” for that hour?
Several MB members suggested being very careful that many jobs are not going to fail or that key servers are not going to be stopped or slowed down by this. Otherwise this operation can cause major unavailability.
I.Bird concluded that this topic should be discussed more at length in a future meeting also looking at what other big service providers do.
- Disk Procurement
- Site Reliability Reports for OPS and VOs
6. Summary of New Actions
12 Nov 2008 - P.Mato should report to the MB the progress on the SL5 testing by the Experiments.