WLCG Management Board
Tuesday 16 December 2008 16:00-17:00 – Phone Meeting
(Version 1 – 29.12.2008)
A.Aimar (notes), D.Barberis, O.Barring, I.Bird(chair), F.Carminati, Ph.Charpentier, L.Dell’Agnello, M.Ernst, X.Espinal, S.Foffano, A.Heiss, M.Kasemann, A.Pace, R.Pordes, M.Schulz, J.Shiers, O.Smirnova
Mailing List Archive
Tuesday 13 January 2008 16:00-18:00 – F2F Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
No comments. The minutes of the previous MB meeting were approved.
1.2 Comments on the HL Milestones of last week – I.Bird
I.Bird, who could not connect during the HLM discussion at the previous MB meeting, had some comments on the milestones discussed the week before.
I.Bird would like to have the milestones above marked as deleted stay in the dashboard and even if managed by the Architects Forum (“SCAS Verified by the Experiments”) and by the GDB (“SCAS + gLexec Deployed and Configured at the Tier-2 Sites”) should be both reported to the MB as High Level Milestones.
New dates for the milestones need to be proposed.
1.3 Quarterly Reports 2008Q4
A.Aimar will ask for the Quarterly report contribution of the Application Area (P.Mato) and the GDB (J.Gordon). LCG Operations (J.Shiers) already sent its contribution.
Experiments will be asked to prepare their usual short report for the MB meetings, starting on January 13th 2009.
1.4 Status of the Implementation of the SRM Addendum (Slides) – F.Donno
F.Donno, who could not participate to the meeting, distributed the slides that summarize the situation of the implementation of the features requested in the SRMV2 Addendum document.
In short, the short-term features will be completed by all implementations in their releases January 2009.
The only exceptions, not yet implemented, are:
- the “srmLs” in Castor and
- the “SpaceTokens for operations other than write” in dCache
Action List Review (List of actions)
M.Schulz reported that latest new patch was rejected because was leaking and occupy all memory on the machine. The service would crash after very short time.
DONE. I.Bird reported that the page reports the AND of all results and the V2 tests.
CERN: O.Barring reported that CERN has asked for a contact to each VO in order to have the SLA approved officially. Received replies from ALICE and ATLAS, not from CMS and LHCb.
M.Kasemann reported that D.Bonacorsi has been nominated contact person for all the SLAs with the Sites.
O.Smirnova reported that NDGF has sent their SLA proposal to ALICE.
In the agenda today.
M.Schulz reported that the SA3 person (O.Keeble) will report the information between AF and MB and between Applications Area and EGEE project. In the Spring the new software from Experiments is available and gcc 4.3 support will be needed.
For now the gcc 4.1 middleware binaries are compatible. The only exception is VOMS that is partially written in C++. There is also an SL5 installation with gcc 4.3 and python 2.5 ready for the porting (in ETICS).
Not done yet.
I.Bird proposed that someone is appointed in order to complete this action.
D.Barberis noted that this information is provided already to the Sites at least twice per year and it is usually updated and therefore not the same every time. In addition the timescale is changed during 2008 because of the LHC issues and others.
I.Bird replied that a table where this information is available and should be maintained up to date. Sites do not understand the information they receive in a form that they can implement. Data flows, network, storage, etc.
F.Carminati stated that ALICE is willing to clarify all unclear issues.
M.Kasemann added that CMS as appointed someone for such document and will provide it.
Not done. Will be followed by the Applications Area? Or F.Donno?
Same action as the one above about SRM V2.
DONE for all Experiments.
Note: OSG asked for one more month, until end of January.
DONE. Received by email from all Experiments. Link
Operations Weekly Report (Draft
pre-CHEP workshop agenda; Minutes;
Summary of status and progress of the LCG Operations. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings.
Now also the weekly GGUS Summary is available. Only one alarm, from LHCb (see slide 2), this week.
3.1 Services Summary
Many on-going activities are continuing, such as
- CMS PhEDEx 3.1.1 deployment
- preparation for Christmas activities by all Experiments
- deployment of new versions of Alien & AliRoot,
- preparation of ATLAS 10M files test
The number of issues discussed at the daily meeting has increased quite significantly since earlier in the year. Cross-Experiments and Sites discussions are healthy and valuable. Often similar problems were already seen by others – and possible solutions are proposed etc.
Services are (disconcertingly) still fragile under significant load – this does not really seem to improve from one week to another: DM services often collapse rendering a site practically unusable. At least some of these problems are attributed to problems at the DB backend – there are also DB-related issues in their own right
The high rate of both scheduled and unscheduled interventions continues – a high fraction of these (CERN DB+CASTOR related) overran significantly during this last week.
3.2 Examples of Service Issues
The ATLAS “10M files test” stressed many DM-related aspects of the service. 100K files are copied from CERN to each Tier-1. This caused quite a few short-term problems during the course of last week, plus larger problems over the weekend:
- ASGC: the SRM is unreachable both for put and get. Jason sent a report this morning. Looks like they had again DB problems and in particular ORA-07445.
- IN2P3: SRM unreachable over the weekend
- NDGF: scheduled downtime
- In addition, RAL was also showing SRM instability this morning (and earlier according to RAL – DB-related issues). These last issues are still under investigation.
ATLAS online (conditions, PVSS) capture processes aborted – the upgrade operation was not fully tested (!!!) before running it on production system by the DB admin.
The Oracle bug was corrected but there no fix for this problem – other customers have also seen same problem but no progress since July (service requests). The WLCG Oracle reviews were proposed several times. This situation will take quite some time to recover – unlikely it can be done prior to Christmas.
But there is no action from Oracle unless we push for solutions to our problems by Oracle.
A.Pace added that when a bug is discovered a quick workaround must be found while waiting for the fix that might take months. When a short term solution is found the Sites should contribute and apply the workaround and not be opposed to it.
The synchronization between ATONR and ATLR of the cond. DBs is done; for the PVSS on-going, The synchronization with the T1s is instead postponed.
Other DB-related errors are affecting dCache e.g. at SARA (PostgreSQL), replication to SARA (Oracle)
3.3 Next Actions
Is the current level of service issues acceptable to:
- Sites –do they have the effort to follow-up and resolve this number of problems?
- Experiments – can they stand the corresponding loss / degradation of service?
If the answer to either of the above is negative, what can we realistically do?
- DM services need to be made more robust to the (sum of) peak and average loads of the Experiments. This may well include changing the way Experiments use the Services (“being kind” to the services)
- DB services (all of them) are clearly fundamental and need the appropriate level of personnel and expertise at all relevant sites
- Procedures are there to help us – much better to test carefully and avoid messy problems rather than expensive and time consuming cleanups which may have big consequences on an experiment’s production
I.Bird suggested a review of the DB services and practices at the Tier-1 Sites for early 2009.
J.Shiers agreed and added that many of the typical mistakes could be fixed at the Sites already: are well known issues and there are written solutions in the DB admin “literature”.
Ph.Charpentier added that there are scalability problems with the software shared. Is not a service originally requested but is a de-facto service. When many applications access the same shared area (e.g. starting applications and run time libraries etc) the central shared file system is not scalable and fails or hangs.
Sites reported different issues with these central shared software areas; it probably depends on the technologies used. For instance at FZK an application, when starting, opens up to 200 files (e.g. 99 ROOT macros). If this is on a shared file system 500 jobs can cause one hundred thousand open files handles in the shared area.
Ph.Charpentier stated that the way the Experiment software is written is not in the scope of the MB discussions but he agreed Experiments should have specified better their needs. This could be reviewed in the future but changing it now is too difficult, if not impossible.
I.Bird concluded that “shared software area” is a GDB and Applications Area issue. Different Sites adopt different solutions (NFS, GPFS, etc) and the situation should be assessed at each of the Sites. And Experiments should keep into account these constraints at the Sites when updating their applications.
3.4 Conclusions and Outlook
The new GGUS weekly summary gives a convenient weekly overview. Other “key performance indicators” could include similar summary of scheduled / unscheduled interventions, plus those that run into “overtime”.
A GridMAP style summary – preferably linked to GGUS tickets, Service Incident Reports and with (as now) click through to detailed test results could also be a convenient high-level view of the service
But this can only work if the number problems are relatively low.
4. HL Milestones for 2009 (HLM20081215) – A.Aimar
Below are the changes agreed at the meeting(in addition to the ones in section 1.2):
- Jan 2009 - VO-Specific Tier-1 Sites Reliability Considering each Tier-0 and Tier-1 site
- Tier-1 Sites Procurement – 2009, stay unchanged for 2009
- Accounting milestones will have to be followed with J.Gordon.
- SRM milestones were discussed. The deployment for the short term solution need to e scheduled.
- FTS milestones stay as it is.
The MSS metrics need to be defined by the sites and they need to be reached. There is no more a single common set of metrics because some Sites cannot collect those specific metrics. For the F2F in January the Sites should present their MSS metrics.
13 Jan 2009 – Sites present their MSS metrics to the F2F MB.
The CPU milestones will be proposed by G.Merino after the working group delivers its report.
The next MB meeting will be on the 13 January. The Agenda for the Mini-Review needs to be discussed then.
Merry Christmas and Happy New Year to all members of the WLCG MB!
6. Summary of New Actions
13 Jan 2009 – Sites present their MSS metrics to the F2F MB.