LCG Management Board
Tuesday 28 October 2008 16:00-17:00 – Phone Meeting
(Version 1 – 31.10.2008)
A.Aimar (notes), I.Bird(chair), D.Barberis, D.Britton, T.Cass, L.Dell’Agnello, F.Donno, D.Duellmann, M.Ernst, A.Heiss, F.Hernandez, J.Gordon, M.Kasemann, M.Lamanna, P.Mato, G.Merino, Di Qing, M.Schulz, J.Shiers, R.Tafirout, J.Templon
Mailing List Archive
Tuesday 4 November 2008 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
L.Dell’Agnello sent by email some clarifications about the split of the procurement for CNAF in future years; he explained that they will have 50% of disk resources by April and 50% only by Q3.
I.Bird noted that Q3 means by October at best and this can be considered too late.
D.Barberis expressed the worry of the Experiments that there will probably be already delays to Q3 in 2009. But users need disk for analysis of past data all the time not only in data taking period.
The MB decided that dates and percentages for disk instalments for future years should be agreed by end of 2008 (for 2010 procurement onwards).
The minutes of the previous MB meeting were then approved.
Action List Review (List of actions)
- DONE. A document describing the shares wanted by ATLAS
- DONE. Selected sites should deploy it and someone should follow it up.
- ONGOING. Someone from the Operations team must be nominated follow these deployments end-to-end
Being discussed in ATLAS.
M.Lamanna reported that today’s system uses Panda to submit production jobs. For Analysis there is progress using glexec and the participation to the WG using pilot has been very useful for ATLAS. Analysis will probably be done using WMS submissions. The importance of the mechanism for JP is decreasing. Will still be necessary that sites distinguish queues and shares for production and analysis and the switching is done by checking VOMS roles. But full JP system maybe not necessary.
This progress ought to be confirmed after the ATLAS Software Week, next week.
Is on the agenda for the WLCG Workshop in November.
Proposal distributed and will be discussed at the GDB.
Operations Weekly Report (Slides)
Summary of status and progress of the LCG Operations. It actually covers last two weeks.
The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
3.1 Service Incident Reports (SIRs)
Service Incidents that have triggered a “Post-Mortem” report (was agree to call these “Service Incident Reports” from now on) have averaged about 1/week since June.
The last 2 weeks – and in particular the last few days – have been well over this average. In addition, a number of these incidents (not just those of this weekend!) have still not been fully understood or resolved. It is essential to have input from all parties involved.
Now have a web page (wiki) where we will keep pointers to these. Progressively add SIRs submitted since 1st June 2008 (when we started to be more systematic about it…) plus pointers to earlier reports
These work well for incidents that occur over a relatively short period of time (hours – days) and can be promptly diagnosed
They work less well for on-going problems – e.g. those “around” the ATLAS conditions service(s) & e.g. storage-related services at RAL, with multiple events over an extended period of time which may or may not all be linked.
As was suggested at a previous MB, maybe simply keep these as “open items” with a regular update.
Probably need to be rather generic in the description – as is attempted above – to avoid too many discussions about details.
It should also take a “service viewpoint” – if it’s the same service that is affected, then its in the same “dossier”
3.2 Recent Major Incidents
CASTOR at ASGC - On Friday the CASTOR services at ASGC started degrading and were essentially unusable for ATLAS and CMS (100% failure) from Saturday on. Numerous mails on castor-operation-external (CASTOR operation issues at institutes outside CERN) about “ORA-600 Errors in Castor rhserver”.
ORACLE Services for CASTOR- Still questions about required patch levels for Oracle services for CASTOR outside CERN.
Unscheduled Downtime at SARA - On Monday, SARA announced an unscheduled downtime of the storage services, following earlier errors (also seen over the weekend).
Power Outage at NIKHEF - No specifics on exactly what went wrong with the power but many of the services did not come back up. Virtualized many of the services, but the VM configurations were not set up in a way that made it possible for them to auto-start. Restoration of service is continuing. we've gotten far enough that our Nagios harness is working, so we can get a reasonable overview of what is not working.
J.Templon clarified that the problem was the UPS batteries that caused the problems. The systems not on UPS continued working but the crucial services were obviously on UPS and went down. When power came back the virtual machines running the services did not auto-restart.
J.Shiers noted that incidents during week end are not fixed until the following Monday.
CASTOR at RAL - On Saturday, A.Sansum circulated a preliminary “SIR” on an incident at RAL affecting the CASTOR service for ATLAS with reported 55h duration. See http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20081018
J.Gordon added that higher SRM requests (2 times higher) caused the issues. User analysis jobs also can cause overloads that cannot be handled at at some point.
J.Templon added that services should have a way to set a threshold when the services are used more than it can handle.
I.Bird replied that we should be sure that the reason is an uncommon usage; and even uncommon usage it should not bring down the services like that.
J.Gordon agreed to send more information on the following days (to the Operations meeting).
FTS and VOMS at CERN - FTS and VOMS services at CERN also suffered on Friday night with a 3-4 hour downtime on some channels and complete failure on others (FTS), only a 5’ interrupt for VOMS.
“Triggered” by security scan – follow-up to further understand and avoid; follow-up on alarm handling “got stuck”
Good news on Oracle Streams – It seems that there is now a “good patch” for a bug impacting the Oracle Streams service (several iterations – took 2 years in total!). Details are on Slide 6.
3.3 Experiments Activities and Reports
As noted in the report to the OB, the Experiments have – judged by what is reported at the daily meetings – ramped up their activities since “09/19”. It is probably not realistic to report on even the high-level messages at every MB, whereas just mentioning the “SIRs” is probably too high-level.
What follows is a chronological walk-through of some of the main points affecting the services. Here is the list but was not discussed in detail at the MB Meeting:
- 13/10: LHCb requested an investigation into LSF@CERN problems – post-mortem produced
- 13/10: LHCb “data integrity checks” and CASTOR@CERN service – PM produced, operations procedures to pro-actively drain servers out of warranty
- 13/10: various comments about announcements for scheduled & unscheduled maintenance
- 14/10: problems with SARA tape b/e for several hours
- 14/10: gLite 3.1 update 33 withdrawn – bdii problems
- 15/10: more downtime discussions
- 15/10: extended downtime at RAL; CNAF squid server down
- 16/10: RAL intervention – memory upgrade problematic due to faulty module; longer than expected to import/export DBs; test of failover of FC connecting disk server to Oracle RACs gave problems
- 17/10: ATLAS & LHCb CASTOR RAL services down ~1 hour – DBs bounced. New occurrence of “bad identifiers” in CMS CASTOR DB as seen in August.
- 20/10: ATLAS report >90% failures to RAL – Oracle error. RAL have 2 F/E m/cs for ATLAS; slow DB queries – attempt to improve load balancing went “horribly wrong” – should now be ok
- 20/10: CMS report v high load on ASGC SRM v2 server
- 21/10: discussions on ATLAS conditions, streams et al
- 22/10: “transparent” intervention on Lyon DB cluster went wrong ~22:00 – impacted LFC, FTS & CIC portal. Not announced.
F.Hernandez clarified that internally the administrators of the service had not announced that “transparent” intervention. This will be investigated and avoided in the future.
- 22/10: Errors seen with Oracle b/e to ASGC storage services. Experts at HEPiX “at hand”.
- 22/10: high load on dCache at SARA. Changed behaviour in gplazma? Users in multiple groups?
- 23/10: performance problems with ATLAS online-offline streams due to delete operation on all rows in a table with neither indices nor primary key.
- The ATLAS conditions papers
- The RAL CASTOR-Oracle conundrum
3.4 Up-coming workshops
- Distributed Database Operations, 11-12 November. Will cover many of the points mentioned above plus also requirements gathering for 2009 (experiments’ input)
- Pre-GDB on storage + GDB also on these days
- Data Taking Readiness Planning, 13-14 November. Volunteers for non-CERN speakers & session chairs welcome.
I.Bird noted that maybe the weekly report should just include the main issues to follow up and not the list of all incidents.
J.Gordon and M.Schulz noted that the incidents could be showed by services and by sites.
J.Shiers agreed that a summary table with “Services and Sites” could provide a good visualization of situation.
4. Installations Accounting Status (Slides) – F.Donno
F.Donno presented an update concerning CPU and storage capacity at the Sites.
4.1 Computing Capacity
The source to calculate computing capacity at sites should be the information system but the information is often incomplete.
The computed capacity should then be compared against the declared pledges. Therefore, it should be expressed in KSpecInt2000.
The publishing vector should be the APEL portal by CESGA.
The relevant values in the Glue attributes are:
- Total CPUs = Total number of assigned Job Slots in the queue
- Physical CPUs = Total number of real CPUs/physical chips in the sub cluster
- Logical CPUs = Total number of core/hyper threaded CPUs in the sub cluster
A sub cluster is a homogeneous set of nodes.
The installed capacity is calculated:
Installed Capacity = BenchMarkSI00 x Physical CPUs
This model is good but has several issues at the moment:
Published numbers mostly filled by hand by site administrators. Better information providers and validation tools can cure the situation
- SubClusters are not homogeneous. Published average should be OK
- Fairshare not published. Is it OK to publish the total?
- Normalized values. If CPU speed is scaled up to some value then also SubCluster's Physical and Logical CPU count must be scaled so that the total power is correct.
G.Merino noted that most sites that do not have sub clusters but have a mix of hardware and therefore they must calculate the normalization factor.
J.Templon noted that scaling and normalizing was never proposed at the Operations meeting as the correct approach.
F.Donno replied that if there are subclusters the normalization is not the proposed solution.
- Benchmark=KSI00 most problematic to check. Retired as of February 2007
- Most sites refer to spec.org. SPEC.ORG reports CPU power per chip and not per core
I.Bird said that a new benchmark is already agreed and will be used soon (a working group has been started).
The current status of the Tier-2 Sites is:
- 124 WLCG T2 Sites
- 13 WLCG T2 Sites not yet in GOCDB
- 21 WLCG T2 Sites not answering
- 103 WLCG T2 Sites were responding
- 78 WLCG T2 Sites running PBS (and its flavors) - others mostly running condor (sge and lsf). Of these 27 WLCG T2 PBS Sites do not publish Physical CPUs
F.Donno, using directly the PBS commands, found the real information at the sites. “pbsnodes –a” and “qmgr –c print server/queue <queue>” used as validation through globus-job-run on the CE
And she also compared the Processor Model/Speed compared with what published by SPEC.ORG to find out correct KSI00 per CPU
Slide 6 shows, just as example, the Canada West Federation Tier-2 Sites. The computed capacity is well above the pledges:
- Pledges 2008 = 300KSI00
- Computed Installed capacity= 90*1.5(135) + 64*2.7(172.8) + 420*1.5=(630)=937.8KSI00
But the data published in GOCDB by those sites in missing on in some cases incorrect.
The data is maintained manually and therefore is not kept up to date. As nodes can be frequently moved out of the clusters, or switched off or maintained, etc.
I.Bird proposed that the information in GOCDB/Glue and the one extracted from the sites should be reported in order to ask the Tier-2 to update their information in the GOCDB. Meanwhile sites are requested to insert and maintain updated information for the Logical CPUs, Physical CPUs and Benchmark information.
Flavia Donno will distribute a document to describe how the installed accounting is collected.
J.Templon warned that also the working group on Glue and the EGEE TMB are working on the same issue.
F.Donno clarified that she is already in contact with S.Traylen that works on the Glue schema.
F.Hernandez asked for very clear instructions on how each value should be interpreted and filled by the Sites. In particular about heterogeneous clusters.
4.2 Storage Capacity
The MSS provide needed info with no sys admin intervention.
CASTOR - The CASTOR information providers were deployed at RAL and they pass the validation procedure – minor changes are needed and a precise schedule is needed.
DPM – The information providers are already deployed at a few sites (UK and France). In certification as a patch release for DPM 1.6.11
dCache – The dCache information providers available with dCache 1.9.2. But there are some implementation problems and a phone conference is scheduled for Thursday, 30 October 2008. OSG is invited as well.
StoRM – The StoRM information providers will be available at the end of November 2008.
6. Summary of New Actions