LCG Management Board
Tuesday 29 August 2006 at 16:00
(Version 2 - 6.9.2006)
A.Aimar (notes), M.Barroso Lopez, L.Bauerdick, K.Bos, N.Brook, F.Carminati, T.Cass, M.Delfino, D.Foster, B.Gibbard, F.Hernandez, J.Knobloch, R.Jones, M.Lamanna, H.Marten, P.Mato, B.Panzer, Di Quing, H.Renshall, L.Robertson (chair), M.Schulz, Y.Schutz, R.Tafirout, J.Templon
LCG Management Board at BNL
1. Minutes and Matters arising (minutes)
1.1 Minutes of Previous Meeting
Comments from M.Ernst and K.Bos received and added to the minutes.
Minutes approved by the MB.
1.2 Sites Reliability
Mail 24.8.2006:Each site (CERN + Tier-1s) - to provide a report with the reasons for each failure of the SAM basic tests at their site in July and the first half of August - to be emailed to the MB before the end of the month.
Update: The answers to the SAM reports of each site are available at this page: https://twiki.cern.ch/twiki/bin/view/LCG/SamReports
1.3 Sites and Experiments Contacts (more information)
Sites and Experiments should check and update the information.
2. Action List Review (list of actions)
Actions that are late are highlighted in RED.
J.Gordon will present some use cases at the GDB in September.
Not done. J. Shiers notes that the SC Tech Day, to be held at CERN on September 15th 2006, has a session devoted to this issue.
I.Bird proposed to HEPIX the organization of a workshop during next HEPIX meeting in October. For the moment it is not clear whether it will be possible. More news from I.Bird in the coming weeks.
Not done. Is the current situation satisfactory and we do not need this action any longer? One more week for the experiment to answer. Then we remove this action.
Not complete. One more week then we consider the values agreed by sites and experiments.
For PIC the procurement is a bit later and in October it will not allow doing the upcoming CMS activities and also continue providing resources for ATLAS. Also for LHCb, noted N.Brook, and H.Renshall has to update the table with the latest values.
Done. The revised numbers on experiments’ resource requirements for 2007 and 2008 were sent to C.Eck (CMS will do it before end of the week).
1.1 CERN (transparencies) - T.Cass
The basic support arrangements provided are:
Redundancy & Load Balancing.
Administration Team (Technicians).
support is best efforts outside working hours.
Some services, especially at Grid Level, are not fully documented for Operators & System Administrators to be able to intervene. There is work in progress to bring the level of support to the standard of the other services provided by the CC.
Some services with defined response times in the MoU are complex. E.g. “Event reconstruction or distribution of data to Tier-1 Centres during accelerator operation”. Problems depend on remote sites and/or services run by the experiments. The necessary contacts details are needed.
Note: There is an action in the action list in which the Contact Details must be maintained in the Contacts page. Sites and experiments should keep it up to date.
- The basic support structure is in place. And will improve as one gains experience during the ongoing service challenges
- Certainly meet the MoU targets in terms of delay in responding to operational problems. And at more than a trivial “acknowledgement” level.
- An ability to fix problems rapidly enough to ensure that the average availability targets can be met has not been demonstrated. In particular, there are concerns about (1) the extent to which engineer level expertise will be required (especially in the database area), and (2) the dependence on experiment servers and services at the remote end of any data transfers.
J.Knobloch added that 24x7 support of the databases, that requires advanced DBA skills and cannot be done by a technician, is still at the “best effort” level by the Service Managers. Discussions are underway with CERN IT in order to guarantee a better support level.
1.2 IN2P3 (transparencies) - F.Hernandez
The CC at IN2P3 provides:
An “On-Call Service” maintains the core site services in the best possible level of operation during non-office hours. The strategy is to automatically detect the dysfunctional component as early as possible and (if needed) trigger human intervention
Coordination, planning, documentation, monitoring tools, web portal, etc. are responsibility of the Operations Team. The On Call Service is by one engineer on duty during a full week (Thursday to Thursday) with one cycle every 4 months and covers the period not covered by office hours neither by the operators
The main role of the on-call person is to:
- identify the affected service/component and assess the critical level of the incident
- isolate it or trigger immediate or delayed intervention by the experts
- coordinate the intervention until the incident is completely handled
The intervention of the on-call engineer can be triggered by:
- the site’s guardian (for power, cooling, fire or flooding incidents)
- on-site operator (If cannot fully handle an incident)
- alarm system through text messages sent to the mobile phone (sent as a result of a detected abnormal situation requiring immediate human intervention)
The on-call person also performs active monitoring, i.e. connects to the site for monitoring the operational status of the core services, reads e-mails and browses tickets opened by end users.
Tools available to the on-call person are:
account. for e-mail and for interactive intervention on critical
internet connection from home for people likely to remotely intervene on the
Site-managed VPN in place
There is a separate on-call service (and procedures) for facility-related incidents such as cooling, power, fire, flooding, etc.
Monitoring & Alarms
There is a collection of web-based tools for having a quick overview of the operational status of the services, for example:
- Number of queued and running jobs over the last 24 hours
- Details on (probably abnormally) early-ended jobs
- List of jobs that look stuck (from the batch system point of view)
- Global status of core services like batch, AFS, HPSS, dCache, cartridge library, …
- Connectivity status of the site.
There are also home-grown tools for centralizing messages from every service/machine and a web-based interface with links to hints associated to the message and the actions to be triggered.
The NGOP-based alarm system extracts information from this message repository and triggers alarms (if needed) such as SMS messages or e-mail.
Introducing new services (e.g. grid components) in their system is greatly facilitated if there are:
- automatic ways to remotely query/detect the operational status of the service
- interface to interact with the service (start, stop, drain, shutdown, etc).
Standardizing this interface (protocol and actions) for all the grid services would be extremely helpful. For instance standardized formats and locations of log files
L.Robertson asked how problems involving other sites or
experiments are solved. F.Hernandez explained that the telephone number in
the Contacts page is the PBX number of the site: calls to this number will be
answered by the guardian (who does not necessarily speak English) during the
Note: Actually the language can be a general issue, if the operators or guardians do not speak English.
1.3 BNL (transparencies) – B.Gibbard
The ATLAS Tier 1 at BNL is co-located and co-operated with RHIC Computing Facility (RCF) and many facilities and procedures are shared:
use of redundancy in critical elements
- 24 x 7 operational coverage of fabric services is jointly maintained by RCF and ATLAS Tier 1 staff
RHIC has been fully operational for five years including 24 x 7 raw data recording in ~real time at the RCF at rates to 300 MB/sec and during accelerator runs which has been 25-30 weeks/year
The coverage includes:
- 16 x 7 on site staff coverage: 2 operators extend coverage to week-ends and evenings
- 24 x 7 on call staff coverage for critical services. The on call activation is sometimes by automated systems, operators/other staff members or selected representatives of critical user communities (experiments production groups, DAQ groups, etc)
The BNL CC uses the RT problem tracking system and Nagios-based monitoring by individual subsystems and planning for facility wide unification is underway.
The Processor Farms already use a Nagios-based monitoring system providing automated monitoring and paging of staff for failures of power / cooling, Condor and for the management databases.
The Storage Systems (Tape / HPSS) have performance and monitoring displays available but not integrated into a unified (Nagios) automated monitoring framework, Therefore people are required for early failure detection
The Central Disk farm has performance and monitoring displays available and some are integrated into Nagios.
The dCache Disk farm, used by both RHIC & ATLAS, has performance and monitoring information available, but no automated monitoring yet in place and human intervention is needed for early failure detection
There are relatively few issues of interoperability (OSG ó EGEE) at data transfer, storage and management levels:
- But ATLAS DDM system is a critical element with a complex interaction with underlying facility services and no automated monitoring
- And externally reported failures and locally detected problems still need to be integrated into overall monitoring framework
Significant interoperability issues with accounting, allocation, work flow management, etc. are still present:
- Work toward interoperability in these areas is ongoing
- Integration with local monitoring under review but more problematic
Primary mechanism for Grid Production and Analysis is the US ATLAS specific “PanDA” job management system. Facility functions (and problems) will be/are masked by PanDA and reported problems will need interpretation involving the PanDA team expertise.
For SC’s and ATLAS CSC activities an on call list for critical services will be prepared and be accessible by OSG GOC, PanDA operations team, other “power” users.
The Tier-1 infrastructure will evolve from the current RHIC centric 24x7 operations by
- Increasing automation (especially in dCache and Grid related services, which have immediate impact on operations)
- Unifying monitoring (under Nagios umbrella)
- Adding an additional operator allowing extension of on site coverage to ~24 x 7
- Integrate problem report/tracking systems (RT – Footprint (OSG GOC at IU) and Footprint – GGUS)
- Codify discrimination of PanDA and/or DDM problems versus underlying Tier 1 operations problems
The target is to establish comprehensive ATLAS-directed ~24x7 operations by Jan ’07.
M.Delfino proposed that other sites also monitoring dCache should work together in order to share tools and experience. B.Gibbard agreed and also F.Hernandez stated that IN2P3 also is interested.
First discussion on what accounting data should be shown to the C-RRB in October. A proposal has to be made to the Overview Board on 11 September.
Data accounted until now needs to be shown to the C-RRB and the MB should send feedback to L.Robertson and prepare for the MB in BNL where it will be decided what is going to be shown (for instance wall-clock vs. CPU time? Etc.)
M.Delfino noted that also the period to report on must be agreed. L.Robertson answered that the full data is available only from April 2006 and could be provided through August for the C-RRB.
3.1 Next week MB at BNL (agenda)
The proposed agenda is available. And includes other 24x7 support talks, the proposed sites being FZK, SARA and TRIUMF.
Update: The presentations on 24x7 support may be moved to the GDB in BNL.
The meeting will start at 9 AM in BNL but the participants will have to check-in and therefore they should come in before 8 AM.
B.Gibbard has distributed more information:
For those of you who will be attending the Sept GDB meeting at BNL.
In answer to a question that was asked at
the MB meeting today; you should not arrive at the BNL gate before 7:00 on
Tuesday morning since that is when the person who will be checking people in
and issuing badges arrives.
4. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.