LCG Management Board

Date/Time:

Tuesday 29 August 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a063099

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 2 - 6.9.2006)

Participants:

A.Aimar (notes), M.Barroso Lopez, L.Bauerdick, K.Bos, N.Brook, F.Carminati, T.Cass, M.Delfino, D.Foster, B.Gibbard, F.Hernandez, J.Knobloch, R.Jones, M.Lamanna, H.Marten, P.Mato, B.Panzer, Di Quing, H.Renshall, L.Robertson (chair), M.Schulz, Y.Schutz, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

LCG Management Board at BNL
Tuesday 5 September from 9:00 to 11:00 (15:00 to 17:00 CERN time)

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

Comments from M.Ernst and K.Bos received and added to the minutes.

Minutes approved by the MB.

1.2         Sites Reliability

Mail 24.8.2006:Each site (CERN + Tier-1s) - to provide a report with the reasons for each failure of the SAM basic tests at their site in July and the first half of August - to be emailed to the MB before the end of the month.

 

Update: The answers to the SAM reports of each site are available at this page: https://twiki.cern.ch/twiki/bin/view/LCG/SamReports

1.3         Sites and Experiments Contacts (more information)

Sites and Experiments should check and update the information.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

  • 30 Jun 06 - J.Gordon reports on the defined use cases and policies for user-level accounting in agreement with the security policy working group, independently on the tools and technology used to implement it.

J.Gordon will present some use cases at the GDB in September.

  • 31 Jul 06 - Sites and experiments should check and update the data in the "Contact Page" page below.

From J.Shiers: The contacts page on the SC Wiki has been used for this purpose for >1 year and is regularly updated.

  • 31 Jul 06 - Sites should exchange more information about monitoring, alarming and 24x7 support in the framework of HEPIX.

Not done. J. Shiers notes that the SC Tech Day, to be held at CERN on September 15th 2006, has a session devoted to this issue.

I.Bird proposed to HEPIX the organization of a workshop during next HEPIX meeting in October. For the moment it is not clear whether it will be possible. More news from I.Bird in the coming weeks.

  • 31 Jul 06 - Experiments should express what they really need in terms of interoperability between EGEE and OSG. Experiments agreed to send information to J.Shiers.

Not done. Is the current situation satisfactory and we do not need this action any longer? One more week for the experiment to answer. Then we remove this action.

Not complete. One more week then we consider the values agreed by sites and experiments.

For PIC the procurement is a bit later and in October it will not allow doing the upcoming CMS activities and also continue providing resources for ATLAS. Also for LHCb, noted N.Brook, and H.Renshall has to update the table with the latest values.

  • 25 Aug 06 - All experiments - to provide by the end of the week (24 August) for each of their Tier-1 sites their requirements for T1-T1 bandwidth, T0-T1 bandwidth and storage space by mass storage class for purposes other than those already included in the T1/T2 relationship table (see email from Chris Eck).

Done. The revised numbers on experiments’ resource requirements for 2007 and 2008 were sent to C.Eck (CMS will do it before end of the week).

 

1.      24X7 Support

 

1.1         CERN (transparencies) - T.Cass

 

The basic support arrangements provided are:

-          Service Redundancy & Load Balancing.
To avoid that simple hardware problems lead to reduction of service below LCG thresholds.

-          Console Operators (Contract).
Present 24x7x52 and able to follow basic procedures to resolve known problems (e.g. rebooting or removing nodes, etc)

-          System Administration Team (Technicians).
On call 24x7x52 and able to follow more complex procedures to resolve problems and to diagnose more complex issues before calling the next level.

-          Service Managers (Engineers).
No formal callout arrangements and, if possible, not called overnight. They skill is required to restore service in event of complex problems.

-          Network support is best efforts outside working hours.
Is handled only by network experts and is not taken care of by the system administrators in piquet. The operator will only be able to call the network experts if there is a problem.

 

Some services, especially at Grid Level, are not fully documented for Operators & System Administrators to be able to intervene. There is work in progress to bring the level of support to the standard of the other services provided by the CC.

 

Some services with defined response times in the MoU are complex. E.g. “Event reconstruction or distribution of data to Tier-1 Centres during accelerator operation”. Problems depend on remote sites and/or services run by the experiments. The necessary contacts details are needed.

 

Note: There is an action in the action list in which the Contact Details must be maintained in the Contacts page. Sites and experiments should keep it up to date.

 

In conclusion:

-          The basic support structure is in place. And will improve as one gains experience during the ongoing service challenges

-          Certainly meet the MoU targets in terms of delay in responding to operational problems. And at more than a trivial “acknowledgement” level.

-          An ability to fix problems rapidly enough to ensure that the average availability targets can be met has not been demonstrated. In particular, there are concerns about (1) the extent to which engineer level expertise will be required (especially in the database area), and (2) the dependence on experiment servers and services at the remote end of any data transfers.

 

J.Knobloch added that 24x7 support of the databases, that requires advanced DBA skills and cannot be done by a technician, is still at the “best effort” level by the Service Managers. Discussions are underway with CERN IT in order to guarantee a better support level.

 

1.2         IN2P3 (transparencies) - F.Hernandez

 

The CC at IN2P3 provides:

-          Round-the-clock site operation.
Public holidays, Christmas and new year days included

-          Office time
Monday to Friday: 8h-18h

-          Operator presence
Monday to Friday: 7h30-21h
Saturday, Sunday: 10h-17h45

-          Guardian presence
Monday to Friday: 18h-8h
Saturday, Sunday and public holidays: 24h

 

On-Call Service

An “On-Call Service” maintains the core site services in the best possible level of operation during non-office hours. The strategy is to automatically detect the dysfunctional component as early as possible and (if needed) trigger human intervention

Coordination, planning, documentation, monitoring tools, web portal, etc. are responsibility of the Operations Team. The On Call Service is by one engineer on duty during a full week (Thursday to Thursday) with one cycle every 4 months and covers the period not covered by office hours neither by the operators

 

The main role of the on-call person is to:

-          identify the affected service/component and assess the critical level of the incident

-          isolate it or trigger immediate or delayed intervention by the experts

-          coordinate the intervention until the incident is completely handled

 

The intervention of the on-call engineer can be triggered by:

-          the site’s guardian (for power, cooling, fire or flooding incidents)

-          on-site operator (If cannot fully handle an incident)

-          alarm system through text messages sent to the mobile phone (sent as a result of a detected abnormal situation requiring immediate human intervention)

 

The on-call person also performs active monitoring, i.e. connects to the site for monitoring the operational status of the core services, reads e-mails and browses tickets opened by end users.

 

Tools available to the on-call person are:

-          Dedicated account. for e-mail and for interactive intervention on critical machines/services
Appropriate privileges granted to this account

-          Dedicated mobile phone
With access to personal phone numbers of the entire site’s staff
Some level of organization is needed to know who to call when for each service

-          Instant messaging
Jabber-based private server managed by the site
For coordinating remote interventions by experts

-          Broadband internet connection from home for people likely to remotely intervene on the site
Provided by any commercial ISP
Connection bill is paid by the individual and reimbursed by the site

-          Site-managed VPN in place
An internal web portal giving access to a collection of monitoring tools

 

There is a separate on-call service (and procedures) for facility-related incidents such as cooling, power, fire, flooding, etc.

 

Monitoring & Alarms

There is a collection of web-based tools for having a quick overview of the operational status of the services, for example:

-          Number of queued and running jobs over the last 24 hours

-          Details on (probably abnormally) early-ended jobs

-          List of jobs that look stuck (from the batch system point of view)

-          Global status of core services like batch, AFS, HPSS, dCache, cartridge library, …

-          Connectivity status of the site.

 

There are also home-grown tools for centralizing messages from every service/machine and a web-based interface with links to hints associated to the message and the actions to be triggered.

 

The NGOP-based alarm system extracts information from this message repository and triggers alarms (if needed) such as SMS messages or e-mail.

 

New services

Introducing new services (e.g. grid components) in their system is greatly facilitated if there are:

-          automatic ways to remotely query/detect the operational status of the service

-          interface to interact with the service (start, stop, drain, shutdown, etc).

 

Standardizing this interface (protocol and actions) for all the grid services would be extremely helpful. For instance standardized formats and locations of log files

 

L.Robertson asked how problems involving other sites or experiments are solved. F.Hernandez explained that the telephone number in the Contacts page is the PBX number of the site: calls to this number will be answered by the guardian (who does not necessarily speak English) during the non-office hours.

A suggested way of solving this problem, within the frame of the current implementation of the on call service, would be to agree on well defined operational messages sent to a particular e-mail address (or the central User Support address) so that the alarming system in place can trigger an intervention by the on call engineer, if needed.

Note: Actually the language can be a general issue, if the operators or guardians do not speak English.

1.3         BNL (transparencies) – B.Gibbard

 

The ATLAS Tier 1 at BNL is co-located and co-operated with RHIC Computing Facility (RCF) and many facilities and procedures are shared:

-          Common use of redundancy in critical elements
Fail over and/or graceful degradation of services
Raid Controllers/Servers, Gateway system, LDAP servers, etc.

-          24 x 7 operational coverage of fabric services is jointly maintained by RCF and ATLAS Tier 1 staff

 

24x7 Coverage

RHIC has been fully operational for five years including 24 x 7 raw data recording in ~real time at the RCF at rates to 300 MB/sec and during accelerator runs which has been 25-30 weeks/year

 

The coverage includes:

-          16 x 7 on site staff coverage: 2 operators extend coverage to week-ends and evenings

-          24 x 7 on call staff coverage for critical services. The on call activation is sometimes by automated systems, operators/other staff members or selected representatives of critical user communities (experiments production groups, DAQ groups, etc)

 

Monitoring

The BNL CC uses the RT problem tracking system and Nagios-based monitoring by individual subsystems and planning for facility wide unification is underway.

 

The Processor Farms already use a Nagios-based monitoring system providing automated monitoring and paging of staff for failures of power / cooling, Condor and for the management databases.

 

The Storage Systems (Tape / HPSS) have performance and monitoring displays available but not integrated into a unified (Nagios) automated monitoring framework, Therefore people are required for early failure detection

 

The Central Disk farm has performance and monitoring displays available and some are integrated into Nagios.

 

The dCache Disk farm, used by both RHIC & ATLAS, has performance and monitoring information available, but no automated monitoring yet in place and human intervention is needed for early failure detection

 

Grid-related Monitoring

There are relatively few issues of interoperability (OSG EGEE) at data transfer, storage and management levels:

-          But ATLAS DDM system is a critical element with a complex interaction with underlying facility services and no automated monitoring

-          And externally reported failures and locally detected problems still need to be integrated into overall monitoring framework

 

Significant interoperability issues with accounting, allocation, work flow management, etc. are still present:

-          Work toward interoperability in these areas is ongoing

-          Integration with local monitoring under review but more problematic

 

Primary mechanism for Grid Production and Analysis is the US ATLAS specific “PanDA” job management system. Facility functions (and problems) will be/are masked by PanDA and reported problems will need interpretation involving the PanDA team expertise.

 

For SC’s and ATLAS CSC activities an on call list for critical services will be prepared and be accessible by OSG GOC, PanDA operations team, other “power” users.

 

Evolution

The Tier-1 infrastructure will evolve from the current RHIC centric 24x7 operations by

-          Increasing automation (especially in dCache and Grid related services, which have immediate impact on operations)

-          Unifying monitoring (under Nagios umbrella)

-          Adding an additional operator allowing extension of on site coverage to ~24 x 7

-          Integrate problem report/tracking systems (RT – Footprint (OSG GOC at IU) and Footprint – GGUS)

-          Codify discrimination of PanDA and/or DDM problems versus underlying Tier 1 operations problems

 

Target

The target is to establish comprehensive ATLAS-directed ~24x7 operations by Jan ’07.

 

M.Delfino proposed that other sites also monitoring dCache should work together in order to share tools and experience. B.Gibbard agreed and also F.Hernandez stated that IN2P3 also is interested.

 

2.      Accounting data for the C-RRB - Les Robertson

First discussion on what accounting data should be shown to the C-RRB in October. A proposal has to be made to the Overview Board on 11 September.

 

 

Data accounted until now needs to be shown to the C-RRB and the MB should send feedback to L.Robertson and prepare for the MB in BNL where it will be decided what is going to be shown (for instance wall-clock vs. CPU time? Etc.)

 

M.Delfino noted that also the period to report on must be agreed. L.Robertson answered that the full data is available only from April 2006 and could be provided through August for the C-RRB.

 

3.      AOB

 

3.1         Next week MB at BNL (agenda)

 

The proposed agenda is available. And includes other 24x7 support talks, the proposed sites being FZK, SARA and TRIUMF.

 

Update: The presentations on 24x7 support may be moved to the GDB in BNL.

 

The meeting will start at 9 AM in BNL but the participants will have to check-in and therefore they should come in before 8 AM.

 

B.Gibbard has distributed more information:

For those of you who will be attending the Sept GDB meeting at BNL.

In answer to a question that was asked at the MB meeting today; you should not arrive at the BNL gate before 7:00 on Tuesday morning since that is when the person who will be checking people in and issuing badges arrives.
We are advising the Hampton Inn shuttle bus for participants of the GDB meeting to leave the Hampton Inn at 7:15 on Tuesday, arriving at the BNL gate at ~7:30. If all goes well, this should allow people to get over to Berkner Hall at 8:00 when the continental breakfast is served. If things go more slowly they should still arrive by 8:30 allowing them to get something to eat before the MB meeting starts at 9:00.

 

4.      Summary of New Actions

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.