Top-tips for ROCs

The weighted regional availability and reliability is calculated by giving more importance to larger sites. This is determined at the moment by the KSI2K value that sites publish in the Information System (in the future, installed capacity will be measured in HEPSpec2006). Therefore, if you have large, stable sites, make sure that they're publishing the correct figures (if nothing is published, the weight of the site is 1).

Weed out the sites that have recurring 0% availability or very low reliability! Either flag them as non-production or un-certified in GOCDB, since both those conditions are prerequisites for the sites to be included in the reports. A few badly performing sites will bring down your region's figures! Sites that provide very few resources and do not have good availability figures might be more trouble than they're worth. Consider the support burden versus the benefits of including such sites in the production infrastructure.

Top-tips for site administrators

Your main priority should be to keep the user community happy with an available grid! The reason that SA1 management and your ROC are checking your site's performance is not to administer punishment, but to identify problem areas and to seek solutions. If you need help, don't hestitate to ask your ROC - that's what it's there for!

If you have not already done so, please install the latest EGEE Nagios site monitoring package (information at GridMonitoringNcgYaim). This will provide you with fabric monitoring and rapid notification in case of problems. Early detection of problems is key to providing a good service!

Use fabric monitoring to ensure that your redundant hardware configurations are not running in degraded mode (e.g. if redundant power supply is disconnected, you don't have redundancy!).

If you have any planned interventions to make (software upgrades, new air-conditioning etc.), don't forget to declare a downtime in GOCDB beforehand! This should be done at least 24 hours prior to the outage, and will ensure that your reliability figures don't suffer. Remember to drain queues to avoid user jobs being aborted!

If the security of your site is compromised, or if you suffer from unexpected hardware or software failures, make sure that you declare an unscheduled downtime to inform your user community. Your availability and reliability will suffer, but your users are more likely to be forgiving when they know what's happening. If you expect the outage to last more than a day, declare a scheduled downtime as soon as you can (24 hours' notice), since this will minimize the impact on your site's reliability i.e. your unscheduled time should last 24 hrs, followed by a scheduled downtime. You can shorten declared downtimes if you see that the service will be restored sooner, but you can't lengthen them (other than by declaring a new one) - so err on the safe side!

Please note that declaring AT_RISK downtimes is only useful for warning the user community of interventions that are meant to be transparent. AT_RISK interventions are completely ignored by SAM and the GridView availability calculations. Similarly, unscheduled downtimes will help neither your availability nor your reliability. Plan ahead and make sure that your downtimes for maintenance activities or planned upgrades are all scheduled (declared in GOCDB) at least one day in advance!

This is a Twiki, so feel free to add to this list smile

N.B. Some useful material was presented during the EGEE'08 EGEE conference in Istanbul in the session Production Sites Best Practices. In summary, the following points were deemed important:

  • Homogeneous, dedicated hardware
  • Planned maintenances
  • Pre-testing of installation on a separate set of VMs
  • Communication: online chat (IRC, Jabber) between ROC and sysadmins, Twiki, mailing lists
  • Training for site-admins
  • Keep up to date: track LCG_ROLLOUT, Operations meetings
  • Good operational procedures (including backups)
  • Lots of monitoring!

-- JohnShade - 09 Jul 2009

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2009-11-18 - JohnShade
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback