VOMS Post Mortem for December 1st 2008.

In fact there were 3 separate incidents on the 1st of December. The first two were following the service upgrade that happened 08:30 UTC to 10:30 UTC. The last was unrelated.

Reduced Proxy Lengths

Following the VOMS service upgrade at 10:30 UTC voms-proxies were reduced in length for all VOs to the default. i.e 86400 seconds instead of for instance CMS should be 691200 seconds. This was corrected after notification from Andreas around 22:00.

Comments

Bad logic from me in an YAIM script, the defaults were used in place of VO specific values. While of course it should not have happened. This would have been detected in 30 seconds if any of CMS, LHCb or ATLAS had tested the VomsPilot.

Misconfigured voms-admin Using Read Only Databases.

Following the VOMS service upgrade at 10:30 UTC voms-admin was mis-configured with read-only database accounts. The consequence was that while user processing could be processed it was not implemented. This was fixed around 21:00 after notification from Tanya. All pending registration updates will have now been implemented.

Comments

Only oversight by me, nothing else.

Security Scan Caused voms116 Crash

At around 23:13 a security scan started of voms116 , part of lcg-voms.cern.ch. About 30 minutes later the node and lcg-voms.cern.ch as a whole became unavailable. A high_load alarm was raised and the piquet (presumably) rebooted the node around 00:30 to restore service.

Comments

  • The good news is that unlike last time an alarm was raised and operator and piquet restored the service.
  • The bad new is that still in this situation HALinux should have detected and failed the service over onto to the hot standby voms115.cern.ch within a matter of minutes.

Review of HALinux Configuration.

I reviewed the HALinux configuration for vom116 (master at the time) and voms115 (slave at the time). The logs for the slave voms115 clearly show entries.

Dec  2 00:13:43 voms115 heartbeat: [3585]: WARN: node voms116.cern.ch: is dead

but still no failover. After rereading the documentation the HALinux configuration is missing something that has been lacking forever. The addition of

ping 128.142.166.195                                     # Ip Address of Switch serving voms115 and 116.
respawn hacluster /usr/lib/heartbeat/ipfail
should detect this problem. voms115, voms116 compare their ping contact to 128.142.166.195 and let the other node no via the private cross over cable. If one host is bad (e.g the network cable has been pulled out) the service will be stolen from the bad node. Documentation for this here: ipfail.

The MyProxy service managers have been informed as well of this, these extra configuration options are implemented in a new ncm-linuxha-1.1.2-1.noarch.rpm.

-- SteveTraylen - 02 Dec 2008

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2008-12-02 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback