Voms Post Mortem July 1st 2008

Hardware migration and upgrade of voms-admin and voms-core.

  • Hardware migration was transparent , voms-admin upgrade was transparent.
  • voms-core migration was not transparent and a downgrade followed 6 hours later due to a change in behaviour. frown

Details of voms proxy generation change

This was from voms 1.7 to voms 1.8 and then downgrade to 1.7. voms-proxies that were generated changed from
/dteam/cern/Role=ftsmaster/Capability=NULL
/dteam/Role=NULL/Capability=NULL
/dteam/cern/Role=NULL/Capability=NULL
 
to
/dteam
 /dteam/cern
 /dteam/cern/Role=ftsmaster
 
Two distinct changes here.
  1. The Role=NULL/Capability=NULL was lost. This broke glite-renewd and also gPlasma in dCache.
    • The change was not meant to happen in the sense that a configuration option should have been required to enable this new behaviour but it was enabled by default and could not be switched off.
    • Annoyingly these had been reported by SARA GGUS:36587, BUG:37008 before the upgrade but I had not really read them or considered their significance.
    • We expect to go to this new behaviour with VOMS 1.9 when glite-renewd , gPlasma must be altered, maybe others?
  2. The predictable order of the FQANs changed. The specification states they can be in a random order but the previous predictable order has been built on in many places.
    • This impacted PANDA production in the US.
    • WMS interactions for production role users.
    • I expect more especially for role users.
    • The predictable order will be restored in a future version. BUG:38506 is accepted and will be integrated before any upgrade.

Why not Detected Anywhere?

  • These changes were not in the release notes since the first change was a bug and not planned to be present. The second met the specification so in principal no change.
  • Not picked up during certification since there was some confusion about what we were aiming for , short or long fqans.
  • There is little certification testing against dCache or long jobs for instance.
  • Not picked up within the PPS. They use the production VOMS server.

Avoiding in the Future.

  • Create a voms-pilot.cern.ch service with the next version of the software to be deployed against the production database. Safe , it's essentially a read operation. This is now done VomsPilot
    • Our US colleagues can point to it.
    • The PPS services can use it.
  • Create a list of middeware using voms functionality so they can be kept informed of the pilot -> production cycle upgrades.
  • Concerning the fact that a bug was submitted that should have put the brakes on. I've alerted the EMT that developers seeing large bugs like this in released production services should alert the EMT so they can pull the software back or recommend it not to be installed. A similar problem hit the LFC BUG:38459. The site managers also have responsibility here which I will take care of. In the case of the LFC CERN detected the problem and then some time later the problem was discovered again at RAL.
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2008-12-18 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback