LCGSCM Authentication Authorisation Services Status

voms/vomrs report

October 15th 2008

Submitted by SteveTraylen
  • No changes to service.
  • Problems
    • Oracle returning wrong results.
      • e.g after 13,000 trivial selects there are occurrences where 0 results are returned rather than 1.
      • This is on the validation cluster as well.
      • Miguel and Tanya investigating, Tanya providing simpler code example.
  • Workshop
  • Pilot
    • PATCH:2390 looks like a reasonable candidate for eventual certification. Will deployed on the pilot this week.

October 1st 2008

Submitted by SteveTraylen
  • No change to service.
  • NA48 VO now completely configured and they are active.
  • New software now available to be added to the VOMS pilot.
    • Will be added in the next week or so.
  • Bizarre problem with VOMRS and CA in country X expiry causing users from country Y to be expired.
    • Has affected 20 users in the last couple of weeks.
    • Fermilab developers investigating.

September 17th 2008

Submitted by SteveTraylen
  • No change or incidents on the service.
  • ATLAS reported yesterday a one off generation of a corrupted proxy. Nothing obvious but to be checked.

September 3rd 2008

Submitted by SteveTraylen
  • Very little to report, all VOMS service development is currently being ignored.
  • VomsPilot service is running and documented but no suitable software yet for testing.
  • A 30 second outage and automatic fail over happened on August 31st for VOMRS.
    • Not obvious what caused it, possible a security scan but the timing is apparently wrong.

August 20th 2008

Submitted by SteveTraylen
  • Very little to report, all VOMS service development is currently being ignored.
  • Addition of NA48 VO is half complete.
  • VomsPilot service is running and documented but no suitable software yet for testing.
  • Next week MariaDimou will be in charge.

August 6th 2008

Submitted by SteveTraylen
  • Intervention completed on Monday August 4th.
    • VOMS and VOMRS is now 100% in quattor and CDB. smile
    • There was a problem with new users registering after 09:00 on Monday till 19:30 on Tuesday.
      • While they could complete the process they were not actually added to their VO.
      • This is now corrected and everything has caught up.
      • The reason, a configuration script mistake from VOMRS that set a VO=test in one place for all VOs. All my testing was with the VO test so it was not seen since that worked.
      • In reality I doubt many users noticed. There was only one report from a dteam member.
      • All is now well.
  • The voms-pilot.cern.ch is now completly setup with hardware and configuration. It is waiting for a VOMS patch in certification which is not rejected by that process.

July 23rd 2008

Submitted by SteveTraylen
  • No service incidents or significant interventions
  • Non-interactive deployment of vomrs is progressing nicely. Non-critical deployment bugs created for vomrs.
  • Hardware(virtual) requested for voms-pilot.cern.ch
  • PATCH:2063 (voms-admin) and PATCH:2061 (voms-core) have now entered certification with proposed fixes for July's upgrade and downgrade. Depending on progress earliest deployment early September assuming a useful pilot phase.
  • DOE support problem solved. DOE reissued same DN CA with new expiry date which is perfectly normal. It seems that an expired CA with the same DN in the web browser causes the server to fail authentication despite the server having the new CA only. Understood but confusing.
  • It's my birthday.

July 8th 2008

Submitted by SteveTraylen

Report from Intervention on Monday 1st July

Hardware migration and upgrade of voms-admin and voms-core.
  • Hardware migration was transparent , voms-admin upgrade was transparent.
  • voms-core migration was not transparent and a downgrade followed 6 hours later due to a change in behaviour. frown

Details of voms proxy generation change

This was from voms 1.7 to voms 1.8 and then downgrade to 1.7. voms-proxies that were generated changed from
/dteam/cern/Role=ftsmaster/Capability=NULL
/dteam/Role=NULL/Capability=NULL
/dteam/cern/Role=NULL/Capability=NULL
 
to
/dteam
 /dteam/cern
 /dteam/cern/Role=ftsmaster
 
Two distinct changes here.
  1. The Role=NULL/Capability=NULL was lost. This broke glite-renewd and also gPlasma in dCache.
    • The change was not meant to happen in the sense that a configuration option should have been required to enable this new behaviour but it was enabled by default and could not be switched off.
    • Annoyingly these had been reported by SARA GGUS:36587, BUG:37008 before the upgrade but I had not really read them or considered their significance.
    • We expect to go to this new behaviour with VOMS 1.9 when glite-renewd , gPlasma must be altered, maybe others?
  2. The predictable order of the FQANs changed. The specification states they can be in a random order but the previous predictable order has been built on in many places.
    • This impacted PANDA production in the US.
    • WMS interactions for production role users.
    • I expect more especially for role users.
    • The predictable order will be restored in a future version. BUG:38506 is accepted and will be integrated before any upgrade.

Why not Detected Anywhere?

  • These changes were not in the release notes since the first change was a bug and not planned to be present. The second met the specification so in principal no change.
  • Not picked up during certification since there was some confusion about what we were aiming for , short or long fqans.
  • There is little certification testing against dCache or long jobs for instance.
  • Not picked up within the PPS. They use the production VOMS server.

Avoiding in the Future.

  • Create a voms-pilot.cern.ch service with the next version of the software to be deployed against the production database. Safe , it's essentially a read operation.
    • Our US colleagues can point to it.
    • The PPS services can use it.
  • Create a list of middeware using voms functionality so they can be kept informed of the pilot -> production cycle upgrades.
  • Concerning the fact that a bug was submitted that should have put the brakes on. I've alerted the EMT that developers seeing large bugs like this in released production services should alert the EMT so they can pull the software back or recommend it not to be installed. A similar problem hit the LFC BUG:38459. The site managers also have responsibility here which I will take care of. In the case of the LFC CERN detected the problem and then some time later the problem was discovered again at RAL.

Other Items

  • Since the hardware upgrade (rather than the software) the number of Oracle sessions has jumped. It has been reduced but is still much higher than it was. Under investigation.
  • For the first time the nodes behind voms.cern.ch are YAIM, NCM-YAIM and complety deployable with out manual intervention after install by the admin. The same must now be repeated for lcg-voms.cern.ch including vomrs - harder.
  • The VOMS service now raises alarms to operators for things that can be detected. This had never been the case before. Operator instructions are updated.

June 24th 2008

Submitted by SteveTraylen
Transparent intervention plans for hardware migration of voms.cern.ch
  • voms.cern.ch currently on voms101 and voms102 will to be migrated to voms113, voms114.
  • voms101 will be destroyed on 15th July so ideally before then, else we drop to one node.
  • New voms delployment will be yaim rather gLite configured.
  • Phased migration, dates to be decided end of this week but:
    Wednesday 2nd July
    Expand voms.cern.ch to voms101, voms102, voms113 and voms114.
    Monday 7th July
    Shrink voms.cern.ch to voms113 and voms114.

Recent Problems
  • Failure to recover after planned DB intervention.
    • Would have been avoided basically with better communication from VOMS service manager.
  • Discussion with developers to change two things.
    • BUG:38130 - Voms core will do explicit database re-connections.
    • BUG:19770 - The case of the user not being in the VO and the Database being down will be distinguished to the client. $ Completed:
  • VOMS now in SLS. http://sls.cern.ch/sls/service.php?id=VOMS
  • More monitoring of tomcat added. $ On Going Work:
  • Lots of testing of VOMS YAIM module.
  • Updates to NCM-YAIM to be added now.

June 4th 2008

Submitted by SteveTraylen
Comments
  • From reboot (e.g powercut) it takes nearly an hour for the service to start! Similarly service restart without expert intervention also takes an hour. .. All this makes auto interventions impossible.
Work in Progress
  • Testing new YAIM component over gLite configuration scripts, it should remove the hour for service start up and improve deployment of voms-core and admin at least. I hope to deploy on new hardware now available with yaim for voms.
  • Numerous requests for VOMS to appear in SLS. Not as urgent as fixing the above point, but soon.
  • 6 hour spikes in load are still present but greatly reduced since CNAF castor randomized their gridmap file generation. More work needed.

May 21st 2008

Submitted by Maria Dimou:
  • Restarted voms core on both hosts representing voms.cern.ch yesterday due to problem reported by A.Sciaba to obtain a proxy for CMS. "Middleman" errors found in the logs. Wondering why soon-2-years-old bug https://savannah.cern.ch/bugs/?19770 is so unfortunate.
  • Ran spma_wrapper.sh on all 4 hosts of the vom(r)s service to push the new CAs. Restarted tomcat and voms core on all of them as well due to the appearance of a new CA. Did this urgently because of complaint in the rollout list from a user with a UK certificate (the UK CA caused this exceptional CA update).

April 30th 2008

  • Recent Failures
    • voms-admin exited again on Friday due to many open files within tomcat.
      • A large spike in load occurs exactly every six hours.
        • This coincides with
          1. Increased ALICE activity in the DB. VOBoxes are being checked for them doing something every 6 hours. BUG:36161
          2. All the CNAF Castor servers appear to download grid-mapfiles at the same time. GGUS:35880
  • Configuration Changes to HA-Linux
    • tomcat is now stopped on the vomrs slave node.
    • voms-core is not stopped on slave node. * Results are
        1. Failover takes 1 minute instead of 30.
        2. Users are no longer able to connect to the broken slave node avoiding confusion.
  • 4 new boxes are now available for VOMS migration.
    • Work is underway to consider migration, maybe at the same time as the new YAIM configuration of VOMS is complete.

  • Possible Upgrades.
    • We can now upgrade vomrs to 1.3.2 and voms-admin to 2.0.13.
    • Monday 5th May, Afternoon is a proposed date.
      • No expected downtime for voms-admin or voms-core
      • 20 minutes downtime for vomrs, i.e only registrations will be disabled.
    • Changes from upgrade.
      • Mostly contains fixes for me the admin that have been worked around anyway.
      • Vomrs is more efficient though in its call to voms-admin.
    • Detailed Plan.
      1. Upgrade vomrs on voms105, the standby node without schema upgrade
      2. Remove voms101 from voms.cern.ch alias.
      3. Upgrade voms-admin on voms101, check some simple queries.
      4. Shutdown vomrs on master of lcg-voms.cern.ch
      5. Atomically remove voms104 from voms.cern.ch alias and add voms101 back to voms.cern.ch.
      6. Upgrade vomrs schema on voms105.
      7. Fail vomrs over to voms106 from voms105 which will start it up.
      8. Upgrade vomrs on voms106 the new standby node.
      9. Upgrade voms-admin on voms104
      10. Add voms104 back to the voms.cern.ch voms-admin alias.
      11. Check if it is working
      12. Wait for the fallout.

April 16th 2008

  • Recent Failures
    • Limit of open files reached for tomcat user, cause voms-admin to exit for CMS.
      • Limit increased , it was lost during SL3->SL4 migration. Should not happen again.
      • Also a new metric introduced which shows large spike every 6 hours. Under investigation.
    • voms-core failed on lcg-voms.cern.ch , this currently has no self healing for this. It should not be a problem since voms.cern.ch is still there but glite-proxy-renewal only uses the one original host. Bugs are all in... and voms-core on lcg-admin will be considered for removal at some point post replication to CNAF.
      • A new cron introduced to keep the service alive, hopefully not needed with new upcoming release.
  • lcg-voms.cern.ch cert expires on 5th May.
    • For smooth operations new one must be depoyed at sites a week before.
    • Update has been released to production today.
  • New releases expected of voms-core, vomrs and voms-admin shortly. Will decide an intervention once released.
    • Certainly there will be some short downtime for vomrs upgrade.
  • Lots of cleanup completed recently.
    • Two mechanisms were updating CRLs. Cleaned up.
    • Moving Remi's existing and new hacks into an RPM to ease deployment.
      • voms-admin no longer broken after a tomcat restart.
  • Discussion with lemon folks about sensible way to publish metrics per VO rather than a metric for every VO.
  • Discussions going on between vomrs and voms-admin to potentially drop vomrs once voms-admin does everything.

February 27th 2008

  • Thanks to Sophie Lemaitre, David Gutierez and Vladimir Bahyl, there is now a dynamic load-balanced DNS alias (vomslb.cern.ch) for voms101 and voms104
    • Best host is determined with some VOMS-related lemon metrics
    • voms.cern.ch is now pointing to vomslb instead of voms101
  • Steve Traylen will be my successor, handover will be done during coming weeks

February 13th 2008

  • VOMS replication tests didn't point out any problems
  • The VOMS patch to fix the crash issue is on the certification process

January 23th 2008

  • The bug which makes voms-core server crash is now fully understood by the developer, and will be fixed in 1.8.2 (AFAIK, no patch yet)

January 16th 2008

  • We're suffering from a bug which makes voms-core processes crash regularly for one month
    • problem under investigation by developer
    • for the time being, a cronjob restart processes once per day
    • next release of voms-core will not fix this bug

Old Reports

Edit | Attach | Watch | Print version | History: r121 | r115 < r114 < r113 < r112 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r113 - 2008-10-15 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback