LCGSCM Authentication Authorisation Services Status

voms/vomrs report

Oct 7th 2009

Submitted by SteveTraylen
  • Voms has been running with out incident since April.
  • As of the 30th September there has been a change in behaviour.
    • A voms-proxy-init across 12 VOs used to take around 15 seconds and now takes 70 seconds.
    • VomsPostMortem2009x10x01 - work in progress.
  • No updates are pending at the moment.
    • A voms and voms-admin is in certification but not SL5
  • A new vo vo.delphi.cern.ch (?) will added in the near future it looks like.

VOMS

May 27th 2009

  • As last time very little expected to change before or during step 09.
    • An update before the end of the year looks to be possible now.
  • DBAs (EVA) are waiting on (Steve) for an up to date problem report with VOMRS getting wrong results.

Not VOMS

There are two completely unrelated items that relate to the CERN CA and consequently CERN users and service particularly.
CERN CA CRL is the shortest at five days
Currently within IGTF most CRLs are 30 days in length. Quite a few are around the 10 to 15 days and CERN is shortest at five days. The consequence is that any CRL download problem anywhere on the planet and CERN CA users and services are the first to suffer. My enquireys and requests to the CERN CA about extending this so that at least expireys on weekends can be avoided has returned the following. This is all valid, we have to find the trade off.
   The CRL lifetime should not be too long otherwise it's purpose becomes
    useless. The initial lifetime was 48 hours and itr was already
   extended to meet Grid requirements, where scripts are updating this
   CRL only once per day or once per 2 days. Extending even more means
   that Grid members will not get the information soon enough if a
   certificate is being revoked for security reasons, and it might
   compromise security. As a result, we consider that 5 days is already
   long and it should not be extended more.
   
I will continue to follow up but this is why CRL problems anywhere on the planet hit CERN so badly. More monitoring will help and will clearly be done but having some else break first would be great.
CERN CA certs and mod_ssl problem
The BUG:48458 is hitting CERN only at the moment such that some CAs have been removed from the set of CAs that CERN trusts. For now this is fine as a work around but as the number of LHC active countries grows we will hit a point where the work around fails. Maarten and others are clearly working on it but this is a brick wall we are walking, all be it slowly, towards.

April 29th 2009

Submitted by SteveTraylen
  • Very little expected to change before or during step 09.
  • Areas of note.
    • CMS and Alice make 10x as many requests to VOMS server than LHCb or Atlas. But there is no actual problem. It just would be nice to explain. I'll produce some accurate figures as a start.
    • Support out of hours for VOMS is still limited.

November 26th 2008

Submitted by SteveTraylen
  • No changes to service.
  • Upgrade Plans.
    • Hopefully voms-core, voms-admin and vomrs upgrade on Monday 1st December at 08:30 UTC.
    • It is pending on feedback from OSG today. Will decide tomorrow if to go ahead or not, broadcast will be sent.
    • Changes for users:
      • vomrs sends notification emails with better information in, should reduce confusion.
      • voms-core (the bulk of queries and updates) changes to a read only operation.
      • voms-core should not crash as often.
      • voms-core should reconnect following a database outage.
  • Other items currently being discussed and then addressed.
    • Users currently have no way to delete them selves from a VO.
  • Reports in from Database folks that VOMS is still making many short queries. * To be investigated after Monday's intervention. * Since VOMS core (readonly) and VOMS admin (readwrite) will be using different accounts after monday it will be easier to resolve.

November 5th 2008

Submitted by SteveTraylen
  • No changes to service.
  • VomsPilot service used little.
    • Potentially some problems submitting jobs but hopefully my fault.
  • Problem of Oracle returning wrong results.
    • The situation only arises for LHC VOs where VOMRS connects both to its own database and the CERN HR database.
    • Reducing the poll frequency to the CERN HR database such that it never happens and the problem vanishes.
      • Not a viable fix or workaround but gives us ideas how to procede. Developers busy until the 14th of November so on hold.

October 15th 2008

Submitted by SteveTraylen
  • No changes to service
  • Problems
    • Oracle returning wrong results.
      • e.g after 13,000 trivial selects there are occurrences where 0 results are returned rather than 1.
      • This is on the validation cluster as well.
      • Miguel and Tanya investigating, Tanya providing simpler code example.
  • Workshop
  • Pilot
    • PATCH:2390 looks like a reasonable candidate for eventual certification. Will deployed on the pilot this week.

October 1st 2008

Submitted by SteveTraylen
  • No change to service.
  • NA48 VO now completely configured and they are active.
  • New software now available to be added to the VOMS pilot.
    • Will be added in the next week or so.
  • Bizarre problem with VOMRS and CA in country X expiry causing users from country Y to be expired.
    • Has affected 20 users in the last couple of weeks.
    • Fermilab developers investigating.

September 17th 2008

Submitted by SteveTraylen
  • No change or incidents on the service.
  • ATLAS reported yesterday a one off generation of a corrupted proxy. Nothing obvious but to be checked.

September 3rd 2008

Submitted by SteveTraylen
  • Very little to report, all VOMS service development is currently being ignored.
  • VomsPilot service is running and documented but no suitable software yet for testing.
  • A 30 second outage and automatic fail over happened on August 31st for VOMRS.
    • Not obvious what caused it, possible a security scan but the timing is apparently wrong.

August 20th 2008

Submitted by SteveTraylen
  • Very little to report, all VOMS service development is currently being ignored.
  • Addition of NA48 VO is half complete.
  • VomsPilot service is running and documented but no suitable software yet for testing.
  • Next week MariaDimou will be in charge.

August 6th 2008

Submitted by SteveTraylen
  • Intervention completed on Monday August 4th.
    • VOMS and VOMRS is now 100% in quattor and CDB. smile
    • There was a problem with new users registering after 09:00 on Monday till 19:30 on Tuesday.
      • While they could complete the process they were not actually added to their VO.
      • This is now corrected and everything has caught up.
      • The reason, a configuration script mistake from VOMRS that set a VO=test in one place for all VOs. All my testing was with the VO test so it was not seen since that worked.
      • In reality I doubt many users noticed. There was only one report from a dteam member.
      • All is now well.
  • The voms-pilot.cern.ch is now completly setup with hardware and configuration. It is waiting for a VOMS patch in certification which is not rejected by that process.

July 23rd 2008

Submitted by SteveTraylen
  • No service incidents or significant interventions
  • Non-interactive deployment of vomrs is progressing nicely. Non-critical deployment bugs created for vomrs.
  • Hardware(virtual) requested for voms-pilot.cern.ch
  • PATCH:2063 (voms-admin) and PATCH:2061 (voms-core) have now entered certification with proposed fixes for July's upgrade and downgrade. Depending on progress earliest deployment early September assuming a useful pilot phase.
  • DOE support problem solved. DOE reissued same DN CA with new expiry date which is perfectly normal. It seems that an expired CA with the same DN in the web browser causes the server to fail authentication despite the server having the new CA only. Understood but confusing.
  • It's my birthday.

July 8th 2008

Submitted by SteveTraylen

Report from Intervention on Monday 1st July

Hardware migration and upgrade of voms-admin and voms-core.
  • Hardware migration was transparent , voms-admin upgrade was transparent.
  • voms-core migration was not transparent and a downgrade followed 6 hours later due to a change in behaviour. frown

Details of voms proxy generation change

This was from voms 1.7 to voms 1.8 and then downgrade to 1.7. voms-proxies that were generated changed from
/dteam/cern/Role=ftsmaster/Capability=NULL
/dteam/Role=NULL/Capability=NULL
/dteam/cern/Role=NULL/Capability=NULL
 
to
/dteam
 /dteam/cern
 /dteam/cern/Role=ftsmaster
 
Two distinct changes here.
  1. The Role=NULL/Capability=NULL was lost. This broke glite-renewd and also gPlasma in dCache.
    • The change was not meant to happen in the sense that a configuration option should have been required to enable this new behaviour but it was enabled by default and could not be switched off.
    • Annoyingly these had been reported by SARA GGUS:36587, BUG:37008 before the upgrade but I had not really read them or considered their significance.
    • We expect to go to this new behaviour with VOMS 1.9 when glite-renewd , gPlasma must be altered, maybe others?
  2. The predictable order of the FQANs changed. The specification states they can be in a random order but the previous predictable order has been built on in many places.
    • This impacted PANDA production in the US.
    • WMS interactions for production role users.
    • I expect more especially for role users.
    • The predictable order will be restored in a future version. BUG:38506 is accepted and will be integrated before any upgrade.

Why not Detected Anywhere?

  • These changes were not in the release notes since the first change was a bug and not planned to be present. The second met the specification so in principal no change.
  • Not picked up during certification since there was some confusion about what we were aiming for , short or long fqans.
  • There is little certification testing against dCache or long jobs for instance.
  • Not picked up within the PPS. They use the production VOMS server.

Avoiding in the Future.

  • Create a voms-pilot.cern.ch service with the next version of the software to be deployed against the production database. Safe , it's essentially a read operation.
    • Our US colleagues can point to it.
    • The PPS services can use it.
  • Create a list of middeware using voms functionality so they can be kept informed of the pilot -> production cycle upgrades.
  • Concerning the fact that a bug was submitted that should have put the brakes on. I've alerted the EMT that developers seeing large bugs like this in released production services should alert the EMT so they can pull the software back or recommend it not to be installed. A similar problem hit the LFC BUG:38459. The site managers also have responsibility here which I will take care of. In the case of the LFC CERN detected the problem and then some time later the problem was discovered again at RAL.

Other Items

  • Since the hardware upgrade (rather than the software) the number of Oracle sessions has jumped. It has been reduced but is still much higher than it was. Under investigation.
  • For the first time the nodes behind voms.cern.ch are YAIM, NCM-YAIM and complety deployable with out manual intervention after install by the admin. The same must now be repeated for lcg-voms.cern.ch including vomrs - harder.
  • The VOMS service now raises alarms to operators for things that can be detected. This had never been the case before. Operator instructions are updated.

June 24th 2008

Submitted by SteveTraylen
Transparent intervention plans for hardware migration of voms.cern.ch
  • voms.cern.ch currently on voms101 and voms102 will to be migrated to voms113, voms114.
  • voms101 will be destroyed on 15th July so ideally before then, else we drop to one node.
  • New voms delployment will be yaim rather gLite configured.
  • Phased migration, dates to be decided end of this week but:
    Wednesday 2nd July
    Expand voms.cern.ch to voms101, voms102, voms113 and voms114.
    Monday 7th July
    Shrink voms.cern.ch to voms113 and voms114.

Recent Problems
  • Failure to recover after planned DB intervention.
    • Would have been avoided basically with better communication from VOMS service manager.
  • Discussion with developers to change two things.
    • BUG:38130 - Voms core will do explicit database re-connections.
    • BUG:19770 - The case of the user not being in the VO and the Database being down will be distinguished to the client. $ Completed:
  • VOMS now in SLS. http://sls.cern.ch/sls/service.php?id=VOMS
  • More monitoring of tomcat added. $ On Going Work:
  • Lots of testing of VOMS YAIM module.
  • Updates to NCM-YAIM to be added now.

June 4th 2008

Submitted by SteveTraylen
Comments
  • From reboot (e.g powercut) it takes nearly an hour for the service to start! Similarly service restart without expert intervention also takes an hour. .. All this makes auto interventions impossible.
Work in Progress
  • Testing new YAIM component over gLite configuration scripts, it should remove the hour for service start up and improve deployment of voms-core and admin at least. I hope to deploy on new hardware now available with yaim for voms.
  • Numerous requests for VOMS to appear in SLS. Not as urgent as fixing the above point, but soon.
  • 6 hour spikes in load are still present but greatly reduced since CNAF castor randomized their gridmap file generation. More work needed.

May 21st 2008

Submitted by Maria Dimou:
  • Restarted voms core on both hosts representing voms.cern.ch yesterday due to problem reported by A.Sciaba to obtain a proxy for CMS. "Middleman" errors found in the logs. Wondering why soon-2-years-old bug https://savannah.cern.ch/bugs/?19770 is so unfortunate.
  • Ran spma_wrapper.sh on all 4 hosts of the vom(r)s service to push the new CAs. Restarted tomcat and voms core on all of them as well due to the appearance of a new CA. Did this urgently because of complaint in the rollout list from a user with a UK certificate (the UK CA caused this exceptional CA update).

April 30th 2008

  • Recent Failures
    • voms-admin exited again on Friday due to many open files within tomcat.
      • A large spike in load occurs exactly every six hours.
        • This coincides with
          1. Increased ALICE activity in the DB. VOBoxes are being checked for them doing something every 6 hours. BUG:36161
          2. All the CNAF Castor servers appear to download grid-mapfiles at the same time. GGUS:35880
  • Configuration Changes to HA-Linux
    • tomcat is now stopped on the vomrs slave node.
    • voms-core is not stopped on slave node. * Results are
        1. Failover takes 1 minute instead of 30.
        2. Users are no longer able to connect to the broken slave node avoiding confusion.
  • 4 new boxes are now available for VOMS migration.
    • Work is underway to consider migration, maybe at the same time as the new YAIM configuration of VOMS is complete.

  • Possible Upgrades.
    • We can now upgrade vomrs to 1.3.2 and voms-admin to 2.0.13.
    • Monday 5th May, Afternoon is a proposed date.
      • No expected downtime for voms-admin or voms-core
      • 20 minutes downtime for vomrs, i.e only registrations will be disabled.
    • Changes from upgrade.
      • Mostly contains fixes for me the admin that have been worked around anyway.
      • Vomrs is more efficient though in its call to voms-admin.
    • Detailed Plan.
      1. Upgrade vomrs on voms105, the standby node without schema upgrade
      2. Remove voms101 from voms.cern.ch alias.
      3. Upgrade voms-admin on voms101, check some simple queries.
      4. Shutdown vomrs on master of lcg-voms.cern.ch
      5. Atomically remove voms104 from voms.cern.ch alias and add voms101 back to voms.cern.ch.
      6. Upgrade vomrs schema on voms105.
      7. Fail vomrs over to voms106 from voms105 which will start it up.
      8. Upgrade vomrs on voms106 the new standby node.
      9. Upgrade voms-admin on voms104
      10. Add voms104 back to the voms.cern.ch voms-admin alias.
      11. Check if it is working
      12. Wait for the fallout.

April 16th 2008

  • Recent Failures
    • Limit of open files reached for tomcat user, cause voms-admin to exit for CMS.
      • Limit increased , it was lost during SL3->SL4 migration. Should not happen again.
      • Also a new metric introduced which shows large spike every 6 hours. Under investigation.
    • voms-core failed on lcg-voms.cern.ch , this currently has no self healing for this. It should not be a problem since voms.cern.ch is still there but glite-proxy-renewal only uses the one original host. Bugs are all in... and voms-core on lcg-admin will be considered for removal at some point post replication to CNAF.
      • A new cron introduced to keep the service alive, hopefully not needed with new upcoming release.
  • lcg-voms.cern.ch cert expires on 5th May.
    • For smooth operations new one must be depoyed at sites a week before.
    • Update has been released to production today.
  • New releases expected of voms-core, vomrs and voms-admin shortly. Will decide an intervention once released.
    • Certainly there will be some short downtime for vomrs upgrade.
  • Lots of cleanup completed recently.
    • Two mechanisms were updating CRLs. Cleaned up.
    • Moving Remi's existing and new hacks into an RPM to ease deployment.
      • voms-admin no longer broken after a tomcat restart.
  • Discussion with lemon folks about sensible way to publish metrics per VO rather than a metric for every VO.
  • Discussions going on between vomrs and voms-admin to potentially drop vomrs once voms-admin does everything.

February 27th 2008

  • Thanks to Sophie Lemaitre, David Gutierez and Vladimir Bahyl, there is now a dynamic load-balanced DNS alias (vomslb.cern.ch) for voms101 and voms104
    • Best host is determined with some VOMS-related lemon metrics
    • voms.cern.ch is now pointing to vomslb instead of voms101
  • Steve Traylen will be my successor, handover will be done during coming weeks

February 13th 2008

  • VOMS replication tests didn't point out any problems
  • The VOMS patch to fix the crash issue is on the certification process

January 23th 2008

  • The bug which makes voms-core server crash is now fully understood by the developer, and will be fixed in 1.8.2 (AFAIK, no patch yet)

January 16th 2008

  • We're suffering from a bug which makes voms-core processes crash regularly for one month
    • problem under investigation by developer
    • for the time being, a cronjob restart processes once per day
    • next release of voms-core will not fix this bug

Old Reports


This topic: LCG > WebHome > LCGServiceChallenges > ProgressLogs > ServiceChallengeFourProgress > LcgScm > LcgScmStatus > LcgScmStatusAas
Topic revision: r119 - 2009-10-07 - SteveTraylen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback