Unexpected problems during Qualiac migration to Safe Host (DB nodes AND Application servers).

Description

  • Date proposed (having in mind absences, upcoming easter holidays and future interventions):
    • 21.03 (Wednesday) with following 2 days of possible troubleshooting by Artur and Andre
    • 23.03 (Friday) with weekend for testing, but without Andre and Artur who were absent on the next days
  • Intervention scheduled on 21st February.
  • During the migration, several unexpected problems occurred:
    • The WOS application was not able to work with 4 Tomcats servers.
    • The tnsnames.ora file on DFS was not updated.
    • Some configuration scripts also failed and Datasphere was not working.

Impact

  • WOS service was running using only one Tomcat server on Thursday and Friday. Performance was constantly monitored and no issues were observed.
  • Business Objects: Service unavailable until Thursday morning (because of the tnsnames propagation issue).
  • Datasphere : Service unavailable until Friday afternoon, affected by 2 different issues.
  • XML order sending: was not working until Thursday morning (firewall issue).

Time line of the incident

  • 07-Mar-12 16:34 - Nicolas Marescaux provided AIS with future production machines as beta environment for testing.
  • 09-Mar-12 15:55 - Artur Wiecek configured the application layer on beta.
  • 21-Mar-12 12:14 - Ivica Dobrovicova confirmed beta environment works fine and gave the GO for migration starting at 6.00 pm.
  • 21-Mar-12 20:57 - Migration was completed and Qualiac application was back to service operation on new hardware.
  • 21-Mar-12 21:00 - Ivica reported issue: WOS application not working. Problem with running 4 JVMs simultaneously appeared.
  • 21-Mar-12 22:19 - Regis Buffet reported issue: file transfer to Datasphere failed.
  • 21-Mar-12 23:00 - WOS web server Up and Running on a single Tomcat with 3.2 Gb of memory (supports around 50 concurrent users), supported by Qualiac consultant.
  • 21-Mar-12 23:36 - Andre and Ivica decided to GO with the migration.
  • 21-Mar-12 23:45 - Copy of the tnsnames.ora file to DFS fails. New entries are manually copied to the local tnsnames.ora file as a workaround for Readsoft Invoice scanning. For DFS, problem resolution had to be postponed to Thursday morning as DFS support needs to be contacted.
  • 22-Mar-12 08:11 - Andre Regelbrugge fixed the issue with tnsnames.ora file and Business Objects application was back to service operation.
  • 22-Mar-12 09:07 - Ivica reported an issue with XML order sending.
  • 22-Mar-12 09:27 - Giacomo Tenaglia fixed the issue with XML sending (aisproxy firewall configuration).
  • 22-Mar-12 16:28 - Artur Wiecek found the cause of the issue when running on 4 Tomcats and proposed to only use 2 containers out of 4 as of now.
  • 23-Mar-12 11:00 - Artur Wiecek fixed a new issue reported affecting Datasphere (but wrong certificate introduced used to sign the files).
  • 23-Mar-12 15:00 - Datasphere certificate issue fixed.
  • 23-Mar-12 15:00 - Core JVM issue understood, final configuration was agreed.
  • 23-Mar-12 19:30 - Artur Wiecek changed the configuration to run with 2 containers only.
  • 23-Mar-12 20:00 - Ivica confirmed that the new configuration works.

Analysis

  • Issue with WOS application not running on multiple tomcats
    • Due to a misconfiguration, the load balancing mechanism was not able to distribute the sessions across the 4 tomcats.
    • This issue was observed on the test environment but was not reported correctly by IT/DB and was not followed up.
    • Tests were running using only one tomcat server (nobody noticed than the other 3 tomcats were down).
    • Decision to go live with just one JVM (DB and AIS): according to Qualiac consultant on the phone, the 64 bit JVM with 3.2G of memory could easily handle 50 simultaneous users, which was sufficient for the CERN scenario. Risk was taken.
    • By Thursday lunchtime, the problem was fully understood. Fix was postponed till Friday evening.
    • Updating the proxy configuration fixed the issue.
    • Had everyone known that the issue with the loadbalancing was not fixed, the migration wuold have been postponed.

  • Issue with tnsnames.ora file
    • Propagation script finishes with errors due to some locks.
    • File could be edited manually on Thursday morning.
    • First occurrence of this problem. In the past we had also problems to distribute the tnsnames.ora file but they are completely unrelated.
    • DFS version of tnsnames.ora file is for Windows/NICE clients, like Business Objects and Readsoft invoice scanning software. Qualiac Software is relying on tnsnames.ora files which are propagated to afs and to our machines and is not connected to the DFS version of tnsnames.ora.

  • Issue with Datasphere
    • New script for Datasphere uses the old machine as a transfer agent. The effort required to test it was higher than justifiable regarding the risk.
    • There were 2 problems: first simple syntax error which was fixed on Thursday morning and 2nd problem with a typo in the certificate name which was fixed on Friday once it was reported.

  • Issue with XML order sending:
    • The functionality was tested and the need for an additional firewall rule was identified. However, due to a misunderstanding in IT-DB, firewall configuration on aisproxy was not updated to allow machines from this new network (10.18.6.6 /255.255.255.0) to connect to the proxy service. Updating this firewall configuration fixed the issue.

Follow up

  • Considering the complexity of the migration, an overall plan would have helped to identify the potential issues and coordinate the preparation work such as setting up the test system in advance and agreeing on the migration date and go/nogo criteria and maximum time.
  • Improve communication between people involved and follow-up on issues, for example, using JIRA to trace/log all steps, issues and solutions performed during the testing phase.
  • Whenever a major intervention is discussed, people involved from AIS should be invited and participate in the coordination meetings in order to facilitate planning and coordination.

  • Issue with tnsnames.ora file
    • Investigate with IT/OIS why the lock was present on the file and why it was released between Wednesday evening and Thursday morning. (IT-DB)
    • Find out if a workaround will be possible if this happens again in the future.
    • Have this listed as possible issue in future migrations and taken into account in the GO/NOGO decision or alternate plan.

  • Issue with Datasphere
    • Currently payments are transferred using old Qualiac machine. It has to be moved to a new location (like aisproxy) with Internet connectivity. (IT-DB)
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2012-04-27 - EricGrancher
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback