TWiki> DB Web>PhysicsDatabase11gUpgradeReport (revision 8)EditAttachPDF

Oracle physics databases (T0 and T1) upgrades to Oracle 11gR2 report

Upgrades at T0

Databases

  • ATLDSC, LHCBDSC, ATLARC, ATONR, ATLR, ADCR, CMSARC, CMSR, CMSONR, LHCBR, LHCBONR, LCGR, ALIONR, PDBR, COMPR databases upgraded successfully.
  • 4 hours downtime for each database. Also hw migration using Oracle DataGuard. No issues found.
  • LCGR upgrade took 15 minutes more than scheduled due to a missconfiguration with the scan listeners.
  • After LCGR migration to new hw, a problem with the eth2 interface was found on node number #2. Node was taken down for further investigations and back on 16th February.

Timeline

ATLARC 24-11-2011
ATLDSC and LHCBSC 14-12-2011
ATONR (ATLAS online db) 11-01-2012
ATLR (ATLAS offline db) 17-01-2012
ADCR (ATLAS offline db) 17-01-2012
CMSARC 17-01-2012
LHCBR (LHCb offline db) 24-01-2011
COMPR 27-02-2012
LCGR 31-01-2012
CMSR 01-02-2012
CMSONR 02-02-2012
LHCBONR (LHCb online db) 06-02-2012
PDBR 08-02-2012
ALIONR (ALICE online db) 08-02-2012

Streams environments

Problems identified during testing phase

  • DDL failing in cascade streams replication from 10g to 11g.
    • problem reported to Oracle (SR# 3-3378025701) and fixed by patch 12593103 which is included 11.2.0.3
  • Redo logs from primary database are not registered on downstream capture database and cannot be removed by RMAN * Accoring to Oracle Support (SR# 3-4893515191) it is a feature and customer should implement its own solution for redo removal * Cron jobs were implemented to keep resonable redo window on downstream capture databases
  • Single capture process (and source queue) cannot serve 10g and 11g targets in parallel
    • reported to Oracle (#3-4497346601)
    • WORKAROUND: Separation of 10g and 11g targets by creation of new capture process and source queue dedicated for 11g targets
  • 11g propagation processes randomly stop sending messages to 10g targets
    • reported to Oracle (#3-4497346601)
    • WORKAROUND: implementation of propagation processes restarting job

Timeline

Date Action
14-Dec-2011 Upgrade and migration of downstream capture databases (plus implementation of all identified problems workarounds)
11-Jan-2012 - 17-Jan-2012 ATLAS online-offline upgrade
18-Jan 2012 - now ATLAS T0 - T1s upgrade: RAL 18-Jan-2012, TRIUMF 23-Jan-2012, KIT 24-Jan-2012, BNL 22-Feb-2012
24-Jan-2011 - 06-Feb-2012 LHCB online-offline upgrade
30-Jan-2012 - now LHCB T0 - T1s upgrade: PIC 30-Jan-2012, CNAF 31-Jan-2012, RAL 31-Jan-2012, SARA 01-Jan-2012, KIT 31-Jan-2012
01-Feb-2012 - 02-Feb-2012 CMS online - offline upgrade
08-Feb-2012 ALICE online - offline upgrade
08-Feb-2012 - 27-Feb-2012 COMPASS online - offline upgrade

Problems faced during the upgrades

  • Capture process has difficulties (and in most cases crashes) while mining redo logs generated during database upgrade.
    • observed on all source databases
    • SOLUTION: recreation of capture process to start just after execution of upgrade scripts - no data are lost as upgrade is done is restricted mode and users cannot access the database to modify the data
  • Capture process aborts because unsupported changes were made in one of replicated schemas (ORA-26767) and/or apply process fails with ORA-26714 and ORA-00942
    • problem was observed after all source database upgrades (starting from ATONR)
    • new feature of segment advisor in 11g creates temporal compressed tables that are not supported by Streams.
    • SOLUTION [see metalink note 1082323.1]: additional filter rules were added to capture processes to prevent that changes on those tables (temp tables have fixed names: DBMS_TABCOMP_TEMP_CMP and DBMS_TABCOMP_TEMP_UNCMP) were captured.
  • Changes on compressed tables are not replicated when source database is 11.x.x and compatibe parameter is 10.x.x (which is the case just after the upgrade)
    • observed only during ATLAS upgrade, few replicated schema had old partitions compressed on few tables. Changes stopped to be replicated on those tables since ATONR was upgraded (but parameter stayed unchanged).
    • SOLUTION: old partions from ATONR were dropped (copy of them are on ATLR) and tables have been reinstantiated. Compatible parameter has been increased to 11.2.x few weeks after the database upgrades.
  • Logminer/Capture process crashes with ORA-600 error when attempts to read archived logs generated by database with different compatible version
    • observed during upgrade of ATLR on downstream capture (17-Jan), CMSONR (02-Feb)
  • No redo continuity provided for standby and downstream databases when primary database is being failovered.
    • Most likely it is a bug as it is fully reproducable. It is being followed up with Oracle Support (SR# 3-5370026131).
    • WORKAROUND: Recreation of downstream capture process with start SCN after the failover. This can lead to lost of transactions on replicas if old capture process was not up to date with capturing data changes from old primary database.
    • observed during upgrade of LHCBR (24-Jan). Capture process could not start as one log was missing. After recreation of capture process all LHCB conditions T1s replicas aborted (18:35) as few transaction were lost. Missing data has been resync manually using source database at 21:15.

Upgrades at T1

RAL

  • ATLAS database successfully upgraded on the 18th January.
  • Intervention did over-run for few reasons:
    • the backup of the database did over-run (block change tracking was not enabled and incremental level 1 backup took more than 1 hour of the time allocated for the intervention).
    • pre-check was run the day of the intervention: found out that some objects were invalid and took ~1 hours to recompile/rebuild/drop them.
    • cluvfy (cluster verification tool) run the same day of the intervention: found some problems, for example Voting Disk configuration (one voting disk was on NFS and cluvfy didn't like it so Carmine had to drop it).
  • Unfortunately these problems were not found during the testing because RAL's testing HW configuration is different from the production one (so couldn't spot the VD problem) and the database is much smaller than the OGMA one so the backup was faster and didn't have any invalid object.
  • Lesson learned: run the checks well in advance!
  • LHCb database upgrade went smoothly.

TRIUMF

  • ATLAS database successfully upgraded on the 23rd January.
  • Database upgrade had to be re-started as shared_pool_size and java_pool_size parameter were not set to the minimum value. This delayed the upgrade by 15 minutes.

KIT

  • ATLAS database sucessfully upgraded on the 24th January.
  • Intervention took longer than scheduled due to some firewall problems.
  • LHCb databases migrated and upgraded on the 1st February. DataPump export and import was used instead of DataGuard which is the procedure recommended by CERN.
  • Intervention took 12 hours to be completed:
    • DataPump requires more downtime than DataGuard.
    • Configuration issues found with the new hw.

SARA

  • LHCb database successfully upgraded on the 1st February.
  • The upgrade of the clusterware (from 11.1 to 11.2.0.3) took about 4 hours.
  • The database upgrade itself took about 2 hours.
  • No problems were found, but the process was slower than expected.
  • Used CERN documentation as a guide. References to local CERN scripts generated some confusion.
  • Lesson learned: CERN documentation is provided only as a reference. Documentation and stepst must be prepared beforehand.

CNAF

  • LHCb and LFC databases successfully upgraded on the 31st January.
  • Slow progress due to the lack of preparation (no test system) and a big snowstorm which difficulted Alessandro to reach CNAF offices - 26 hours in total (7 hours scheduled).

PIC

  • LHCb database upgraded successfully on the 30th January.
  • Intervention took a bit longer due to some problems found during the post-intallation steps while registering the database with the clusterware (mistake with instances configuration).
  • Elena has also found some problems while deinstalling the old Oracle software. Fixed using the note: "RAC Instance Fails to Start With Error ORA-27504 [ID 1059831.1]"

BNL

  • ATLAS database successfully upgraded on the 22nd Feburary. Also hardware Migration and CPU JAN 2012 applied. Using DataGuard. No issues were found. Took about 2 hours and 10 minutes.

IN2P3

  • Upgrades still pending due to the issues found with the Oracle installer and the database storage.
  • Bug 13731278 prevents the CRS upgrade from 10.2.0.5 to 11.2.0.3.
  • Service Request (3-5225032577) open and escalated. According to the support, there is no way to pass over the bug without a patch, being developed by Oracle (no timeline has been provided)

-- ZbigniewBaranowski - 29-Feb-2012

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2012-03-01 - ZbigniewBaranowski
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback