WARNING: This web is not used anymore. Please use PDBService.MinuteS18September08 instead!
 

Minutes 18 September 08

  • Phone:
    • Gridka: Doris, Kamil
    • RAL: Carmine
    • BNL: Carlos
    • NDGF: Olli
    • CNAF: Barbara
    • PIC: Luis

  • CERN:
    • 3D: Dirk, Maria, Jacek, Lisa, Eva
    • ATLAS: Gancho, Florbela
    • CASTOR: Eric, Nilo

  • Today cannot join - reports sent by email:
    • TRIUMF: 'Out of Sessions' errors during FDR2 testing. The problems is that they are submitting too many sessions overloading the database and impacted the Streams performance. No planned interventions.

  • Sites status:
    • NDGF (Olli):
      • During ATLAS stress tests, single instance database was overloaded (I/O limitation) causing replication rate being decreased. Olli tried to implement consumer groups causing Streams replication getting stuck. Now the number of ATLAS_COOL_READER sessions is limited to 16 active sessions.
      • New cluster being approved. Not final date yet.
    • RAL (Carmine): nothing to report
    • CNAF (Barbara):
      • Grid cluster upgraded to 10.2.0.4.
      • Internal problem with backups on ATLAS and LHCB clusters. Upgrade to 10.2.0.4 postponed till October. July CPU patch is still pending. CERN proposed to do the upgrade and apply the cpu patch as soon as possible.
    • PIC (Luis):
      • Problem with raw devices at the ATLAS cluster, asm disk group metadata was not updated correctly. Fixed with help of Oracle support. Eric proposed to ask support for the cause/bug of this problem to communicate to other sites.
      • Databases upgraded to 10.2.0.4 and July cpu patch applied.
    • GRIDKA (Doris):
      • During ATLAS high load, several problems observed related to COOL queries. Fixed by increasing the undo retention.
      • LFC replication problem to Gridka. Propagation job disabled due to error: connection lost contact. Working on this problem together with Oracle support. No solution yet. Workaround: recreate Gridka configuration split from the main Streams setup.
      • CPU patch not applied – waiting for October patch
      • Storage intervention planned for next month - Doris will ask how long it will take. Split might be needed.
    • BNL (Carlos):
      • July cpu patch applied at FTS database. Firmware on the storage upgraded. OS patches to be applied (Luca sent an email with the information about the bug).
      • Plan to upgrade the agents to 10.2.0.4. Databases on version 10.2.0.4 require agents on version 10.2.0.4.
      • Last week an user reported problems using COOL application, long database response time. Carlos was running several sqlnet tests in order to identify the problem.

  • ATLAS status:
    • Tier1 to Tier2 replication plans:
      • It is up to the Tier1 sites to decide. There is not any similar configuration yet.
      • No Streams expertise at Tier1 sites.
      • Streams replication is a chain, if one point in the chain fails, the complete chain will fail.
      • Do the Tier1 sites need downstream configurations? Only one Tier2 to be added, but if there are more in the next months.
      • Also changes in the current configuration will be needed.
    • Online database will be not accessed anymore from the outside. Several accounts being added to the replication so the accounts can be accessed from the offline. Before the online database can be closed, several tests must be performed.
    • Tier1 sites problems during FDR ATLAS stress tests:
      • Tests are run without notifying: nor ATLAS dbas, nor Tier1 sites dbas, nor Eva.
      • Tier1 databases are overloaded with ATLAS_COOL_READER sessions and Streams performance is impacted.
      • From the first checks, systems are I/O limited.
      • Requirements to Tier1 sites were only on volume. Stress tests are showing other limitations, new requirements to be added??
      • Gancho and Florblea will notify Sasha that it is necessary to send around the schedule about the tests and involve the Tier1 sites dbas.

  • Eva remembered to the Tier1 sites that they must check the OEM configuration for their targets. Several agents are inaccessible and database are not configured. OEM monitoring is not useful in this way. She has already sent an email.

  • CASTOR:
    • Barbara:
      • After the upgrade to 10.2.0.4 the agents were postponed and this caused a problem on the database. Agents are upgraded now.
      • Castor middleware upgrade: several sites reported problems after the upgrade when running more than one VOs per cluster. Workaround proposed by Nilo: set init parameter "_kks_use_mutex_pin" to false. Sites observing the problem should open a SR reporting this problem so Oracle support can track the problem and produce the correct fix patch to be included on 10.2.0.4 and 10.2.0.5.
    • Carmine:
      • Getting ORA-600 errors. Oracle support suggested to upgrade to 10.2.0.4. CNAF is already running on 10.2.0.4.
      • The patches identified and circulated by Nilo (for Oracle version 10.2.0.3) must be also identified for version 10.2.0.4.
    • Eva will collect a summary of the configurations at the Tier1 sites.

  • Next meeting on 9th October.

Topic attachments
I Attachment History Action Size Date Who Comment
Microsoft Word filertf cnaf_castor.rtf r1 manage 1.2 K 2008-09-19 - 17:15 EvaDafonte Database configuration for castor at CNAF
Microsoft Word filertf ral_castor.rtf r1 manage 6.4 K 2008-09-19 - 17:15 EvaDafonte Database configuration for castor at RAL
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2008-09-19 - EvaDafonte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PSSGroup All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback