WARNING: This web is not used anymore. Please use PDBService.MinuteS30October08 instead!
 

Minutes 30 October 08

  • Phone:
    • Gridka: Andreas
    • RAL: Carmine
    • BNL: Carlos
    • CNAF: Barbara
    • SARA: Alexander
    • TRIUMF: Andrew
    • IN2P3: Jean René, Osman Oaidel

  • CERN:
    • 3D: Maria, Jacek, Lisa, Luca, Eva
    • ATLAS: Florbela
    • CASTOR: Dirk, Eric Grancher

  • Today cannot join - reports sent by email:
    • NDGF (Olli): Problems with the SAN system. Under an urgent intervention to fix the problem.

  • Sites status:
    • SARA: network failure, database unavailable during 40 min.
    • CNAF: intervention to change parameter on next Monday. Next week, configure new storage and setup new RACs, tests results will be published.
      • Castor: job failing every 1 hour, application and users not affected, looking after this.
    • RAL: 2 interventions: first to set up the memory parameter related to Streams, second to increase the sessions number (150 to 300).
      • Castor: upgrade to 10.2.0.4. Many problems have been fixed. Problem with a process trying to insert data. This behavior only applies to one of the sessions running. Reset the session fixes the problem. RAL is the unique site observing this problem. CERN has tried to reproduce this, without success.
      • FTS and LFC database for ATLAS, Oracle recommended to install one patch – Carmine will send email with patch number so we can have a look.
    • TRIUMF: nothing to report – no interventions scheduled
    • IN2P3: network failure last Monday linked to a power cut. CPU patch was applied, problem found but cause unknown (local LFC database). For LHCB and ATLAS databases, there is not plan yet.
      • Will come to workshop only 12th, because 11th is holidays in France.
    • BNL: one of the nodes crashed last week, asm instance problem, impossible to re-start it, node was rebooted to fix the problem. Identified as a bug by Oracle, but there is not patch for BNL architecture yet (assigned to development).
      • No news from SR on apply problem (apply process getting stuck).
      • Reconstruction job results:
        • A user from tier2 or tier3 asked why slow job resolution – database slow? He was opening 8 sessions, 1 session constantly communicating between database and client. Others doing queries to the database. Data was not retrieved fast from the database through the network. Carlos suggested some network parameters in order to improve the rate. This has proved that the database is not the problem in long network distances.
    • Gridka: Full downtime for atlas and lhcb databases, physical storage moved successfully.
    • CERN:
      • Problem with the Streams replication for ATLAS problem during unavailability of NDGF. NDGF database was in unresponsive state and the propagation job could not report any problem but LCRs were not consumed. Queue was filled up during the weekend, causing the whole replication system getting stuck one day later due to the lack of memory.
        • Running memory tests in order to identify how the streams memory is used by the spilled LCRs, consumption is increased linearly with the amount of spilled LCRs in the queue.
        • We have allocated a new node to be added to the downstream cluster. With 4 nodes, we will move the downstream databases to run in separate nodes and this will allow us to add more memory to the streams pool.
      • We have received a new patch in order to fix the ORA-600 error when dropping the propagation job and it has been tested successfully. We will apply it on the production databases within the next interventions.

  • ATLAS report:
    • Replication from online to offline: One application without primary keys caused problems with the replication. Apply performance was completely killed due to the full table scans. Atlas has implemented a procedure to check that all the schemas have primary keys.
    • Pending ATLAS_SFO_T0 schema to be replicated.
    • AMI replication, status:
      • Will contact Eva to ask some questions.
      • Will present the plans /report on this during the workshop.

  • CMS report:
    • Request to replicate data from the offline to the online. Tests failed because missing primary keys.

  • Highlights from Oracle Open World for Streams

  • Agenda for the workshop
    • Report on reconstruction jobs to be added
    • Florbela will talk about TAGS together with Gancho (hw requirements)

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2008-11-03 - EvaDafonte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PSSGroup All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback