Minutes of the WLCG Tier1 Service Coordination Meeting of 26 August 2010

Participation

  • Local: Patricia, Peter Love, MariaDZ, Andrea Sciaba, Massimo, Jamie, Roberto, Marie-Christine, Nicolo, Dirk, Andrea Valassi, Alexandar, Dario, Eric, Manuel, Julia, Kors, Alexander, Massimo, Maria Alandes, Pablo, Jacek, Luca
  • Remote: Rolf, Tiju, Michael, Catalin, Stefano, Angela, Barbara & Alessandro, Andrew - TRIUMF, Patrick, Massimo Sgaravatto, Felix Lee (ASGC), Carmine, BNL, Andrew, Ron, Pepe, Jon, Tiziana, Shaun, Alexander Verkooijen

Release update

Status of open GGUS tickets

Michael (BNL) stressed the need to understand better where to assign tickets when they turn out to be network issues. MariaDZ will discuss with the T0 network team what a GGUS work-flow that works would be (the ticket GGUS:61440 couldn't have been assigned to the LHCOPN).

End-to-end GGUS alarm reliability

GGUS developer Guenter Grein collected data from the Tier0 and the Tier1s on ALARM ticket mail handling. He will present them at the next T1SCM on Sept. 23rd. Details in https://savannah.cern.ch/support/?116430

Review of recent / open SIRs

  • SIR received for CERN vault cooling issues
  • SIR being prepared for LHCb online DB problem after power cut
  • SIR requested for LFC/FTS DB problem at NLT1

Scheduled / Unscheduled downtime handling - update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.9-5 (All)
SRM 2.9-3 (all)
  1/9: upgrade CASTOR nameserver to 2.1.9-8 to reduce load on nameserver DB; from 0900 CEST 4 hours "at risk"
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
23/8: unscheduled power outage 31/8: scheduled downtime for network construction and server BIOS upgrade
BNL dCache 1.9.4-3 (PNFS) none none
CNAF CASTOR 2.1.7-27 (ALICE)
SRM 2.8-5 (ALICE)
StoRM 1.5.1-3 (ATLAS, CMS, LHCb,ALICE)
   
FNAL dCache 1.9.5-10 (admin nodes) (PNFS)
dCache 1.9.5-12 (pool nodes)
None - most components have been running since last December Investigating latest 1.9.5 release for possible upgrade on Nov 2
IN2P3 dCache 1.9.5-11 (Chimera)    
KIT dCache 1.9.5-15 (admin nodes) (Chimera)
dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
None None
NDGF dCache 1.9.7 (head nodes) (Chimera)
dCache 1.9.5, 1.9.6 (pool nodes)
None None
NL-T1 dCache 1.9.5-19 (Chimera) (SARA), DPM 1.7.3 (NIKHEF) None None
PIC dCache 1.9.5-21 (PNFS) None To re-start some Disk Servers next SD (by Sept. to be decided) to apply a network patch for X4500 servers. Will take 30'.
RAL CASTOR 2.1.7-27 (stagers)
CASTOR 2.1.8-3 (nameserver central node)
CASTOR 2.1.8-17 (nameserver local node on SRM machines)
CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers)
SRM 2.8-2
None Upgrade to CASTOR 2.1.9-6, tentative dates: 27/9 LHCb, 25/10 GEN (ALICE), 8/11 ATLAS, 22/11 CMS; subject to successful VO testing and VO approval
TRIUMF dCache 1.9.5-17 with Chimera namespace   1/9: dCache upgrade to 1.9.5-21

CASTOR news

xrootd news

dCache news

New golden release update release yesterday (1.9.5-222) with minor bug fixes.

StoRM news

DPM news

LFC news

LFC deployment
Site Version OS Backend Upgrade plans
ASGC 1.7.2 SLC4 Oracle  
BNL 1.7.2-4 SL4 ORACLE 1.7.4 on SL5 in September
CERN        
CNAF        
FNAL N/A     Not deployed at Fermilab
IN2P3        
KIT        
NDGF        
NL-T1        
PIC 1.7.4-7 SL5 64-bit Oracle LFC_mysql of non-LHC project still SLC4. To be upgraded to SL5 64-bit in a few weeks.
RAL 1.7.3 SLC4 Oracle Upgrade by Sep to 1.7.4 for SLC5; possibly mixed frontend for 1-2 weeks
TRIUMF 1.7.2-5 SL5 64 bit MySQL  

FTS news

Conditions Data Access and related services

COOL, CORAL and POOL

  • Followed up (thanks to help from IT-DB and ATLAS DBAs) the problems observed in June on ATLAS and LHCb servers (server-side ORA-07445 process crashes) after applying the Oracle June PSU.
    • A COOL test suite was prepared to try and reproduce the error, based on one COOL test that had been observed to fail 15 times in June and July on the database used for COOL nightly tests. Using this tests suite, the ORA-07445 error could be successfully reproduced several times per hour on the ATLAS integration RAC.
    • By switching on and off connection sharing in the COOL test suite, the problem was shown to be present only if client sessions share one connection to the Oracle server. The ORA-07445 problem however is not specific to COOL alone: it was also observed a few times in other tests executed by ATLAS (e.g. using the TAGS application).
    • The COOL-based test suite was then used to validate possible fixes. It was successfully shown that patch 6196748 solves the Oracle bug, whereas the July PSU alone does not fix it.
    • A summary of the COOL tests for the ORA-07445 problem was presented at the ATLAS database meeting on August 23.
    • The COOL-based test suite wa then repackaged and documented so that it can be used by the physics database team in IT-DB to test similar issues in the future.

  • Followed up the gssapi issues in Globus (incompatibility between the two implementations of gssapi symbols in the Globus and system libraries).
    • After an emergency release of CORAL for ATLAS and LHCb, the upgrade to globus 4.2, which uses 'versioned symbols' for gssapi, had been proposed by the CORAL team as a possible long-term fix to the problem.
    • New builds of globus 4.0.8 and of the Grid middleware, including a backport of the gssapi 'versioned symbols' globus 4.2 patch, have been received from IT-GT (thanks to Oliver Keeble). These new software versions have been succesfully tested against CORAL, showing that they provide a valid solution to the problem.
    • The timescales for the deployment of these new software versions are now being discussed by the CORAL team with the experiments, the SPI team in PH and IT-GT.

  • While investigating the gssapi issue in Globus, a problem in the Oracle client library has been identified: Oracle also provides its own implementation of gssapi symbols (the third, after Globus and system/MIT).
    • The problem has been reported to Oracle as a Service Request. It has been suggested to Oracle that implementing gssapi 'versioned symbols' in the Oracle client library could be a solution to the problem, as this was shown to successfully solve the same issue in Globus.
    • This bug has not been reported to cause any issue so far, but it was tested that this could lead to problems in client applications depending on the order in which libraries are loaded at runtime.

  • A new LCGCMT_59a release is being prepared for ATLAS. Its main motivation is the upgrade to a new version of ROOT 5.26.00d, but this will also contain a large number of fixes and improvements in all of CORAL, COOL and POOL.
    • The versions of all Grid middleware packages will also be updated (on SLC5) to those from the latest gLite 3.2.8 UI, including python2.6 support. In the context of this change, voms clients will be upgraded to version 1.9.17-1 that also contains some functionality enhancements that are needed for certificate authentication and authorization in the CORAL server.

  • CMS is also preparing a new software release, which will include the upgrade to the current CORAL_2_3_10. The new CMS release will no longer contain a dependency on POOL, which will be replaced by a custom CMS component derived from the POOL relational packages.

Experiment Database Service Issues

  • Summary of the PSU discussion * At CERN we cannot apply PSU APR nor JUL as the usse with possible non rollingness is not fixed (SR with Oracle and testing still in progress) * We recommend that Tier1 sites utilize the same Oracle version as at Tier0. Therefore our recommendation is that PSU JUL should be put on hold for the moment

  • Experiment reports:
    • Alice:
      • Transient HW issue casue one storage array to go out of service temporarily. DB storage redundancy was rebuilt on surviing nodes
      • Alice plans replcaement of current storage arrays for the DB as warranty is expiring.
    • Atlas:
      • The previously reported issue of PSU APR (and JUL) with conditions workload (high load and ora-7445) has been fully reproduced with help of the Persistency Team and Atlas. Oracle patch 6196748 (on top of PSU APR and/or JUL) fixes the problem
      • Atlas offline and in particular Atlas DQ2 workload is affected by sporadic reboots. Extra monitoring has been deployed recently to help to understand the problem. High memory usage from some DQ2 activities has been identified has possible trigger of high pagin activity which makes server nodes unresponsive and eventually get evicted from cluster. Work is in progress with Atlas DQ2 development and DBAs.
      • We have a SR open with Oracle on a few occurrences of a possible bug that has blocked standby redo apply in the recent months.
      • Atlas online PVSS archiver clients had problems reconnecting automatically after node reboot (they had to be restarted by Atlas expert on shift). This is being investigated together with Atlas and EN-ICE, tuning of Atlas PVSS client configuration is being considered (in particular change of client keep alive timeout to 60 seconds).
      • Atlas PVSS reports sporadic spykes of load related to commit time (log file sync) on the DB. The issues have a frequenct of about once per week and a duration of 1-2 minutes. PVSS clients buffer to disk in that interval and write to DB then load is back to normal. Upon investigation together with Atlas PVSS experts and DBAs, the proposed change for this issue is to split the load of PVSS on node 2 and 3 of the cluster (currently PVSS runs on node 3 during normal operations of the cluster).
    • CMS:
      • CMS offline database is still affected by occassional reboots of individual nodes. The root cause is still not clear. Extra monitoring has been deployed recently to help to understand the problem.
      • Over last few weeks there were few problems with streaming from online to offline. Most of those issues were related to users' mistakes and are well unerstoond. However last Friday following a node reboot on the offline database the propagation got unexpectedly stuck and the root cause of this incident is still not clear.
      • During the technical stop next week physical memory of the servers hosting CMS online database will be upgraded. Exact date and time is still being discussed with CMS. The intervention will require few hours of full downtime of the database.
    • LHCB:
      • 2 power cuts at the pit have brough down the online DB server (on different dates). On the first instance of this issue (see post mortem) LHCB online DB got corrupted and switchover to standby was necessary. Since then issue with storage BBUs has been fixed

  • Site reports
Site Status, recent changes, incidents, ... Planned interventions
TRIUMF Migrated our ATLAS 3D Oracle RAC to Linux RHEL 5 and to new SAS drives (was on SATA drives before). Planned outage: Wed Sep 1, re-configure ASM multipathing
NDGF node PSU JUL to be discussed
RAL none PSU JUL to be discussed
BNL none 6196748 patch (ORA-07445 fix for Atlas conditions)., PSU Jul2010
SARA DB outage and backup corruption -
GRIDKA none PSU JUL to be discussed
IN2P3 On 4th august, added 2 new schemas to the AMI stream config none
ASGC Power surge in 8/24: DB were recovered within half an hour.
- RAC Testbeds were patched to 10.2.0.4, then, --> April 2010--> July 2010
Scheduled downtime: 2010-08-31 01:00 ~ 2010-08-31 10:00 (UTC) - for DC network reconfiguration, enabling additional 10Gb switch and upgrade BIOS for most of servers.
CNAF Aug 9: FTS scheduled down in order enable the FTS Oracle service on a single node (out of three). The other two nodes of the cluster are defined as "available".
Aug 17: LHCb cluster (LFC and conditionsDB) Transparent intervention Substitution of motherboard on one cluster node in order to solve the not well understood network problems.
An intervention on the network switch hosting the private connections of a half of the Oracle servers has to be scheduled. We don't have a precise date yet, it will be done in the next 15 days. The intervention will be transparent for users, but the services will run on a half of their hardware resources for few minutes.
An LFC (both ATLAS and LHCb) upgrade to version 1.7.4-7 will be scheduled in the next two weeks, it will be a transparent intervention.

Additional notes from CNAF:
Summary of the recent LHCB cluster problems:

The first network problem appeared on June 3 (ticket id 58770), then on June 28 (ticket 59486), July 25 (ticket id 60458) and on Aug 6 (Streams propagation failure). The problem appeared as a network connection freeze up, ie: even if, from an operating system point of view, the ethernet interface seemed up & running, no bit were sent or received by the host. After various investigations also involving the network switch and the operating system, we decided to substitute the motherboard of the faulty server (on Aug 17). Since the hardware intervention, the problem didn't arise again.

Dates & topics for future meetings

  • The Site Status Board - Feedback from Sites
  • Prolonged site & service downtimes - strategies
  • Testing of Oracle patches - conclusions of ATLAS/IT-DB meeting

AOB

-- JamieShiers - 25-Aug-2010

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2010-09-22 - JonBakken
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback