WLCG Tier1 Service Coordination Minutes - March 31 2011


  • Local: Ian F, Oliver, Gavin, Stephane, Hans, Maria, Jamie, Maarten, Maite, Arne, Massim, Stefan, Eva, Giuseppe, Andrea S, Luca, Alexei, Simone, Dario, Dirk, Nicolo
  • Remote: Michael, Gonzalo, Elena, Jon, Patrick, Rolf, Elizabeth, Carlos, Andrew, Felix, John de Stefano, Carmine, Jhen-Wei, Tiju, Christian, Pierre

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.10 (all)
SRM 2.10-x
xrootd: ALICE 2.1.10 update 1, others: 2.1.9-7
We are going to have all SRM on 2.10-2 due to issues on 2.10-1. Central services
CASTOR central services (VMGR, VDQM and Cupv) will be upgraded to version 2.1.10-1. The goal of the change is to enable support for tapes greater than 2TB
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
DPM 1.8.0-1
31/03: 3-hour downtime for works to electrical system and construction None
BNL dCache 1.9.5-23 (PNFS, Postgres 9) None None
CNAF StoRM 1.5.6-3 SL4 (CMS, LHCb,ALICE)
StoRM 1.6 SL5 (ATLAS)
29/03: upgraded StoRM Atlas end-point to 1.6., the first one on SL5 1.6 SL5 deployment on the other end-points will be done during the next technical stop of LHC
FNAL dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25
Scalla xrootd 2.9.1/1.4.2-4
Oracle Lustre 1.8.3
DNS round robin srm service deployed today none
IN2P3 dCache 1.9.5-24 (Chimera) on all core servers and pool nodes    
KIT dCache (admin nodes): 1.9.5-15 (Chimera), 1.9.5-24 (PNFS)
dCache (pool nodes): 1.9.5-9 through 1.9.5-24
NDGF dCache 1.9.12    
NL-T1 dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.5-25 (PNFS, Postgres 9) none none
RAL CASTOR 2.1.10-0
2.1.9-1 (tape servers)
SRM 2.8-6
Upgraded all instances to 2.1.10 none
TRIUMF dCache 1.9.5-21 with Chimera namespace None None


CERN operations


xrootd news

dCache news

StoRM news

FTS news

DPM news

  • DPM 1.8.0-1 for gLite 3.1: waiting for rebuild of the meta package with the correct VOMS libraries (1.9.10-14)
    • Edinburgh verified the memory leak is gone when those libraries are installed instead of the ones currently provided by the patch

LFC news

  • LFC 1.8.0-1 for gLite 3.1: waiting for rebuild of the meta package with the correct VOMS libraries (1.9.10-14)

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.7.4-7 SLC5 64-bit Oracle None
BNL 1.8.0-1 SL5, 64-bit Oracle None
CERN 1.7.3 64-bit SLC4 Oracle Upgrade to SLC5 64-bit pending
CNAF 1.7.4-7 SL5 64-bit Oracle  
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.0-1 SL5 64-bit Oracle 11g Oracle DB migrated to 11g on Feb. 8th
KIT 1.7.4 SL5 64-bit Oracle  
NDGF Ubuntu 9.10 64-bit MySQL None
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL  

Experiment issues

WLCG Baseline Versions

Status of open GGUS tickets

GGUS - Service Now interface: update

Review of recent / open SIRs and other open service issues

Conditions data access and related services

Database services

  • Experiment reports:
    • ALICE:
      • NTR
    • ATLAS:
      • New schemas have been added to the ATLAS conditions Streams replication setup on Tuesday 15th of March
    • CMS:
      • The CMS offline production database (CMSR) crashed on Friday 11th March on 10:20 due to a local power cut in the critical area of the CERN CC. Due to the power cut several disk arrays went down which in turn caused failure of the whole database. After the power was restored the database still could not be restarted due to failure of 2 disks containing mirror copies of data. A decission has been taken to fail over the service to the standby hardware. The service was restarted on the standby hardware at 12:30. Unfortunately applications connecting from outside CERN could not reach the database until 3:10 when the CERN central firewall was re-configured to allow for such access. Detailed postmortem is available here https://twiki.cern.ch/twiki/bin/view/DB/PostMortem11Mar11
      • The CMS offline production database (CMSR), running on standby hardware, failed on Tuesday 15th March at 11:45. The failure was caused by an issue in the SAN. The database has been failed over to the original hardware which it was running on until 11th of March. The service was fully restored at 13:20. The SAN issue was caused by a bug in Emulex driver, a fixed version has been identified in the newer kernel (2.6.18-238.1.1.el5).
      • Sine one week CMS PVSS replication is suffering from repetitive data latency caused by degradation of apply process performance. The root cause is not obvious and not yet known. Some extra workload from certain schemas has been observed but not big enough to saturate streams throughput. The problematic schemas have been separated from main replication by new streams configuration. Also a service request to the Oracle Support has been opened.
    • LHCb:

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
ASGC   * New DB instance for Castor 2.1.10 upgrade: in progress.
* DB configuration optimizations for CASTOR peding.
BNL Nothing to report None
KIT Nothing to report * Migration of LFC/FTS RAC to new hardware postponed from March 31 to April 7
IN2P3 Nothing to report None
NDGF Nothing to report None
PIC LHCB DB was upgraded to on the 28th of March.  
RAL Castor schemas have been upgraded to 2.1.20  
SARA Nothing to report None
TRIUMF Nothing to report None


-- JamieShiers - 30-Mar-2011

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2011-04-19 - GonzaloMerino
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback