WLCG Tier1 Service Coordination Minutes - 17 March 2011

Attendance

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.10 (all)
SRM 2.9-4 (ALICE, CMS, LHCb), SRM 2.10 (ATLAS)
xrootd 2.1.9-7
SRM 2.10 rolled out for ATLAS, having issues, investigating them  
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
17/2: unscheduled downtime from 06:50 to 09:40 UTC due to a network issue which caused data transfers from EU and US te be slow 1/3: "at risk" intervention from 06:00 to 10:00 UTC on tape system for data centre construction works; tapes may be inaccessible, D1T0 data unaffected
BNL dCache 1.9.5-23 (PNFS, Postgres 9) None None
CNAF StoRM 1.5.6-3 (ATLAS, CMS, LHCb,ALICE)    
FNAL dCache 1.9.5-23 (PNFS)
Scalla xrootd 2.9.1/1.4.2-4
Oracle Lustre 1.8.3
None None
IN2P3 dCache 1.9.5-24 (Chimera) on all core servers and pool nodes Upgrade from 1.9.5-22 to 1.9.5-24 on 2011-02-08 None
KIT dCache (admin nodes): 1.9.5-15 (Chimera), 1.9.5-24 (PNFS)
dCache (pool nodes): 1.9.5-9 through 1.9.5-24
   
NDGF dCache 1.9.12    
NL-T1 dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.5-24 (PNFS)    
RAL CASTOR 2.1.9-6 (stagers)
2.1.9-1 (tape servers)
SRM 2.8-6
NS now upgraded to 2.1.10-0 during March we want to upgrade: castor clients on WNs to 2.1.9-6 and stagers to 2.1.10-0
TRIUMF dCache 1.9.5-21 with Chimera namespace    

CASTOR news

— CERN operations

— Development

xrootd news

dCache news

StoRM news

FTS news

DPM news

  • DPM 1.8.0-1 for gLite 3.1: waiting for rebuild of the meta package with the correct VOMS libraries (1.9.10-14)
    • Edinburgh involved as Early Adopter site

LFC news

  • LFC 1.8.0-1 for gLite 3.1: waiting for rebuild of the meta package with the correct VOMS libraries (1.9.10-14)

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.7.4-7 SLC5 64-bit Oracle None
BNL 1.8.0-1 SL5, 64-bit Oracle None
CERN 1.7.3 64-bit SLC4 Oracle Will upgrade to SLC5 64-bit by the end of Jan or begin of Feb.
CNAF 1.7.4-7 SL5 64-bit Oracle  
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.0-1 SL5 64-bit Oracle 11g Oracle DB migrated to 11g on Feb. 8th
KIT 1.7.4 SL5 64-bit Oracle  
NDGF 1.7.4.7-1 Ubuntu 9.10 64-bit MySQL None
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL  

Experiment issues

WLCG Baseline Versions

  • WLCG Baseline versions: table

Status of open GGUS tickets

GGUS - Service Now interface: update

Review of recent / open SIRs and other open service issues

Conditions data access and related services

COOL, CORAL and POOL

  • A new LCGCMT_60b has been prepared for ATLAS. The main motivation for this release is the upgrade to newer versions of the ROOT (5.28.00b) and Qt (4.6.3p2) external dependencies. The new release also includes a few bug fixes enhancements in CORAL and POOL, while it is based on the same COOL code base as LCGCMT_60a. The full release notes are available at https://twiki.cern.ch/twiki/bin/view/Persistency/PersistencyReleaseNotes

Frontier/Squid

Database services

  • Experiment reports:
    • ALICE:
      • Intervention on Monday 7th from 9:15 to 10:15 in order to fix an electricity problem with one of the power boxes used by the Alice online database. The database was unavailable during the intervention.
    • ATLAS:
      • Muon Calibration data replication from 3 Muon sites: Michigan, Rome and Munich to CERN (ATLAS offline database) moved to production.
      • New schemas have been added to the ATLAS conditions Streams replication setup on Tuesday 15th March.
    • CMS:
      • CMS integration database affected by Oracle bug 7612454: More "direct path read" operations/OERI:kcblasm. Fix patch applied and tested. Recently, same issue affecting one application on CMS offline production database (CMSR). Waiting for green light from CMS to apply it in production (rolling intervention).
      • CMS online production database got stuck during a scheduled rolling reboot on Friday 4th February at 14:10, as part of the disk replacement. The hang was caused by an ASM locking and the database was unavailable for the users during 45 min.
      • CMS offline production database (CMSR) crashed on Friday 11th March on 10:20 due to a local power cut in the critical area of the CERN CC. Several disk arrays went down which in turn caused failure of the whole database. After the power was restored the database still could not be restarted due to failure of 2 disks containing mirror copies of data. The service was restarted on the standby hardware at 12:30. Unfortunately applications connecting from outside CERN could not reach the database until 3:10 when the CERN central firewall was re-configured to allow for such access. Detailed postmortem is available here https://twiki.cern.ch/twiki/bin/view/DB/PostMortem11Mar11
      • CMS offline production database (CMSR), running on standby hardware, failed on Tuesday 15th March at 11:45. The failure was caused by an issue in the SAN which is still being investigated. The database has been failed over to the original hardware. The service was fully restored at 13:20.
    • LHCB:
      • On Saturday 26th of February around 1am, the LHCB offline production database hung due to a problem with the controller for one of disk arrays attached to it. The service was fully recovered by 2:10am. The faulty controller has been replaced by the vendor.

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
CERN High memory consumption by the qmon processes on the downstream databases, being investigated by Oracle None
ASGC DB upgrades: CASTOR NS: Oracle 10.2.0.5 new RAC ready, CASTOR STAGER/SRM: Oracle 10.2.0.5 new RAC ready, CASTOR testbed upgrade 2.1.10: in processing. None
BNL Conditions DB: Deploy the Oracle Enterprise Manager agents patches 10.2.0.5 PSU3 (patch 9282414) + patch 10170020 that reports to the 3D OEM. Apply the Oracle Enterprise Manager production agents patch 10170020 that report to BNL OEM. Dataguard physical standby database enabled for LFC and FTS production cluster None
CNAF LHCb and FTS clusters updated to 10.2.0.5 - ATLAS skipped because CNAF COND and LFC will be decommissioned as planned by the experiment.  
KIT ATLAS 3D RAC: enabled AUDIT_TRAIL (3rd March) Migration of LFC/FTS-DB to the new hardware for the week: 28.03-01.04 (using DataGuard)
IN2P3 Due to a failure on a network switch, 3D database were unreachable on 14Th march between 20:00 and 20:30 (UTC) None
NDGF ntr None
PIC ntr None
RAL shmmax kernel parameter changed on 3D database to be able to increase the SGA size from 4 to 6GB. Renamed ~3M SFNs on ATLAS file catalog for the CASTOR pool merge. None
SARA ntr None
TRIUMF ntr None

AOB

-- JamieShiers - 14-Mar-2011

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r16 - 2011-03-17 - EvaDafonte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback