WLCG Tier1 Service Coordination Minutes - 17 March 2011

Attendance

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.10 (all)
SRM 2.9-4 (ALICE, CMS, LHCb), SRM 2.10-1 (ATLAS)
xrootd 2.1.9-7
SRM 2.10-1 working OK for ATLAS; xrootd 3.0.2 for ALICE SRM 2.10-1 for the other experiments, dates being negotiated
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
March 13 power maintenance during which storage services were not available No planned downtime. ATLAS D1T0 data being moved from CASTOR to DPM (v1.8.0), mostly done; new data going to DPM directly.
BNL dCache 1.9.5-23 (PNFS, Postgres 9) None None
CNAF StoRM 1.5.6-3 (ATLAS, CMS, LHCb,ALICE) We have completed the installation and validation of the storage for 2011 and we are now enlarging the file-systems (done for LHCb, next week for the others). We are still waiting for the last InfnGrid StoRM release (1.6.2, the next one will be included in EMI 1), the first one on SL5. This release was due about one month ago: hence we have to change our deployment plans and we have negotiated with LHC experiments representatives to perform the upgrade of their StoRM end-points during the LHC technical stops: March 28 for Atlas, March 29 for LHCb and in May for CMS and Alice (exact days to be agreed).
FNAL dCache 1.9.5-23 (PNFS)
Scalla xrootd 2.9.1/1.4.2-4
Oracle Lustre 1.8.3
CVMFS now deployed on all 1200 WN, working OK DNS-load-balanced SRM in final testing, to be deployed Monday March 21
IN2P3 dCache 1.9.5-24 (Chimera) on all core servers and pool nodes Upgrade from 1.9.5-22 to 1.9.5-24 on 2011-02-08 None
KIT dCache (admin nodes): 1.9.5-15 (Chimera), 1.9.5-24 (PNFS)
dCache (pool nodes): 1.9.5-9 through 1.9.5-24
   
NDGF dCache 1.9.12    
NL-T1 dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.5-24 (PNFS, Postgres 9)    
RAL CASTOR 2.1.9-6 (stagers)
2.1.9-1 (tape servers)
SRM 2.8-6
NS now upgraded to 2.1.10-0 during March we want to upgrade: castor clients on WNs to 2.1.9-6 and stagers to 2.1.10-0
TRIUMF dCache 1.9.5-21 with Chimera namespace    

CASTOR news

— CERN operations

— Development

xrootd news

dCache news

StoRM news

FTS news

DPM news

  • DPM 1.8.0-1 for gLite 3.1: waiting for rebuild of the meta package with the correct VOMS libraries (1.9.10-14)
    • Edinburgh involved as Early Adopter site

LFC news

  • LFC 1.8.0-1 for gLite 3.1: waiting for rebuild of the meta package with the correct VOMS libraries (1.9.10-14)

LFC deployment

Site Version OS, n-bit Backend Upgrade plansSorted ascending
CNAF 1.7.4-7 SL5 64-bit Oracle  
KIT 1.7.4 SL5 64-bit Oracle  
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL  
ASGC 1.7.4-7 SLC5 64-bit Oracle None
BNL 1.8.0-1 SL5, 64-bit Oracle None
NDGF 1.7.4.7-1 Ubuntu 9.10 64-bit MySQL None
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.0-1 SL5 64-bit Oracle 11g Oracle DB migrated to 11g on Feb. 8th
CERN 1.7.3 64-bit SLC4 Oracle Upgrade to SLC5 64-bit pending

Experiment issues

WLCG Baseline Versions

  • WLCG Baseline versions: table

Status of open GGUS tickets

GGUS - Service Now interface: update

Review of recent / open SIRs and other open service issues

Conditions data access and related services

COOL, CORAL and POOL

  • A new LCGCMT_60b has been prepared for ATLAS. The main motivation for this release is the upgrade to newer versions of the ROOT (5.28.00b) and Qt (4.6.3p2) external dependencies. The new release also includes a few bug fixes enhancements in CORAL and POOL, while it is based on the same COOL code base as LCGCMT_60a. The full release notes are available at https://twiki.cern.ch/twiki/bin/view/Persistency/PersistencyReleaseNotes

Frontier/Squid

Database services

  • Experiment reports:
    • ALICE:
      • Intervention on Monday 7th from 9:15 to 10:15 in order to fix an electricity problem with one of the power boxes used by the Alice online database. The database was unavailable during the intervention.
    • ATLAS:
      • Muon Calibration data replication from 3 Muon sites: Michigan, Rome and Munich to CERN (ATLAS offline database) moved to production.
      • New schemas have been added to the ATLAS conditions Streams replication setup on Tuesday 15th March.
    • CMS:
      • CMS integration database affected by Oracle bug 7612454: More "direct path read" operations/OERI:kcblasm. Fix patch applied and tested. Recently, same issue affecting one application on CMS offline production database (CMSR). Waiting for green light from CMS to apply it in production (rolling intervention).
      • CMS online production database got stuck during a scheduled rolling reboot on Friday 4th February at 14:10, as part of the disk replacement. The hang was caused by an ASM locking and the database was unavailable for the users during 45 min.
      • CMS offline production database (CMSR) crashed on Friday 11th March on 10:20 due to a local power cut in the critical area of the CERN CC. Several disk arrays went down which in turn caused failure of the whole database. After the power was restored the database still could not be restarted due to failure of 2 disks containing mirror copies of data. The service was restarted on the standby hardware at 12:30. Unfortunately applications connecting from outside CERN could not reach the database until 3:10 when the CERN central firewall was re-configured to allow for such access. Detailed postmortem is available here https://twiki.cern.ch/twiki/bin/view/DB/PostMortem11Mar11
      • CMS offline production database (CMSR), running on standby hardware, failed on Tuesday 15th March at 11:45. The failure was caused by an issue in the SAN which is still being investigated. The database has been failed over to the original hardware. The service was fully restored at 13:20.
    • LHCB:
      • On Saturday 26th of February around 1am, the LHCB offline production database hung due to a problem with the controller for one of disk arrays attached to it. The service was fully recovered by 2:10am. The faulty controller has been replaced by the vendor.

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
CERN High memory consumption by the qmon processes on the downstream databases, being investigated by Oracle None
ASGC DB upgrades: CASTOR NS: Oracle 10.2.0.5 new RAC ready, CASTOR STAGER/SRM: Oracle 10.2.0.5 new RAC ready, CASTOR testbed upgrade 2.1.10: in processing. None
BNL Conditions DB: Deploy the Oracle Enterprise Manager agents patches 10.2.0.5 PSU3 (patch 9282414) + patch 10170020 that reports to the 3D OEM. Apply the Oracle Enterprise Manager production agents patch 10170020 that report to BNL OEM. Dataguard physical standby database enabled for LFC and FTS production cluster None
CNAF LHCb and FTS clusters updated to 10.2.0.5 - ATLAS skipped because CNAF COND and LFC will be decommissioned as planned by the experiment.  
KIT ATLAS 3D RAC: enabled AUDIT_TRAIL (3rd March) Migration of LFC/FTS-DB to the new hardware for the week: 28.03-01.04 (using DataGuard)
IN2P3 Due to a failure on a network switch, 3D database were unreachable on 14Th march between 20:00 and 20:30 (UTC) None
NDGF ntr None
PIC ntr None
RAL shmmax kernel parameter changed on 3D database to be able to increase the SGA size from 4 to 6GB. Renamed ~3M SFNs on ATLAS file catalog for the CASTOR pool merge. None
SARA ntr None
TRIUMF ntr None

AOB

-- JamieShiers - 14-Mar-2011


This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsWeb > Tier1ServiceCoordination > WLCGTier1ServiceCoordinationMinutes110317
Topic revision: r19 - 2011-03-18 - MaartenLitmaath
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback