WLCG Tier1 Service Coordination Minutes - 24 February 2010

Attendance

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.10 (all)
SRM 2.9-4 (ALICE, CMS, LHCb), SRM 2.10 (ATLAS)
xrootd 2.1.9-7
SRM 2.10 rolled out for ATLAS, having issues, investigating them  
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
17/2: unscheduled downtime from 06:50 to 09:40 UTC due to a network issue which caused data transfers from EU and US te be slow 1/3: "at risk" intervention from 06:00 to 10:00 UTC on tape system for data centre construction works; tapes may be inaccessible, D1T0 data unaffected
BNL dCache 1.9.5-23 (PNFS, Postgres 9) None None
CNAF StoRM 1.5.6-3 (ATLAS, CMS, LHCb,ALICE)    
FNAL dCache 1.9.5-23 (PNFS)
Scalla xrootd 2.9.1/1.4.2-4
Oracle Lustre 1.8.3
None dns load balanced srm service - March 1. Adding cvmfs service March 3.
IN2P3 dCache 1.9.5-24 (Chimera)    
KIT dCache (admin nodes): 1.9.5-15 (Chimera), 1.9.5-24 (PNFS)
dCache (pool nodes): 1.9.5-9 through 1.9.5-24
   
NDGF dCache 1.9.11    
NL-T1 dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.5-23 (PNFS)    
RAL CASTOR 2.1.9-6 (stagers)
2.1.9-1 (tape servers)
SRM 2.8-6
All disk servers now upgraded to SL5 64bit 2.1.10-0 upgrade during March
TRIUMF dCache 1.9.5-21 with Chimera namespace    

CASTOR news

CERN operations

Development

xrootd news

dCache news

StoRM news

FTS news

DPM news

  • DPM 1.8.0-1 for gLite 3.1: expected to be released soon after correction of the VOMS libraries

LFC news

  • LFC 1.8.0-1 for gLite 3.1: expected to be released soon after correction of the VOMS libraries

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.7.4-7 SLC5 64-bit Oracle None
BNL 1.8.0-1 SL5, 64-bit Oracle None
CERN 1.7.3 64-bit SLC4 Oracle Will upgrade to SLC5 64-bit by the end of Jan or begin of Feb.
CNAF 1.7.4-7 SL5 64-bit Oracle  
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.0-1 SL4 64-bit Oracle 11g Oracle DB migrated to 11g on Feb. 8th
KIT 1.7.4 SL5 64-bit Oracle  
NDGF 1.7.4.7-1 Ubuntu 9.10 64-bit MySQL None
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL  

Experiment issues

WLCG Baseline Versions

Status of open GGUS tickets

GGUS - Service Now interface: update

Review of recent / open SIRs and other open service issues

Conditions data access and related services

COOL, CORAL and POOL

  • A new LCGCMT_60a has been prepared for ATLAS and LHCb. The main motivation for this release is the upgrade to newer versions of the ROOT (5.28.00a) and frontier_client (2.8.0) external dependencies. The frontier_client upgrade includes fixes and improvements that will be useful for ATLAS data access. The new release also includes a POOL patch (needed by the ROOT upgrade) and minor bug fixes in COOL, while it is based on the same CORAL code base as LCGCMT_60. The full release notes are available at https://twiki.cern.ch/twiki/bin/view/Persistency/PersistencyReleaseNotes.

Frontier/Squid

  • ATLAS weekly Frontier meetings
  • Please note the proposals to consolidate the number of ATLAS Frontier service sites, as well as to provide a small team of ATLAS administrators access to service configuration files at each site.

Database services

  • Experiment reports:
    • ALICE:
    • ATLAS:
      • Testing of DB switchover to standby for Atlas online DB is scheduled for Thursday 17th. Tests were successful (though some listener-related problems were observed in connections from the HLT system and it is not yet fully clear what caused them and what made them disappear in the end).
      • 3rd node of ATLAS offline production database (ATLR) rebooted on Tuesday 22nd February around midnight due to high load generated by clients connecting to the ATLAS_COOL_READER account. Even though the machine itself was up again after approx 45 minutes one of its local file systems, essential for proper operation of an Oracle RDBMS instance, could not be mounted in a read-write mode due to some inconsistencies. Inability to write to the affected file system made impossible crash recovery of few Oracle data files which in turn affected online to offline data replication. The machine with faulty file system was removed from ATLR cluster at 5:20 and all outstanding problems with not recovered data files and the replication were fixed by 6am. The problematic file system was checked by sysadmins and at 2pm on 23rd the machine was re-added to the cluster. The issue introduced few hours of delay to online->offline replication.
      • Muon Calibration data will be replicated from the Muon site Michigan to CERN (ATLAS offline database) in production since Thursday 17.02.Muon Calibration data will be replicated from the Muon site Michigan to CERN (ATLAS offline database) in production since Thursday 17.02.
    • CMS:
      • Spontaneous reboot of the 4th node of CMSR database (CMS offline) on Monday 21st February at 5pm. The reboot was caused by excessive PGA utilization by CMS DBS application.
      • CMS PVSS replication latency increased to 1 day due to a huge transaction (more than 175 million changes) executed in a replicated schema without tagging. This caused CMSR node 3 reboot due to high swap and exhausted the space on the SYSAUX tablespace during the night (15.02).
      • On Monday 21st CMS PVSS replication suffered (aborted) from unsupported user operations compiling synonyms. Streams error handling procedures has been updated to avoid such problems in future.
    • LHCb:

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
ASGC    
BNL * Decommission of TAGS services: iELSSI Browser, TAGS database services, (Cluster database and data)  
CNAF    
KIT    
IN2P3 * upgraded LHCB et ATLAS db to 10.2.0.5
* enabled the audit trail on atlas db
 
NDGF    
PIC Nothing to report None
RAL * All our 10.2.0.4 databases have been upgraded to 10.2.0.5.
* Patch 9184754 has been applied on 3D and FTS/LFC, just in case, there are plans to apply them on Castor DB.
* Testing Castor 2.1.10.
 
SARA Nothing to report No interventions
TRIUMF * Applied Oracle 10.2.0.5 PSU3 (patch 9282414) + patch 10170020 on OEM Agents  

AOB

-- JamieShiers - 23-Feb-2011

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2011-03-15 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback