WLCG Tier1 Service Coordination Minutes - 15 September 2011

Attendance

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.11-6 (SL5); CASTOR-SRM 2.10-x (SL5); xrootd: 2.1.11-1
FTS: 5 nodes in SLC5 3.7.0-3; 7 nodes in SLC4 3.2.1
EOS -0.1.0/xrootd-3.0.4
CASTOR-SRM on SL5 EOS-SRM (bestman) update to 2.1.3
ASGC CASTOR 2.1.11-2
SRM 2.11-0
DPM 1.8.0-1
None None
BNL dCache 1.9.5-23 (PNFS, Postgres 9) None Transition to Chimera during next TS (Nov)
CNAF StoRM 1.7.0    
FNAL dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25
Scalla xrootd 2.9.1/1.4.2-4
Oracle Lustre 1.8.3
   
IN2P3 dCache 1.9.5-26 (Chimera) on core servers. Mix of 1.9.5-24 and 1.9.5-26 on pool nodes    
KIT dCache (admin nodes): 1.9.5-27 (ATLAS, Chimera), 1.9.5-26 (CMS, Chimera) 1.9.5-26 (LHCb, PNFS)
dCache (pool nodes): 1.9.5-6 through 1.9.5-27
   
NDGF dCache 1.9.12    
NL-T1 dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.12-10 (last upgrade to patch release on 13-Sep); PNFS on Postgres 9.0   Planning intervention in the MSS: upgrade to Enstore2. Possible date 28-Sep. In contact with experiments to check if this is ok.
RAL CASTOR 2.1.10-2
2.1.10-0 (tape servers)
SRM 2.10-0
  7/9: will apply DB patches to 3D ATLAS and LHCb, FTS and LFC. Services will be "at risk"
TRIUMF dCache 1.9.5-28 with Chimera namespace None None

Other site news

CASTOR news

CERN operations and development

EOS news

xrootd news

dCache news

StoRM news

FTS news

DPM news

LFC news

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.8.0-1 SLC5 64-bit Oracle None
BNL 1.8.0-1 SL5, 64-bit Oracle None
CERN 1.8.2-0 64-bit SLC5 Oracle Upgrade to SLC5 64-bit only pending for lfcshared1/2
CNAF 1.7.4-7 (ATLAS, to be dismissed>
1.8.0-1 (LHCb, recently updated)
SL5 64-bit Oracle  
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.0-1 SL5 64-bit Oracle 11g Oracle DB migrated to 11g on Feb. 8th
KIT 1.7.4-7 SL5 64-bit Oracle Oracle backend migration pending
NDGF 1.7.4.7-1 Ubuntu 9.10 64-bit MySQL None
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL  

Experiment issues

WLCG Baseline Versions

Status of open GGUS tickets

  • All 4 experiments confirmed they had no issues. Atlas shifters are reminded to use TEAM tickets in order to share the ticket ownership across shifts.
  • The introduction of a "Type of Problem" field for TEAM and ALARM tickets will take place with the 2011/09/28 GGUS Release as planned according to Savannah:117206. The field values will be, as agreed:
    • Infrastructure (File transfer/access, Batch, Monitoring)
    • Storage Systems
    • Databases
    • Network problem
    • Middleware

Review of recent / open SIRs and other open service issues

Conditions data access and related services

Database services

  • Experiment reports:
    • ATLAS:
      • Due to a high load coming from COOL application third node of Atlas offline database (ATLR) rebooted on Thursday (8th of Sept) around 2AM. After 3rd node has restarted first instance of ATLR crashed due to clusterware error (according to Oracle documentation it's a known problem in 10g per unpublished bug). All services were relocated to the second node which managed to survive. All ATLR instances were back in operation in around 20 minutes.
      • As it was requested by ATLAS experiment the new schema ATLAS_CONF_TRIGGER_REPR has been added to ATLAS T0-T1s replication on Wednesday (14th of Sept). The intervention was not transparent and required 1 hour downtime (from 10.30 until 11.30) of whole ATLAS replication service between T0 and T1s.
    • CMS:
      • There were several failures of CMS PVSS stream replication on: Monday, Tuesday and Wednesday (12-14th of Spet) due to user mistakes who created view with wrong syntax, and tried to recompile non-existing views.
    • WLCG:
      • LCGR database services were unreachable for 1 hour on Sunday (11th Sept ~19:30). The problem was caused by reaching maximum number of opened session by the database. This problem was triggered by the fact that some VOMS application stopped disconnecting from the database but kept opening new sessions. In order to make database available again instance no 4 was restarted.

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
BNL - Resynchronization of a table in the Conditions DB due to a streams instantiation problem.
- Renew DOE certificates in the BNL Oracle Enterprise Manager Grid Control.
- apply CPU patches initially to Conditions Database and proposed patch from Oracle (P6011045).
CNAF   Applying CPU July patches on 20th and 21st of September
KIT ATLAS 3D DB: due to the migration to new hardware, disk group names changed and we had to re-create a few dba_directories. None
IN2P3    
NDGF    
PIC   We're going to apply July CPU patch on all database next week, between Tuesday and Wednesday.
RAL Nothing to report None
SARA Nothing to report (Not attending) None
TRIUMF Nothing to report None

AOB

-- JamieShiers - 14-Sep-2011

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2011-10-12 - AndrewWong
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback