WLCG Tier1 Service Coordination Minutes - 17 November 2011

Attendance

Local: Mike Lamont (BE), Massimo (IT-DSS), Ian (CMS), Jorge (CMS), Pepe (CMS), Dirk (IT-DSS), Dawid (IT-DB), Oliver (IT-GT), Eva (IT-DB), Stefan (IT-ES), Jamie (IT-ES), MariaG (IT-ES), AndreaS (IT-ES), Alessandro (IT-ES), Maarten (IT-ES), Maite (IT-PES), DavidT (IT-ES), Torre (ATLAS), Stephan (ATLAS), MariaD (IT-ES), Julia (IT-ES)

Remote: Carlos (BNL), Gonzalo (PIC), Joel (LCHb), Jhen-Wei (ASGC), Maria Francesca (NDGF), Rob (OSG), Alexander (NL-T1), Ron (NL-T1), Andrew (TRIUMF), Alessandro (CNAF), Andreas (KIT), JohnK (RAL), Burt (FNAL)

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.11-8 for all main instances; SRM 2.10-2 (2.11 on PPS); xrootd: 2.1.11-1
FTS: 5 nodes in SLC5 3.7.0-3; 7 nodes in SLC4 3.2.1
EOS -0.1.0/xrootd-3.0.4
As announced, all instances have been upgraded (transparently) to 2.1.11-8 (including the name server machines). 2.1.11.8 contains the Tape Gateway software which will allow more stable and efficient tape handling.

We plan (next week) to start upgrading SRM to 2.1 (Public only, no LHC activity) and (again public only) to start using the Tape Gateway.

CASTOR 2.12 will be released on 1st of December.

The tests on the usage of Oracle 11g are continuing and a plan for the upgrade is being preprared together with the DB group.

ASGC CASTOR 2.1.11-6
SRM 2.11-0
DPM 1.8.0-1
1-3/11: downtime for power construction in data centre: all storage affected.
7-9/11: downtime for CASTOR upgrade last week
Hardware reconfiguration to disk servers to provide full capacity to CMS
BNL dCache 1.9.5-23 (PNFS, Postgres 9) Upgraded to dCache 1.9.12-10 and switched from PNFS to Chimera None
CNAF StoRM 1.8.0 (Atlas, CMS, LHCb) 9/11: Fixed memory leak in 1.8 and upgraded ATLAS and LHCb instances to 1.8 17/11: Upgrade CMS to 1.8
FNAL dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25
Scalla xrootd 2.9.1/1.4.2-4
Oracle Lustre 1.8.3
   
IN2P3 dCache 1.9.5-29 (Chimera) on core servers and pool nodes    
KIT dCache
‣ atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera)
‣ cmssrm-fzk.gridka.de: head nodes 1.9.5-26 (Chimera), pool nodes 1.9.5-6 through -25
‣ gridka-dcache.fzk.de: head nodes 1.9.5-26 (PNFS), pool nodes 1.9.5-24,-25
Upgraded all nodes for atlassrm-fzk.gridka.de to 1.9.12-11 on 15th Nov 2011.  
NDGF dCache 1.9.14 (Chimera) on core servers. Mix of 1.9.13 and 2.0.0 on pool nodes.    
NL-T1 dCache 1.9.12-10 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.12-10 (last upgrade to patch release on 13-Sep); PNFS on Postgres 9.0    
RAL CASTOR 2.1.10-1
2.1.10-0 (tape servers)
SRM 2.10-2
None None
TRIUMF dCache 1.9.5-28 with Chimera namespace None None

Other site news

CASTOR news

CERN operations and development

EOS news

* EOS 0.1.0-43 installed on ATLAS

    • namespace boot performance boost (6 x faster)
    • small bug fixing
  • EOS 0.1.1-1/xrootd-3.1.0 under test
    • GPL v3 license
    • unordered adler checksumming
    • extension of fsck tool
    • r/w lock implementation on namespace

xrootd news

dCache news

StoRM news

FTS news

  • FTS 2.2.8 now installed on the CERN pilot service. Phedex transfers are running. Atlas intend to start testing. Transfer publishing to messaging is working.
  • FTS 2.2.7 cancelled, next version will be FTS 2.2.8
  • FTS 2.2.6 released in EMI-1 Update 6 on Sep 1

DPM news

LFC news

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.8.0-1 SLC5 64-bit Oracle LFC service moved to CERN
BNL 1.8.0-1 SL5, 64-bit Oracle None
CERN 1.8.2-0 64-bit SLC5 Oracle Upgrade to SLC5 64-bit only pending for lfcshared1/2
CNAF 1.8.0-1 SL5 64-bit Oracle None
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.2-0 SL5 64-bit Oracle 11g  
KIT 1.7.4-7 SL5 64-bit Oracle Oracle backend migration pending
NDGF 1.7.4.7-1 Ubuntu 10.04 64-bit MySQL None
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL None

Experiment issues

WLCG Baseline Versions

Status of open GGUS tickets

Review of recent / open SIRs and other open service issues

Conditions data access and related services

  • Work has resumed on COOL performance validation on Oracle 11g servers. Different results with a better performance have been observed on the server used for previous tests after its upgrade from 11.2.0.2 to 11.2.0.3. For comparison, tests have been repeated on another server still running 11.2.0.2, confirming the observation of worse performance than on 10g or 11.2.0.3. More tests are however needed to confirm whether 11.2.0.3 solves all problems previously reported, or whether the use of the 10g optimizer (previously planned on 11.2.0.2 servers) should still be recommended. The interpretation of results is complicated by the new 'SQL plan baseline feature' on Oracle 11g: the new tests on the 11.2.0.2 server showed that this feature can lead to worse performance and less transparency of query execution plans, so that it presently seems better to disable this feature for COOL.

Database services

  • Experiment reports:
    • Status report - all integration and production DBs have been patched with CPU Oct and all 11.2.0.2 integration DBs migrated to 11.2.0.3

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
BNL Nov 7 2011 - Network intervention:
- Due to a site network intervention the propagation, apply process had to be stopped during this intervention.
- The LFC/FTS and VOMS/Priority Stager oracle clusters moved to the new datacenter. No problems observed.
Preventive stop of 3D OEM conditions database oracle agents due to problems with the 3D OEM certificate.
Nov 8 2011:
- Due to a site network intervention the propagation, apply process had to be stopped for ~ 1.5 hour during this intervention 10:00PM-11:30PM EST
- Followup with ORACLE via SR 3-4849304931 the results after applying 11.2.0.3 patch in 11g test RAC. Specifically GSD resources don't start after upgrade 11.2.0.2 to 11.2.0.3
None
CNAF    
KIT Nov 15-17: ATLAS LFC migration to CERN (ATLAS still testing).  
IN2P3 At the end of July and the beginning of October, some databases under differents releases 10.2, 11.1 and 11.2 had been impacted by physical corruptions.
A SR has been opened to the Oracle and Pillar (storage unit)support. All investigations led by Oracle have concluded to a hardware problem whereas Pillar support did not diagnostic any problem. Therefore, to make safe all databases including 3D, we have moved all data to another storage unit and put out of production the PILLAR unit. An expert came from Pillar support to examine our hardware and after some testbeds, many malfunctions have been noticed. We work in close collaboration with Oracle EMEA top support management so that they provide us a solution as soon as possible.
 
NDGF    
PIC Nothing to report Planning to install CPU October patch before the end of November, no exact dates yet.
RAL Nothing to report None
SARA Nothing to report None
TRIUMF Nothing to report None

AOB

-- AndreaSciaba - 16-Nov-2011

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2011-12-12 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback