WLCG Operations Coordination Minutes - November 21, 2013

Agenda

Attendance

  • Local:
  • Remote:

News

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR:
v2.1.14-3 and SRM-2.11-2 on all instances
EOS:
ALICE (EOS 0.2.37 / xrootd 3.2.8)
ATLAS (EOS 0.3.2 / xrootd 3.3.4 / BeStMan2-2.2.2)
CMS (EOS 0.3.2 / xrootd 3.3.4 / BeStMan2-2.2.2)
LHCb (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
CASTOR central servers/nameservers/headnodes updated to SLC6
EOS 0.3 for ATLAS,CMS
EOS 0.3 for LHCb 2013-11-26
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.6-1
xrootd
3.2.7-1
None DPM upgrade to 1.8.7-3 on Nov 25th, there will be 9 hours intervention: 1AM ~ 10AM (UTC)
BNL dCache 2.2.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None dCache upgrade to v2.6 on Dec 17 for SHA-2 compatibility
CNAF StoRM 1.11.2 emi3 (ATLAS, CMS, LHCb) Removed Alice endpoint
Added xrootd to LHCb endpoint
none
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.7.slc
Oracle Lustre 1.8.6
EOS 0.3.2-4/xrootd 3.3.3-1.slc5 with Bestman 2.2.2.0.10
EOS 0.3.2-4 Will test EOS 0.3.3 and possibly upgrade next week; dCache 2.2 by hopefully end of 2013
IN2P3 dCache 2.6.15-1 (Chimera) on SL6 core servers and pool nodes
Postgres 9.2
xrootd 3.0.4
dCache upgrade to 2.6.15 on 15/11/2013 (SHA2- compliant) none
KISTI xrootd v3.2.6 on SL5 for disk pools
xrootd 20100510-1509_dbg on SL6 for tape pool
dpm 1.8.6
   
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.5-1
  • cmssrm-fzk.gridka.de: 2.6.5-1
  • lhcbsrm-kit.gridka.de: 2.6.5-1
xrootd
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd proxy 3.3.1-1
None We want to update all dCache setups to at least 2.6.15 this year. However, no fixed dates as of yet.
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.7 (Chimera) (SURFsara), DPM 1.8.6 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 2.2.17-1
xrootd door to VO severs (3.3.1-1)
None None
RAL CASTOR 2.1.13-9
2.1.13-9 (tape servers)
SRM 2.11-1
   
TRIUMF dCache 2.2.18    

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.3.1-1 for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.6-1 SLC6, EMI2 Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations About to deploy 1.8.7-3 on Puppet managed Openstack nodes

Other site news

Status of Tier-1 WN deployment on OPN now tracked in this survey:

NOTE: this is NOT a request for deployment, it is a survey of the current status to facilitate experiment operations planning.

Experiments operations review and Plans

ALICE

  • CERN
    • SLC6 jobs (all running in VMs) have higher failure rates and typically 10-20% lower CPU/wall-time efficiency than SLC5 jobs (all running on physical nodes)
      • still to be understood
  • CVMFS
    • 59 sites using it in production
    • 21 in various stages of preparation
    • sites please ensure the WN have version 2.1.15 (or higher)
  • SAM

ATLAS

NTR

CMS

Not much to report

  • Ongoing project proceeding
    • Disk-Tape separation at Tier-1s, SL6 migration, Savannah-GGUS, CVMFS, Multi-core scheduling
    • See last meeting for a few more details

  • gLite-WMS Decommissioning
    • CRAB users warned that gLite-WMS submission is in decreasing support
    • Use of Glidein scheduler recommended (since more then a year)
    • No centrally managed WMS list distributed with CRAB client any longer
      • CRAB client uses locally configured UI values for CMS

  • Interest to become a CMS Tier-3
    • T3_IN_PUHEP - Chandigarh, INDIA
    • T3_HR_IRB - Zagreb, Croatia

LHCb

  • Fall incremental stripping campaign finished within 6 weeks, many thanks to all T1 sites for the excellent performance
  • New LFC instances (puppetized and openstack based) have been put into production
  • A plan for the migration of grid user files from CERN / CASTOR -> EOS has been defined and will be executed 9&10 December
    • A minimum downtime of 2 days will be required for the migration of the files
    • After the finishing of this migration no more essential data will be left on the disk-only CASTOR storage (some more cleanup needed though)
  • LHCb will only build slc6 binaries as of January 2014, therefore applications deployed after that time will not be able to run on sl5 resources anymore

Ongoing Task Forces Review

Middleware readiness

FTS3

perfSONAR

SHA-2

  • sites are steadily upgrading affected services to versions supporting SHA-2
    • in particular dCache and StoRM instances
    • EGI have a ticket open for each incompliant service
    • OSG sites that need to upgrade include BNL and FNAL
  • the experiments have tested a lot and look ready
  • timelines
    • by Dec 1 the WLCG infrastructure is expected to be mostly ready
      • remaining upgrades in Dec and possibly Jan
    • it is unlikely for SHA-2 certs to appear still this year
      • the OSG CA foresees starting mid Jan
      • the CERN CA will switch when WLCG is ready
  • VOMRS
    • a VOMS-Admin test setup has been successfully loaded with the VOMRS data of ALICE
    • the setup will be redone after its configuration has been cleaned up
    • it is expected to become available for testing by the experiments in the course of next week

WMS decommissioning

  • there has been progress for CMS - thanks!
    • users have been informed that support of the gLite WMS is ramping down and they have again been advised to use CRAB's scheduler=remoteglidein option instead
    • the CRAB-2 client will no longer use a centrally distributed list of WMS hosts
  • we will see the effects on the remaining usage of the WMS nodes for CMS and try to progress further accordingly

gLExec

  • 58 tickets closed and verified, 36 still open
    • various sites waiting to finish their SL6 migration first
    • some difficult cases being debugged
  • EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tar ball WN and do not have the Perl module Time/HiRes.pm installed (GGUS:98767)
    • a solution is being looked into
  • Deployment tracking page

Tracking tools evolution

xrootd deployment

UDP collector (a.k.a. GLED ) for detailed monitoring

  • an additional instance of the collectors has been enabled at CERN for FAX
    • ATLAS EU sites should configure the detailed monitoring to send monitoring stream to this new collector
      • migration will be coordinated internally to FAX
  • The current number of running collectors is 5
    • FAX: 1 Collector in US, 1 Collector at CERN, 1 Collector at CERN for EOS
    • AAA: 1 Collector in US, 1 Collector at CERN for EOS

IPv6

-- SimoneCampana - 19 Nov 2013

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback