WLCG Operations Coordination Minutes - February 20, 2014

Agenda

Attendance

  • Local:
  • Remote:

News

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

  • OpenSSL issue
    • EGI broadcast sent Feb 4 describing current state of affairs and recipes for cures
    • Sites using HTCondor as batch system may need to apply one of these configuration changes for now:
      • DELEGATE_JOB_GSI_CREDENTIALS = False
      • GSI_DELEGATION_KEYBITS = 1024
    • HTCondor v8.0.6 has the default increased to 1024

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR:
v2.1.14-5 and SRM-2.11-2 on all instances
EOS:
ALICE (EOS 0.3.4 / xrootd 3.3.4)
ATLAS (EOS 0.3.8 / xrootd 3.3.4 / BeStMan2-2.3.0)
CMS (EOS 0.3.7 / xrootd 3.3.4 / BeStMan2-2.3.0)
LHCb (EOS 0.3.3 / xrootd 3.3.4 / BeStMan2-2.2.2)
   
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.7-3
xrootd
3.3.4-1
None None
BNL dCache 2.6.18 (Chimera, Postgres 10 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.11.3 emi3 (ATLAS, LHCb)
StoRM 1.11.2 emi3 (CMS)
   
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.7.slc
EOS 0.3.15-1/xrootd 3.3.6-1.slc5 with Bestman 2.3.0.16
Upgrade to EOS and xrootd Will begin migrating pool nodes to FNAL_Disk dCache instance
IN2P3 dCache 2.6.18-1 (Chimera) on SL6 core servers and pool nodes
Postgres 9.2
xrootd 3.0.4 (Alice T1), xrootd 3.3.4 (Alice T2)
xrootd 3.3.4 (Alice T2) None
JINR-T1 dCache
  • srm-cms.jinr-t1.ru: 2.6.21
  • srm-cms-mss.jinr-t1.ru: 2.2.24 with Enstore
xrootd federation host for CMS: 3.3.3
   
KISTI xrootd v3.2.6 on SL5 for disk pools
xrootd 20100510-1509_dbg on SL6 for tape pool
dpm 1.8.7-3
   
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.17-1
  • cmssrm-kit.gridka.de: 2.6.17-1
  • lhcbsrm-kit.gridka.de: 2.6.17-1
xrootd
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd proxy 3.3.3-1
   
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.17 (Chimera) (SURFsara), DPM 1.8.7-3 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 2.2.17-1
xrootd door to VO severs (3.3.4)
   
RAL CASTOR 2.1.13-9
2.1.14-5 (tape servers)
SRM 2.11-1
   
TRIUMF dCache 2.2.18    

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
JINR-T1 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.3.1-1 for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.7-3 SLC6, EPEL Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations  

Other site news

Experiments operations review and Plans

ALICE

  • job efficiencies will be fluctuating with the amounts of analysis jobs being run in preparation for the Quark Matter 2014 conference (May 19-24)
  • CERN
    • Wigner vs. Meyrin job failure rates and CPU/wall-time efficiencies
      • no change
      • next meeting Feb 28
  • KIT
    • the final number of corrupted files was 26126
      • 21k other files were salvaged, thanks very much!
    • many jobs were reading a lot of data remotely from CERN
      • resolved?
  • CNAF
    • tape SE updated to xrootd v3.3.4 with new checksum plugin successfully validated with test transfers, thanks!
  • RRC-KI-T1
    • careful memory tuning for jobs done, thanks!

ATLAS

  • Nothing to add since the WLCG F2F meeting

CMS

  • DBS3 was put into production! Needed a week long stop of production activity. Production was drained for Feb. 10th and started to ramp back up on Feb. 12th in the CERN evening
  • CVMFS
    • CMS wants to access Integration Builds (~nightly builds) via CVMFS
      • Needs to be separated from production software releases because of very high activity (2 builds per day)
      • Will start working on establishing an additional CVMFS instance and repository only for nightly builds
    • CVMFS at CERN
      • testing few LSF nodes with CVMFS instead of AFS
      • If ok, will inform user community and then plan to switch to CVMFS at CERN
  • RAL FTS3 scale test
    • Sites have to change their local configuration to use the RAL FTS3 server
    • We will start pushing now that DBS3 is in production
  • EOS redirector needed to be restarted, GGUS:101414
    • work to introduce redundancy and provide proper critical service instructions started
  • T1 scheduling policy
    • reintroducing 5% share for analysis jobs (/cms/Role=pilot)
    • needed for SAM tests not timing out and allow more analysis
    • VO card updated, will discuss with T1 sites soon
  • CNAF T1 is in downtime to upgrade storage, production activities were stopped but queues were kept open for analysis reading input via AAA, seems to work fine

LHCb

  • Started incremental stripping campaign, expected duration of 6-8 weeks, heavy loads on the stager systems foreseen.
    • LHCb is asking for T1 sites for stager throughput as stated in the table below (same as last campaign)
  • WN power at Wigner found to be significantly lower than CERN/Meyrin and other grid sites. As a consequence single payloads will take more wall time to execute for the same work as on other sites.
  • EOS/SRM endpoint upgraded to new SHA2 compliant version
  • WMS decomissioning no progress in the last two weeks b/c of other priorities. As agreed before the decomissioning of the CERN instance by April is OK for LHCb.

Stager throughput at T1 sites for incremental stripping campaign

Total Rate (MB/s)
CERN 50
CNAF 153
GRIDKA 124
IN2P3 134
PIC 39
RAL 111
SARA 104

Ongoing Task Forces and Working Groups

FTS3 Deployment TF

  • Deployment proposal presented at F2F
  • RAL FTS3 incident on Feb 18th - case for multiple server deployment

gLExec deployment TF

IPv6 validation and deployment TF

Machine/Job features TF

  • For batch we do have a prototype installation ready for testing. For cloud usage there was a concern about the usage of the "meta-data service" for providing the job features
    • A prototype based on couchdb has been tested and after some stress testing proved to be reliable. Currently discussing with CERN/Openstack on further steps of installing this tool.
  • Asked remaining VOs to give feedback by end of Februrary (as agreed in the TF) on the current prototype installation on the CERN batch infrastructure.

Middleware readiness WG

  • Minutes of the Feb 6 meeting are linked from the agenda. They include conclusions on repositories, answering concerns raised on 2014/02/19 in the e-group, namely:
  • Next meeting to be decided in this doodle. It currently shows that we meet on Mon March 10th at 3:30pm CET.. If you wish to change this, please keep voting.

Multicore deployment

  • Reported in detail last week on the F2F meeting:
    • First round of meetings with initial reports from experiment activities status (ATLAS and CMS).
    • First impressions from sites.
  • Next meetings dedicated to batch system mini-workshops (HTCondor, SGE, Torque/Maui, Slurm).

perfSONAR deployment TF

SHA-2 Migration TF

  • EOS SRM for LHCb OK since Feb 17
  • VOMRS
    • the VOMS-Admin test cluster is available since Feb 17
      • experiment VO admins are trying it out and reporting feedback, thanks!
    • host certs of our future VOMS servers are from the new SHA-2 CERN CA
      • we will do a campaign to get the new servers recognized across the grid well before we start using them
      • for convenience we will provide the configuration files also in rpms

Tracking tools evolution TF

WMS decommissioning TF

  • all looks fine for decommissioning the WMS instances for experiments by the end of April
    • draining as of early April
  • the SAM instances have their own timeline

xrootd deployment TF

Action list

  1. Investigate how to separate Disk and Tape services in GOCDB
    • proposal submitted via GGUS:93966
    • in progress - ticket updated, current solution to be validated
  2. Agree with IT-PES on a clear timeline to migrate OPS and the LHC VOs from VOMRS to VOMS-Admin
    • in progress - action item to be expanded

-- NicoloMagini - 17 Feb 2014

Edit | Attach | Watch | Print version | History: r21 | r19 < r18 < r17 < r16 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r17 - 2014-02-20 - StefanRoiser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback