WLCG Operations Coordination Minutes - 3 October 2013

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=272668

Attendance

  • Local: Andrea Sciabà (secretary), Nicolò Magini, Maarten Litmaath, Stefan Roiser, Oliver Keeble, Felix Lee, Ikuo Ueda, Simone Campana, Maria Dimou, Markus Schulz, Alberto Aimar, Pablo Saiz, Alessandro Di Girolamo, Maite Barroso Lopez, Domenico Giordano
  • Remote: Pepe Flix (chair), Javier Sanchez, Alessandra Doria, Alessandra Forti, Jeremy Coles, Massimo Sgaravatto, Alessandro Cavalli, Burt Holzman, Gareth Smith, Marc Caubet Serrabou, Daniele Bonacorsi

News

No news.

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

No changes.

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR: v2.1.13-9-2 and SRM-2.11-2 for all instances
EOS:
ALICE (EOS 0.2.37 / xrootd 3.2.8)
ATLAS (EOS 0.2.38 / xrootd 3.2.8 / BeStMan2-2.2.2)
CMS (EOS 0.2.38 / xrootd 3.2.8 / BeStMan2-2.2.2)
LHCb (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
EOS:
PUBLIC (EOS 0.3.1 / xrootd 3.3.3)
 
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.6-1
xrootd
3.2.7-1
None None
BNL dCache 2.2.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None Upgrade of dCache to sha2 capable version in late October
CNAF StoRM 1.11.2 emi3 (ALICE, ATLAS, CMS, LHCb) Updated ALICE and CMS  
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.7.slc
Oracle Lustre 1.8.6
EOS 0.3.1-12/xrootd 3.3.3-1.slc5 with Bestman 2.2.2.0.10
Upgraded to EOS 0.3.1-12 Upgrade to new EOS version in 2 weeks, upgrade to dCache 2.2 before end of year
IN2P3 dCache 2.2.12-1 (Chimera) on SL6 core servers and 2.2.13-1 on pool nodes
Postgres 9.1
xrootd 3.0.4
   
KISTI xrootd v3.2.6 on SL5 for disk pools
xrootd 20100510-1509_dbg on SL6 for tape pool
dpm 1.8.6
   
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.5-1
  • cmssrm-fzk.gridka.de: 2.6.5-1
  • lhcbsrm-kit.gridka.de: 2.6.5-1
xrootd
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd proxy 3.3.1-1
   
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.7 (Chimera) (SURFsara), DPM 1.8.6 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 2.2.17-1
xrootd door to VO severs (3.3.1-1)
   
RAL CASTOR 2.1.13-9
2.1.13-9 (tape servers)
SRM 2.11-1
   
TRIUMF dCache 2.2.13(chimera), pool/door 2.2.10    

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1    
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1 FTS3 instance available since Sept. 4th. Currently used by CMS. Available for all VO.  
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.3.1-1 for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.6-1 SLC6, EMI2 Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations  

Other site news

CASTOR 2.2.14 will be deployed at CERN within the next two weeks. The CASTOR team contacted the experiments and a date should be agreed next Monday.

Data management provider news

gfal/lcg_util

A bugfix 1.16.0 release is being pushed to EPEL.

https://svnweb.cern.ch/trac/lcgutil/wiki/GFALRelease_1_16_0

The CERN DM clients are following the same release strategy as DPM, ie the packages have been removed from EMI and are released directly to EPEL. This is totally transparent also to sites that use EMI

A presentation will be made at the next GDB on medium term planning, including a discussion on timescales for the retirement of gfal/lcg_util in favour of gfal2/gfal-util. gfal-util is not yet released: we would like to have one more iteration on the functional scope, and then release it for testing.

Experiment operations review and plans

ALICE

  • CVMFS
    • 41 tickets closed, 11 open, 3 done since last meeting
    • 27 sites have been switched already, many others are in progress
      • T1 done: CNAF, IN2P3, KIT, RAL
  • NDGF
    • Mon Sep 30: 35k files no longer found in dCache
    • Tue Oct 01: file name mapping turned out to be broken due to a misconfiguration a few weeks ago; some renaming to do, after which the missing files should be available again
  • Russian sites
    • network problems partially resolved. The T2s have 100 Mb/s, the proto-T1 follows a different route and has a better connection.
    • 4 sites operational: JINR, RRC-KI, SPbSU, IHEP (limited at 150 jobs)
    • 4 sites still closed for ALICE jobs: PNPI, Troitsk, ITEP, MEPHI

ATLAS

  • the BNL VOMS server had to be switched off because of the new CA (/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=vo.racf.bnl.gov) which is now providing the host certificate, resulting in a different DN of the VOMS server. This DN is hardcoded in a file for the EGI sites. We are working to make a broadcast to have this file updated.
  • CA CERN SHA-2: we understand that the new SHA-2 CERN CA is a pilot version (as clearly stated), but should be the experiments telling their users that there are still services who maybe don't work with SHA-2 (and thus the users need to have also a SHA-1 cert), or should be the CA itself with some even more strong message?
  • 21-25th October is ATLAS SW week

Maarten agrees that it should be stated much more strongly on the new CA site that it is still a pilot service and MUST NOT yet be used to get certificates for Grid usage, at least before December 1st. He will reiterate the request.

CMS

  • Processing activities
    • 2011 data reprocessing ongoing

  • CVMFS
    • A handful of sites not yet migrated
    • Sending of installation jobs should stop now
    • Agreed to continue to send installation jobs to selected sites for another month

LHCb

  • GGUS statistics of the last 2 weeks: 15 GGUS tickets opened (3x T1, 12x T2) mostly issues of
    • pilots aborting (7) b/c of site configuration issues
    • CVMFS (2) with a black hole WN and performance problems b/c of increased load
    • Wrong publishing of queue information (1) i.e. MaxCPUTime set to 999.999.999
  • Decommissioning of the lhcb-conddb CVMFS repository. Due to changes in the reconstruction workflows this "fast updating" repository will not be used in production.
    • All sites who have mounted this repository may remove it as a mount point on worker nodes whenever they prefer to
  • LHCb asks migration of WN resources to sl6 by end of October as agreed
    • focus should be set on T1 sites and "T2D" sites where at least a part of the infrastructure should be made available on sl6
    • main reason is the planned move to C++11 which will only be supported on the sl6 platform

Task Force reports

Machine/Job Features

  • Tool developed "mjf.py" which can be called on the command line or imported in python and returns job/machine features
    • Tested on CERN physical batch and openstack environment
    • Next steps are generating the information automatically within CERN/openstack, packaging of the tool and deploy it on lxbatch and make it available to VOs for integration into their VMs for testing

Pepe asks how the HS06 values are extracted. Stefan explains that for physical hosts the values are put by the sysadmin and are reasonably accurate, but for virtual machines it is still an open question if they are needed and if so how to define them (for example, using a mean value). Maarten adds that this issue is much more general and HEPIX experts will be involved. Concerning validation of the published numbers, a SAM test is going to be used.

SHA-2 migration

Maarten explains the incident with the BNL VOMS server reported above serves as a "heads up" for CERN, as in the coming months also the two CERN VOMS services would need to change CA (the host DNs would stay the same, but that is not sufficient). We are working on a plan that minimises the disruption for the infrastructure. A possible solution would be to deploy in production a third VOMS server with the new CA, make the infrastructure aware of it and remove the old ones. This is not urgent but will need to happen in the next half year.

Nicolò asks Burt what are the plans for upgrading dCache at FNAL to a SHA-2 ready version. Burt says that they plan to upgrade to version 2.2 by the deadline, but considering that it is being coupled to the disk/tape separation and the migration to Chimera, it is a particularly tricky upgrade.

FTS-3

  • Issues with priorities and some file transfer stuck observed by ATLAS on RAL FTS3. Both the problems were understood and fixed:
    • Sorting files based on priority - from the devs: There was a misplaced order by clause, instead of being in the outer select it was in sub-query
    • Stuck file transfers - from the devs: we recently introduce a way to load balance fts3 server based on hash keys, so as fts3 server not to race over the same records from the database. While the range of the hash keys should have been between 0-FFFF, it was 0-FFFE, so all the records assigned to FFFF never executed, that why not all the transfer affected.
  • FTS3 server at CCIN2P3 stable for CMS Debug transfers, ready to accept more Tier-2s. Increased load on selected links on CERN FTS3 server.

Pepe asks if and when CMS plans to use more FTS-3 servers in the PhEDEx debug instance for testing. Nicolò answers that the focus right now is on testing the scalability with the lowest possible number of servers. Alessandro adds that for ATLAS the priority is to use few (ideally one) well supported servers, so that problems can be understood and fixed quickly; testing e.g. the Oracle backend is less important. Stefan says that LHCb uses only the FTS-3 at CERN, with RAL as backup. Oliver comments that the goal would actually be to support the whole infrastructure with a single server. It is clear though that sites would like to move away from FTS-2 because it is costly in terms of operations and licenses. Answering to Burt, Alessandro thinks that FTS-3 is not quite ready to replace FTS-2 instances yet, but we can expect it to be in 1-2 months.

gLExec

  • 46 tickets closed and verified (1 done since last meeting), 48 still open, most on hold until mid/late autumn in line with SL6 migration.
    • ATLAS sites are lagging behind more than sites for other VOs.
  • Deployment tracking page

XrootD Deployment

dCache xrootd door monitoring plugin (from I. Vukotic)

  • Versions of the third party plugin for the XRootD detailed monitoring of dCache sites have been released in CERN WLCG repo http://linuxsoft.cern.ch/wlcg/ compatible with dCache versions v1.9, v2.2. v2.6
    • For dCache versions <= v2.2 the installation of xrootd4j-backport plugin is additionally needed
      • NOW distributed in the same WLCG repo
Summary of RPMs
  • dCache 1.9
    • dcache19-plugin-xrootd-monitor-5.0.0-1.noarch.rpm
    • dcache19-plugin-xrootd4j-backport-2.4-1.noarch.rpm
  • dCache 2.2
    • dcache22-plugin-xrootd-monitor-5.0.0-1.noarch.rpm
    • dcache22-plugin-xrootd4j-backport-2.4-1.noarch.rpm
  • dCache 2.6
    • dcache26-plugin-xrootd-monitor-5.0.0-1.noarch.rpm

Pepe asks if CMS is already using these plugins. Domenico answers that some CMS dCache sites are, but he will further push the others. It is easy to find out which sites are using the plugins by comparing the list of known dCache sites to the sites publishing xrootd monitoring information.

IPv6

  • The relevant scenarios have been identified (in order of increasing importance)
    • testing services DualStack with IPv4 client
    • testing services DualStack with DualStack client
    • testing services DualStack with IPv6 clients
  • Those scenarios will apply to the following use cases
    • Job Submission
    • Data transfer (including third party)
    • Condition data access
    • Access to Experiment Software
    • Information System
    • Monitoring
  • Finally, full experiment workflows (involving some of the use cases above) should be also tested
  • More information in https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6
  • We are testing a recipe to provide dual stack nodes on the CERN Agile Infrastructure, useful for testing. In progress.

The really important thing to test as soon as possible is that moving existing services to a DualStack configuration will not break anything in production.

Pepe asks why tape systems are not in the list of services to test. Simone answers that any IPv4 address shortage will likely not affect such service nodes and that anyway tape systems are never directly exposed to the outside.

Andrea reiterates the invitation for sites to join the testing activities in collaboration with the HEPIX IPv6 working group (where several T1s and some T2s are already active).

perfSONAR

The procedure to register perfSONAR in OSG has been revisited. OSG sites are invited to follow the new instructions

Tracking tools

Middleware verification

See the Middleware Readiness Verification page for the details.

Alberto proposes to maintain a table with all products, names of people doing the testing and an indication if a given version is verified. Maarten thinks keeping track of this information should be a reasonably light task. In most cases, having the right sites and people to test a given product should be straightforward, as many are already active in validation (e.g. for DPM, dCache, StoRM, etc.). The EGI staged rollout should be taken into account for as long as it exists, but we cannot automatically assume that a given version is good for WLCG when it appears in the UMD. Markus points out that the role of the experiments should not be underestimated because they should eventually give the green light, and this was never really done in the context of the staged rollout. Therefore their participation in the process is essential.

SL6 migration

-28 days

  • Total number of Tier1s Done: 10/16 (Alice 5/10, Atlas 8/13, CMS 4/8, LHCb 5/9)
  • Total number of Tier1s not Done: 6/16 (3 with a plan, 3 in progress)
    • KIT: working on last minute software versions conflict should start migrating next week VOs representatives will take care of integrating the site in the experiments frameworks.
    • FNAL: working on NFS configuration and performance should have the first node ready on Monday and rolling migration onward
    • RRC-KI-T1: no report since August.
    • CERN: 13.5% migrated.
    • IN2P3-CC: 50% migrated
    • NDGF: 85% migrated

  • Total number of Tier2s Done: 74/130 (Alice 20/40, Atlas 46/89, CMS 36/65, LHCb 19/45) - ~56.9%
  • Total number of Tier2s not Done: 56/130 (1 without a plan, 43 with a plan, 12 in progress)
    • Other 3 sites are scheduled to be completed by Friday.
    • It might be a fat tail after 31st October......

AOB

Jeremy asks if any issue with GOCDB v5 has been noticed after its release yesterday. Pepe reports a small problem in declaring a downtime, but otherwise it looks fine. Stefan asks if this GOCDB version implements the GLUE2 schema: the answer is yes, but it is not exposed. Anyway it is agreed that this is now a good time to resume the discussion on how to declare downtimes for tape systems (see action no. 2 below).

Action list

  1. Tracking tools TF members who own savannah projects to list them and submit them to the savannah and jira developers if they wish to migrate them to jira. AndreaV and MariaD to report on their experience from the migration of their own savannah trackers. Further discussion expected for the next meeting, after a dedicated meeting about the migration of the GGUS Savannah tracker to JIRA). Maria clarifies that 83 trackers need a decision, and trackers that will not be migrated will be gone for good. Maarten suspects the migration cannot be finished this year, but will need to stretch a few months into next year.
  2. Investigate how to separate Disk and Tape services in GOCDB
  3. Agree with IT-PES on a clear timeline to migrate OPS and the LHC VOs from VOMRS to VOMS-Admin
    • in progress
  4. Experiments interested in using WebDAV should contact ATLAS to organise a common discussion
    • input welcome by the next GDB
  5. Contact the storage system developers to find out which are the default/recommended ports for WebDAV

-- AndreaSciaba - 01-Oct-2013

Edit | Attach | Watch | Print version | History: r33 < r32 < r31 < r30 < r29 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r33 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback