WLCG Operations Coordination Minutes - April 3rd, 2014



  • Maria Alandes (chair), Nicolo Magini (secretary)
  • local:
  • remote:


  • The next WLCG Collaboration Workshop will take place from the 7th to the 9th of July in Barcelona. Some details are already available in the Indico Agenda. Registration will open next week. More details will be announced at the GDB next Wednesday. As far as the agenda is concerned, we are working on it and we will contact experiments and the relevant people in the upcoming weeks.

  • The next WLCG Operations Planning meeting will take place on the 17.04.2014. Experiments should expose their future plans for the next three months.

Middleware news and baseline versions


Tier-0 and Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
v2.1.14-11 and SRM-2.11-2 on ATLAS, ALICE, CMS and LHCB
ALICE (EOS 0.3.4 / xrootd 3.3.4)
ATLAS (EOS 0.3.8 / xrootd 3.3.4 / BeStMan2-2.3.0)
CMS (EOS 0.3.7 / xrootd 3.3.4 / BeStMan2-2.3.0)
LHCb (EOS 0.3.3 / xrootd 3.3.4 / BeStMan2-2.3.0 (OSG pre-release))
ASGC CASTOR 2.1.13-9
DPM 1.8.7-3
None None
BNL dCache 2.6.18 (Chimera, Postgres 9.3 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.11.3 emi3 (ATLAS, LHCb, CMS)    
FNAL dCache 2.2 (Chimera, postgres 9) for disk instance; dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) for tape instance; httpd=2.2.3
Scalla xrootd 3.3.6-1
EOS 0.3.21-1/xrootd 3.3.6-1.slc5 with Bestman
  Planning to upgrade dCache tape instance to 2.2+Chimera soon
IN2P3 dCache 2.6.18-1 (Chimera) on SL6 core servers and pool nodes
Postgres 9.2
xrootd 3.3.4 (Alice T1), xrootd 3.3.4 (Alice T2)
JINR-T1 dCache
  • srm-cms.jinr-t1.ru: 2.6.23
  • srm-cms-mss.jinr-t1.ru: 2.2.24 with Enstore
xrootd federation host for CMS: 3.3.6
KISTI xrootd v3.3.4 on SL6 (redirector only; servers are still 3.2.6 on SL5 to be upgraded) for disk pools (ALICE T1)
xrootd 20100510-1509_dbg on SL6 for tape pool
xrootd v3.2.6 on SL5 for disk pools (ALICE T2)
dpm 1.8.7-4
xrood v3.3.4 installation on SL6 xrootd upgrade planned to 3.3.4 on SL6 for all xrootd nodes (including tape pool; waiting for procurement of a SAN switch)
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.21-1
  • cmssrm-kit.gridka.de: 2.6.17-1
  • lhcbsrm-kit.gridka.de: 2.6.17-1
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd redirector 3.3.3-1
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.17 (Chimera) (SURFsara), DPM 1.8.7-3 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 2.2.23-1
xrootd door to VO severs (3.3.4)
RAL CASTOR 2.1.13-9
2.1.14-5 (tape servers)
SRM 2.11-1
RRC-KI-T1 dCache 2.2.24 + Enstore (ATLAS)
dCache 2.6.22 (LHCb)
xrootd - EOS 0.3.19 (Alice)
TRIUMF dCache 2.6.21    

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
JINR-T1 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.7-3 SLC6, EPEL Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations  

Oracle deployment

  • Note: only Oracle instances with a direct impact on offline computing activities of LHC experiments are tracked here
  • Note: an explicit entry for specific instances is needed only during upgrades, listing affected services. Otherwise sites may list a single entry.

Site Instances Current Version WLCG services Upgrade plans
CERN CMSR CMS computing services Done on Feb 27th
CERN CASTOR Nameserver CASTOR for LHC experiments Done on Mar 04th
CERN CASTOR Public CASTOR for LHC experiments Done on Mar 06th
CERN CASTOR Alicestg, Atlasstg, Cmsstg, LHCbstg CASTOR for LHC experiments Done: 10-14-25th March
CERN LCGR All other grid services (including e.g. Dashboard, FTS) Done: 18th March
CERN LHCBR LHCb LFC, LHCb Dirac bookkeeping Done: 24th of March
CERN ATLR, ADCR ATLAS conditions, ATLAS computing services Done: April 1st
CERN HR DB VOMRS upgrade to planned for 14th April
CERN CMSONR_ADG CMS conditions (through Frontier) TBA: upgrade to (tentatively May)
BNL ATLAS LFC, ATLAS conditions TBA: upgrade to (tentatively June)
RAL ATLAS conditions TBA: upgrade to (tentatively June)
IN2P3 ATLAS conditions TBA: upgrade to (tentatively April)
TRIUMF TRAC ATLAS conditions Done

T0 news

  • A series of meetings are being held end of 2013 and during 2014 with IT experts and the 4 experiments in order to measure batch jobs efficiency depending on the location in which they are executed: Meyrin or Wigner. The conclusions are being drafted and will be discussed at next ISM (IT Services Meeting), on the 14th of April. After that they will be reported elsewhere. Stay tuned.
    • Some derived actions already in place: Started to monitor systematically job efficiencies on lxbatch, with automated notification.

  • The WMS service at CERN is being decommissioned according to the announced timeline. The CMS and shared instances are being drained this week. The SAM instance will be kept till the replacing mechanism is in place, estimated by end of June.

  • The batch capacity migrated to SLC6 is 65%, expecting 5% more at the end of next week

  • Batch and plus upgraded to CVMFS client 2.1 in line with the 1st of April deadline.

  • All mw services up to date according to WLCG baseline.

  • The central JIRA instance is currently running JIRA 6.1.7. and hosts more than 230 projects. For Savannah project migration, we recommend Open Source projects (notably for Grid software) to go to the "Open" instance, and other projects to the central instance. (The PH/SFT Savannah team is handling the migration.)

  • After some weeks of testing VOMS admin, and after finding some bugs to be fixed and some features that need further understanding/changing (because of their different behaviour to VOMRS), it has been agreed with the developers that we need a new voms-admin release before going to production.

Other site news

Data management provider news

Experiments operations review and Plans


  • Steady activities in preparation for Quark Matter 2014 (May 19-24, GSI Darmstadt)
  • KIT
    • around 14:00 CEST on Apr 1 the number of running jobs in MonALISA started going down from 5k+ to ~1k for unknown reasons; since then the numbers have fluctuated wildly around that low level, while the batch system typically is seeing ~3k jobs running at any time; no changes were done by ALICE and other sites are working OK; experts are investigating
  • CERN
    • SLC6 job efficiencies
      • Geneva jobs looked better in the last few days, but got worse again starting around 03:00 today; the cause of these changes is unknown
      • next meeting tomorrow


  • MC production and analysis: activity as usual
  • ADCR DB upgrade went smoothly.
  • Rucio migration: now migration from LFC to Rucio as file catalog. Migration is proceeding, more than half of the ATLAS sites have been migrated. No show stoppers seen up to now.
  • Rucio commissioning: we started just in the last days the commissioning of the various Rucio services. Injection of the data (Automatix), creation of a subscription and it's translation into rules for each file (transmogrifier), rule evaluation (judge), data transfer (conveyor through FTS3).
  • DataTransfer issues: observed few links with "slow transfers" (order of 0.5MB/s). we can share our understanding presented at one ADC weekly meeting few days ago. we are trying to understand within ATLAS and PerfSONAR expert how to proceed, identify, track, follow up these issues, we believe this could be a WLCG activity.
  • CERN low efficiency jobs: under investigation. 2 issues under the spotlight: 1) ATLAS pilot jobs are(were) not properly cleaning up after the main process finish (successfully), there were leftover processes (becoming like daemons) which were keeping the batch system slot busy. 2)CREAM report back to ATLAS that the job is successfully finished while LSF reports it still running. This is under investigation by Ulrich et al.
  • CVMFS cache: we observed failures at CERN due to the fact that one of the ATLAS file is 2.2GB and the default shared cache was set to 2GB. Slightly different issue, but related: it seems that with CVMFS <2.1.17 if the cache is not smaller respect the partition of the possible biggest file to be transferred failures can be observed due to partition full. 2.1.17 should have fixed this.
  • FTS3: now using both CERN FTS3 and RAL FTS3 servers, 80%/20%.


  • DBS2 will be switched off April 7th
  • FTS2 decommissioning at CERN in August 2014 ok for CMS
  • FTS3: Sites should switch their Phedex “Debug” agents to use RAL FTS3
  • small production and processing overview
    • Heavy Ion rereco pass launched last week,
      • Currently using HLT resources at CERN, working to expand it further
      • Quite memory (3-4GB) and cpu (96h jobs) demanding, playing with job splitting to mitigate
    • Phase II upgrade MC
    • soon 13 TeV MC digi/reco
  • CVMFS switch at CERN: Monday, April 14th


  • Many thanks to all the sites who have installed already the pledged disk capacities
  • Incremental stripping campaign almost finished, some remaining files to be processed at NL-T1, GRIDKA and IN2P3
    • Many thanks to all T1 sites !!!!
  • Future VOMS2 server added to the VO card
  • PES informed LHCb about possible LSF jobs with low efficiency. An example job has been investigated with the information available from LHCb/Dirac and no inefficiency of the payloads within this LSF job have been found. Investigations ongoing....

Ongoing Task Forces and Working Groups

Tracking tools evolution TF

  • New release of GGUS on 26 March
    • New features: multiple site notification, CMS specific SU and forms
    • Initial test of the alarm notification failed. Repeated again on 31 March, and then it worked
  • 'GGUS Shopping list' tracker migrated from Savannah to JIRA on 1st April
    • All users/tickets imported.
    • One field ('Planned Release', aka 'Fix Version') was not imported properly. Still under investigation

FTS3 Deployment TF

  • New version deployed on pilot, containing among others:
    • https://svnweb.cern.ch/trac/fts3/wiki/ReleaseNotes#Releasecandidate
    • Features requested in TF
      • bandwidth restrictions
      • transfer multi-hop
      • stage-in from tape only
      • activity VO shares
      • bulk cancellations via a file
    • Fixes for issues seen in CMS ASO scale testing
      • delegation issues related to voms roles, server-side timeouts with legacy client
    • Improvements to optimizer plots (increasing time range up to 7 days)
  • CMS proceeding with migration to FTS3

gLExec deployment TF

Machine/Job Features

  • Meeting last week, detailed plan for bare metal, cloud, client and bi-directional developments has been discussed and agreed within the TF
  • SAM probe for checking the availability and course correctness of features has been deployed in pre-production (for LHCb)
  • Deployment of the mjf client has been done in the LCG/AA area

Middleware readiness WG

  • The minutes of the 3rd meeting held on March 18 are now available on indico direct link to the minutes here. MariaD will make the list of confirmed Volunteer sites for the April 15 WLCG MB, relying on the Experiment twikis linked from column "Documentation pages" of the Experiment Workflows' table, provided they are also used by the Product Teams for testing, as we agreed they should be close to the MW providers AND the VOs. Candidate (but are they confirmed?) Volunteer Sites so far:
    • ATLAS: Triumf, Edinburgh, QMUL, OSG, CERN, INFN-T1.
    • CMS: Grif, CERN, Legnaro.
    • ALICE: not yet defined. We have an action on Maarten from the 18 March minutes
    • LHCb: we have no site mentioned, although their doc is quite detailed.

Multicore deployment

  • News from experiments concerning multicore:
    • CMS: Multicore project discussed during the last CMS Comp & Offline week. Work is ongoing for the multithreaded application, therefore CMS will test its multicore submission pilots with single core jobs. Tier1 representatives have been contacted to start these tests immediately with a two step plan: functional tests, followed by real scale submission.
    • ATLAS: Atlas can already pass arguments to the batch system, but it requires to create a new panda queue for each set of parameters one may want. ATLAS is now working on making the queue creation dynamic based on the site characteristics. The current problem though is that most CREAM sites blah scripts don't pass the parameters to the batch system. Only SGE and ARC-CE sites have this capability. This needs to be looked into at WLCG level.

  • Review of sites/batch system experiences with multicore jobs continues: done Condor at RAL, UGE at KIT, Torque/Maui at Nikhef. Next week SLURM.
  • Next step: review experience in CMS and ATLAS shared sites when handling multicore jobs from both VOs.

perfSONAR deployment TF

  • Deadline for perfSONAR installation has passed (April 1st). 9 sites missing out of 111.
    • BelGrid-UCL: asked for SLC6 installation, pointed to https://code.google.com/p/perfsonar-ps/wiki/Level1and2Install RPM bundle (unsupported).
    • GR-07-UOI-HEPLAB: no hardware, on hold.
    • GoeGrid: no reply, 4 reminders
    • ICM: "We do not have free resources to deploy perfSonar", ticket closed.
    • MPPMU: procuring hardware
    • RO-11-NIPNE: site under upgrade on 09/01/2014, no news since then (2 reminders)
    • T2_Estonia: under installation/configuration
    • TECHNION-HEP: first reply yesterday (3 reminders).
    • USCMS-FNAL-WC1: installed and configured (since long time), not publishing in OIM
  • Still many services need attention (wrong configuration/old releases/firewalls).
  • Future of TF should be discussed at the planning meeting.

SHA-2 Migration TF

  • the EGI Operations Portal VO cards for the experiments have been updated with the details of the future VOMS servers

WMS decommissioning TF

  • CERN WMS instances for experiments are being drained as of 13:53 CEST on April 1
    • no issue reports were received
  • SAM instances have their own timeline

xrootd deployment TF

IPv6 validation and deployment TF

  • A BeStMan SE and and OSG CE with IPv6 have been kindly set up at Nebraska for testing in CMS. They'll soon be added to the transfer tests and the Condor-G submission tests.
  • The Panda Dev instances are being made dual stack. The next step is the JEDI development (JEDI is a component of the new ATLAS production system for Run 2) to progress further before trying to run jobs.

HTTP proxy discovery TF

Action list

  1. Investigate how to separate Disk and Tape services in GOCDB
    • proposal submitted via GGUS:93966
    • in progress - ticket updated, current solution to be validated.
      • Some of the T1 sites are adding SRM.nearline entries as desired.
      • Downtime declaration tests to be done.
      • Experiences to be reported in the ticket.
  2. Agree with IT-PES on a clear timeline to migrate OPS and the LHC VOs from VOMRS to VOMS-Admin


-- NicoloMagini - 31 Mar 2014

Edit | Attach | Watch | Print version | History: r35 | r33 < r32 < r31 < r30 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r31 - 2014-04-03 - MaartenLitmaath
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback