WLCG Operations Coordination Minutes, January 26th 2017





  • local: Maarten (chairperson), Alberto Aimar, Julia Andreeva, Marian Babik, Jerôme Belleman, Marcelo Soares, Andrea Manzi, Alexander Kryukov, Mark Slater, Andrew McNab
  • remote: John Gordon, Alessandra Doria, Catherine Biscarat, Stephan Lammel, Di Qing, Dave Mason, Kyle Gross, Thomas Hartmann, Ulf Tigerstedt, Frédérique Chollet, Alessandra Forti, Christoph Wissing, Massimo Sgaravatto, Oliver Keeble, Renaud Vernet, Antonio Perez-Calero Yzquierdo, Carlos Acosta, Hung-te Lee, Robert Ball, Pepe Flix, Elena Korolkova.
  • Apologies: Nurcan Ozturk

Operations News

  • Next WLCG Ops Coord meeting will be on March 2nd

Middleware News

  • T0 and T1 services
    • CERN
      • check T0 report
      • FTS upgrade to v. 3.5.8 and C7.3 planned for next week
    • FNAL
      • FTS upgrade to v. 3.5.7
    • IN2P3
      • dCache upgrade 2.13.32 -> 2.13.49 on Dec 2016
    • JINR
      • dCache minor upgrade 2.13.49 -> 2.13.51, xrootd minor upgrade 4.4.0 -> 4.5.0-2.osg33
    • NDGF-T1
      • Upgrade to dCache (3.0.5) next week to fix a communication bug that is rare.
    • NL-T1
      • SURFsara upgraded dCache from 2.13.29 to 2.13.49 on Dec 1
    • RAL
      • Castor upgrade to v 2.1.15-20 ongoing, Tape severs upgraded to v 2.1.16
      • Almost completed migration of LHCb data to T10KD drives.
      • Update SRMs to version 2.1.16-10 has been planned
    • RRC-KI-T1
      • Upgraded dCache for tape instance to v 2.16.18-1

Tier 0 News

  • Apex 5.1, released in December, has been installed on the development instance; the upgrade of the production environments is scheduled for February.
  • The main compute services for the Tier-0 are now provided by 31k cores in HTCondor, 70k cores in the LSF main instance, 12k cores in the dedicated ATLAS Tier-0 instance, and 21.5k cores in the CMST0 compute cloud. Some CC7-based capacity is available in HTCondor for user testing. We are in contact with major user groups for consultancy for migrating to HTCondor.
  • 2016 LHC data taking ended with the p-Pb run; about 5 PB have been collected in December. Since then there has been a lot of consolidation activity.
  • The p-Pb run for ALICE was performed using a 1.5 PB Ceph-based staging area, without incident or slow-down. This configuration, offering increased flexibility and easier operation, is now considered production-ready for CASTOR.
  • Mostly transparently to users, EOS and CASTOR instances have been rebooted. On 23 January the failover of an EOS ATLAS headnode failed, causing service disruption; the root cause is being investigated. Some EOS CMS instability requiring a headnode to be rebooted has been due to heavy user activity.
  • New storage hardware will be installed in February as soon as it becomes available to the service.
  • The FTS service has been upgraded to v.3.5.8 and a refreshed VM image based on CentOS 7.3 fixing vulnerabilities.
  • In January the mark of one billion files in EOS at CERN was crossed.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports


  • High to very high activity since the end of the proton-ion run
    • Also during the end-of-year break
    • In particular for Quark Matter 2017, Feb 5-11
    • Thanks to the sites for their performance and support!
  • On Wed Jan 4 the AliEn central services suffered a scheduled power cut
    • Normally such cuts are short and do not present a problem
    • This time the UPS was exhausted many hours before the power was restored
    • All ALICE grid activity started draining away
    • Fall-out from the incident took ~2 days to resolve
  • A recent MC production is using very high amounts of memory
    • Large numbers of jobs ended up killed by the batch system at various sites
    • Experts are looking into reducing the memory footprint further


  • Very smooth production operations during the winter break; MC simulation, derivations and user analysis have been running in full speed since then. Derivations use up to 100k slots to finish full processing of data15+data16 and their MC samples. Very high user analysis activity for Moriond and spring conferences.
  • ATLAS Sites Jamboree took place at CERN on January 18-20. Good feedback and discussions from sites. Interest on virtualization from sites; docker containers and singularity, a dedicated meeting is being planned before Software&Computing week in March.
  • Tape staging test at 3 Tier1’s two weeks ago to understand the performance in case running the derivations from the tape input might become a possibility. Results will be presented today.
  • Global shares are being implemented to better manage the required resources among different workflows (MC, derivations, reprocessing, HLT processing, Upgrade processing, analysis). More intensive productions will run in the future. Sites will be asked to get rid of the fairshares set at the site level. This is currently in revision and some sites that ran only evgensimul (or part of it) in the past will be affected.


  • good, steady progress on Monte Carlo production for Moriond 2017
    • tape backlog at sites worked down
    • JINR looking into tape setup to improve performance
  • PhEDEx agents were updated in time at all but two sites, ready to switch to FTS3 and new API
  • dedicated hi-performance VMs for GlideinWMS, under evaluated using Global Pool scalability test
  • slots overfilled at sites due to HT Condor bug (triggered by node with high network utilization/errors); bug will be addressed in v8.6 to be released early February
  • CentOS 7 plans: no general upgrade planned; sites are free to upgrade and provide SL6 via container, etc.; CMS software is also released for CentOS 7; physics validation expected soon; discussions and tests to move HLT farm to CentOS 7;
  • pilots send to Tier-1 sites consolidated: one pilot with role "pilot" instead of two with roles "pilot" and "production"


Ongoing Task Forces and Working Groups

Accounting TF

Information System Evolution

IPv6 Validation and Deployment TF

Machine/Job Features TF

  • Further updates to DB12-at-boot in mjf-scripts distribution. See discussion at HEPiX benchmarking WG meeting. These values are made available as $MACHINEFEATURES/db12 and $JOBFEATURES/db12_job


MW Readiness WG

Network and Transfer Metrics WG

Squid Monitoring and HTTP Proxy Discovery TFs

  • CERN now has the first site-specific http://grid-wpad/wpad.dat service, and CMS is using it for their jobs
    • hosted on the same 4 physical 10gbit/s servers that CMS uses for squid service
    • supports both IPv4 and IPv6 connections in order to determine whether squids in Wigner or Meyrin should be used first
    • for CMS destinations (Frontier or cvmfs), it directs to the CMS squids, otherwise it defaults to the IT squids

Traceability and Isolation WG

  • A tool was identified as a possible solution for isolation (without traceability), 'singularity':
    • WG now evaluating the tool: a small test cluster is being built now
    • A security review of this tool is needed (transition plan might require SUID)
  • Next meeting: Wednesday 1 Feb 2017 (https://indico.cern.ch/event/604836/)

Theme: Downtimes proposal followup

Theme: Tape usage performance analysis

Action list

Creation date Description Responsible Status Comments
01.09.2016 Collect plans from sites to move to EL7 WLCG Operations On-going The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF will stay on SL6 for now but they plan to go directly to EL7 early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress.
03.11.2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations pending searching the doc location and editor. Is it in the EGI wiki?
03.11.2016 Discuss internally on how to follow up long term strategy on experiments data management as raised by ATLAS WLCG Operations Pending  
03.11.2016 Check status, action items and reporting channels of the Data Management Working Group WLCG Operations Pending Julia gives an update of behalf of Oliver

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
29.04.2016 Unify HTCondor CE type name in experiments VOfeeds all - Proposal to use HTCONDOR-CE. In progress for ALICE. Raja will ask the status for LHCb.   Ongoing
03.11.2016 Proposal for advance warning of long site downtimes All - Dec GDB proposal Dec 12, 2016 DONE
01.12.2016 Open tickets to sites for moving to FTS3 client     There are PhEDEx prerequisites Year End 2016 January 2017

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
01.12.2016 Proposal for advance warning of long site downtimes All - Please, give feedback to the Dec GDB proposal 20th January 2017 In progress


Edit | Attach | Watch | Print version | History: r20 | r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r11 - 2017-01-26 - VincentBrillault
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback