WLCG Operations Coordination Minutes, May 21st 2015


Agenda

Attendance

  • local:
  • remote:

Operations News

Middleware News

  • Baselines:
    • EMI update today containing Storm 1.11.8. This version has already been verified by MW readiness and as soon as will get into UMD ( end of May) will be set as baseline
    • dCache 2.10.28/2.12.8 verified by MW readiness and set as baseline ( fixes for DB leak)
    • as discussed in the previous meeting, torque 2.5.13 has been added to the baselines table

  • MW Issues:
    • NTR

  • T0 and T1 services
    • CERN
      • CASTOR for LHC has been updated to 2.1.15. Small delta releases (2.1.15-8 are being rolled out)
      • SRM validation going on. xroot is now the main access protocol (RFIO is obsolete and its possible decommisioning will be discussed at the end of 2015 to take place in 2016 or later)
    • KIT
      • Updated all dCache setups to 2.11.19 last week. Very urgent update due to a leak in the Chimera database
    • IN2P3
      • plan to upgrade dCache to 2.10.30+ on core servers (16/06/2015)
    • RRC-KI-T1
      • dCache upgrade to 2.10.29

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • normal to high activity
  • no major operational issues

ATLAS

  • Very high activity (> 200k running job slots today)
  • Only possible through mix of single and multi-core jobs (see later)
  • Running final stress tests of data transfer this week (inside CERN and outside) before data taking
  • Request T1 sites to avoid major downtimes from now until late summer

CMS

  • Production overview
    • Finished with Upgrade DIGI-RECO (for now)
      • More memory intense production compared to usual workloads
      • Needed to run in multi-core pilots not using all cores
    • Finally started Run2 DIGI-RECO campaign
      • Will keep resources rather busy for next weeks/months
      • Successfully extended to stronger Tier-2 sites

  • Global Xrootd re-director at CERN
    • Is an important component for CMS
    • Increased usage and higher dependency on the service
      • Many users requesting files via Global redirector
      • Production jobs occasionally sent to sites not hosting data
    • CMS requests an increase in impact to 8 (from 5) and urgency to 6 (from 5) number for WLCG critical service
    • Quite long iteration to get a configuration settled recently GGUS:113032

LHCb

  • LHCb Computing workshop going on, so not much operation follow up
  • Downtime today though due to the Oracle DB upgrade
  • T1
    • Problem to contact SARA SRM Seems to be back. They clame that the fetch-crl problem could be now on CERN side. Upgrading on FTS servers and our vobox planned.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

RFC proxies

  • SAM-Nagios refresh_proxy probe should handle RFC proxies transparently when UMD-3 is used
    • preprod already has UMD-3, prod still has UMD-2
    • we will try this out on ALICE preprod
  • the readiness of sites then can be observed per experiment
    • no failures are expected to be due to the change of proxy type
  • when the sites are checked OK, the central services of each experiment are next
    • plus pilot factories at T1 etc.
  • experiment status
    • ALICE: done
    • ATLAS:
      • check central services
      • ensure all pilot factories run a recent Condor-G with a UMD-3 CREAM client
    • CMS: ditto
    • LHCb: check DIRAC

Machine/Job Features

Middleware Readiness WG


Multicore Deployment


  • ATLAS deployment:
    • Goal: 80% of production resources usable by multicore. This corresponds roughly to 150k slots. The max achieved has been 110k slots. It is not a problem of sites not having a queue as only 9 smaller sites are still missing. It's a matter of tuning the system.
    • ATLAS is looking at increasing the jobs length up to about 10-15h for 8-core slots to improve on the efficiency and is also looking at a global fairshare to avoid sending too many low priority single core jobs when there are high priority multicore in the system.
    • Sites should instead look at their setup and try to prioritize multicore over single core. The base TF solutions can be found here. In particular Torque, Htcondor and SGE have all a solution.
      • Sites that are already dynamically configured but still cap the multicore slots should also revise this number to respect the 80% share. As a reminder the shares for ATLAS jobs are as follows T1: 5% analysis and 95% production; T2: 50% analysis and 50% production. So multicore should get 80% of 95% at T1s and 80% of 50% at T2.
  • CMS status:
    • Consolidating deployment to T1s: working together with CNAF and CCIN2P3 to increase multicore resources allocation results
    • Analysis of multicore pilots internal (in)efficiencies ongoing

IPv6 Validation and Deployment TF


Squid Monitoring and HTTP Proxy Discovery TFs

  • No news, main developers haven't been able to give it any time lately because of other priorities

Network and Transfer Metrics WG


  • perfSONAR status
    • Security: New SSL vulnerability dubbed Logjam: https://weakdh.org/sysadmin.html. WLCG perfSONAR hosts should NOT be vulnerable to this attack. The Apache configuration installed by the Toolkit disables the cipher suites in question by default.
  • Network performance incidents process - new GGUS SU (WLCG Network Throughput) will become available on 24th of June.
  • Next meeting 3rd of June (https://indico.cern.ch/event/382624/). Plan is to focus it on latency ramp up and proximity service.

HTTP Deployment TF

The first TF meeting has taken place, minutes are attached to the agenda, all reachable from the TF home page - https://twiki.cern.ch/twiki/bin/view/LCG/HTTPDeployment

Some first steps were agreed, in particular identification of an appropriate monitoring solution and the compilation of a full list of functionality that the experiments would like to see delivered via HTTP. The next meeting will focus on the latter topic.

Action list

Description Responsible Status Comments
CMS instructions to shifters to be changed so that tickets are not opened if just one CE is red. C. Wissing ONGOING  

AOB

-- MariaALANDESPRADILLO - 2015-05-06

-- MariaALANDESPRADILLO - 2015-05-20

Edit | Attach | Watch | Print version | History: r17 | r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r13 - 2015-05-21 - ChristophWissing
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback