WLCG Operations Coordination Minutes - October 2nd, 2014



Operations News

Middleware News

  • Baselines:
    • News from EGI URT meeting on monday
      • New version of the UI and WN are going to be hopefully included in the next version of UMD foreseen for end of October.
      • dCache 2.2.x decommissioned deadline is 31-10-2014. Baseline for the 2.6.x series is 2.6.34 which fixes issues with Brazilian CAs certs.
      • Globus 6 is in epel-testing and PTs are invited to test compatibility. We are aware of FTS and DPM being tested, at the moment not blocking issues have been discovered and therefore soon Globus 6 will go to stable. ( the exact date will be discussed during the next URT)
    • ShellShock: Being followed up by the EGI Security group. In particular it impacted Perfsonar deployments with several vulnerability discovered. More info from the Network & Transfer metric group https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Announcements

  • MW Issues:
    • xroot package deployed with ROOT 6 breaks access to dCache storage, affecting LHCb. The problem is both client and service side, A fix for dCache has been developed but not yet released, at the moment there will be a workaround fix for xrootd clients.
    • installation of several grid product is broken. CREAM, WMS, L&B, WMS , WN cannot be installed at the moment cause the classADS package ( dependencies on all of them ) was declared orphaned in EPEL, and retired from all EPEL repositories. The package is going to be included at the moment into UMD and EMI third-party repo, waiting for a maintainer. CESNET should take care of it but they are not happy with this extra effort.

  • T0 and T1 services
    • IN2P3
      • dCache upgrade to 2.6.34
    • NL-T1
      • dCache upgrade to 2.6.34
    • KIT
      • Update of dCache for CMS and LHCb to 2.6.34
      • Update xrootd configuration for FAX and AAA to respect EU privacy policy Thursday 08:00 - 08:30 UTC.
      • Update for LHCb dCache to next version that fixes issues with ROOT6 not scheduled yet (new dCache release required first).
    • JINR-T1
      • one dCache instance upgrade to 2.6.34
      • one dCache instance running 2.2.27 to be upgrade to 2.6 or 2.10 early november
    • BNL
      • FTS upgrade to 3.2.27

Oracle Deployment

  • IT-DB new hardware installations in: CERN computer centre and Wigner.
  • Timeline: testing in October, production move - by the end of 2014. Schedule will be updated accordingly.
  • Following table includes only those DB services that concers WLCG

Database Comment Destination Upgrade plans, dates
ATONR Data Guard for Atlas Online Wigner  
ATLR Data Guard for Atlas Offline Wigner
ADCR Data Guard for ADC Wigner  
CMSR Data Guard for CMS offline Wigner  
LHCBR Data Guard for LHCB Offline Wigner  
LCGR Data Guard for WLCG Wigner  
CMSONR Data Guard for CMS Online Wigner  
LHCBONR Data Guard for LHCB Online Wigner  
CASTORNS Data Guard for Castor Nameserver Wigner  
ATONR Active Data Guard for Atlas Online CERN CC  
ALIONR Active Data Guard for Alice Online CERN CC  
ADCR Active Data Guard for ADC CERN CC  
LHCBONR Primary Database for LHCB Online CERN CC  

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports


  • CERN: investigation of job failure rates and inefficiencies
    • the batch team adjusted the parameters of the ALICE queue - thanks!
    • the effects should become visible over the coming days
    • the pilot ("job agent") logic has been checked and a potential improvement is being looked into
  • CERN: HLT farm running as an ALICE site since Sep 24
    • being ramped up over the coming months


  • DC14
    • digi+reco Run2 in 8-core mode will finish in about a week from now
    • some more simulation samples launched
    • further AOD2AOD on reprocessed data will be launched in one week (0.5PB)
  • Multicore recommendations for 8-core reconstruction
    • Allocate 16GB physical memory per job
    • if limiting memory per process: 3GB RSS and/or 5GB VMEM
    • For cgroup-enabled sites: total RSS of the job should be 16GB
  • Deployment of ATLAS MCORE queues
    • more than 70k cores were used this week for multicore jobs exclusively
    • after fixing initial memory issues, the digi+reco processing is progressing very fast, up to 50M events/daily - thanks to the sites for fast action
    • to all the sites, please continue the deployment of multicore queues
    • serial production tasks in the future will be limited
  • SAM3 tests
    • all ARC-CE sites fixed the configuration and the ATLAS_CRITICAL tests are effective since 1.10.
  • Rucio and Prodsys-2 commissioning ongoing, still no fixed date for deployment


  • Processing overview:
    • Not much work in the system
    • Preparations for new campaign (PHYS14) ongoing
  • Scale testing of HTContor and GlideinWMS by OSG colleagues
    • Launch many pilots on one acquired job slot to reach high scales
    • Caused some trouble at sites
      • Firewalls not able to handle that many connections
      • Maximum number of NAT connections exhausted
      • Report problems to CMS (via ticket or HyperNew) - we negotiate with the testers
  • Problems with Dashboard reporting
    • Caused by underlying ML infrastructure
    • Dashboard team presently working on improvements
  • AFS-UI at CERN
    • Analysis of AFS access logs
      • ~45 individual analysis users - usage likely decrease after closing of lxplus5
      • ~5 users from central production
    • Extending the UI availability beyond Oct 31st is still preferred
      • Perhaps even needed - migration path of some services still to be understood
  • Reminders for sites
    • Participate in space monitoring (compare last meeting)
    • Update xrootd fallback configuration
      • Opened tickets to various sites - quite some took action - thanks!
    • Add "Phedex Node Name" to site configuration


Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

Machine/Job Features

Middleware Readiness WG

Multicore Deployment

SHA-2 Migration TF

  • introduction of the new VOMS servers
    • node firewalls are being opened selectively per experiment
      • ALICE OK
      • LHCb looking good
      • CMS in progress
        • FNAL FTS-3 config needed fixing
      • ATLAS in progress
        • BNL FTS-3 config needed fixing
    • EGI will soon conclude their campaign to get all EGI sites to recognize the new servers for the Ops VO
      • the port for Ops will be opened at that time
    • our special routing rules have been extended until Tue Nov 18 (sic)
      • those rules allow remote sites to get "Connection refused" instead of timeouts
      • by that time we still have 1 week to fix unwanted behavior
      • we should have things in good shape long beforehand...

WMS Decommissioning TF

IPv6 Validation and Deployment TF

Squid Monitoring and HTTP Proxy Discovery TFs

  • No progress to report again this meeting

Network and Transfer Metrics WG

  • Details on the shell shock vulnerabilites and its impact on perfSONAR available at https://twiki.cern.ch/twiki/bin/view/LCG/ShellShockperfSONAR
  • We recommend ALL sites that didn't patch bash before Friday Sep 26 to terminate their instances and wait until perfSONAR 3.4 is released
  • perfSONAR 3.4 to be released on Mon Oct 6, WLCG and EGI broadcasts will be sent with the installation instructions
  • perfSONAR operations meeting this Friday (Oct 3 at 3PM), agenda at https://indico.cern.ch/event/342995/

Action list

  1. ONGOING on the WLCG middleware officer: to take the steps needed to enable the CVMFS UI distribution as a viable replacement for the AFS UI.
    • The CVFMS grid.cern.ch contains the emi-ui-3.7.3 SL6 (path /cvmfs/grid.cern.ch/emi-ui-3.7.3-1_sv6v1) and provides as well CA certs, crls and voms lsc files. Given the new UI release we can also plan to upload the UI v3.10.0.
    • TODO: clarify the responsibilities (including ticketing etc.) for the maintenance of the CVMFS UI, in particular running fetch-crl
    • UPDATE: Steve said that the grid.cern.cn CVMFS server maintenance is under PES responsibility, so also the fetch-crl update process. In case of issues the Configuration Management SE should be addressed.
  2. ONGOING on the experiments: check the usage statistics of the AFS UI, report on the use cases at the next meeting.
  3. ONGOING on Andrea S.: to understand with EGI if it is possible to bypass the validation of VO card changes in the case of the LHC VOs. Agreed by Peter Solagna.
  4. ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Report about showstoppers. Status: the SAM team made a proposal on the steps to taken to enable SAM. ATLAS is following up to make sure that the new CEs are correctly visible in AGIS, while for the CMS VO feed they will be taken directly from OIM. The plan is at first to test HTCondor-CEs in preproduction and later switch to production. It is not foreseen to monitor at the same time GT5 and HTCondor endpoints on the same host.
    • No showstopper for SAM. Need to discover topology; publishing queues in BDII not necessary for SAM probes since Condor can choose the queue based on the proxy.
  5. ONGOING on Alessandro DG: find out from OSG about plans for publication of HTCondor CE in information system, and report findings to WLCG Ops. To be followed up with Michael Ernst and Brian Bockelman.


-- NicoloMagini - 19 Sep 2014

Edit | Attach | Watch | Print version | History: r30 | r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r11 - 2014-10-02 - MarcinBlaszczyk
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback