WLCG Operations Coordination Minutes - June 5th, 2014

Agenda

Attendance

  • local: Andrea Sciaba (chair), Nicolo Magini (secretary), Maria Alandes Pradillo, Stefan Roiser (LHCb), Maarten Litmaath (ALICE), Giuseppe Bagliesi (CMS), Andrea Manzi (MW officer), Rene Meusel, Alberto Aimar, Felix Lee (ASGC), Marian Babik, Zbyszek Baranowski, Oliver Keeble, Maria Dimou, Steve Traylen
  • remote: Yury Lazin (RRC-KI-T1), Maite Barroso (CERN), Rob Quick (OSG), Alessandra Doria, Alessandro Cavalli (INFN-T1), Di Qing (TRIUMF), Frederique Chollet (IN2P3), Jeremy Coles (GridPP)

News

  • Andrea Sciaba presents the calendar of the upcoming meetings: August 7th is cancelled, see agenda for details.
  • Andrea presents the proposed changes to the WLCG operations meetings, see agenda for details.

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

  • Highlights: CVMFS updated; FTS3 added; fix for DPM 1.8.8
  • New section added to track known issues, currently: ARGUS; RFC proxies in HTCondor; SHA512 VOMS host certificates in job submission. See twiki for details.

Tier-0 and Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR:
v2.1.14-11 and SRM-2.11-2 on ATLAS, ALICE, CMS and LHCB
EOS:
ALICE (EOS 0.3.4 / xrootd 3.3.4)
ATLAS (EOS 0.3.8 / xrootd 3.3.4 / BeStMan2-2.3.0)
CMS (EOS 0.3.7 / xrootd 3.3.4 / BeStMan2-2.3.0)
LHCb (EOS 0.3.3 / xrootd 3.3.4 / BeStMan2-2.3.0 (OSG pre-release))
   
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.7-3
xrootd
3.3.4-1
None None
BNL dCache 2.6.18 (Chimera, Postgres 9.3 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.11.3 emi3 (ATLAS, LHCb, CMS)
xrootd:
3.3.4-1 sl6 ALICE
3.3.3-1 slc5 ATLAS
v20130511 sl5 CMS
3.3.4-1 slc5 LHCb
   
FNAL dCache 2.2 (Chimera, postgres 9) for both disk and tape instance; httpd=2.2.3
Scalla xrootd 3.3.6-1
EOS 0.3.21-1/xrootd 3.3.6-1.slc5 with Bestman 2.3.0.16
dCache 2.2 upgrade Testing newer EOS versions -- will upgrade soon
IN2P3 dCache 2.6.18-1 (Chimera) on SL6 core servers and pool nodes
Postgres 9.2
xrootd 3.3.4 (Alice T1), xrootd 3.3.4 (Alice T2)
   
JINR-T1 dCache
  • srm-cms.jinr-t1.ru: 2.6.25
  • srm-cms-mss.jinr-t1.ru: 2.2.24 with Enstore
xrootd federation host for CMS: 3.3.6
   
KISTI xrootd v3.3.4 on SL6 (redirector only; servers are still 3.2.6 on SL5 to be upgraded) for disk pools (ALICE T1)
xrootd 20100510-1509_dbg on SL6 for tape pool
xrootd v3.2.6 on SL5 for disk pools (ALICE T2)
dpm 1.8.7-4
   
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.28
  • cmssrm-kit.gridka.de: 2.6.28
  • lhcbsrm-kit.gridka.de: 2.6.28
xrootd
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd redirector 3.3.3-1
minor dCache upgrade  
NDGF dCache 2.8.2 (Chimera) on core servers and on pool nodes.    
NL-T1 dCache 2.2.17 (Chimera) (SURFsara), DPM 1.8.7-3 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 2.2.23-1
xrootd door to VO severs (3.3.4)
None On 10th June, upgrade to dcache 2.6 is scheduled.
RAL CASTOR 2.1.13-9
2.1.14-5 (tape servers)
SRM 2.11-1
   
RRC-KI-T1 dCache 2.2.24 + Enstore (ATLAS) tape instance
dCache 2.2.24 (ATLAS) disk instance
dCache 2.6.22 (LHCb) disk instance
xrootd - EOS 0.3.19 (Alice) disk instance
   
TRIUMF dCache 2.6.21 None None

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
CERN 3.2.22    
ASGC 2.2.8 - transfer-fts-3.7.12-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None In the process of setting up FTS3 Production instance
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
FNAL fts-server-3.2.3-5    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
JINR-T1 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1 None None
RAL 2.2.8 - transfer-fts-3.7.12-1    
RAL 3.2.22    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.3.1-1 for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.7-4 SLC6, EPEL Oracle 11 ATLAS, OPS, ATLAS Xroot federations being decommissioned
CERN 1.8.7-4 SLC6, EPEL Oracle 12 LHCb  

Oracle deployment

  • Note: only Oracle instances with a direct impact on offline computing activities of LHC experiments are tracked here
  • Note: an explicit entry for specific instances is needed only during upgrades, listing affected services. Otherwise sites may list a single entry.

Site Instances Current Version WLCG services Upgrade plans
CERN CMSR 11.2.0.4 CMS computing services Done on Feb 27th
CERN CASTOR Nameserver 11.2.0.4 CASTOR for LHC experiments Done on Mar 04th
CERN CASTOR Public 11.2.0.4 CASTOR for LHC experiments Done on Mar 06th
CERN CASTOR Alicestg, Atlasstg, Cmsstg, LHCbstg 11.2.0.4 CASTOR for LHC experiments Done: 10-14-25th March
CERN LCGR 11.2.0.4 All other grid services (including e.g. Dashboard, FTS) Done: 18th March
CERN LHCBR 12.1.0.1 LHCb LFC, LHCb Dirac bookkeeping Done: 24th of March
CERN ATLR, ADCR 11.2.0.4 ATLAS conditions, ATLAS computing services Done: April 1st
CERN HR DB 11.2.0.4 VOMRS Done: April 14th
CERN CMSONR_ADG 11.2.0.4 CMS conditions (through Frontier) Done: May 7th
BNL   11.2.0.3 ATLAS LFC, ATLAS conditions TBA: upgrade to 11.2.0.4 (tentatively September)
RAL   11.2.0.3 ATLAS conditions TBA: upgrade to 11.2.0.4 (tentatively September)
IN2P3   11.2.0.4 ATLAS conditions Done: 13th of May
TRIUMF TRAC 11.2.0.4 ATLAS conditions Done

  • Zbyszek reports about the recent workshop on Oracle replication technologies. Streams will be phased out everywhere, replaced by GoldenGate, ADG, or custom replications. The timeline is end of July for online to offline, September for Tier-0 to Tier1s.

T0 news

  • As previously announced, CERN is ramping down SLC5 resources, and will finally disable Grid submissions to the remaining SLC5 resources on the 19th of June. The remaining SLC5 submitting CEs (ce201-ce207) will be put into draining mode on that day, and no new submissions will be allowed.
  • LFC decommissioning for Atlas: the daemons have been stopped, and the data is frozen (Users locked and tablespace set to read only as discussed and agreed with Atlas)

Other site news

Data management provider news

Update to DPM's gridftp server released, to fix issues encountered with FTS2 transfers. http://dl.fedoraproject.org/pub/epel/6/x86_64/repoview/dpm-dsi.html

Release of davix 0.3.1, major bugfix and functionality release: http://dmc.web.cern.ch/release/davix-031

Migration of the CVMFS servers

  • Steve presents the schedule for the upcoming CVMFS repository migrations to version 2.1, see agenda for details.
  • Experiments are asked to send feedback about the proposed dates.
  • All sites need to upgrade the CVMFS client to version 2.1.19 by August 5th.
    • Fixes a known bug in the 2.1.17 client affecting job efficiency when running against a server upgraded to 2.1
    • The upgrade from 2.1.X is straightforward
    • Andrea Manzi (middleware officer) will follow up on the upgrade progress. The status can be easily checked using the existing SAM probes. He needs the privilege to send mass notifications in GGUS, to be followed up offline.
    • Stefan comments that the lhcb-condb repository is no longer used and can be retired

Experiments operations review and Plans

ALICE

  • KIT
    • network load due to continued use of old ROOT versions by users
      • a campaign is ongoing to get all users to fix their JDL files
      • jobs with bad versions are reported to the offending users
      • as a mitigation the concurrent jobs cap has been kept low
  • CERN
    • SLC6 job efficiencies:
      • various data analytics and comparison efforts ongoing
      • another new VOBOX has been added to target SLC6 VM hosts only

ATLAS

  • MonteCarlo production and analysis: stable load in the past week
    • MonteCarlo prod workload is available till the start of DC14 (now date should be 1st of July), but single-core only
    • occasional multi-core validation tasks
  • Rucio full chain testing has started 2 weeks ago. Preliminary results shown just this week in the ATLAS SW week, now we inject data in Tier0 at at approx 2% of the max file throughput observed in the past year. Files are 1MB, so in terms of throughput this is negligible. Today replication is only to 10Tier1s, in the next weeks we foreseen to increase the number of files created in Tier0 and include also other Tiers in the replica distribution.
  • CERN LFC has been decommissioned. The daemons have been switched off, but nodes still available. (Issue reported and discussed at this meeting 2 weeks ago about NDGF file not properly recorded was understood, it was internal to the NDGF special ARC config.)

CMS

  • Processing overview: Grid resources basically busy
    • 13 TeV MC in 3 scenarios for CSA14 challenge
    • Phase 2 Upgrade MC
    • LHC run 1 legacy MC
  • We will now ramp up scale of Tier-0 tests on AI and HLT clouds
  • ARGUS problem
    • Mainly affects glexec
    • Caused when a CA has issues to report CRLs
    • There is an Argus update, but not yet released?
    • Important for CMS to have this fixed (glexec is being used!)
  • DPM
    • Fix for FTS2 issues available
    • Released to EPEL
      • Sites should update now
      • or switch to FTS3 (contact CMS Transfer Team for help how to configure production transfers in FTS3)

  • Andrea Manzi reminds that the fix for ARGUS is in testing, but there is no release date yet.

LHCb

  • Operations
    • Mainly Monte Carlo and user jobs executed since the last meeting
    • Reprocessing of 2010 data currently in preparation
  • CVMFS switch over to new stratum infrastructure as announced by PES
    • Recommended (== new baseline) client version 2.1.19 installed by 20 sites so far
    • Remaining sites are mostly on 2.1.17 (25) or 2.1.15 (16), 2.1.14 (6) -> this may be risky
    • -> dashboard view

Ongoing Task Forces and Working Groups

Tracking tools evolution TF

  • GGUS releases of June and July will be combined on a single release on the 16th July.
    • The automatic creation of tickets through mail will be stopped. It will still be possible to update tickets through mail.
    • The GGUS certificate used to sign ALARM email notifications will be renewed with this release. No DN/CA change, normally transparent.

FTS3 Deployment TF

  • New release upcoming in July with minor bugfixes; bulk srmBringOnline. Moving to EPEL repository.
  • Discussion on new feature request: multi-destination transfer with automated rerouting.
  • IPv6 enabled on pilot on Friday 6th
  • Experiment status:
    • ATLAS - no issue with FTS3 in recent months. Will start to test bringOnline and activity shares in FTS3 in ~4 weeks.
    • CMS - opened tickets to sites still using FTS2 in PhEDEx Debug; only ~10 T2 missing. Start migration of PhEDEx Prod transfers. Scale tests with ASO OK.

gLExec deployment TF

  • 85 tickets closed and verified, 10 still open (-2)
  • Deployment tracking page
  • other activities on hold until resolution of Argus stability issues

Machine/Job Features

  • "batch system implementations"
    • SGE deployed on two sites (GRIDKA, Imperial)
    • SLURM first implementation currently being developed
    • LSF, deployed at CERN, second site to be tested is CNAF - currently debugging some wrongly reported numbers
    • CONDOR, implementation done, to be deployed on first site, i.e. UCSD
    • TORQUE, implementation done, to be deployed on first site, i.e. NIKHEF
  • cloud implementation, little progress since last meeting and no conclusion yet on a final implementation
  • Project Plan uploaded

  • Stefan gives details about the TF plan: after each batch system implementation is validated at at least 2 sites, the TF will contact the rest of the sites asking for deployment by the deadline (~end of year).
  • A SAM nagios probe is available to test the installations.

Middleware readiness WG

  • Good progress in the "WLCG Package Reporter" development. (Lionel)
  • Important steps done, in establishing the MW Readiness processes using DPM as the 'pilot' product, also in the CVMFS grid.cern.ch area population. (MW Officer, AndreaM)
  • Created the WG's Task's overview table for this meeting. (MariaD with input by the above)
  • Added more MW products in the PT table of the WG twiki. (MariaD with input by the Product owners)
  • Preparing content for the 5th WG meeting on July 2nd at 4pm CEST. (MariaD with input from the WG)
  • Planning a discussion with OSG and experts in the WG on the recent Condor issues. (Maarten & MariaD)

  • For Condor, Rob Quick is going to contact Tim Cartwright.
  • Maarten explains that the long-term plan is to understand with ATLAS and CMS how to validate Condor as a common tool, Rob Quick from OSG is also involved.
  • In the short term there are two issues to solve:
    • Issue with RFC proxies due to use of an old version of the CREAM client.
    • Higher priority: the CREAM GAHP component used by Condor to submit to CREAM is crashing repeatedly since last week, both on pilot factories and in SAM pre-prod. For unknown reasons it affects only ATLAS, not CMS.

Multicore deployment

  • NTR

SHA-2 Migration TF

  • UK CA switched to SHA-2 on May 28; no issues so far
  • introduction of the new VOMS servers
    • the fix for blocking issue GGUS:104768 was announced by EMI on May 27
      • bouncycastle-mail-1.46-2
      • affected node types: ARGUS, CREAM, UI, WN
      • the experiments will have to update their affected UI instances
    • we now need the fix to become available in UMD
      • a release is expected any day now
    • all sites need to update their affected hosts
    • we will define a new timeline and send a broadcast accordingly
  • RFC proxies
    • CREAM client in Condor dev versions >= 8.1.5 supports RFC proxies
      • but it currently is an EMI-2 client, i.e. unsupported
      • and a quick test by ATLAS failed (GGUS:105188)
      • to be followed up further
    • CMS tried RFC proxies on their SAM preprod instance
      • all tests failed, because the probe that refreshes the proxy only supports legacy proxies
        • easy to fix

WMS decommissioning TF

  • NTR

IPv6 validation and deployment TF

  • NTR

HTTP proxy discovery TF

  • NR

Network and transfer metrics WG

Action list

  1. CLOSED on the middleware readiness working group: include Condor in the readiness verification; report at the next meeting about the status of the issue with RFC proxies. Done.
  2. ONGOING on the WLCG monitoring team: status of the CondorG probes for SAM to be able to decommission SAM WMS
    • SAM release with Condor job submission probes was deployed in preproduction for both ATLAS and CMS on Monday 2nd of June.
    • Validation has started and work is ongoing to fix several issues already identified as well as to follow up on particular issue with Condor (core dumps from Condor GAHP for ATLAS).
    • Marian says that he should be able to estimate the timeline for deployment to production at the next meeting. Maarten comments that September would be better than August.
  3. NEW on the middleware officer: report about progress in CVMFS 2.1.19 client deployment

AOB

  • Jeremy Coles informed us about the UK HEP system administration meeting held this week - thanks! The agenda linked below contains many talks of general interest:

-- NicoloMagini - 03 Jun 2014

Edit | Attach | Watch | Print version | History: r31 < r30 < r29 < r28 < r27 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r31 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback