WLCG Operations Coordination Minutes - May 22nd, 2014

Agenda

Attendance

  • local: Maria Alandes (chair), Nicolo Magini (secretary), Alberto Aimar, Simone Campana (ATLAS), Zbigniew.Baranowski, Marian Babik, Oliver Keeble, Maarten Litmaath (ALICE), Hassen Riahi
  • remote: Christoph Wissing (CMS), Di Qing (TRIUMF), Jeremy Coles, Kyle Gross (OSG), Maite Barroso (Tier-0), Massimo Sgaravatto, Thomas Hartmann (KIT), Burt Holzmann (FNAL)

News

  • 2014 WLCG Workshop in Barcelona (7-9 July):
    • The WLCG workshop agenda is now available in Indico. We will start contacting the speakers to define the contents and details of each talk, and we will also contact the experiments to start thinking about the experiment session. We would like to see part of each experiment session dedicated to the long term future and hear about computing model evolution for Run3/Run4.
    • Please, register to the conference if you are planning to come! Registration deadline is 9th of June.

  • Simone comments that the ATLAS presentation for computing model evolution for Run3/Run4 would be more about questions than solutions.
  • Alberto proposes a standard skeleton for experiment presentations, unlike last year when each experiment used a different format.

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

  • Introducing Andrea Manzi who will take over the maintenance of the baseline versions as WLCG Middleware Officer
  • The latest DPM 1.8.8 has issues with FTS2 transfers (currently affecting only some CMS sites), more news in the DM section.

Tier-0 and Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR:
v2.1.14-11 and SRM-2.11-2 on ATLAS, ALICE, CMS and LHCB
EOS:
ALICE (EOS 0.3.4 / xrootd 3.3.4)
ATLAS (EOS 0.3.8 / xrootd 3.3.4 / BeStMan2-2.3.0)
CMS (EOS 0.3.7 / xrootd 3.3.4 / BeStMan2-2.3.0)
LHCb (EOS 0.3.3 / xrootd 3.3.4 / BeStMan2-2.3.0 (OSG pre-release))
   
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.7-3
xrootd
3.3.4-1
None None
BNL dCache 2.6.18 (Chimera, Postgres 9.3 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.11.3 emi3 (ATLAS, LHCb, CMS)    
FNAL dCache 2.2 (Chimera, postgres 9) for both disk and tape instance; httpd=2.2.3
Scalla xrootd 3.3.6-1
EOS 0.3.21-1/xrootd 3.3.6-1.slc5 with Bestman 2.3.0.16
dCache tape instance upgrade to Chimera/2.2  
IN2P3 dCache 2.6.18-1 (Chimera) on SL6 core servers and pool nodes
Postgres 9.2
xrootd 3.3.4 (Alice T1), xrootd 3.3.4 (Alice T2)
FAX federation enabled
CMS (T1+T2) federation enabled
perhaps dCache upgrade on some Solaris servers holding staging pools
JINR-T1 dCache
  • srm-cms.jinr-t1.ru: 2.6.25
  • srm-cms-mss.jinr-t1.ru: 2.2.24 with Enstore
xrootd federation host for CMS: 3.3.6
minor dCache upgrade  
KISTI xrootd v3.3.4 on SL6 (redirector only; servers are still 3.2.6 on SL5 to be upgraded) for disk pools (ALICE T1)
xrootd 20100510-1509_dbg on SL6 for tape pool
xrootd v3.2.6 on SL5 for disk pools (ALICE T2)
dpm 1.8.7-4
   
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.21-1
  • cmssrm-kit.gridka.de: 2.6.17-1
  • lhcbsrm-kit.gridka.de: 2.6.17-1
xrootd
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd redirector 3.3.3-1
  Downtime 26th/27th May for updating to the latest available dCache release in the 2.6 branch, probably 2.6.28
NDGF dCache 2.8.2 (Chimera) on core servers and on pool nodes.    
NL-T1 dCache 2.2.17 (Chimera) (SURFsara), DPM 1.8.7-3 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 2.2.23-1
xrootd door to VO severs (3.3.4)
xrootd proxy deployed for ATLAS / being deployed for CMS Before summer 2.6 / After summer 2.10, most likely
RAL CASTOR 2.1.13-9
2.1.14-5 (tape servers)
SRM 2.11-1
  Upgrade to Castor 2.1.14. Plan for Nameserver on 10th June (date to be confirmed). Stagers over following 2 to 3 weeks.
RRC-KI-T1 dCache 2.2.24 + Enstore (ATLAS)
dCache 2.6.22 (LHCb)
xrootd - EOS 0.3.19 (Alice)
   
TRIUMF dCache 2.6.21 None None

  • Burt comments that the upgrade of the FNAL tape instance is basically complete, tests are OK and the instance will be opened up later today.
  • Nicolo asks if the compatibility issues between dCache 2.6 and Enstore, which were previously preventing the upgrade at PIC, have been solved. To be followed offline since PIC is not connected.
    • After the meeting, PIC confirms that they have been testing dCache 2.6 with the new Enstore software and everything is working as expected. The upgrade to dCache 2.6 is tentatively scheduled for June 10th, site downtime to be announced next week.

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
CERN 3.2.22    
ASGC 2.2.8 - transfer-fts-3.7.12-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
FNAL fts-server-3.2.3-5    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
JINR-T1 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1 None Deprecation by August 2014
RAL 2.2.8 - transfer-fts-3.7.12-1    
RAL 3.2.22    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

  • Now tracking FTS3 servers at RAL and CERN, which are running the latest release candidate.

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.3.1-1 for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.7-4 SLC6, EPEL Oracle 11 ATLAS, OPS, ATLAS Xroot federations  
CERN 1.8.7-4 SLC6, EPEL Oracle 12 LHCb  

Oracle deployment

  • Note: only Oracle instances with a direct impact on offline computing activities of LHC experiments are tracked here
  • Note: an explicit entry for specific instances is needed only during upgrades, listing affected services. Otherwise sites may list a single entry.

Site Instances Current Version WLCG services Upgrade plans
CERN CMSR 11.2.0.4 CMS computing services Done on Feb 27th
CERN CASTOR Nameserver 11.2.0.4 CASTOR for LHC experiments Done on Mar 04th
CERN CASTOR Public 11.2.0.4 CASTOR for LHC experiments Done on Mar 06th
CERN CASTOR Alicestg, Atlasstg, Cmsstg, LHCbstg 11.2.0.4 CASTOR for LHC experiments Done: 10-14-25th March
CERN LCGR 11.2.0.4 All other grid services (including e.g. Dashboard, FTS) Done: 18th March
CERN LHCBR 12.1.0.1 LHCb LFC, LHCb Dirac bookkeeping Done: 24th of March
CERN ATLR, ADCR 11.2.0.4 ATLAS conditions, ATLAS computing services Done: April 1st
CERN HR DB 11.2.0.4 VOMRS Done: April 14th
CERN CMSONR_ADG 11.2.0.4 CMS conditions (through Frontier) Done: May 7th
BNL   11.2.0.3 ATLAS LFC, ATLAS conditions TBA: upgrade to 11.2.0.4 (tentatively September)
RAL   11.2.0.3 ATLAS conditions TBA: upgrade to 11.2.0.4 (tentatively September)
IN2P3   11.2.0.4 ATLAS conditions Done: 13th of May
TRIUMF TRAC 11.2.0.4 ATLAS conditions Done

T0 news

  • Quattor phase-out: CERN is currently migrating all centrally managed services from Quattor to a new Puppet based Configuration Management system. This migration is meant to be fully finished by 31st October 2014. On that date, all components of the Quattor infrastructure (CDB, CDBSQL, CDBWeb, SINDES, SWREP, SMS, LEAF tools and CLUMAN) will stop working.
  • SHA-2 Certificates have been automatically added to all users in the 4 LHC VOs.

Other site news

Data management provider news

  • DPM 1.8.8 was released last week including a new gridftp component, which has issues with the gridftp implementation in FTS2 (currently still used only by CMS); FTS3 transfers are OK. Since CMS is anyway pushing to switch to FTS3, the affected sites which have already upgraded are asked to change their PhEDEx configuration to use FTS3. Sites which have not yet upgraded are encouraged to wait for the DPM fix which is in testing and is probably going to be released next week.
  • Nicolo will inform the CMS Tier-2s through the appropriate mailing list.

Experiments operations review and Plans

ALICE

  • activities for Quark Matter 2014 have ramped down
  • high production activity has taken over
  • CERN
    • SLC6 job efficiencies:
      • various data analytics and comparison efforts ongoing
      • new VOBOX has been set up to target physical SLC6 hosts only

ATLAS

  • MC production and analysis: stable load in the past week
    • MC prod workload is available till the start of DC14 (mid/endJune), but single-core only
    • occasional multi-core validation tasks
  • rucio full chain testing starting now. Ramping up in the next 4/6 weeks to stress test the Rucio components. Still proper monitoring missing, more news in one week from now at the ADC weekly. This will not impact in terms of data transfers the normal ATLAS activities, and thus not even the other experiments nor sites.
  • migration of sites from LFC to Rucio: all clouds migrated but US, which is ongoing. The CERN LFC is not used anymore. We will discuss with CERN-IT in the coming week to snaphot the DB and close the frontends.
  • issue about rfc proxy support in condor cream submission. GGUS:105188

  • About the catalog migration, Maarten suggests to keep the LFC running in parallel to compare during the migration and debug issues recently seen at NDGF about files that were never written on the storage but were in the ATLAS catalogs. Simone comments that this can be done for the next weeks, but in the meantime they can plan the decommissioning campaign.
  • Condor is using a very old version of the CREAM client, the issue can be fixed repackaging with a recent version; the fix is currently in testing in the dev trunk. Simone and Maarten suggest to follow condor in the Middleware Readiness Working Group now that it is used as common middleware by two experiments. Maria opens an ACTION item on the middleware readiness working group to include condor and report at the next meeting about the status.

CMS

  • High priority Production and Processings
    • Heavy Ion MC (almost done now)
    • Upgrade MC
    • CSA14 preparation (13 TeV MC)
  • Made CMS SAM test for glexec critical on May 19th
  • SAM test for xrootd fallback
    • Not yet critical
    • Still waiting (mainly) for RAL to fix some issues
  • FTS3 for Phedex Debug transfers becoming mandatory now
    • Have sent tickets to sites
  • Need to deprecate Savannah to GGUS bridge in GGUS Mai release (May 26th)
    • Relies on old CMS siteDB API, which is being decommissioned June 3rd
    • Changing CMS shifter instructions to use GGUS directly
    • Operations effort is moving to GGUS anyway

LHCb

  • 5 GGUS tickets in total (3 x pilot problems, 1 x glitch in cvmfs, 1 x "brazilian proxies")
  • Mostly MonteCarlo productions and user analysis during the last 2 weeks
  • Problem with brazilian certificates not able to access dCache storage was understood and will be fixed by the product team

Ongoing Task Forces and Working Groups

Tracking tools evolution TF

  • GGUS release on the 26th. The alarms for UK and USA will be done on the 27th. The rest, on the 26th
  • The Savannah-GGUS bridge for CMS will be decomissioned in this release

FTS3 Deployment TF

  • CMS opening GGUS tickets to sites to complete migration to FTS3 in PhEDEx Debug.

gLExec deployment TF

  • No follow up yet with ATLAS and LHCb on the adoption of glExec.
  • On a related topic, Maarten comments that there have been many tickets for ARGUS instabilities recently, but the experts are actively following up, hopefully a solution will be found soon.

Machine/Job Features

  • NTR

Middleware readiness WG

  • 4th meeting took place on May 15
    • agenda
    • minutes will be announced
  • next meeting on Wed July 2, 16:00-17:30 CEST

  • Maarten comments that an internally developed solution is going to be used to track middleware versions instead of pakiti, since this would need to be extended anyway. A prototype is expected to be ready by the next meeting, to be deployed initially on the sites involved in middleware validation.
  • Marian asks if there is a plan to extend the monitoring to the entire infrastructure. Maarten answers that this has been a long-time goal; the monitoring will be demonstrated on the validation infrastructure and then proposed for wider adoption e.g. at the MB.

Multicore deployment

  • NTR

SHA-2 Migration TF

  • introduction of the new VOMS servers
    • blocking issue - job submission to CREAM fails when the proxy was signed by a VOMS server with a SHA512 host certificate (GGUS:104768)
    • the fix has been put into the EMI-3 third-party repo on May 16
      • bouncycastle-mail-1.46-2
    • no official announcement yet about which node types are affected
      • CREAM and/or UI
      • others?
    • we also need the fix to become available in UMD
    • all sites then need to update their affected hosts
    • we will define a new timeline accordingly
  • RFC proxies
    • ATLAS discovered that Condor still uses an old CREAM client that does not support RFC proxies
      • their pilot factories thus need to keep using legacy proxies for the time being
      • this matter will be followed up with the Condor devs
    • CMS intended to switch their SAM preprod instance to RFC proxies
      • that should still work now, but would fail when WMS submission is replaced by Condor-G submission

  • Maria asks about the criticality of the issue with job submission to CREAM. Maarten comments that is top priority because it's blocking the introduction of the new VOMS servers; however currently users are not affected because the new VOMS servers are not yet in production.
  • About the RFC proxy issue, Nicolo comments that he's not aware about such issues in the CMS CRAB3 tests, which are also using RFC proxies with Condor. Possibly related by different usage of Condor by the two VOs.

WMS decommissioning TF

  • Maite asks about the timeline of the SAM WMS decommissioning by the end of June. Marian answers that the deployment of the new CondorG probes is late, currently finalizing the CMS probes with Andrea Sciaba'. The new probes will be in SAM preprod by beginning of June, so realistically the commissioning will complete by end of August.
  • Maite proposes to check back on the status in mid summer; the hard deadline is 31st of October since the SAM WMS machines are Quattor-managed.

IPv6 validation and deployment TF

HTTP proxy discovery TF

  • NTR

Network and transfer metrics WG

  • Mandate:
    • Ensure all relevant network and transfer metrics are identified, collected and published
    • Ensure sites and experiments can better understand and fix networking issues
    • Enable use of network-aware tools to improve transfer efficiency and optimize experiment workflows
  • Objectives:
    • Identify and continuously make available relevant transfer and network metrics
    • Document metrics and their use
    • Facilitate their integration in the middleware and/or experiment tool chain
    • Coordinate commissioning and maintenance of WLCG network monitoring
  • Homepage at https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics; linked from WLCG OPS coordination page
    • Added initial information, mandate and objectives proposed at the last WLCG ops coordination meeting
  • Mailing list: wlcg-ops-coord-wg-metrics@cernNOSPAMPLEASE.ch
  • Meetings:
    • Thu 22nd May, Internal meeting held in LAPP, Annecy
  • Discussed members to invite and started contacting them
  • Proposing to organize the work in two sub-groups:
    • Deployment, commissioning and maintenance of the tools providing network and transfer metrics (FAX, AAA, FTS, perfSONAR, etc.) - versions and configuration tracking (parameter tuning), operational issues, etc.
    • Higher-level services that will make use of the provided metrics (PhEDEx, Panda, Rucio) - technical aspects of existing metrics (latency, measurement methodology, API access, etc.), identification of missing metrics, data analytics on archived data
  • Discussed current issues and short-term tasks in the perfSONAR deployment and commissioning objective:
    • Identified near-term operational tasks needed to finalize deployment of perfSONAR (tracking the versions, identifying and resolving firewall issues, etc.)
    • Discussed plans for the OSG data store hosting and archiving perfSONAR data
  • Next steps:
    • Invite all the proposed members and organize kick-off meeting

  • Concerning perfSONAR, Marian mentions that a new version is coming out with a REST API.
  • Maria asks about the kick-off meeting for the working group; Marian answers that the date will be agreed when all WG members are invited, the aim is for June.

Action list

  1. NEW on the middleware readiness working group: include Condor in the readiness verification; report at the next meeting about the status of the issue with RFC proxies.
  2. NEW on the WLCG monitoring team: status of the CondorG probes for SAM to be able to decommission SAM WMS

AOB

  • Reminder: the next WLCG Operations Coordination will be on June 5th

-- NicoloMagini - 19 May 2014

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2014-05-26 - NicoloMagini
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback