WLCG Operations Coordination Minutes -24 October 2013

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=272669

Attendance

  • Local: Andrea Sciaba', Simone Campana, Maria Dimou, Oliver Keeble, Alessandra Forti, Felix Lee, Nicolo' Magini, Michail Salichos, Markus Schulz, Alberto Aimar, Manuel Guijarro, Jan Iven
  • Remote: Burt Holzman, Alessandro Cavalli, Antonio Maria Perez Calero Yzquierdo, Christoph Paus, Christoph Wissing, Frederique Chollet, Di Qing, Alessandra Doria, Peter Solagna, Nilsen (KIT), Philippe Charpentier, Jeremy Coles

News

  • Simone announces that he will succeed Maria Girone as WLCG Operations working group coordinator and presents the schedule for the upcoming meetings. Note: the slides in the agenda were updated after the meeting to fix a typo in the "daily" meeting schedule.

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

  • Updated baseline versions for BDII-top, EMI-3 CREAM CE, frontier-tomcat (affecting CMS only)
  • Nicolo': EMI-UI version 2.0.3 was released restoring the missing grid-cert-info command, if this is important for the VOs it will be set as baseline. Simone: action to collect feedback from VOs.

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR: v2.1.13-9-2 and SRM-2.11-2 on ALICE, CMS and LHCb
EOS:
ALICE (EOS 0.2.37 / xrootd 3.2.8)
ATLAS (EOS 0.2.38 / xrootd 3.2.8 / BeStMan2-2.2.2)
CMS (EOS 0.2.38 / xrootd 3.2.8 / BeStMan2-2.2.2)
LHCb (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
CASTOR v2.1.14 on ATLAS 21/10
EOS 0.3 for ATLAS 23/10
CASTOR v14 for others: 28-29/10
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.6-1
xrootd
3.2.7-1
None None
BNL dCache 2.2.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
   
CNAF StoRM 1.11.2 emi3 (ALICE, ATLAS, CMS, LHCb)    
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.7.slc
Oracle Lustre 1.8.6
EOS 0.3.1-12/xrootd 3.3.3-1.slc5 with Bestman 2.2.2.0.10
EOS -0.3.1-12 Plan to upgrade to new EOS version in 2 weeks, dCache 2.2 by hopefully end of 2013
IN2P3 dCache 2.2.12-1 (Chimera) on SL6 core servers and 2.2.13-1 on pool nodes
Postgres 9.1
xrootd 3.0.4
  dCache upgrade to 2.6.10+ by end of Nov (SHA-2 support)
KISTI xrootd v3.2.6 on SL5 for disk pools
xrootd 20100510-1509_dbg on SL6 for tape pool
dpm 1.8.6
   
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.5-1
  • cmssrm-fzk.gridka.de: 2.6.5-1
  • lhcbsrm-kit.gridka.de: 2.6.5-1
xrootd
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd proxy 3.3.1-1
None None
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.7 (Chimera) (SURFsara), DPM 1.8.6 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 2.2.17-1
xrootd door to VO severs (3.3.1-1)
None None
RAL CASTOR 2.1.13-9
2.1.13-9 (tape servers)
SRM 2.11-1
   
TRIUMF dCache 2.2.13(chimera), pool/door 2.2.10   dCache upgrade to support SHA2 by end of Nov.

  • Burt comments that migration to dCache 2.2. at FNAL is progressing in parallel to disk/tape separation.

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1    
BNL 2.2.8 - transfer-fts-3.7.10-1    
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.3.1-1 for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS  
CERN 1.8.6-1 SLC6, EMI2 Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations About to deploy 1.8.7-3 on Puppet managed Openstack nodes

  • IT-PES has prepared Puppet managed Openstack nodes with the current LFC version (1.8.7-3) with the same configuration of the current CERN production nodes. An added bonus of this version is that dmlite WebDav access is available. The change should be otherwise transparent. Access has been given to ATLAS and LHCB data management people to a couple of nodes that are already configured (but not yet in the LB alias). If things look OK for them, we will add one node to each of the production aliases and if everything is OK after a few days, we will progressively replace all Quattor nodes.

Other site news

Data management provider news

Experiment operations review and plans

ALICE

  • CVMFS
    • 40 sites have been switched already, many others are in progress
  • CERN
    • Mon Oct 7: lcg-voms.cern.ch host certs updated with wrong DNs, causing job submission failures for ALICE (and other experiments) around the grid (alarm GGUS:97815)
      • as a side effect it was discovered that alarm tickets did not work (GGUS:97817); fixed Fri Oct 11
    • significant fraction of jobs (~35%) running on SLC6 with CVMFS now
      • so far the efficiencies have been lower and the failure rates have been higher than on SLC5 with Torrent
        • the issue looks systematic; under investigation
      • SLC5 jobs are still using Torrent for now
  • all ALICE activity stopped on Thu Oct 10 due to unexpected expiration of the AliEn CA at 11:33 CEST that day
    • fixed late afternoon; all production sites were working again by afternoon the next day
  • US T2 LLNL needed to be switched off for a few days due to the government shutdown!

ATLAS

  • ATLAS encourages sites to deploy cvmfs 2.1

CMS

  • IPV6 (CMS - Tony Wildish, Andrea Sciaba)
  • overview workflows
    • legacy 7 TeV data rereco pass re-started last week (still issue with production infrastructure reliability)
    • running 8 TeV MC GEN-SIM and started 7 TeV legacy MC GEN-SIM
  • CVMFS: working with the last remaining ~10 T1/T2 sites to switch to CVMFS
    • two more sites moved to CVMFS
    • instructed site admins to prepare debug tarballs and send them to CVMFS development team
    • starting to also push the Tier-3 stronger (many have already converted)
    • sometimes seeing black holes due to CVMFS (can be quite annoying)
      • need to collect the various tricks to fix these issue from the sites
      • will propagate this information in this meeting and CVMFS team
  • Savannah to GGUS move
    • most technical issues are understood
    • ball presently in our court to formulate properly the various conversion solutions
    • will communicate with the GGUS team
  • checking in
    • SAM tests: condor_g mode, is there any progress?
    • glexec: waiting for report from WLCG task force
  • older items
    • MC Workflow loaded large Gridpack via Squids (700 MB) - launchpads started to die
    • Post Mortem is close to complete, but not complete yet

  • Simone and Andrea answer about SAM condor_g probe: activity should start in IT-SDC-MI (Julia's section) in November. Simone proposes to discuss timeline in next planning meeting. Alessandra reminds that the experiment tests will be used for official availability calculations from Jan 2014. ChristophW and Simone: fit with decommissioning of gliteWMS
  • Simone comments that there will be no report from glExec task force at this meeting due to Maarten's absence, to be followed up at the next meeting

LHCb

  • GGUS statistics: 8 tickets in the last two weeks
    • mostly pilot problems (2x RAL with ARC CE), 2x CVMFS, 1x staging pools went offline at IN2P3
  • All WAN transfers are executed by FTS3 now in production (using the CERN instance with RAL as backup)
  • 2013 fall incremental stripping campaign progressing well, approx half of the data has been processed

  • Philippe: KIT fixed staging problems seen last spring, OK now for LHCb
  • Philippe: LHCb now uses FTS3 for all transfers including internal (e.g CERN-CERN)

Task Force reports

FTS-3

  • Nicolo': next task force meeting and demo on Wed 30th October, items to discuss will include the bandwidth limitation feature proposed by some sites at the last GDB.

Machine/Job Features

  • finalizing the packaging of the "mjf.py" tool, so it can be put into the WLCG rpm repository to allow deployment on first grid sites
  • discussing with Tim and Ulrich on the development and deployment of machine/job features in the CERN/openstack environment

SL6 migration

Lowered the target to 90% for T1s and 66% for T2.

2013-10-24

  • Total number of Tier1s Done: 12/16 (Alice 7/10, Atlas 10/13, CMS 5/8, LHCb 6/9)
  • Total number of Tier1s not Done: 4/16 (2 with a plan, 2 in progress)
    • CERN: moved 22% of batch resources.
    • KIT: moved 55% of the resources and hope to finished by the end of the month
    • FNAL: have had problems with the migration have two nodes in test if they'll work correctly they'll be increased to 10 nodes and then taking it from there.
    • RRC-KI-T1: no answer since August.

  • Total number of Tier2s Done: 95/130 (Alice 29/40, Atlas 62/89, CMS 47/65, LHCb 31/45) ~73.0
  • Total number of Tier2s not Done: 35/130 (24 with a plan, 11 in progress)
    • Will surely finish by the end of October (visible progress): 7
    • Plan to finish before the end of October: 4
    • Should make it by the end of October but cannot guarantuee: 6
    • Didn't confirm the status so far: 13
    • Post deadline tail of resources: 5

  • Sites that will not end by the end of October will be ticketed.
  • Need to decide future of the TF after 31st of October.

  • Alessandra, Andrea and Simone agree to consider the task force CLOSED, stopping the active push on sites. Tickets to sites not yet migrated will be submitted at the end of October, and statistics on tickets will be reviewed after one month.
  • Manuel: what is the target to consider a site "migrated"? Alessandra: we allow for ~5-10% of resources in the tail.
  • Philippe asks for the CERN migration plan. Manuel: there is a plan, >80% before end 2013. Actions in progress: recycling CERN SLC5 to SLC6, and allocating additional resources in Wigner. Migration is coupled to move to OpenStack and Wigner
  • Jeremy: what are the experiment plans for using SLC5/SLC6 after the end of October? Philippe: LHCb will use what they have, on SLC6 they see 10-15% improvement, planning to discontinue SLC5 builds by end of the year. Simone: discuss hard experiment deadlines for stopping to use SLC5 resources in next planning meeting.

Tracking tools

  • Three meetings tool place in October. Most recent first:
  1. 2nd meeting on savannah-to-jira migration of the GGUS dev. tracker 2013/10/09:*
    Our tracker was not migrated to JIRA6. This will be done by Benedikt after CHEP and testing will restart. Agenda of this 2nd meeting on savannah-to-jira migration of the GGUS dev. tracker 2013/10/09 and Minutes.

  2. 3rd CMS-GGUS meeting on migrating the savannah-ggus bridge:
    The meeting defined in great precision the new forms, fields and their values for use by CMS GGUS users only and concluded on the January 2014 GGUS release as date of entering production. Agenda 3rd CMS-GGUS meeting on migrating the savannah-ggus bridge 2013/10/08 and Notes/Actions/Decisions .

  3. Tracking Tools Evolution TF meeting on savannah and GGUS dev. issues 2013/10/08:

    The meeting decided that:
    1. Experiments should discuss their savannah trackers in the Librarians & Integrators' meeting.
    2. Discussion on IT savannah trackers started within Tracking Tools members (namely Maarten & Benedikt).
    3. A GGUS submission form for experts that can open multiple identical tickets to multiple sites, NGIs/ROCs was specified in detail.
    Agenda of this Tracking Tools Evolution TF meeting on savannah and GGUS dev. issues of 2013/10/08 and Minutes.

  • Pablo Saiz replaces Maria Dimou in the GGUS development team and the chairing of Tracking Tools. There is no cut-off date so that migration can be smooth.
  • The Tracking Tools TF twiki is up-to-date.
  • The results of 4+ years of ALARMs' drills, the ALARMs' workflow and the GGUS infrastructure are on this CHEP 2013 Poster.

CVMFS Deployment

  • 6 ALICE sites left to deploy CVMFS (5 done since last meeting)

SHA-2 migration

  • Peter announces that alarms for SHA-2 compliance are now raised for all services by EGI. Fixes for StoRM will be only in EMI-3/UMD-3, StoRM for EMI-2/UMD-2 is now UNSUPPORTED since sites need to upgrade to EMI-3 anyway for SHA-2 support.
  • Manuel: voms-admin at CERN is being worked on, available in puppet. Still a couple of weeks before it is ready to be exposed to experiments. Simone reminds that careful experiment validation is needed. Peter: dteam VO managers were happy with recent migration from VOMRS to voms-admin
  • Andrea: how many SEs are not SHA2-compliant and would be put in downtime on December 1st? Peter will check the numbers, for dCache 70% of the endpoints were not compliant 3 weeks ago. Peter will meet with NGIs next week to review the situation. Andrea: let's review at the next WLCG Ops Coord meeting. Simone proposes to keep stats, as for SLC6 migration.

AOB

Action list

  1. Tracking tools TF members who own savannah projects to list them and submit them to the savannah and jira developers if they wish to migrate them to jira. AndreaV and MariaD to report on their experience from the migration of their own savannah trackers. Further discussion expected for the next meeting, after a dedicated meeting about the migration of the GGUS Savannah tracker to JIRA). Maria clarifies that 83 trackers need a decision, and trackers that will not be migrated will be gone for good. Maarten suspects the migration cannot be finished this year, but will need to stretch a few months into next year. Maria thinks we can close this action as decided in the Tracking Tools' Evolution TF meeting of 2013/10/08. See Minutes HERE. closed
  2. Investigate how to separate Disk and Tape services in GOCDB
    • proposal submitted via GGUS:93966
    • in progress - ticket updated
  3. Agree with IT-PES on a clear timeline to migrate OPS and the LHC VOs from VOMRS to VOMS-Admin
    • in progress
  4. Experiments interested in using WebDAV should contact ATLAS to organise a common discussion
    • closed
  5. Collect feedback from VOs about need for grid-cert-info and setting EMI-UI 2.0.3 as baseline.
    • new

-- NicoloMagini - 24-Oct-2013

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2013-10-25 - NicoloMagini
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback