TWiki> LCG Web>WebPreferences>WLCGOpsMinutes131219 (revision 24)EditAttachPDF

WLCG Operations Coordination Minutes - December 19, 2013

Agenda

Attendance

  • Local:
  • Remote:

News

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR:
v2.1.14-5 and SRM-2.11-2 on all instances
EOS:
ALICE (EOS 0.3.4 / xrootd 3.3.4)
ATLAS (EOS 0.3.4 / xrootd 3.3.4 / BeStMan2-2.2.2)
CMS (EOS 0.3.2 / xrootd 3.3.4 / BeStMan2-2.2.2)
LHCb (EOS 0.3.3 / xrootd 3.3.4 / BeStMan2-2.2.2)
   
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.7-3
xrootd
3.3.4-1
none none
BNL dCache 2.2.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
dCache upgrade to v2.6 for SHA-2 compliance  
CNAF StoRM 1.11.2 emi3 (ATLAS, CMS, LHCb) none none
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.7.slc
EOS 0.3.2-4/xrootd 3.3.3-1.slc5 with Bestman 2.2.2.0.10
Lustre decomissioned in favor of EOS Will upgrade xrootd/EOS after next EOS release (if FUSE bugs are fixed); a dCache-disk pool (Chimera + dCache 2.2) is up, links are being commissioned and all data to be migrated is being pinned to the existing dCache 1.9 instance
IN2P3 dCache 2.6.18-1 (Chimera) on SL6 core servers and pool nodes
Postgres 9.2
xrootd 3.0.4
dCache 2.6.15-1 --> 2.6.18-1 xrootd 3.3.4
KISTI xrootd v3.2.6 on SL5 for disk pools
xrootd 20100510-1509_dbg on SL6 for tape pool
dpm 1.8.7-3
DPM 1.8.6-1 --> 1.8.7-3  
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.17-1
  • cmssrm-kit.gridka.de: 2.6.17-1
  • lhcbsrm-kit.gridka.de: 2.6.17-1
xrootd
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd proxy 3.3.3-1
   
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.17 (Chimera) (SURFsara), DPM 1.8.7-3 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 2.2.17-1
xrootd door to VO severs (3.3.4)
  today upgrade DBs to PostgreSQL9.2 & upgrade of dCache to 2.2.21
RAL CASTOR 2.1.13-9
2.1.13-9 (tape servers)
SRM 2.11-1
  CASTOR 2.1.14 in testing
TRIUMF dCache 2.2.18    
JINR-T1 dCache
  • srm-cms.jinr-t1.ru: 2.6.19
  • srm-cms-mss.jinr-t1.ru: 2.2.23 with Enstore
xrootd federation host for CMS: 3.3.3
   

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    
JINR-T1 2.2.8 - transfer-fts-3.7.12-1   We plan to install additional fts3 on separate host

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.3.1-1 for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.7-3 SLC6, EPEL Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations  

Other site news

Experiments operations review and Plans

ALICE

  • plans for the end-of-year break (reminder)
    • MC production at all sites
    • we do not expect to run RAW reconstruction
    • the user/organized analysis will naturally diminish in intensity
    • the usual 'best effort' support from the sites, which worked so well in the past years, will be appreciated!
  • CERN
    • SLC6 vs. SLC5 job failure rates and CPU/wall-time efficiencies
      • 4 VOBOXes submit to different sets of CERN resources since 3 weeks
      • Wigner job failure rate was 55% compared to 18% for SLC6 jobs in Meyrin and 30% for SLC5 jobs
      • the average efficiency of SLC6 jobs was 20% lower than the average for SLC5 jobs
      • similar comparisons for various classes of ATLAS and CMS jobs suggest differences ranging from 0 to 20% depending on the type of job
      • a queue targeting only physical SLC6 nodes would help to understand if the differences are due to SLC6 or due to the VM infrastructure
      • to be continued...
  • RRC-KI-T1
    • commissioning activities ongoing since late Nov - thanks!
      • EOS, VOBOX, CEs
  • CVMFS
    • 64 sites using it in production
    • 8 in various stages of preparation
    • sites please ensure the WN have version 2.1.15 (or higher)
  • SAM (reminder)

ATLAS

  • ATLAS xmas break plans
    • MC production single job: produce 100M events, which corresponds to approx 8/10 days of ATLAS Grid production resources utilization as of today.
    • MC production MultiCore: produce 150M events, tasks are being tested now. If everything goes well as expected, the message to sites is:
      • Configure MCORE queues fully dynamically if experienced with it,
      • Static allocation otherwise: 50% of production resources for T1 and big T2
      • if for some reason the MultiCore configuration is not production ready at the site, do not increase the share now (before/during XMas) if you think the system stability could be endangered. Please communicate with ATLAS on when you think you can do these changes.
    • Reprocessing: a reprocessing campaigns has started. The total of the inputs is 2.2PB on tape, small output (2%). This corresponds to approx 30 days for 20% of the T1s. Pre-stage of the data is automatically handled by PanDA. During XMas Break only a part of it is foreseen to be done (approx 500TB of inputs)
    • Group prod: NTUP_COMMON v2 campaign is now starting. It corresponds to 35% of all the resources for approx 5 weeks.
    • analysis as usual
    • check more details on Tuesday 17 December ADC Weekly agenda
  • Issues: Openssl
  • information for sites
    • Rucio Renaming: deadline 1st of February. sites not migrated (or not started/agreed) will be excluded from ATLAS DDM. ADC Weekly 17 December - Rucio Renaming: Deadline
      • Exceptions can be discussed for sites with migration in progress with clear plans agreed beforehand with DDM/Rucio teams
      • What we expect from the not yet renamed sites:
        • DPM and dCache sites must provide a WebDAV access before this date to allow remote renaming. If they cannot/do not want, they have to contact atlas-dq2-ops@cernNOSPAMPLEASE.ch and they will have to take care of the renaming themselves.
        • StoRM sites: we notice that the performance of the current implementation of WebDAV is not good enough. StoRM developers are working on an improved version, but it might be tight to have it deployed on all StoRM sites by February 1st. The sites have to be ready to upgrade their storage as soon as possible (e.g. the beginning of January) if StoRM release will be ready as expected.
    • MultiCore allocations for production (as described above under XMas activity) :
      • Configure MultiCore queues fully dynamically if experienced with it,
      • Static allocation otherwise: 50% of production resources for T1 and big T2
      • if for some reason the MultiCore configuration is not production ready at the site, do not increase the share now (before/during XMas) if you think the system stability could be endangered. Please communicate with ATLAS on when you think you can do these changes.

CMS

  • Reminder CMS holiday break plans:
    • Production and digitization-reconstruction of Run 2 preparation MC samples
    • Digitization-reconstruction of 7 TeV MC for 2011 data legacy re-reconstruction pass
  • Reminder: Best-effort operations during holiday break as every year
    • Appreciate all support from the sites we can get, but donít expect normal levels of support, especially for T2 sites
      • Will still send tickets though

LHCb

  • Plans for the Xmas break
    • Started a ProtonIon/IonProton reprocessing.
      • RAW data staged in at CERN & GRIDKA. CERN all data is staged, GRIDKA progressing well
      • bulk of the work shall be finished in ~ 10 days
    • Otherwise Monte Carlo productions at all Sites / Tier levels
  • GGUS statistics: 17 tickets opened in last 2 weeks, mainly problems with pilots aborted (9)
  • A fix for resolving of the xroot address by SRM correctly on DPM sites has been tested successfully at CBPF.
  • LHCb hit by problem caused by latest openssl version on redhat linux versions

Ongoing Task Forces and Working Groups

SHA-2 Migration TF

  • sites are steadily upgrading remaining affected services to versions supporting SHA-2
    • Operations update in Dec 19 EGI OMB meeting mentioned 6 sites with non-compliant services remaining
    • OSG T1 sites
      • BNL ready since Dec 17
      • FNAL hopefully OK by the end of Dec
        • cmssrmdisk.fnal.gov seems OK
        • cmssrm.fnal.gov not yet ready
  • EOS SRM instances not yet ready!
    • updated version tested OK on eospps.cern.ch and standby nodes for the experiments
      • can be switched quickly if needed
    • updates of the production instances early Jan
  • newer dCache SRM client v2.2.22 able to handle SHA-2 host certificates released on Dec 16
  • timelines
    • by mid January the WLCG infrastructure is expected to be essentially ready
      • we may be able to ignore any remaining stragglers by the end of Jan
    • it is unlikely for SHA-2 certs to appear still this year
      • the OSG CA foresees starting mid Jan
      • the CERN CA will switch when WLCG is ready
  • VOMRS
    • VOMS-Admin test setup should become available for testing by VO managers early Jan
    • VOMS-Admin instability being looked into (GGUS:99327)
      • thanks to the VOMS developers for their prompt efforts!

Tracking tools evolution TF

  • GGUS: Reminder: For the Year End period: GGUS is monitored by a system connected to the on-call service. In case of total GGUS unavailability the on-call engineer (OCE) at KIT will be informed and will take appropriate action. If GGUS is available but there is a problem with the workflow, e.g. ALARM to CERN doesn't generate email notification to the operators, then WLCG should submit an ALARM ticket, notifying Site DE-KIT, which triggers a phone call to the OCE.

gLExec deployment TF

  • 64 tickets closed and verified, 31 still open
    • some sites still waiting to finish their SL6 migration first
    • progress for some difficult cases being debugged
  • EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tar ball WN and do not have the Perl module Time/HiRes.pm installed (GGUS:98767)
    • installation of that dependency now looks sufficient to cure the problem
    • a proper fix is still to be decided
  • Deployment tracking page

Middleware readiness WG

  • first meeting happened on Dec 12
    • the discussion was mostly on repositories and processes
      • how to involve experiments and sites should be discussed in the next meeting, which is planned for Feb 6 at 16:00 CET
    • please consult the minutes for a detailed summary

WMS decommissioning TF

  • usage of the CMS WMS at CERN has remained lower since CMS users were informed that support of the gLite WMS is ramping down and they should use CRAB's scheduler=remoteglidein option instead
    • the CRAB-2 client also no longer uses a centrally distributed list of WMS hosts
  • to be continued after the break

FTS3 Deployment TF

  • FTS3 servers affected by issue with new openssl version, for now rolled back to SLC6.4, permanent fix when new gridsite version is released in EPEL
  • FTS3 performance comparison (fixed conf vs autoconf) tests ongoing - some bugs reported to developers; collecting preliminary results on FTS3 twiki

Action list

  1. Investigate how to separate Disk and Tape services in GOCDB
    • proposal submitted via GGUS:93966
    • in progress - ticket updated
  2. Agree with IT-PES on a clear timeline to migrate OPS and the LHC VOs from VOMRS to VOMS-Admin
    • in progress
  3. Collect feedback from VOs about need for grid-cert-info and setting EMI-UI 2.0.3 as baseline.

-- SimoneCampana - 12 Dec 2013

Edit | Attach | Watch | Print version | History: r27 < r26 < r25 < r24 < r23 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r24 - 2013-12-19 - AleDiGGi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback