WLCG Operations Coordination Minutes - February 20, 2014

Agenda

Attendance

  • Local: Simone, Nicolo, Alessandro D.G., Maite, Alberto, Antonio, Fabrizio, Stefan, Luca, Felix, Oliver K., Oliver G., Maarten, Maria
  • Remote: Burt, Alessandro C., Alexander, Antonio P., Carlos, Cheng-Hai, Cristoph, Di Qing, Jeremy, Thomas, Massimo

News

  • Simone - next meeting on March 6th; future meetings including planning meeting to be announced, should follow regular agenda.

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

  • Highlights include a StoRM security fix release, and the release of the proxyrenewal daemon to fix proxy key strength issues.
  • Added note about end of security updates for EMI-2 in April 2014.

  • OpenSSL issue
    • EGI broadcast sent Feb 4 describing current state of affairs and recipes for cures
    • Sites using HTCondor as batch system may need to apply one of these configuration changes for now:
      • DELEGATE_JOB_GSI_CREDENTIALS = False
      • GSI_DELEGATION_KEYBITS = 1024
    • HTCondor v8.0.6 has the default increased to 1024

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR:
v2.1.14-5 and SRM-2.11-2 on all instances
EOS:
ALICE (EOS 0.3.4 / xrootd 3.3.4)
ATLAS (EOS 0.3.8 / xrootd 3.3.4 / BeStMan2-2.3.0)
CMS (EOS 0.3.7 / xrootd 3.3.4 / BeStMan2-2.3.0)
LHCb (EOS 0.3.3 / xrootd 3.3.4 / BeStMan2-2.3.0)
LHCb EOS SRM upgraded to SHA-2 compliant version  
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.7-3
xrootd
3.3.4-1
None None
BNL dCache 2.6.18 (Chimera, Postgres 10 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.11.3 emi3 (ATLAS, LHCb)
StoRM 1.11.2 emi3 (CMS)
None None
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.7.slc
EOS 0.3.15-1/xrootd 3.3.6-1.slc5 with Bestman 2.3.0.16
Upgrade to EOS and xrootd Will begin migrating pool nodes to FNAL_Disk dCache instance
IN2P3 dCache 2.6.18-1 (Chimera) on SL6 core servers and pool nodes
Postgres 9.2
xrootd 3.0.4 (Alice T1), xrootd 3.3.4 (Alice T2)
xrootd 3.3.4 (Alice T2) None
JINR-T1 dCache
  • srm-cms.jinr-t1.ru: 2.6.21
  • srm-cms-mss.jinr-t1.ru: 2.2.24 with Enstore
xrootd federation host for CMS: 3.3.3
minor dCache version upgrades None
KISTI xrootd v3.2.6 on SL5 for disk pools
xrootd 20100510-1509_dbg on SL6 for tape pool
dpm 1.8.7-3
   
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.17-1
  • cmssrm-kit.gridka.de: 2.6.17-1
  • lhcbsrm-kit.gridka.de: 2.6.17-1
xrootd
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd proxy 3.3.3-1
None None
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.17 (Chimera) (SURFsara), DPM 1.8.7-3 (NIKHEF) None None
PIC dCache head nodes (Chimera) and doors at 2.2.17-1
xrootd door to VO severs (3.3.4)
None Scheduled downtime for upgrade to dCache 2.6 cancelled for incompatibility with Enstore.
RAL CASTOR 2.1.13-9
2.1.14-5 (tape servers)
SRM 2.11-1
   
TRIUMF dCache 2.2.18    

  • Stefan confirms that EOS SRM for LHCb was upgraded.
  • PIC: no information about dCache2.6-Enstore compatibility yet.

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
JINR-T1 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.3.1-1 for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.7-3 SLC6, EPEL Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations  

Other site news

Oracle upgrades

  • Points from the discussion:
    • After the upgrades, the old RACs (RAC10/RAC11) will be without critical power and will host only standby databases; later these will be moved back to critical power on RAC51 (together with DBs currently at SafeHost)
    • Maarten: status of streams decommissioning? Simone, Alessandro: multiple instances still needed for ATLAS conditions DB; number of streams reduced from 10 to 4, but not planning to switch off completely.
    • LHCBR candidate to migrate to Oracle12 in 2014 because it is less busy and hosts few applications (LHCb LFC, LHCb Dirac bookkeeping), validation is easier. Both applications need to be migrated simultaneously. Should be a short downtime (~2h)
      • CAPTURE and REPLAY test worked.
      • LFC developers will test LFC using Oracle11 clients against Oracle12 test instance. LFC is a simple application, problems not expected. Stefan, Oliver K.: LHCb will stop using LFC before Run2, minimize effort.
      • DIRAC developers currently validating DIRAC bookkeeping.
      • Rollback is possible because new Oracle12 features will not be enabled.
      • Stefan to check if March schedule is OK for LHCb. Cannot go beyond mid-may because of unavailability of critical power.
      • Simone: we agree to proceed with LHCBR upgrade to Oracle12 when we get green light from all of the above.
    • Other DBs (e.g. AIS, LCGR, other experiments) will be migrated to 11.2.0.4 - each instance will have its own schedule, agreed individually. Test instances available. Migration to Oracle12 in winter.
    • Maite: get DB migration schedule as recurrent point in the WLCG Ops Coord meeting agenda. Also update on testing at next meeting.

Experiments operations review and Plans

ALICE

  • job efficiencies will be fluctuating with the amounts of analysis jobs being run in preparation for the Quark Matter 2014 conference (May 19-24)
  • CERN
    • Wigner vs. Meyrin job failure rates and CPU/wall-time efficiencies
      • no change
      • next meeting Feb 28
  • KIT
    • the final number of corrupted files was 26126
      • 21k other files were salvaged, thanks very much!
    • many jobs were reading a lot of data remotely from CERN
      • resolved?
  • CNAF
    • tape SE updated to xrootd v3.3.4 with new checksum plugin successfully validated with test transfers, thanks!
  • RRC-KI-T1
    • careful memory tuning for jobs done, thanks!

ATLAS

  • Nothing to add since the WLCG F2F meeting

CMS

  • DBS3 was put into production! Needed a week long stop of production activity. Production was drained for Feb. 10th and started to ramp back up on Feb. 12th in the CERN evening
  • CVMFS
    • CMS wants to access Integration Builds (~nightly builds) via CVMFS
      • Needs to be separated from production software releases because of very high activity (2 builds per day)
      • Will start working on establishing an additional CVMFS instance and repository only for nightly builds
    • CVMFS at CERN
      • testing few LSF nodes with CVMFS instead of AFS
      • If ok, will inform user community and then plan to switch to CVMFS at CERN
  • RAL FTS3 scale test
    • Sites have to change their local configuration to use the RAL FTS3 server
    • We will start pushing now that DBS3 is in production
  • EOS redirector needed to be restarted, GGUS:101414
    • work to introduce redundancy and provide proper critical service instructions started
  • T1 scheduling policy
    • reintroducing 5% share for analysis jobs (/cms/Role=pilot)
    • needed for SAM tests not timing out and allow more analysis
    • VO card updated, will discuss with T1 sites soon
  • CNAF T1 is in downtime to upgrade storage, production activities were stopped but queues were kept open for analysis reading input via AAA, seems to work fine

  • Simone: criticality of glExec SAM test still on hold for T1 analysis scheduling policy? Oliver: yes.

LHCb

  • Started incremental stripping campaign, expected duration of 6-8 weeks, heavy loads on the stager systems foreseen.
    • LHCb is asking for T1 sites for stager throughput as stated in the table below (same as last campaign)
  • WN power at Wigner found to be significantly lower than CERN/Meyrin and other grid sites. As a consequence single payloads will take more wall time to execute for the same work as on other sites.
  • EOS/SRM endpoint upgraded to new SHA2 compliant version
  • WMS decomissioning no progress in the last two weeks b/c of other priorities. As agreed before the decomissioning of the CERN instance by April is OK for LHCb.

Stager throughput at T1 sites for incremental stripping campaign

Total Rate (MB/s)
CERN 50
CNAF 153
GRIDKA 124
IN2P3 134
PIC 39
RAL 111
SARA 104

  • Simone: power of CERN WNs is vastly different if running at Meyrin or Wigner. Consider using Machine/Job Feature reports to choose correct timeout to use on individual WNs.
  • Maarten: also publish correctly in info sys.

Ongoing Task Forces and Working Groups

FTS3 Deployment TF

  • Deployment proposal presented at F2F
  • RAL FTS3 incident on Feb 18th - case for multiple server deployment

  • Oli: upcoming RAL FTS3 downtime? Andrew: rescheduled, no planned FTS3 downtime at the moment
  • Alessandro: current DDM requires a couple of hours for manual FTS3 server switch in case of incident. Upcoming discussion with FTS3 developers on how to perform FTS3 server switch in Rucio.

gLExec deployment TF

IPv6 validation and deployment TF

Machine/Job features TF

  • For batch we do have a prototype installation ready for testing. For cloud usage there was a concern about the usage of the "meta-data service" for providing the job features
    • A prototype based on couchdb has been tested and after some stress testing proved to be reliable. Currently discussing with CERN/Openstack on further steps of installing this tool.
  • Asked remaining VOs to give feedback by end of Februrary (as agreed in the TF) on the current prototype installation on the CERN batch infrastructure.

  • Also good progress in interaction with Igor Sfiligoi's project (for bidirectional communication)

Middleware readiness WG

  • Minutes of the Feb 6 meeting are linked from the agenda. They include conclusions on repositories, answering concerns raised on 2014/02/19 in the e-group, namely:
  • Next meeting to be decided in this doodle. It currently shows that we meet on Mon March 10th at 3:30pm CET.. If you wish to change this, please keep voting.

Multicore deployment

  • Reported in detail last week on the F2F meeting:
    • First round of meetings with initial reports from experiment activities status (ATLAS and CMS).
    • First impressions from sites.
  • Next meetings dedicated to batch system mini-workshops (HTCondor, SGE, Torque/Maui, Slurm).

  • Next TF meeting to be skipped for clash with GDB in Bologna. Pre-GDB will be about batch systems, so it will touch multicore scheduling among other topics.

perfSONAR deployment TF

  • No news since F2F.
  • Wait before declaring perfSONAR 3.3.2 as baseline.

SHA-2 Migration TF

  • EOS SRM for LHCb OK since Feb 17
  • VOMRS
    • the VOMS-Admin test cluster is available since Feb 17
      • experiment VO admins are trying it out and reporting feedback, thanks!
    • host certs of our future VOMS servers are from the new SHA-2 CERN CA
      • we will do a campaign to get the new servers recognized across the grid well before we start using them
      • for convenience we will provide the configuration files also in rpms

  • Need to make sure that new VOMS services are not used too early. Clients accidentally connecting to new VOMS should get error, not hang.

Tracking tools evolution TF

  • No update since F2F

WMS decommissioning TF

  • all looks fine for decommissioning the WMS instances for experiments by the end of April
    • draining as of early April
  • the SAM instances have their own timeline

xrootd deployment TF

Action list

  1. Investigate how to separate Disk and Tape services in GOCDB
    • proposal submitted via GGUS:93966
    • in progress - ticket updated, current solution to be validated. PIC reported that they had issues, make sure that their experience is reported to the developers in the ticket.
  2. Agree with IT-PES on a clear timeline to migrate OPS and the LHC VOs from VOMRS to VOMS-Admin

-- NicoloMagini - 17 Feb 2014

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback