TWiki> LCG Web>WebPreferences>WLCGOpsMinutes130221 (revision 22)EditAttachPDF

WLCG Operations Coordination Minutes - 21st February 2013

Agenda

Attendance

  • Local:
  • Remote:

News

  • Started a discussion with EGI to see how to improve the communication between WLCG and EGI.
  • A planning meeting needs to be scheduled around end of March.
  • A new TF to discover more detailed configuration information about the squid server which is not available in GOCDB/OIM from a job.
  • A new TF for SL6 migration has been announced at the GDB. Looking for representatives from T0/T1/T2 and Experiments.

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Tier-1 Grid services

Storage deployment

SiteSorted ascending Status Recent changes Planned changes
CERN CASTOR 2.1.13-6.1; SRM-2.11 for all instances.

EOS:
ALICE (EOS 0.2.20 / xrootd 3.2.5)
ATLAS (EOS 0.2.27 / xrootd 3.2.7 / BeStMan2-2.2.2)
EOS:
CMS (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
LHCb (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
EOS upgrades to 0.2.29 for ALICE and ATLAS will be scheduled in agreement with the experiments
IN2P3 dCache 1.9.12-16 (Chimera) on SL6 core servers and 1.9.12-24 on pool nodes
Postgres 9.1
xrootd 3.0.4
   
RAL CASTOR 2.1.12-10
2.1.12-10 (tape servers)
SRM 2.11-1
  CASTOR upgrade to 2.1.13 planned but not yet scheduled, some tape servers already migrated
ASGC CASTOR 2.1.11-9
SRM 2.11-0
DPM 1.8.5-1
None Tuesday Feb 26 (if agreed by the experiments): CASTOR upgrade to 2.1.13-9
DPM upgrade to EMI2 1.8.6-1
UPS construction and storage firmware upgrade to castor and dpm disk servers.
BNL dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.8.1 (Atlas, CMS, LHCb)    
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.4-1.osg
Oracle Lustre 1.8.6
EOS 0.2.22-4/xrootd 3.2.4-1.osg with Bestman 2.2.2.0.10
None None
KIT dCache
  • atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera)
  • cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera)
  • lhcbsrm-kit.gridka.de: 1.9.12-24 (Chimera)
xrootd (version 20100510-1509_dbg)
None None
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.4 (Chimera) (SARA), DPM 1.8.2 (NIKHEF)   Feb Mon 25th-Tue 26th: major upgrade to grid network infrastructure. WNs at SARA will be migrated to SL6, NIKHEF will stay at SL5
PIC dCache 1.9.12-20 (Chimera) - doors at 1.9.12-23 None March 26th: full site downtime for electrical maintenance
TRIUMF dCache 1.9.12-19(Chimera) None None

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1 None None
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.0-1 for T1 and 1.8.3.1-1 for US T2s SL5, gLite Oracle ATLAS None
CERN 1.8.6-1 SLC6, EMI2 Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations  

Other site news

Data management provider news

Experiment operations review and plans

ALICE

  • CERN: job submission to the CREAM CEs has often been very slow on Feb 9-11, leading to a large shortfall in the use of CERN resources by ALICE at that time, as the submission could not keep up with the rate of jobs finishing. As of Feb 11 ~13:00 things looked normal again, but the problem remains not understood (GGUS:91376). We thank the IT-PES team and the CREAM developers who are looking also into other, possibly related performance issues affecting the CEs.

ATLAS

status:

  • ATLAS is running very important production/analysis jobs for winter conferences
  • End of data taking, finishing up the T0 processing
  • ATLAS has started integrating the RU-T1 prototype (RRC-KI-T1) in ATLAS systems. FTS3 servers at RAL and CERN were used for test fie transfers.

Issues:

  • lcg-cp issues with EMI releases -- continuing
    • the same issue observed at TW (reported last meeting WLCGOpsMinutes130207) was observed also at RAL (GGUS:91223)
    • posting to the ticket has been restricted, and now our ATLAS colleagues cannot put more information.
  • EMI-WN tarball release
    • sites using the WN tarball fail a nagios/SAM test because it looks in /etc/emi-version (GGUS:91655)
  • PROOF usage on the grid
    • ATLAS and ATLAS sites have observed PROOF usages (many processes/threads created per job) in analysis jobs on the grid (WLCGDailyMeetingsWeek130211#Friday)
    • we would like to avoid putting sites into trouble, and basically would restrict such usage (many processes per job) except for the whole node queues, but it is not simple
      • $ROOTSTS/etc/system.rootrc is on CVMFS
      • current idea is to make ./rootrc by pilot
    • we need to understand the use cases and find a solution for them
    • Do other experiments have similar use-cases, we will be interested to understand how they treat it and what is the policy

CMS

  • 2013 data reprocessing campaign continues to use T1 resources well
    • Switch of IN2P3 to use xrootd for reading files from MSS successful
  • CMS T2 sites are asked to check their BDII reporting of the max wall time in the queues
    • CMS asks since a long time in the VO card for at least 48h jobs
    • We are in the process to enable pilots to run for 48h, the majority of the jobs within the pilot will still be 8-12 hours, but we need the possibility to run up to 48 hours
  • CMS T2 sites using DPM are asked to switch to use xrootd for file reading instead of rfio, using dpm-xrootd

  • CMS is looking into the CERN resource setup and a re-optimization for LS1 and 2015 data taking
    • LSF asked to move all resources to the public queues and get rid of special queues, CMS is trying to accommodate this and will contact CERN-IT in a separate mail thread
  • In general, CMS wants to use EOS primarily for all T0 workflows in the new running period and only write to tape through PhEDEx subscriptions
    • Currently we are working on a request to move all Castor disk pools not needed anymore for the T0 to EOS, after clean up request will follow
  • In LS1, CMS wants to use EOS and the EOS srm for all processing and analysis workflows running on CERN resources (lxbatch, HLT cloud, AI cloud, etc.) and subscribe data via PhEDEx to T1 sites or Castor srm at CERN for archiving on tape
    • During LS1, we will lift the restriction for the Castor and EOS srm endpoints to only transfer to T1 sites to be able to use the full mesh to all T1 and T2 sites
    • This will be adapted appropriately for the data taking period starting 2015

  • We are continuing to work with sites to bring up glexec. Currently in the focus: T2_IT_Bari (SAV:129297), T2_PK_NCP (SAV:129307), T2_CN_Beijing (GGUS:88988)

LHCb

  • No major operational issues during last 2 weeks
  • Castor -> EOS migration progressing well, executed by LHCb and estimated to take another ~ 5-6 weeks.
  • Current main data processing operation, i.e. 2011 data reprocessing, is close to end. Last 2 sites (CERN, IN2P3) shall be finished by end of the week.
  • Switching to MC productions at all Tier levels + HLT farm

Task Force reports

CVMFS

  • 97 sites targeted by task force
    • 52 sites deployed CVMFS (+ 5 since last meeting)
  • Target deployment date 30 April 2013 for sites supporting ATLAS and/or LHCb
    • 27 sites in this category have not yet deployed and will be contacted by GGUS with a reminder by end Feb
  • two site which has not initially replied now has deployed CVMFS
    • leaves 3 sites with no info provided (INSU01-PARIS, NCP-LCG2, ru-Moscow-SINP-LCG2)
      • no more effort will be done to contact those sites

gLExec

  • ATLAS: a much more robust design for the new implementation of gLExec support has been agreed by the PanDA team - a big step forward, thanks!

SHA-2 migration

  • The HW module for the new CERN CA has arrived only a few days ago and is currently being tested etc. The new CA is foreseen to become available for tests by the end of next week, while proper web pages etc. may take longer. As soon as the new CA certificate is available, we will try to get a dedicated VOMS server instance equipped that would allow SHA-2 certificates of testers to be registered in WLCG VOs. To be continued...

Generic links:

Middleware deployment

FTS 3 integration and deployment

  • FTS3 CERN pilot instance is used to commission the RRC-KI-T1 by ATLAS. The CERN FTS3 server VM configuration has been modified to reflect the increasing load.

Xrootd deployment

  • Xrootd deployment at WLCG sites continues both for ATLAS and CMS.
    • Monitor systematically the health of the service at each site will be soon crucial.
    • We intend to study in the task force the approaches to follow to instrument SAM tests, both for ATLAS and CMS
    • A proposal will be then presented in this forum during the coming weeks.

  • Xrootd detailed monitoring:
    • Collectors of the xrootd detailed monitoring have been upgraded to new version of the software, able to publishing directly in AMQ.
    • There are different instances of these collectors, monitoring respectively: AAA, FAX, EOS-CMS, EOS-ATLAS.
    • We plan to install a new collector at CERN to follow the xrootd deployment of the EU sites of ATLAS (FAX)
    • The consumers of the collector information are currently the data popularity and dashboard transfer monitoring. An effort has been put in place to unify the two monitoring workflow (unify the database schemas and the Web UI) to guarantee the future maintenance of these two services.

PerfSONAR

Tracking tools

The TF members suggested a direct contact with the savannah developers to plan the migration of their trackers is more effective than holding a meeting, as decided on 2012/12/05 (Minutes here). The course of action circulated in the e-group was to:

  1. Read the 2013/02/18 presentation of the savannah-to-jira migration experts https://indico.cern.ch/materialDisplay.py?contribId=0&materialId=slides&confId=223661
  2. Get in touch with these experts, they are expecting the 'list of projects to migrate', see why here: https://savannah.cern.ch/support/?134651#comment9 . NB! Even if archiving a tracker would be enough, IF internal links should remain active, then the tracker MUST be migrated.
  3. Envisage a TF meeting on this issue on WLCG-specific use cases and leaves the migration to each savannah tracker owner and the savannah/jira experts.

News from other WLCG working groups

AOB

Action list

  • ALICE, ATLAS, CMS and LHCb should give feedback on the lxplus migration to SL6 (twiki). The plan is to move a significant part of lxbatch to SLC6 before the alias switch takes place. The amount of SLC6 WN resources provided by other sites have not been taken into account for planning this alias switch (M. Guijarro). DONE
  • Maarten will look into SHA-2 testing by the experiments when the new CERN CA has become available.
  • MariaD will update the WLCGCriticalServices twiki. DONE: the twiki is superseded by this.
  • MariaD will convey savannah developers OliverK's idea to place a banner on every savannah ticket warning about the switch off date. *DONE: Savannah:134651#comment14
  • Tracking tools TF members who own savannah projects to list them and submit them to the savannah and jira developers if they wish to migrate them to jira. AndreaV and MariaD to report on their experience from the migration of their own savannah trackers.

Chat room comments

cwalker There are (or at least were) bugs in the info reporters for some CEs - that result in them not publishing correctly. If you want sites to publish correctly, you should get the middleware fixed.

-- AndreaSciaba - 18-Feb-2013

Edit | Attach | Watch | Print version | History: r30 | r24 < r23 < r22 < r21 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r22 - 2013-02-21 - AlessandraForti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback