WLCG Operations Coordination Minutes - 21st February 2013

Agenda

Attendance

  • Local: Maria Girone, Andrea Valassi, Andrea Sciabà, Steve Traylen, Oliver Keeble, Nicolò Magini, Oliver Gutsche, Maria Dimou, Jan Iven, Michail Salichos, Felix Lee, Ikuo Ueda, Alessandro Di Girolamo, Maarten Litmaath
  • Remote: Alessandra Forti (chair), Pavel Weber, Dimitri Nilsen, Christoph Wissing, Di Qing, Gareth Smith, Ron Trompert, Rob Quick, Burt Holzman, Josep Flix, Jeremy Coles, Christopher Walker, Ewan Mc Mahon, Ian Collier, John Green, Peter Gronbech, Robert Frank, Michel Jouvin, Steve Jones

News (M. Girone)

  • Started a discussion with EGI to see how to improve the communication between WLCG and EGI. EGI will give a presentation once per month about middleware updates, UMD, etc. and will co-lead the SHA-2 task force. Related to this, it may be time to get the middleware deployment TF started with involvement from sites, EGI, OSG
  • A planning meeting needs to be scheduled around end of March.
  • A new TF about ways to discover more detailed squid server configuration information from the WN.
  • A new TF for SL6 migration has been announced at the GDB. Looking for representatives from T0/T1/T2 and Experiments. Contact Alessandra if you are interested. Participation of IT-PES would be useful, at least as observer.

Middleware news and baseline versions (N. Magini)

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Highlights:

  • new Frontier/squid release (mostly repackaging). The previous update has already an ACL update that will become mandatory in April when the new monitoring nodes at CERN will replace the current ones. ACTION: make sure the sites are informed of this.
  • known issues in the EMI-2 UI with the status command in direct CREAM submission and an intended change in the JDL parsing for CREAM
  • there is a beta version of the UI tarball (link to GGUS ticket in the above twiki)

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR 2.1.13-6.1; SRM-2.11 for all instances.

EOS:
ALICE (EOS 0.2.20 / xrootd 3.2.5)
ATLAS (EOS 0.2.27 / xrootd 3.2.7 / BeStMan2-2.2.2)
EOS:
CMS (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
LHCb (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
EOS upgrades to 0.2.29 for ALICE and ATLAS will be scheduled in agreement with the experiments
ASGC CASTOR 2.1.11-9
SRM 2.11-0
DPM 1.8.5-1
None Tuesday Feb 26 (if agreed by the experiments) [experiments agree]: CASTOR upgrade to 2.1.13-9
DPM upgrade to EMI2 1.8.6-1
UPS construction and storage firmware upgrade to castor and dpm disk servers.
BNL dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.8.1 (Atlas, CMS, LHCb)    
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.4-1.osg
Oracle Lustre 1.8.6
EOS 0.2.22-4/xrootd 3.2.4-1.osg with Bestman 2.2.2.0.10
None None
IN2P3 dCache 1.9.12-16 (Chimera) on SL6 core servers and 1.9.12-24 on pool nodes
Postgres 9.1
xrootd 3.0.4
   
KIT dCache
  • atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera)
  • cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera)
  • lhcbsrm-kit.gridka.de: 1.9.12-24 (Chimera)
xrootd (version 20100510-1509_dbg)
None None
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.4 (Chimera) (SARA), DPM 1.8.2 (NIKHEF)   Feb Mon 25th-Tue 26th: major upgrade to grid network infrastructure. WNs at SARA will be migrated to SL6, NIKHEF will stay at SL5
PIC dCache 1.9.12-20 (Chimera) - doors at 1.9.12-23 None March 26th: full site downtime for electrical maintenance
RAL CASTOR 2.1.12-10
2.1.12-10 (tape servers)
SRM 2.11-1
  CASTOR upgrade to 2.1.13 planned but not yet scheduled, some tape servers already migrated
TRIUMF dCache 1.9.12-19(Chimera) None None

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1 None None
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.0-1 for T1 and 1.8.3.1-1 for US T2s SL5, gLite Oracle ATLAS None
CERN 1.8.6-1 SLC6, EMI2 Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations  

Other site news

Data management provider news

Experiment operations review and plans

ALICE (M. Litmaath)

  • CERN: job submission to the CREAM CEs has often been very slow on Feb 9-11, leading to a large shortfall in the use of CERN resources by ALICE at that time, as the submission could not keep up with the rate of jobs finishing. As of Feb 11 ~13:00 things looked normal again, but the problem remains not understood (GGUS:91376). We thank the IT-PES team and the CREAM developers who are looking also into other, possibly related performance issues affecting the CEs.

ATLAS (I. Ueda)

Status:
  • ATLAS is running very important production/analysis jobs for winter conferences
  • End of data taking, finishing up the T0 processing
  • ATLAS has started integrating the RU-T1 prototype (RRC-KI-T1) in ATLAS systems. FTS3 servers at RAL and CERN were used for test file transfers.

Issues:

  • lcg-cp issues with EMI releases -- continuing
    • the same issue observed at TW (reported last meeting WLCGOpsMinutes130207) was observed also at RAL (GGUS:91223)
    • posting to the ticket has been restricted, and now our ATLAS colleagues cannot put more information. [Maria D. suggests to ask to become a supporter in GGUS to be able to add comments.]
  • EMI-WN tarball release
    • sites using the WN tarball fail a nagios/SAM test because it looks in /etc/emi-version (GGUS:91655)
  • PROOF usage on the grid
    • ATLAS and ATLAS sites have observed PROOF usages (many processes/threads created per job) in analysis jobs on the grid (WLCGDailyMeetingsWeek130211#Friday)
    • we would like to avoid putting sites into trouble, and basically would restrict such usage (many processes per job) except for the whole node queues, but it is not simple
      • $ROOTSTS/etc/system.rootrc is on CVMFS
      • current idea is to make ./rootrc by pilot
    • we need to understand the use cases and find a solution for them
    • Do other experiments have similar use-cases, we will be interested to understand how they treat it and what is the policy [None of the experiments seem to have a similar use case.]

CMS (O. Gutsche)

  • 2013 data reprocessing campaign continues to use T1 resources well
    • Switch of IN2P3 to use xrootd for reading files from MSS successful
  • CMS T2 sites are asked to check their BDII reporting of the max wall time in the queues
    • CMS asks since a long time in the VO card for at least 48h jobs
    • We are in the process to enable pilots to run for 48h, the majority of the jobs within the pilot will still be 8-12 hours, but we need the possibility to run up to 48 hours. ACTION: make sure that all CMS sites have queues longer than 48 hours.
  • CMS T2 sites using DPM are asked to switch to use xrootd for file reading instead of rfio, using dpm-xrootd

  • CMS is looking into the CERN resource setup and a re-optimization for LS1 and 2015 data taking
    • LSF asked to move all resources to the public queues and get rid of special queues, CMS is trying to accommodate this and will contact CERN-IT in a separate mail thread
  • In general, CMS wants to use EOS primarily for all T0 workflows in the new running period and only write to tape through PhEDEx subscriptions
    • Currently we are working on a request to move all Castor disk pools not needed anymore for the T0 to EOS, after clean up request will follow
  • In LS1, CMS wants to use EOS and the EOS srm for all processing and analysis workflows running on CERN resources (lxbatch, HLT cloud, AI cloud, etc.) and subscribe data via PhEDEx to T1 sites or Castor srm at CERN for archiving on tape
    • During LS1, we will lift the restriction for the Castor and EOS srm endpoints to only transfer to T1 sites to be able to use the full mesh to all T1 and T2 sites
    • This will be adapted appropriately for the data taking period starting 2015
  • We are continuing to work with sites to bring up glexec. Currently in the focus: T2_IT_Bari (SAV:129297), T2_PK_NCP (SAV:129307), T2_CN_Beijing (GGUS:88988)

Oliver K. asks to confirm that CMS is asking the DPM sites to enable the xrootd interface, not to install xrootd. The answer is yes.

Jan warns about using too much the SRM interface in EOS because it is not very reliable and Alessandro adds that ATLAS indeed observed a 1 Hz bottleneck for WAN transfers, so it is using GSIFTP instead, which is much more scalable. Oliver G. and Nicolò clarify that SRM is never used for local access and WAN transfers are already using GSIFTP since more than a year. The only usage of SRM is for remote stageout to EOS from external sites.

LHCb (S. Roiser)

  • No major operational issues during last 2 weeks
  • Castor -> EOS migration progressing well, executed by LHCb and estimated to take another ~ 5-6 weeks.
  • Current main data processing operation, i.e. 2011 data reprocessing, is close to end. Last 2 sites (CERN, IN2P3) shall be finished by end of the week.
  • Switching to MC productions at all Tier levels + HLT farm

Task Force reports

CVMFS (S. Roiser)

  • 97 sites targeted by task force
    • 52 sites deployed CVMFS (+ 5 since last meeting)
  • Target deployment date 30 April 2013 for sites supporting ATLAS and/or LHCb
    • 27 sites in this category have not yet deployed and will be contacted by GGUS with a reminder by end Feb
  • two site which has not initially replied now has deployed CVMFS
    • leaves 3 sites with no info provided (INSU01-PARIS, NCP-LCG2, ru-Moscow-SINP-LCG2)
      • no more effort will be done to contact those sites

gLExec (M. Litmaath)

  • ATLAS: a much more robust design for the new implementation of gLExec support has been agreed by the PanDA team - a big step forward, thanks!

Maarten adds that he hopes that ATLAS can start with some tests in a matter of weeks and that in general the deployment can ramp up after the winter conferences.

Alessandra asks sites could associate the gLExec installation with the SL6 migration. Maarten sees the two things as orthogonal.

SHA-2 migration (M. Litmaath)

  • The HW module for the new CERN CA has arrived only a few days ago and is currently being tested etc. The new CA is foreseen to become available for tests by the end of next week, while proper web pages etc. may take longer. As soon as the new CA certificate is available, we will try to get a dedicated VOMS server instance equipped that would allow SHA-2 certificates of testers to be registered in WLCG VOs. To be continued...

Generic links:

Middleware deployment

No report.

FTS 3 integration and deployment (A. Di Girolamo, N.; Magini)

  • FTS3 CERN pilot instance is used to commission the RRC-KI-T1 by ATLAS. The CERN FTS3 server VM configuration has been modified to reflect the increasing load.
  • A CMS Tier-2 volunteered to help testing the "channel-less" configuration by running low-rate transfers from any other CMS site (~50 sites).

Xrootd deployment (D. Giordano)

  • Xrootd deployment at WLCG sites continues both for ATLAS and CMS.
    • Monitor systematically the health of the service at each site will be soon crucial.
    • We intend to study in the task force the approaches to follow to instrument SAM tests, both for ATLAS and CMS
    • A proposal will be then presented in this forum during the coming weeks.

  • Xrootd detailed monitoring:
    • Collectors of the xrootd detailed monitoring have been upgraded to new version of the software, able to publishing directly in AMQ.
    • There are different instances of these collectors, monitoring respectively: AAA, FAX, EOS-CMS, EOS-ATLAS.
    • We plan to install a new collector at CERN to follow the xrootd deployment of the EU sites of ATLAS (FAX)
    • The consumers of the collector information are currently the data popularity and dashboard transfer monitoring. An effort has been put in place to unify the two monitoring workflow (unify the database schemas and the Web UI) to guarantee the future maintenance of these two services.

PerfSONAR (A. Forti)

The perfSONAR dashboard provided an RPM to insert into the Nagios infrastructure so that SAM tests can be run by the experiments.

Tracking tools

The TF members suggested a direct contact with the savannah developers to plan the migration of their trackers is more effective than holding a meeting, as decided on 2012/12/05 (Minutes here). The course of action circulated in the e-group was to:

  1. Read the 2013/02/18 presentation of the savannah-to-jira migration experts https://indico.cern.ch/materialDisplay.py?contribId=0&materialId=slides&confId=223661
  2. Get in touch with these experts, they are expecting the 'list of projects to migrate', see why here: https://savannah.cern.ch/support/?134651#comment9 . NB! Even if archiving a tracker would be enough, IF internal links should remain active, then the tracker MUST be migrated.
  3. Envisage a TF meeting on this issue on WLCG-specific use cases and leaves the migration to each savannah tracker owner and the savannah/jira experts.

The deadline to migrate to JIRA is the end of 2013. Savannah projects specific to single experiments may be migrated by category rather than individually, but there are some projects (e.g. ROOT, CORAL, Grid middleware, etc.) that do not belong to any experiment. The Savannah-JIRA experts presented to the experiments a tool to migrate a project, so project owners can do it by themselves. This seems to be the best approach; Alessandra suggests to put a link to the tool documentation from the Savannah web page.

News from other WLCG working groups

No report.

AOB

Action list

  • Inform sites that they need to install the latest Frontier/squid RPM by April at the latest.
  • Inform CMS sites that they must configure a queue with a length of at least 48 hours, if they have not done it already.
  • Inform CMS DPM sites that they should enable the xrootd interface.
  • Maarten will look into SHA-2 testing by the experiments when the new CERN CA has become available.
  • MariaD will convey savannah developers OliverK's idea to place a banner on every savannah ticket warning about the switch off date. DONE: Savannah:134651#comment14
  • Tracking tools TF members who own savannah projects to list them and submit them to the savannah and jira developers if they wish to migrate them to jira. AndreaV and MariaD to report on their experience from the migration of their own savannah trackers.

Chat room comments

cwalker There are (or at least were) bugs in the info providers for some CEs - that result in them not publishing correctly. If you want sites to publish correctly, you should get the middleware fixed.

-- AndreaSciaba - 18-Feb-2013

Edit | Attach | Watch | Print version | History: r30 < r29 < r28 < r27 < r26 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r30 - 2013-03-07 - NicoloMagini
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback