WLCG Operations Coordination Minutes - 22nd November 2012

Agenda

Attendance

  • Local: Andrea Sciabà (secretary), Maria Girone, Giuseppe Bagliesi, Jerome Belleman, Maria Dimou, Ikuo Ueda, Ian Fisk, Simone Campana, Andrea Valassi, Alessandro Di Girolamo, Felix Lee, Wei-Jen, Massimo Lamanna, Maite Barroso, Nicolò Magini, Stefan Roiser
  • Remote: Gonzalo Merino (chair), Stephen Burke, Andreas Petzold, Ulf Tigerstedt, Daniele Bonacorsi, Jeremy Coles, Di Qing, Michel Jouvin, Ron Trompert, Ian Collier, Massimo Sgaravatto, Oliver Keeble

Task Force reports

CVMFS (Stefan)

Progress since last meeting

  • 96 sites targeted by this task force
  • 18 sites have deployed CVMFS
    • +3 sites since last meeting (WEIZMANN-LCG2, CIEMAT-LCG2, EELA-UTFSM)
  • 9 sites have set deployment target for Nov
  • 12 sites have set deployment target for end 2012
  • 41 sites have not provided information about deployment plans or obstacles (another reminder sent this week)

Maria G. proposes to use the GDB slot to push other sites to deploy or at least announce their plans. Michel agrees but he does not expect it to be very efficient. It would be useful to make the list of missing sites by country.

Ian C. asks if sites were contacted my email. Stefan confirms, and he thinks that it is preferrable to opening 40 tickets. Sites can answer either via email or directly editing the twiki.

gLExec (Maarten)

  • On Nov 5 the "ops" gLExec test has been added to the EGI ROC_OPERATORS profile so that failures may raise alarms in the EGI operations dashboard (GGUS:88095).
  • The idea is for EGI ROD teams to try and get the relevant sites in their regions to reach 75% availability as a first goal; this effort may start in Dec (GGUS:86372), but it might end up delayed to the beginning of 2013.

Gonzalo asks what is the situation about the VO tests. Ian F. explains that in CMS the pilot infrastructure supports gLExec and the only missing thing is to make the CMS SAM test critical. Maarten adds that some CMS (and other) Tier-2 sites are in good shape, but this is not the case for WLCG overall. It is hoped that EGI can help on this, as passing the ops test should already solve the major hurdles.

Maria G. asks how many sites deployed it. Maarten says that they are not many, a big effort and a few months more will be required.

Concerning ATLAS, Simone explains that 1) the ATLAS SAM test is very similar to the CMS one and will not be critical for quite some time; 2) the work to integrate gLExec in the ATLAS pilot is ongoing and it will take time.

PerfSonar (Simone)

Now it is possible to publish PS instances in GOCDB and OIM and all sites were invited to do so using a WLCG broadcast. The spreadsheet linked at last meeting is still relevant and it is used to mark places where PS is deployed and where is published. At next meeting Simone plans to spend more time on this point and again invite sites to deploy and publish PS. Currently the TF is also testing the mesh configuration, before pushing it for broader deployment.

Maria G. asks whether there will be more news by the December GDB, and Simone's answer is yes.

Tracking tools (Maria D.)

Needed participants' feedback on:
  • Savannah:131988 Last chance to comment on GGUS Reminders for slow response to tickets to be sent also to Notified sites (not only Support Units). If no comment is received, agreement will be assumed and the functionality will be added in a near release.
  • Savannah:133041#comment4 Setting GGUS tickets to 'solved' if 'waiting for reply' for more than 15 working days. Savannah:133041#comment9 is the summary of discussion on this at the meeting.

SHA-2 migration (Maarten)

Generic links:

No news, apart from the fact the we are working with CERN and EGI to set up a test infrastructure, but not before January.

FTS 3 integration and deployment (Nicolò)

  • Task-force meeting and demo were held on Nov 7th, main items:
  • Functional testing:
    • ATLAS: enabled transfers between all sites in UK cloud, found issue when destination is StoRM and space token is specified GGUS:88607 (looks like a StoRM bug rather than FTS3 bug)
    • CMS: started functional tests of transfers from srm to gridftp (EOS), no issue seen so far
  • Tentative date for next meeting: Wed 28th 17:00 CERN time

Ian F. suggests that FNAL could help with testing the FTS 3 configuration.

Maria G. asks if there is a date for testing the centrally deployed FTS. Alessandro says that there is not any yet, the TF is still working on functional tests for the FTS2-like configuration and thinking how multiple FTS instances can be used to mitigate problems a single Tier-1 has with its FTS.

Middleware deployment (Maarten)

EMI-2 is deployed on the CERN WNs since a couple of weeks and no problems were seen. As gLite 3.2 support ends this month, EMI is urging sites to upgrade before the end of Jan. Only 2 ATLAS sites that use gsidcap should wait until a patch is released in about 2 weeks [released in EMI-2 Update 6 on Nov 26]. A few more sites need to wait for the tarball distribution: significant progress was made by CERN IT-GT and GridPP, but Jeremy reports problems with Python 2.6 and a few weeks are still needed.

Jeremy asks how many non-UK sites need the tarball; Maarten answers that there is a small minority with examples in France and Spain. Gonzalo asks if there is any monitoring that can be used to monitor progress: Maarten says that EGI is doing it and there are SAM tests that can be used but maybe a summary page is missing.

About the UI, a new update may arrive in a few days [ not part of EMI-2 Update 6]: we will need to check if the gLite WMS submission has been fixed and if there are other blockers. Then it will have to be made relocatable as well. Gonzalo asks if it could be made available in the CVMFS repository: yes for the WN, not clear for the UI.

Simone issues a strong warning against upgrading to SL6, because it has not yet been validated by ATLAS: in fact, some ATLAS workflows CANNOT run on SL6 for some releases! Upgrading an ATLAS site to SL6 would mean to break the site for ATLAS.

Ian F. says that instead for CMS the SL5 binaries are fully validated to SL6. [Maarten: SL6 is also OK for ALICE and LHCb.]

XrootD

No report.

Squid monitoring

No report.

WMS future

No report.

News from other WLCG working gropus

Experiment operations review and plans

ALICE (Maarten)

  • KIT: the mechanism for providing SW to jobs was switched to Torrent on Tue Nov 6, but had to be reverted to the shared SW area on Sat Nov 10 after various inconclusive attempts to debug massive job failures due to timeouts; the performance recovered only by Wed Nov 14. Further tests are foreseen in a way that should keep the impact limited.

Andreas warns that just repeating the tests will not help: the problem needs to be understood by an expert. Maarten explains that Woo Jin is working on it and that the feeling is that the issue is in inter-WN communication. A simple test was suggested and could be tried next week if there is time.

ATLAS (Ueda)

  • Reprocessing of 2012 data on-going, towards the end of November - beginning of December.
  • FZK-LCG2: tape
  • INFN-T1: disk space publication (WLCGDailyMeetingsWeek121119#Tuesday)
  • RAL-LCG2: down (WLCGDailyMeetingsWeek121119#Tuesday)
    • All the best for them, thanks for the efforts and bringing services up.
    • No complaint to RAL, but this incident reminded us the following;
  • GOCDB: WLCG-ops should review the fall-back system / procedure
  • FTS: affects T2 activities largely. ATLAS requests WLCG-ops to address fall-back solution
  • Frontier: to avoid default TCP timeout in case of service down (WLCGDailyMeetingsWeek121119#Wednesday)
    • There is a discussion within ATLAS, more details would be reported later.

The discussion focused on the VOMS-GGUS synchronisation, the GOCDB fallback, the FTS fallback and the OPN alerts.

Maria D. apologizes for failing to notice the ticket about the GGUS-VOMS issues on November 1. The problem caused some UK ATLAS members to be unable to edit tickets. On the GGUS side, the problem was immediately fixed, but on the VOMS side it has not been properly followed up. Maria will follow up with IT-PES and Tanya Levshina.

Jeremy said that he looked into the GOCDB fallback, which exists and is in Germany, but it was out of synch because its DB password had expired and apparently there also was a disk (space?) issue. Next time it should work.

About the FTS fallback, Gonzalo asks if people are working on it and what is the general idea of how it should work. Ueda says that the idea is simply that another FTS should be used if one goes down. Alessandro and Maarten explain that this is difficult with FTS 2, but entirely possible with FTS 3 the FTS TF is working on it. Nicolò adds that different servers could even synchronise their configuration via a messaging system, but this has to be discussed with the developers.

About the OPN alerts, Gonzalo explains that the operational procedures prescribe to open a GGUS ticket in case of incidents. It turns out that the experiments are not aware of these tickets and this information needs to be more visible. Andrea S. suggests using SLS. Maarten suggests the matter can be further looked into offline.

CMS (Ian F.)

CMS would like all sites to activate the xrootd fallback. From next spring CMS would also encourage sites to put the WNs in the OPN, from which all VOs should benefit.

Another request is for Tier-1 sites to implement the disk-tape separation. This is already the case at RAL and CERN (here, via EOS). The goal is to have it completed by mid 2013. We should discuss on how to do it in the best possible way.

Gonzalo asks if it is foreseen to organize discussions with the service providers. Ian answers that it could be useful to do that if it helps sites, but sites are free to choose the implementation that they think is more approriate.

Andreas warns that putting the WNs in the OPN requires extensive changes to the network setup. Ian is aware of that and he did not imply that it will be easy or that it should be done immediately, but firewalls do cause inefficiencies. Moreover, the OPN is normally used at 15% of its capacity. Of course, other solutions are welcome if they exist.

Andreas asks if enabling the xrootd fallback requires sites to enable the xrootd protocol in their storage. Ian says that the two things are decoupled and only 1/3 of CMS sites have enabled xrootd access, but this corresponds already to tens of PB of data in EOS, FNAL and several Tier-2's. Fallback benefits also sites that do not provide xrootd access. Ian also reminds that there is a lot of monitoring (for example for the Data Popularity) that can be used to monitor xrootd fallback.

Simone comments that there is no reason not to extend the discussion about the WNs in the OPN to LHCONE. Maria G. proposes to bring this up during the network workshop that will be held at CERN on December 13-14, organized by the WLCG Networking working group.

LHCb (Stefan)

At the end of his report, Stefan asks if experiments can coordinate the usage of resources. Ian F. answers that CMS is now doing organized processing at KIT and CERN, but not so much elsewhere.

GGUS tickets

Sites/VOs were prompted on Tuesday but no tickets were submitted for analysis and discussion.

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR 2.1.13-6.1; SRM-2.11 for all instances.
EOS 0.2.21/xrootd-3.2.5/BeStMan2-2.2.2 for all instances except ALICE (0.2.20)
CASTOR 2.1.13-5 → -6.1 EOS headnode hardware replacement
ASGC CASTOR 2.1.11-9
SRM 2.11-0
DPM 1.8.2-5
None None
BNL dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.8.1 (Atlas, CMS, LHCb)    
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.2-1.osg
Oracle Lustre 1.8.6
EOS 0.2.20/xrootd 3.2.2-1.osg with Bestman 2.2.2.0.10
   
IN2P3 dCache 1.9.12-16 (Chimera) on SL6 core servers and 1.9.12-24 on pool nodes
Postgres 9.1
xrootd 3.0.4
none Site downtime from 10th to 12th of December 2012
KIT dCache
atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera)
cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera)
gridka-dcache.fzk.de: 1.9.12-17 (PNFS)
xrootd (version 20100510-1509_dbg)
  Downtime from 15th to 17th of Jan 2013 (to be confirmed by LHCb): Segregation of LHCb from gridka-dcache.fzk.de, move to lhcbsrm-kit.gridka.de. Also, migrate lhcbsrm-kit.gridka.de, gridka-dcache.fzk.de to Chimera. The latter however won't have any WLCG-VOs in it after the downtime.
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.4 (Chimera) (SARA), DPM 1.8.2 (NIKHEF)   wednesday november 28th there will be a short outage because the FTS database will be moved to another server
PIC dCache 1.9.12-20 (Chimera)    
RAL CASTOR 2.1.11-8/2.1.12-10
2.1.11-8/2.1.12-10 (tape servers)
SRM 2.11-1
   
TRIUMF dCache 1.9.12-19 with Chimera namespace    

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.11-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1 updated transfer-fts None
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1 updated on 2012-11-07  
FNAL 2.2.8 - transfer-fts-3.7.12-1 updated transfer-fts  
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1 updated transfer-fts  
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

Most sites have upgraded FTS to the latest patch, that fixes the problems in delegation and grid-mapfile generation. CERN is still running the next-to-last version, and indeed there are still errors in user mapping. BNL and NDGF did not report the status before the meeting, but Ulf annouced that NDGF is running the latest version; the twiki should be updated [done].

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.0-1 for T1 and 1.8.3.1-1 for US T2s SL5, gLite Oracle ATLAS None
CERN 1.8.2-0 SLC5, gLite Oracle ATLAS, LHCb  
CERN, test lfc-server-oracle-1.8.3.2-1 SLC6, EMI2 Oracle ATLAS Xroot federations  

Other site news

Data management provider news

CASTOR news

CERN operations and development

EOS news

xrootd news

dCache news

StoRM news

DPM news

FTS news

LFC news

gfal/lcg_util news

Middleware news and baseline versions (Nicoḷ)

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

No changes to the recommended versions. A notification about a security advisory from EGI regarding the dCache client on UI and WN: sites with EMI-2 should upgrade the dCache client RPM. Instructions are on the twiki above. The new RPM has been released both in EMI and UMD.

AOB

Maria G. announces that the first WLCG operations planning meeting should be held before Christmas to discuss plans and requests like the CMS ones. As no firm date could be agreed during the meeting, it will be announced later [Update: the date has been determined to be November 29]

Action list

  • Maria D. will follow up with PES and Tanya about the VOMS-GGUS synchronisation problem.

-- AndreaSciaba - 26-Oct-2012

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2012-11-26 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback