WLCG Operations Coordination Minutes - 1st November 2012

Agenda

Attendance

  • Local: Maria Girone (chair), Andrea Sciabà (secretary), Wei-Jen, Felix Lee, Alessandro Di Girolamo, Maria Dimou, Stefan Roiser, Jan Iven, Maarten Litmaath, Maite Barroso Lopez, Nicolò Magini, Simone Campana, Ikuo Ueda, Massimo Lamanna, Michail Salichos, Andrea Valassi, Ulrich Schwickerath, Ian Fisk, Oliver Gutsche, Helge Meinhard, Domenico Giordano
  • Remote: Stephen Burke, John Gordon, Si Liu, Alexei Klimentov, Burt Holzman, Dave Dykstra, Gareth Smith, Rob Quick, Di Qing, Ian Collier, Ron Trompert, Peter Solagna

Communication tools

Andrea announces that the wlcg-tier1-contacts@cernNOSPAMPLEASE.ch list has been checked and can be used to contact the Tier-1 sites. He proposes to return wlcg-operations@cernNOSPAMPLEASE.ch to its original, "pre-working group" status and create a wlcg-ops-broadcast@cernNOSPAMPLEASE.ch for broadcasts. The proposal does not encounter any objection. Maria D. proposes to add it to the CIC operations portal to be able to send the broadcasts from the web portal.

Task Force reports

CVMFS

  • Reminder sent to all site admins to fill in the twiki page, so far we have responses from ~ 1/3 of sites
  • Sites that deployed CVMFS since the last meeting
    • DESY-ZN (LHCb)
    • IN2P3-CC (CMS)
    • IN2P3-CC-T2 (CMS)
    • ru-PNPI (ATLAS, CMS, LHCb)
    • RU-Protvino-IHEP (CMS)
    • UAM-LCG2 (ATLAS)
  • Total deployment status (since start of task force): 94 sites contacted, 13 sites deployed, 1 site decomissioned
  • Sites asking for new CVMFS version 2.1 with shared cache and NFS export (release foreseen for Q1/2014)

gLExec

  • ATLAS: integration of glexec usage into the pilot code has restarted; testing expected in the coming weeks.

PerfSonar

Simone presented some slides.
  • The PS service type has been added to GOC; Simone and Rob will discuss later today how to do it for OSG.
  • Now we have the list of the most important sites and channels to test; the next step is to:
    • register existing PS instances in GOC/OIM
    • install PS at sites which do not have it
Oliver asks if there is any recommendation for using a special hardware setup, as it is the case in the US. Simone answers that this is not the case for now (but it is recommended to use different nodes for the bandwidth and latency measurements) but he will discuss it with Shawn.

Tracking tools

ALARM raising for additional CERN services?

This item concerns the Tier0 only. Presentation http://indico.cern.ch/getFile.py/access?subContId=0&contribId=1&resId=1&materialId=slides&confId=215003 was prepared on ATLAS request. The 2012/11/20 WLCG MB will discuss which additional services (if any) should be added in the list of services eligible to raise ALARMs. E-groups' "criticality", in particular, should be well justified, the service being supported by another department (not CERN IT). Implementation-wise, there is consensus that GGUS ALARMs is the way to go.

Maria G. encourages the experiments to give her the list of additional services for which it should be possible to submit ALARM tickets. She will bring this up at the MB in two weeks during her update on the service criticality.

Maite points out that it never happened that ALARM tickets were abused and that it is correct to use them for any serious incident.

Maarten clarifies that sending an alarm ticket does not necessarily awake an expert in the night: only the operator will immediately react, and follow the published procedures. Helge says that for most services an SMS is sent to the expert at any time, and then he/she decides how to proceed. The general consensus is that the operator's response time is always very short and the experts' response more than adequate in most cases, even when support is best effort (as for the majority of the services).

Rob asks why the twiki is considered so critical and Ian explains that it's basically because almost all experiment documentation is on twiki, including the procedures to restart services!

Send GGUS reminders for outstanding tickets to "Notified Sites" as well (not only ROCs/NGIs)?

This item concerns sites only except the Tier0. Presentation http://indico.cern.ch/getFile.py/access?subContId=0&contribId=1&resId=3&materialId=slides&confId=215003 was prepared to get WLCG feedback for Savannah:131988. First reactions were in favour of this development because reminders group all outstanding tickets for a given Support Unit (SU) and they are sent 1-2 times per week, i.e. not too much traffic. We left 1 month for more detailed comments.

SHA-2 migration

  • EGI validation infrastructure time frame?
  • Try to let the experiments profit from it.

Generic links:

FTS 3 integration and deployment

  • Functional tests started on fts3-pilot-service.cern.ch, some bugs observed and promptly fixed by FTS developers:
    • "Connection reset" errors in communication between client and server: TCP keepalive disabled (already included in FTS3 version deployed at other sites)
    • Server repeatedly crashing when checksumming was issued for bulk transfers: fix deployed on Monday 22nd together with other bugfixes
  • Tests will resume now that the stability issue is solved.

Middleware deployment

  • Worker Node testing for WLCG
  • CERN: EMI-2 WN deployed in preprod (~10% of the farm), allowing ATLAS to verify compatibility also with the EOS SRM (BeStMan)
  • WLCG Software Life Cycle Process beyond EMI: please look at the document and/or presentation attached to the Oct 16 Management Board meeting agenda
    • please send comments or questions to Markus and Maarten

XrootD

Squid monitoring

It is agreed that the squid monitoring support by CMS also for ATLAS and CVMFS can be considered a CMS contribution to WLCG.

WMS future

News from other WLCG working gropus

Experiment operations review and plans

ALICE

  • The old SE hosting conditions data has been retired last week: for that type of data ALICE mainly relies on EOS-ALICE now, with backup replicas at other sites.
  • Migration from CASTOR ALICE_DISK to EOS ongoing, foreseen to be finished by Nov 30. Thanks to the CASTOR/EOS team for their efforts in this matter!
  • On Nov 1 between 00:00 and 01:00 local time job submissions were found failing on multiple (possibly all) gLite 3.2 VOBOX nodes at various sites, with the CREAM client complaining that the proxy had supposedly expired while it was fine. This happened for different DNs from different CAs. GGUS:87997 opened for the CREAM developers.

ATLAS

  • The bulk reprocessing starting soon, earliest today.
  • Some of the urgent simulation jobs have been stuck due to FZK TAPE+DISK problems

CMS

Ian brings up the topic of xrootd fallback to Tier-1 sites. Normally the WNs are not in the OPN, so it might generate a very heavy traffic on the site firewall. Alessandro adds that in fact CERN and FNAL have the WNs in the OPN and Oliver says that KIT agreed to accommodate the higher load on the firewall but the official approval from the site management will require at least three weeks.

Maria G. proposes to start xrootd discussions in the working group to decide how to best deal with this kind of issues.

LHCb

  • Reprocessing of 2012 data progressed very well so far, first part 1.2 /fb have been processed. Currently waiting for new conditions to be deployed next week. Until then the activities will be ramping down
  • Discussion with FZK next week on how to improve the staging performance at the site
  • EOS, deleted files b/c of buggy script and problem with SRM upload and concurrent writing to the file at the same time. Some files could be recovered, some reproduced, some lost.

GGUS tickets

No VO or Site was unhappy about support provided to tickets of their concern.

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR 2.1.13-5; SRM-2.11 for all instances.
EOS 0.2.21/xrootd-3.2.5/BeStMan2-2.2.2 for all instances except CMS (0.2.16)
EOS-0.2.19/20/21 CASTOR - deploy 2.1.13-6 over next weeks;
EOS - get CMS onto current version, deploy prototype readonly slave
ASGC CASTOR 2.1.11-9
SRM 2.11-0
DPM 1.8.2-5
None None
BNL dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.8.1 (Atlas, CMS, LHCb)    
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.2-1.osg
Oracle Lustre 1.8.6
EOS 0.2.20/xrootd 3.2.2-1.osg with Bestman 2.2.2.0.10
   
IN2P3 dCache 1.9.12-16 (Chimera) on core servers and 1.9.12-24 and pool nodes.
New hardware (more RAM, SSD disks) for Chimera and SRM servers (with SL6).
Postgres 9.1
xrootd 3.0.4
None None
KIT dCache
atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera)
cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera)
gridka-dcache.fzk.de: 1.9.12-17 (PNFS)
xrootd (version 20100510-1509_dbg)
   
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.4 (Chimera) (SARA), DPM 1.8.2 (NIKHEF)    
PIC dCache 1.9.12-20 (Chimera) None None
RAL CASTOR 2.1.11-8/2.1.12-10
2.1.11-8/2.1.12-10 (tape servers)
SRM 2.11-1
   
TRIUMF dCache 1.9.12-19 with Chimera namespace voms+kpwd authentication and tape recycling postgres9 upgrade

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.11-1 applied new proxy delegation patch on Mon 29th  
ASGC 2.2.8 - transfer-fts-3.7.10-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None Planning to apply proxy delegation patch next week
CNAF 2.2.8 - transfer-fts-3.7.10-1    
FNAL 2.2.8 - transfer-fts-3.7.10-1 applied patch1 last week will test patch2 on cmsfts2.fnal.gov next week and cmsfts1.fnal.gov after 1 week of stable running  
IN2P3 2.2.8 - transfer-fts-3.7.12-1 Last patch applied on Oct. 31st  
KIT 2.2.8 - transfer-fts-3.7.10-1   will apply new patch on 06.11.
NDGF 2.2.8 - transfer-fts-3.7.10-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1 applied the patch during the ops coordination phone conf november 1st :-)  
PIC 2.2.8 - transfer-fts-3.7.10-1 None None
RAL 2.2.8 - transfer-fts-3.7.12-1 applied new patch on Nov 1st  
TRIUMF 2.2.8 - transfer-fts-3.7.10-1   new proxy delegation patch next week

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.0-1 for T1 and 1.8.3.1-1 for US T2s SL5, gLite Oracle ATLAS None
CERN 1.8.2-0 SLC5, gLite Oracle ATLAS, LHCb  
CERN, test lfc-server-oracle-1.8.3.2-1 SLC6, EMI2 Oracle ATLAS Xroot federations  

Other site news

Data management provider news

CASTOR news

CERN operations and development

EOS news

xrootd news

dCache news

StoRM news

DPM news

FTS news

A new patch for the proxy expiration issue is available and has been deployed at CERN on Monday 29th.

The patch fixes the following issues:

  • submitters proxy expiration (GGUS:81844)
  • jobs wrongly allocated to different VOs (GGUS:87929)
  • VOMS attrs can't be read from CRLs, delegation ID can't be generated (GGUS:87975, GGUS:86775)
  • mkgridmap cron job has old gLite paths, thereby preventing the addition of new VO members to the submit-mapfile

VOs report that the frequent transfer submission and allocation errors seen with the previous patch have disappeared, and the proxy expiration issue has not yet resurfaced. Therefore we would like to ask all T1s to deploy the patch starting from next Monday, please update the table with the deployment schedule.

Installation instructions:

It is agreed that, given the simplicity of the patch installation process, the target date for having it installed at all Tier-1 sites is the end of next week. An official release (2.2.9) including all the patches released so far will take longer, but there won't be any difference between installing it or installing the individual patches. It is also agreed that rollback instructions will be provided, as requested in particular by Burt.

After the meeting the instructions were consolidated on the FTS 2.2.8 admin documentation page:

LFC news

gfal/lcg_util news

Middleware news and baseline versions (Nicoḷ)

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Nicolò repeats some recommendations for sites:

  • In general, if a site wants to upgrade a service, the EMI-2 version should be chosen
  • Sites with a gLite WMS should upgrade to version 3.3.8, which fixes important bugs
  • From today, EMI-1 gets only security updates: therefore, important bug fixes will be available only for EMI-2
  • Now the baseline version for the WN is the EMI-2 one: sites are strongly encouraged to upgrade if they still have a gLite WN, but it's less urgent if they have EMI-1 (check the WLCG WN testing page for the most recent information)
  • the EMI-2 UI has a serious bug affecting the submission to the gLite WMS; moreover the tar ball distribution is not yet available
Peter brings up the fact that the SAM/Nagios boxes still use the gLite UI. Anyway, this is not considered a problem because it's a very controlled environment and the required functionality works as needed with gLite.

AOB

It is agreed that the next meeting will be on November 22, in order to avoid the GDB week.

Action list

-- AndreaSciaba - 26-Oct-2012

Edit | Attach | Watch | Print version | History: r28 < r27 < r26 < r25 < r24 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r28 - 2012-11-06 - AndreasPetzold
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback