WLCG Tier1 Service Coordination Minutes - 11/11/2010

Attendance

Local: Ricardo, Oliver, Dirk, Kors, Gavin, Alexei, Flavia, Roberto, Massimo, Andrea V, Maarten, Miguel, Manuel, Alessandro, Nicolo, Simone, Jamie, Maria A, Maria D, Maria G, Huang.

Connected: Michael, Carlos, Jon B, Jon NDGF, John DS, John K, Felix, Elena, Xavier, Gonzalo, Andrew, Ron, Rolf, Carmine, Andrew S, Andreas, Alexander, Jeremy, Foued, Joel, Ian, Paul, Jhen-Wei, Dave.

Release Update

WLCG Baseline Versions

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.9-8 (ATLAS)
CASTOR 2.1.9-9 (ALICE, CMS and LHcb)
SRM 2.9-4 (all)
xrootd 2.1.9-7
LHCb upgraded on 8-NOV-2010  
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
11/11 00:00-06:00 UTC: scheduled downtime for network maintenance none
BNL dCache 1.9.4-3 (PNFS)    
CNAF StoRM 1.5.4-5 (ATLAS, CMS, LHCb,ALICE)    
FNAL dCache 1.9.5-23 (PNFS)
Scalla xrootd 2.9.1/1.4.2-4
Upgraded dCache on Nov 8
xrootd read accessed opened to CMS VO
WARNING: new (gsi)dcap client stricter on correct use of slashes in URLs!
 
IN2P3 dCache 1.9.5-22 (Chimera) The recent performance and stability problems with the storage seem to be due to problems with Solaris on the disk servers, i.e. not due to dCache or its configuration; the issues can be reproduced with "iperf".
Paul: PIC have seen similar issues with Solaris, getting better performance using Linux instead.
 
KIT dCache 1.9.5-15 (admin nodes) (Chimera)
dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
   
NDGF dCache 1.9.7 (head nodes) (Chimera)
dCache 1.9.5, 1.9.6 (pool nodes)
   
NL-T1 dCache 1.9.5-19 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.5-21 (PNFS)    
RAL CASTOR 2.1.7-27 and 2.1.9-6 (stagers)
2.1.9-1 (tape servers)
SRM 2.8-2 and SRM 2.8-6
  CMS upgrade to 2.1.9-6 on 16-18/11/10 and ATLAS to 2.1.9-6 on 6-8/12/10.
Upgrades to 2.1.9-10 early 2011
TRIUMF dCache 1.9.5-21 with Chimera namespace none none

CASTOR news

CERN operations

LHCb has been upgraded. We are following on a alarm ticket (not necessarily linked to the upgrade). A post-mortem (upgrade + ticket) is planned for the end of the week.

We are preparing for the DB upgrades on the stagers (after the HI run but before Christmas). The DB upgrade related to the NS is definitely for January 2011 (it affects all instances (hence VO) and will require some downtime)

Development

Release 2.1.9-10 has been produced, which mainly targets issues found at RAL following their upgrade to 2.1.9. Full release notes and upgrade instructions are available.

xrootd news

dCache news

  • Installed an experimental RSS feed for the different dCache downloads. It is linked from the dCache download pages and the feeds are organized per target (e.g. clients, server versions, etc.)
  • New Golden Release, 1.9.5-23 (see the release notes). Highlights:
    • Space Manager Database access was optimized to make better use of indexes.
    • Fixed read from pool that was disabled with the -rdonly flag of the pool disable command.
    • Fixed xrootd mover TCP port allocation to avoid reusing a port until previous transfers on the port have finished.
    • Fixed implementation of rep ls -l=c command.
  • Recommended feature release: 1.9.10-2 (release notes). Highlights:
    • Mostly speedup of the SRM, xrootd and NFS4.1 protocol.

StoRM news

  • GGUS:64107 was opened about a possible incompatibility of gLite 3.2 VOMS proxies with StoRM, but the problem turned out to be due to an incorrect configuration of the StoRM server.

FTS news

DPM news

  • DPM 1.8.0-1 has been certified
    • fixing memory leaks and thread safety issues in VOMS library (used e.g. in SRM v2.2 daemon)
    • allowing admin to ban users (DN or VOMS FQAN)
  • dpm-xrootd 2.2.0-1 has been certified
  • Working on DPM 1.8.1:
    • faster dpm-drain and replication
    • refactoring of Name Server code for better performance of SRM, xroot and NFS 4.1

LFC news

  • LFC 1.8.0-1 has been certified
    • fixing memory leaks and thread safety issues in VOMS library (used in LFC daemon)
    • allowing admin to ban users (DN or VOMS FQAN)

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.7.2-4 SLC4 64-bit Oracle Testing on a database dump, upgrade will be scheduled after tests, no date for now
BNL 1.7.2-4 SL4 Oracle 1.7.4 on SL5 in November
CERN 1.7.3 64-bit SLC4 Oracle Will upgrade to SLC5 64-bit by the end of the year
CNAF 1.7.2-4 SLC4 32-bit Oracle 1.7.4 on SL5 64-bit in November
FNAL N/A     Not deployed at Fermilab
IN2P3 1.7.4-7 SL5 - 64 bits Oracle  
KIT 1.7.4 SL5 64-bit Oracle  
NDGF        
NL-T1        
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.2-5 SL5 64 bit MySQL  

Experiment issues

GGUS Issues

Maria D. will update the CMS ticket IN2P3 to MIT transfers to clarify the investigation is expected from IN2P3. She will also re-assign the finnish T2 ticket from Gstat to NDGF to prompt them to change config. parameters in order to attribute the T2 FIN its real resources in the WLCG monitoring. KIT was reminded to finish configuring CE and batch nodes with the new CMS VOMS Roles on hiproduction.

Outstanding SIRs

Three reports were discussed (see agenda

  1. The RAL - storage degradation for LHCb Full Report.
  2. An interim report from IN2P3 about shared area problems (see attachments below).
  3. The need for a report addressing the problems seen by LHCb "since the recent CASTOR upgrade" (although it has since been confirmed that the specific problem reported pre-dated this upgrade).

Conditions data access and related services

COOL, CORAL and POOL

Frontier/Squid

  • Squid service deployment discussion
    • Kors Bos (ATLAS) announced that ATLAS had taken the decision to not share Squids with other experiments. Discussion closed.

Database services

  • Topics of general discussion
    • Distributed Database Operations Workshop - please register if attending social dinner:
http://indico.cern.ch/conferenceDisplay.py?confId=111194

  • Experiment reports:
    • ALICE:
      • Nothing to report
    • ATLAS:
      • Conditions replication to SARA fixed.
      • Atlas PANDA applications have suffered from transaction locking issues on Wednesday (3rd Nov) afternoon. DBAs had to intervene to kill user sessions to unblock the application. We are following up with Atlas on the issue.
    • CMS:
      • On Tuesday morning (9th Nov) CMS PVSS replication from online to offline database was unexpectedly disabled for 2 hours due failure of one of weekly automatic maintenance procedures (shrinking of LogMiner table). The procedure has been modified on to inform DBAs via email whenever its execution is unsuccessful.
      • On Tuesday (9th Nov) CMS PVSS replication was affected once again for 30 minutes because of user error (using table without primary key) which caused abort of the apply process.
    • LHCb:
      • Conditions replication to SARA fixed.

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
ASGC Nothing to report - Install and set up TSM
- Data Guard studies, testbed creation and implementation plans.
BNL Nothing to report Deployment of PSU OCT 2010 in TAGS test cluster.
Performance tests of data replication using Transportable Tablespaces between Triumf and BNL for TAGS database.
CNAF Nothing to report None
KIT Nothing to report None
IN2P3 Nothing to report None
NDGF Nothing to report None
PIC Nothing to report None
RAL Nothing to report Planning to apply October PSU on 3D DBs at the next CERN technical stop.
SARA Nothing to report November 16th starting at 7:00 UTC until 17:00 UTC- intervention is caused by maintenance on network infrastructure.
TRIUMF Nothing to report None

AOB

Action List

Action number Description Announced Due Last Update Status
20101028_01 RAL out of production due to Atlas upgrade
Being an announcement and not an action it will not be further followed up
20101028 20101206-08 20101119 Open
20101028_02 Configure new ASGC T2 channels
Done at: ASGC, CERN, IN2P3. KIT: tomorrow
20101028 20101104 20101111 Open
20101028_03 CMS to decide on redirector fix of GGUS:62696 20101028 a.s.a.p. 20101028 Open
20101028_04 IN2P3 (P.Girard)-CERN(H.Renshall) WG to address Afs issues of LHCb shared area GGUS:59880,GGUS:62800
The WG is active (thanks to Harry) and an intermediate report is attached
20101028 a.s.a.p. 20101111 Closed
20101028_05 Invite Dave Dijkstra to discuss FroNTier/squid sharing by Atlas and CMS sites 20101028 20101111 20101028 Open

-- AndreaSciaba - 10-Nov-2010

Topic attachments
I Attachment History Action Size Date Who Comment
Microsoft Word filedoc CCIN2P3-WLCGT1SCM-LHCB-SW-Problem-Report-20101111-0.doc r1 manage 38.5 K 2010-11-11 - 12:21 AndreaSciaba Intermediate report for the LHCb software area AFS problem
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2010-11-19 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback