WLCG Tier1 Service Coordination Minutes - 28/10/2010

Attendance

Ian, Federico, Dawid, Gavin, Alexei, Lola, Andrea V., Andrea S., Alessandro, Nicolo, Simone, Harry, Jamie, Maria A., Maria D, Maria G, Jacek, Luca. Connected Michael, Carlos, Jon B. Jon NDGF, John DS, John K, Felix, Elena, Xavier, Gonzalo, Elena, Andrew, Ron, Rolf, Carmine, Andrew S., Andreas, Alexander.

Release Update

VO questionnaire on EMI middleware transition and features

  1. The EMI project would like to move from gLite 3.1 to gLite 3.2 and gradually replace all the 3.1 components with their respective 3.2 version. The transition of gLite grid sites from gLite 3.1 to gLite 3.2 implies an update from Scientific Linux 4 to Scientific Linux 5 on those sites that are still using version 4 of the operating system. Because of the operation system update, grid applications that use dynamically linked shared libraries may need to be recompiled. Do you see any issue with the recompilation of any of the grid applications of your VO? A: no
  2. gLite 3.2 includes CREAM CE, but not LCG-CE service for job execution. Although the two services should be identical for users' jobs, given that the jobs are submitted through Workload Management Service (WMS), do you see the replacement of LCG-CEs with CREAM CEs as an issue? A: no, but not all jobs go via WMS
  3. The European Middleware Initiative project (EMI) is collecting requirements to define the development plans for the ARC, gLite and UNICORE middleware stacks for the first year of EMI. (planned release in April 2011) EMI year 1 developments focus on service enhancements (security, consolidation of services and libraries, usability) not on adding completely new functionalities. The EGI user community is expected to provide input to this plan. Please list and explain what enhancements your VO would like to see in the new middleware release. A: schedule not compatible with LHC operation in 2011. More details via direct WLCG<->EMI channels.

WLCG Baseline Versions

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.9-8 (ATLAS and LHCb)
CASTOR 2.1.9-9 (ALICE and CMS)
SRM 2.9-4 (all)
xrootd 2.1.9-7
19-20/10: Upgrade of the ALICE and CMS instances to 2.1.9-9  
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
none 3/11: network maintenance affecting also storage and in particular connections to European Tier-1's (but not links to US Tier-1's)
BNL dCache 1.9.4-3 (PNFS)    
CNAF StoRM 1.5.4-5 (ATLAS, CMS, LHCb,ALICE)   Final assignment of disk to Atlas and Alice this week
FNAL dCache 1.9.5-10 (admin nodes) (PNFS)
dCache 1.9.5-12 (pool nodes)
  Upgrading to 1.9.5-22 (or 23) on Nov 8
IN2P3 dCache 1.9.5-22 (Chimera)    
KIT dCache 1.9.5-15 (admin nodes) (Chimera)
dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
   
NDGF dCache 1.9.7 (head nodes) (Chimera)
dCache 1.9.5, 1.9.6 (pool nodes)
   
NL-T1 dCache 1.9.5-19 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.5-21 (PNFS)   Upgrade to dCache 1.9.5-23 planned for the 2-Nov SD
RAL CASTOR 2.1.7-27 and 2.1.9-6 (stagers)
CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers)
SRM 2.8-2
LHCb and ALICE upgraded to 2.1.9 CMS and ATLAS to be upgraded to 2.1.9 during November
TRIUMF dCache 1.9.5-21 with Chimera namespace    

CASTOR news

  • CASTOR 2.1.9-9 has been released and already deployed for a HI test run for ALICE and CMS. Main improvements include a better management of the tape infrastructure targeting the upcoming HI run.

xrootd news

dCache news

StoRM news

FTS news

DPM news

Certification of 1.8.0 is on hold waiting for a new version of the VOMS library without the memory leak.

LFC news

Certification of 1.8.0 is on hold waiting for a new version of the VOMS library without the memory leak.

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.7.2-4 SLC4 64-bit Oracle None
BNL 1.7.2-4 SL4 Oracle 1.7.4 on SL5 in November
CERN 1.7.3 64-bit SLC4 Oracle Will upgrade to SLC5 64-bit by the end of the year
CNAF 1.7.2-4 SLC4 32-bit Oracle 1.7.4 on SL5 64-bit in November
FNAL N/A     Not deployed at Fermilab
IN2P3 1.7.4-7 SL5 - 64 bits Oracle  
KIT 1.7.4 SL5 64-bit Oracle  
NDGF        
NL-T1        
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.2-5 SL5 64 bit MySQL  

Experiment issues

Following the discussion in ASGC, ATLAS started testing the usage of DPM in TW-FTT as the only T0D1 storage element in ASGC (removing the split between T1 and T2 storage). ATLAS asks all T1s and CERN to configure FTS so that the site TW-FTT is treated like another T1 (setting up the proper T1-T1 channels).

GGUS Issues

Open tickets presented in http://indico.cern.ch/materialDisplay.py?contribId=6&materialId=slides&confId=111906 Actions derived listed in 20101028_03 and |20101028_04 below. GGUS 8.0 new SUs explaned. Input for the new hierarchical structure of Support Unit levels, i.e. 1st = TPM, 2nd = DMSU et al, 3rd = developers requested via Savannah:117155.

Outstanding SIRs

From the last MB report no SIRs are outstanding and two have been received.

  1. Severely degraded response from CERN Batch Service
  2. CMS storage down due to GPFS bug

Full reports available here

Conditions data access and related services

COOL, CORAL and POOL

Frontier

  • LHCb has shown some interest in testing Frontier for accessing their COOL conditions data. Discussions are ongoing inside LHCb and with several people in IT and the Persistency team (AndreaV, DaveD, Flavia, Roberto) about the technical details (infrastructure for the tests, software changes in LHCb to make the COOL queries more cacheable, security model).

Database services

  • Topics of general discussion
    • 10.2.0.5 and PSU testing
      • We are validating Oracle 10.2.0.5 and 10.2.0.4 with Oct PSU on our integration services
      • We are still waiting for an answer from Oracle Support about PSU April non-rollingness issue
      • Oct PSU patching in production DBs have been postponed due to earlier than expected technical stop - no enough time to validate it on integration DBs
    • Distributed Database Operations Workshop - please register and send comments on the agenda:
http://indico.cern.ch/conferenceDisplay.py?confId=111194

  • Experiment reports:
    • ALICE:
      • Nothing to report
    • ATLAS:
      • Problems with conditions replication to SARA. Some tables not instantiated after last re-synchronization with RAL. Traced down to be a documentation bug - procedures will need to be updated.
      • ATLAS initially has decided to drop streams replication to ASGC and conditions database there. For accessing conditions data they will use Frontier instead.
    • CMS:
      • During last few weeks we have encountered unpublished Oracle bug in streams environment for CMS PVSS replication. All transactions that were executed on destination database during last spontaneous reboots of CMS nodes were marked as to be applied even thought they had been already applied. This issue fortunately did not affect running of PVSS replication. Oracle confirmed that the problem needs to be fixed manually - cleaning apply queues. This has been done this week.
      • Second node of CMSR database rebooted on Thursday (21st Oct) evening due to excessive memory utilization by database session of the PhEDEx aplication. The problematic queries have been reported to PhEDEx developers and the issue is being addressed by them.
    • LHCb:
      • On Monday afternoon (11th Oct) replication of LHCB conditions to SARA site was broken for 4 hours because of abort of apply process. The problem was caused by missing metadata dictionary on destination site which defines structure of replicated tables. Resending of the dictionary and restarting of the process solved the issue.

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
ASGC - SRM: Problems with backup deletion scripts - under investigation. Expired backupset cleaned manually.
- New TSM set-up is now in place.
- Dataguard implementation is under investigation - test bed to be set up.
None
BNL Nothing to report None
CNAF Nothing to report None
KIT Nothing to report None
IN2P3 Nothing to report None
NDGF Upgrade firmware and Linux kernel on NDGF ATLAS 3D database servers - completed today 28th October 2010, 08-10 UTC None
PIC Nothing to report None
RAL - New hardware is coming for CASTOR. It will be set in data guard mode. We got a long testing plan for it.
- CASTOR Gen and LHCb upgraded to 2.1.9
- Planning to apply October PSU on 3D, LFC, FTS DBs at the next CERN technical stop.
- Apply Jan PSU 10.2.0.4.3 on CASTOR to fix bug 8437213 (bitmap block corruption) ASAP.
SARA Corruption on the LHCb and ATLAS conditions database. Caused by missing instantiation of tables that were created between last SARA's restore and re-synchronization with RAL. Partial re-synchronization from CERN using export/import is now running  
TRIUMF Nothing to report None

AOB

Action List

Action number Description Announced Due Last Update Status
20101028_01 RAL out of production due to Atlas upgrade 20101028 20101124-27 20101028 Open
20101028_02 Configure new ASGC T2 channels 20101028 20101104 20101028 Open
20101028_03 CMS to decide on redirector fix of GGUS:62696 20101028 a.s.a.p. 20101028 Open
20101028_04 IN2P3 (P.Girard)-CERN(H.Renshall) WG to address Afs issues of LHCb shared area GGUS:59880,GGUS:62800 20101028 a.s.a.p. 20101028 Open
20101028_05 Invite Dave Dijkstra to discuss FroNTier/squid sharing by Atlas and CMS sites 20101028 20101111 20101028 Open

-- AndreaSciaba - 27-Oct-2010

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2010-10-29 - MariaGirone
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback