WLCG Tier1 Service Coordination Minutes - 7 July 2011

Attendance

  • Remote: Stefane, Ron, Pavel, Felix, Alexander, Carlos, Michael, DaveD, John DeStefano, Cristina, Ulf, Luca, Gonzalo, Kyle, Gareth), Jhen Wei)

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.11 (SL5); SRM 2.10-x (SL4); xrootd: 2.1.11-1
FTS SL4 3.2.1 i.e old
EOS -0.1.0/xrootd-3.0.4
All CASTOR instances have been upgraded (after the SL5 migration). SSL auth in xrootd had been obsoleted (in favour of gsi). New CASTOR scheduling (transfermanager) in production for ATLAS and CERNT3.

Loadbalanced DNS aliases for EOSCMS & EOSATLAS deployed.

SRM upgrade to 2.11 (includes SL5 migration)

SL5 FTS 3.2.1-2 preparations well underway. 6 new channels on T2 service will be added on a new SL5 box shortly. Further steps in the migration will announced in due course.

ASGC CASTOR 2.1.10-0
SRM 2.10-2
DPM 1.8.0-1
None None
BNL dCache 1.9.5-23 (PNFS, Postgres 9) None Migration from pnfs to Chimera in August 2011
CNAF StoRM 1.5.6-3 SL4 (CMS, LHCb,ALICE)
StoRM 1.6 SL5 (ATLAS)
   
FNAL dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25
Scalla xrootd 2.9.1/1.4.2-4
Oracle Lustre 1.8.3
none none
IN2P3 dCache 1.9.5-26 (Chimera) on core servers. Mix of 1.9.5-24 and 1.9.5-26 on pool nodes none none
KIT dCache (admin nodes): 1.9.5-25 (ATLAS, Chimera), 1.9.5-26 (CMS, Chimera) 1.9.5-26 (LHCb, PNFS)
dCache (pool nodes): 1.9.5-9 through 1.9.5-26
   
NDGF dCache 1.9.12    
NL-T1 dCache 1.9.12-10 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.12-8 (since Aug 9th); PNFS, Postgres 9    
RAL CASTOR 2.1.10-1
2.1.10-0 (tape servers)
SRM 2.10-0
Upgraded to 2.1.10-1 on 5/7/11 Start repacking to T10KC during July/Aug
TRIUMF dCache 1.9.5-21 with Chimera namespace Expanded tape system to 5.5PB  

CASTOR news

CERN operations

The CERNT3 and ATLAS deployments use the new internal scheduling. In testing prior to deployment we have validated the system to handle 5 times more requests per seconds than the original LSF-based scheduler. With ATLAS we validated the system in conditions at the level (or slighlty exceeding) the top-rate operations experienced. ATLAS and CASTOR operations agreed to continue to operate the system in the continuation of the LHC 2011 run.

Development

EOS news

  • CMS imported approx. all CAF data into EOSCMS 0.5 M files and 0.6 PB . CMS switched CAF to EOS with fallback to CASTOR.
  • ATLAS import is at 4.5 M files 0.9 PB (~300.000 files still to import).

xrootd news

dCache news

StoRM news

FTS news

DPM news

  • DPM 1.8.1 in EGI UMD Staged Rollout: changes
  • A wlcg-preview repository has been set up which contains new middleware versions which have been extensively tested but have not been certified in a formal way neither for gLite nor for EMI
  • Testing DPM version "1.8.2" (wlcg-preview release) with the help of ASGC. The main features coming with this version are:
    • fast dpm-drain
    • better filesystem selection algorithm (designed by Maarten)
    • central banning (Argus)

LFC news

  • LFC 1.8.1 in EGI UMD Verification
  • A wlcg-preview repository has been set up which contains new middleware versions which have been extensively tested but have not been certified in a formal way neither for gLite nor for EMI
  • Currently running LFC version "1.8.2" (wlcg-preview release) at CERN in production for LHCb and it is also the version installed for the Atlas Central LFC entering production soon at CERN. Compared to 1.8.0, this LFC version fixes a problem reported by IN2P3 when running the replica instance of LFC for LHCb. This version also brings support for central banning (Argus)

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.8.0-1 SLC5 64-bit Oracle None
BNL 1.8.0-1 SL5, 64-bit Oracle None
CERN 1.7.3 64-bit SLC4 Oracle Upgrade to SLC5 64-bit pending
CNAF 1.7.4-7 (ATLAS, to be dismissed>
1.8.0-1 (LHCb, recently updated)
SL5 64-bit Oracle  
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.0-1 SL5 64-bit Oracle 11g Oracle DB migrated to 11g on Feb. 8th
KIT 1.7.4-7 SL5 64-bit Oracle Oracle backend migration pending
NDGF 1.7.4.7-1 Ubuntu 9.10 64-bit MySQL None
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL  

Experiment issues

WLCG Baseline Versions

GGUS issues

Review of recent / open SIRs and other open service issues

Conditions data access and related services

  • A new LCGCMT_60c has been prepared for ATLAS. The motivation for this release is the upgrade to newer versions of all Persistency Framework packages (CORAL, COOL, POOL) and of the ROOT and oracle external dependencies, with multiple fixes and improvements in all of the above. The largest changes are in the new CORAL (2.3.16), including in particular several fixes for the most serious problems observed in its handling of network and database connection glitches. The new ROOT (5.28.00e) includes a more robust fix for the xrootd/kerberos issue responsible for the KDC flood in the last few weeks, while the new Oracle client configuration (11.2.0.1.0p3) includes a workaround for another kerberos-related bug in the Oracle client, that redefines symbols conflicting with those in the system libraries. The full release notes are available at https://twiki.cern.ch/twiki/bin/view/Persistency/PersistencyReleaseNotes.
  • A new frontier client version 2.8.2 has been released, including some performance optimizations that split into chunks the data received via http. The new client is now used in the CMS HLT system. While it was produced too late for its inclusion in the LCG60c release for ATLAS, this is now being tested in the CORAL nightlies and will be included in the next CORAL release prepared for ATLAS offline.

Database services

  • Experiment reports:
    • ALICE:
    • ATLAS:
      • Instance 2 of Atlas Offline (ADCR) database has rebooted on Friday (01.07) around 00:50 due to ORA-00600 error. Database instance was back in operation after few minutes. Rootcause of this problem was an Oracle bug.
      • New schema has been added to the ATLAS AMI Streams replication setup from IN2P3 to CERN on Tuesday 05.07. Due to some troubles at IN2P3 replication downtime has been extended to ~2h.
      • Some intervention by ATLAS developer on one of ATLAS conditions data schema caused aborts of replication on Wednesday (06/07) at midnight.
      • Following up on ATLAS request, replication for Atlas Conditions to PiC will be discontinued on Tuesday 12.07 14:00.
    • CMS:
      • Node 2 of CMS Offline database has rebooted on Monday 27th of June around 7:50PM due to excessive memory consumption caused by one of CMS applications (Phedex). Database instance was back in operation after few minutes. Rootcause of this problem is under investigation.
      • CMS integration database (INT2R) has been successfully upgaded to Oracle 11.2.0.2 on Tuesday 5th of July afternoon.
    • LHCb:
      • Misconfiguration of Advance Queuing in LHCb replication to SARA has been fixed on Monday 27th (10AM). Whole replication LHCb T1s streaming had to be stopped for 1 hour.

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
ASGC    
BNL Follow up of SR 3-3535183751 with ORACLE which is in reviewing state. Migration of VOMS database replica to a newer disk storage.
CNAF   2011-07-12 LHCb cluster (CONDDB & LFC) maintenance for patches installation and parameter changes
KIT Nothing to report Concerning patch 9232517 ("PROPAGATION MISSING MESSAGES AFTER DESTINATION QUEUE OWNERSHIP SHIFT ON RAC"), we'll either migrate our "Streams RAC"s (LHCb 3D/LFC, ATLAS 3D) to new hardware and then apply this 64bit patch, or apply a 32bit version to them before migration. We'll request a 32bit patch version through metalink.
IN2P3    
NDGF    
PIC Nothing to report Following ATLAS request, replication for Atlas Conditions to PiC will be discontinued on Tuesday 12.07 14:00.
RAL Nothing to report None
SARA On 6th of July around 3:30 AM one of the node of our cluster lost network connectivity for a while and the listener went down. After that LHCb propagation stopped. None
TRIUMF Nothing to report None

AOB

-- AndreaSciaba - 06-Jul-2011

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2011-09-29 - OnnoZweersExCern
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback