WLCG Tier1 Service Coordination Minutes - 20th January 2011

Attendance

local(); remote ();

Release update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.10 (CMS, ATLAS and ALICE)
CASTOR 2.1.9-9 (LHCb)
SRM 2.9-4 (all)
xrootd 2.1.9-7
Oracle changes (all instances): upgrade on Jan the 6th to 10.2.0.5 LHCb will be upgraded to 2.1.10
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
14/1: 4h "at risk" intervention on tape system due to construction of electrical power system in data center None
BNL dCache 1.9.5-23 (PNFS, Postgres 9) 1.9.4 upgraded to 1.9.5-23; PG 8.3 to PG 9 None
CNAF StoRM 1.5.6-3 (ATLAS, CMS, LHCb,ALICE)   upgrade OS to SL5 within February
FNAL dCache 1.9.5-23 (PNFS)
Scalla xrootd 2.9.1/1.4.2-4
None Putting Lustre into Production Service for Merging Pools
IN2P3 dCache 1.9.5-22 (Chimera)    
KIT dCache 1.9.5-15 (admin nodes) (Chimera)
dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
   
NDGF dCache 1.9.7 (head nodes) (Chimera)
dCache 1.9.5, 1.9.6 (pool nodes)
   
NL-T1 dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.5-23 (PNFS)    
RAL CASTOR 2.1.9-6 (stagers)
2.1.9-1 (tape servers)
SRM 2.8-6
Upgraded ATLAS disk servers to SL5 64bit and enabled checksum support Will next upgrade CMS disk servers and enable checksum support. Will upgrade Oracle to 10.2.0.5 on 1 Feb 2011
TRIUMF dCache 1.9.5-21 with Chimera namespace None None

CASTOR news

CERN operations

Development

No news.

xrootd news

dCache news

StoRM news

  • StoRM 1.6.0 released for Early Adopters for SL5 X86_64 on Jan 13. Changelog is here.

FTS news

No news.

DPM news

  • DPM 1.8.0 released for gLite 3.2 / SL5 on Jan 18, including these highlights:
    • new VOMS library fixing memory leaks in SRM daemons
    • facility to ban users and groups (VOMS attributes)
  • DPM 1.8.0 for gLite 3.1 / SL4 still in Staged Rollout:

LFC news

  • LFC 1.8.0 for gLite 3.1 / SL4 and 3.2 / SL5 still in Staged Rollout

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.7.4-7 SLC5 64-bit Oracle None (upgrade done on 4/1)
BNL 1.8.0-1 SL5, 64-bit Oracle  
CERN 1.7.3 64-bit SLC4 Oracle Will upgrade to SLC5 64-bit by the end of Jan or begin of Feb.
CNAF 1.7.4-7 SL5 64-bit Oracle  
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.0-1 SL5 64-bit Oracle Upgraded to LFC 1.8.0 on January 4th
KIT 1.7.4 SL5 64-bit Oracle  
NDGF        
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL  

Experiment issues

Status of open GGUS tickets

GGUS - Service Now interface

Review of open SIRs

Conditions Data Access and related services

COOL, CORAL and POOL

  • New software releases are being built for ATLAS (LCG 59b with CORAL 2.3.14, COOL 2.8.8 and POOL 2.9.11, based on ROOT 5.26.00e) and LHCb (LCG 60 with CORAL 2.3.14a, COOL 2.8.8a and POOL 2.9.11a, based on ROOT 5.28.00). CMS is also building a new software release using the CORAL 2.3.12 tag, upgrading from a previous software version. These are the releases that will be used for the 2011 data taking.
  • A workshop will be held in LHCB next Monday to discuss about the future strategy for conditions database deployment and distribution on the Grid. Several options will be discussed, including Oracle Streams, SQLite files (optionally via CVMFS/Squid) and Frontier/Squid (agenda).

Frontier/Squid

Database services

  • Experiment reports:
    • Generic:
      • LHCb and ATLAS downstream capture databases have been patched to 10.2.0.5 on Wednesday 5th of Jan.
    • ALICE:
      • Online DB upgraded to 10.2.0.5 on 12 of Jan
      • Online DB not available between 19th of Jan 16h and 20th of Jan 19h due to power tests in the pit.
    • ATLAS:
      • Atlas offline database for ADC applications (PANDA, DQ2 and prodsys accounts) have been moved to a dedicated database on Monday 17th of Jan 2011. The operation has involved a scheduled downtime for the affected accounts from 9am till 6pm and has included a full backup to tape of the new DB accounts. The rest of Atlas offline DB, in particular conditions and PVSS have been untouched.
      • Atlas online DB has been upgraded to 10.2.0.5 on Wednesday afternoon (19th of Jan).
      • After the CC power cut on 18th of December, the ATLARC database did not want to reboot. The investigation showed, that one of online log files was corrupted. This file has been already archived by the database but it turned out that it was archived corrupted. That looks like a database bug, since a corrupted online log file should not be archived successfully. In the consequence, the restore of the database from the backup was needed. The recovery operation was possible only to the point in time few hours before the power cut (7.30 a.m.), because TSM service lost some backup data after power cut and the existing archivelog was corrupted. The users agreed to restore the DB to the point in time several hours before, without waiting for TSM to be able to recover more data.
    • CMS:
      • CMS online DB was not available between 8:25 and 19:10 on the 4th of January, due to power cut in P5. Manual interventions were require to start the DB after power was back (~17h).
      • Maintenance activities over the weekend on schemas replicated by CMS PVSS streaming introduced aborts and high latency of the replication. To avoid further problems other changes has been done manually on the target database - 17th of Jan
      • CMS online production database was stopped twice during last week on Monday and Tuesday morning (10-11 Jan). First downtime was necessary due to power tests at P5. At the same time several disks critical for database operation were replaced with new ones. Second downtime was related to upgrade to 10.2.0.5. The upgrade was completed successfully.
      • CMS offline database has been upgraded to version 10.2.0.5 on Wednesday afternoon (19th of Jan).
    • LHCb:
      • LHCb online DB was not available between 0:10 and 19h on the 20th of Dec due to scheduled power tests in LHCb pit.

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
ASGC LFC DB upgraded to 10.2.0.5 None
BNL Conditions DB - underlying storage firmware patches applied
Updates of OS RHEL 5 and Oracle to 10.2.0.5:
- LFC BNL, LFC Tier 3 and FTS database service successfully migrated to a new hardware (head nodes/storage)
- VOMS / Dcache Priority Stager database successfully migrated to a new hardware service (head nodes/storage)
Upgrade 10.2.0.5 Conditions DB
CNAF   Still no exact plan for the 10.2.0.5 upgrade, but will define it soon.
KIT A new DBA: Stefan Waldecker Upgrade of 3D RACs (ATLAS, LHCb) to Oracle 10.2.0.5 on Jan 26 (during full GridKa/DE-KIT downtime 7:00-18:00 UTC).
IN2P3 Nothing to report - DBLHCB,DBATL and DBAMI - On 8th feb [9:00 - 18:00 CET], we are going to upgrade our storage system and network switch. All 3D databases will be shutdown during this intervention.
- DBAMI - On 20th Jan, we will add a new schema into the AMI stream.
NDGF Nothing to report None
PIC   We are planning a downtime for early of February to upgrade our Oracle databases and other tasks, but the date is not fixed yet.
RAL We have installed the new CASTOR and ATLAS LFC hardware which is now been tested. Next step is to install Data Guard in High Availability mode and test it before going in production. - Planning to upgrade Castor DBs on the 31st if we get the final approval from the experiment.
- Waiting for CERN to finish their upgrade before we proceed to upgrade our 3D and LFC/FTS DBs
SARA Successfully moved the database back to the original cluster hardware on 18th of Jan No date for 10.2.0.5 upgrade yet
TRIUMF Nothing to report Planning to upgrade to Oracle 10.2.5 sometime in February.

Action List

Action number Description Announced Due Last Update Status
20101028_01 RAL out of production due to Atlas upgrade
Being an announcement and not an action it will not be further followed up
20101028 20101206-08 20101119 Open
20101028_02 Configure new ASGC T2 channels
Done at: ASGC, CERN, IN2P3. KIT: tomorrow
20101028 20101104 20101111 Open
20101028_03 CMS to decide on redirector fix of GGUS:62696 20101028 a.s.a.p. 20101028 Open
20101028_05 Invite Dave Dijkstra to discuss FroNTier/squid sharing by Atlas and CMS sites 20101028 20101111 20101028 Open
20101216_01 Write SIR on Atlas db server reboot 20101216 a.s.a.p. 20110120 Done
20101216_01 Write SIR on Atlas db server reboot 20101216 a.s.a.p. 20110120 Open

AOB

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2011-01-28 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback