WLCG Tier1 Service Coordination Minutes - 28/10/2010
Attendance
Release Update
VO questionnaire on EMI middleware transition and features
- The EMI project would like to move from gLite 3.1 to gLite 3.2 and gradually replace all the 3.1 components with their respective 3.2 version. The transition of gLite grid sites from gLite 3.1 to gLite 3.2 implies an update from Scientific Linux 4 to Scientific Linux 5 on those sites that are still using version 4 of the operating system. Because of the operation system update, grid applications that use dynamically linked shared libraries may need to be recompiled. Do you see any issue with the recompilation of any of the grid applications of your VO? A: no
- gLite 3.2 includes CREAM CE, but not LCG-CE service for job execution. Although the two services should be identical for users' jobs, given that the jobs are submitted through Workload Management Service (WMS), do you see the replacement of LCG-CEs with CREAM CEs as an issue? A: no, but not all jobs go via WMS
- The European Middleware Initiative project (EMI) is collecting requirements to define the development plans for the ARC, gLite and UNICORE middleware stacks for the first year of EMI. (planned release in April 2011) EMI year 1 developments focus on service enhancements (security, consolidation of services and libraries, usability) not on adding completely new functionalities. The EGI user community is expected to provide input to this plan. Please list and explain what enhancements your VO would like to see in the new middleware release. A: schedule not compatible with LHC operation in 2011. More details via direct WLCG<->EMI channels.
WLCG Baseline Versions
Data Management & Other Tier1 Service Issues
Site |
Status |
Recent changes |
Planned changes |
CERN |
CASTOR 2.1.9-8 (All) SRM 2.9-4 (all) xrootd 2.1.9-7 |
|
|
ASGC |
CASTOR 2.1.7-19 (stager, nameserver) CASTOR 2.1.8-14 (tapeserver) SRM 2.8-2 |
none |
3/11: network maintenance affecting also storage and in particular connections to European Tier-1's (but not links to US Tier-1's) |
BNL |
dCache 1.9.4-3 (PNFS) |
|
|
CNAF |
StoRM 1.5.4-5 (ATLAS, CMS, LHCb,ALICE) |
|
Final assignment of disk to Atlas and Alice this week |
FNAL |
dCache 1.9.5-10 (admin nodes) (PNFS) dCache 1.9.5-12 (pool nodes) |
|
Upgrading to 1.9.5-22 (or 23) on Nov 8 |
IN2P3 |
dCache 1.9.5-22 (Chimera) |
|
|
KIT |
dCache 1.9.5-15 (admin nodes) (Chimera) dCache 1.9.5-5 - 1.9.5-15 (pool nodes) |
|
|
NDGF |
dCache 1.9.7 (head nodes) (Chimera) dCache 1.9.5, 1.9.6 (pool nodes) |
|
|
NL-T1 |
dCache 1.9.5-19 (Chimera) (SARA), DPM 1.7.3 (NIKHEF) |
|
|
PIC |
dCache 1.9.5-21 (PNFS) |
|
Upgrade to dCache 1.9.5-23 planned for the 2-Nov SD |
RAL |
CASTOR 2.1.7-27 and 2.1.9-6 (stagers) CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers) SRM 2.8-2 |
LHCb and ALICE upgraded to 2.1.9 |
CMS and ATLAS to be upgraded to 2.1.9 during November |
TRIUMF |
dCache 1.9.5-21 with Chimera namespace |
|
|
CASTOR news
xrootd news
dCache news
StoRM news
FTS news
DPM news
Certification of 1.8.0 is on hold waiting for a new version of the VOMS library without the memory leak.
LFC news
Certification of 1.8.0 is on hold waiting for a new version of the VOMS library without the memory leak.
LFC deployment
Site |
Version |
OS, n-bit |
Backend |
Upgrade plans |
ASGC |
1.7.2-4 |
SLC4 64-bit |
Oracle |
None |
BNL |
1.7.2-4 |
SL4 |
Oracle |
1.7.4 on SL5 in November |
CERN |
1.7.3 64-bit |
SLC4 |
Oracle |
Will upgrade to SLC5 64-bit by the end of the year |
CNAF |
1.7.2-4 |
SLC4 32-bit |
Oracle |
1.7.4 on SL5 64-bit in November |
FNAL |
N/A |
|
|
Not deployed at Fermilab |
IN2P3 |
1.7.4-7 |
SL5 - 64 bits |
Oracle |
|
KIT |
1.7.4 |
SL5 64-bit |
Oracle |
|
NDGF |
|
|
|
|
NL-T1 |
|
|
|
|
PIC |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
RAL |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
TRIUMF |
1.7.2-5 |
SL5 64 bit |
MySQL |
|
Experiment issues
Following the discussion in ASGC, ATLAS started testing the usage of DPM in TW-FTT as the only
T0D1 storage element in ASGC (removing the split between T1 and T2 storage). ATLAS asks all T1s and CERN to configure FTS so that the site TW-FTT is treated like another T1 (setting up the proper T1-T1 channels).
GGUS Issues
Outstanding SIRs
From the last MB report no SIRs are outstanding and two have been received.
- Severely degraded response from CERN Batch Service
- CMS storage down due to GPFS bug
Full reports available
here
Conditions data access and related services
COOL, CORAL and POOL
Frontier
Database services
- Topics of general discussion
- 10.2.0.5 and PSU testing
- We are validating Oracle 10.2.0.5 and 10.2.0.4 with Oct PSU on our integration services
- We are still waiting for an answer from Oracle Support about PSU April non-rollingness issue
- Oct PSU patching in production DBs have been postponed due to earlier than expected technical stop - no enough time to validate it on integration DBs
- Distributed Database Operations Workshop - please register and send comments on the agenda:
http://indico.cern.ch/conferenceDisplay.py?confId=111194
- Experiment reports:
- ALICE:
- ATLAS:
- Problems with conditions replication to SARA. Some tables not instantiated after last re-synchronization with RAL. Traced down to be a documentation bug - procedures will need to be updated.
- ATLAS initially has decided to drop streams replication to ASGC and conditions database there. For accessing conditions data they will use Frontier instead.
- CMS:
- During last few weeks we have encountered unpublished Oracle bug in streams environment for CMS PVSS replication. All transactions that were executed on destination database during last spontaneous reboots of CMS nodes were marked as to be applied even thought they had been already applied. This issue fortunately did not affect running of PVSS replication. Oracle confirmed that the problem needs to be fixed manually - cleaning apply queues. This has been done this week.
- Second node of CMSR database rebooted on Thursday (21st Oct) evening due to excessive memory utilization by database session of the PhEDEx aplication. The problematic queries have been reported to PhEDEx developers and the issue is being addressed by them.
- LHCb:
- On Monday afternoon (11th Oct) replication of LHCB conditions to SARA site was broken for 4 hours because of abort of apply process. The problem was caused by missing metadata dictionary on destination site which defines structure of replicated tables. Resending of the dictionary and restarting of the process solved the issue.
Site |
Status, recent changes, incidents, ... |
Planned interventions |
ASGC |
- SRM: Problems with backup deletion scripts - under investigation. Expired backupset cleaned manually. - New TSM set-up is now in place. - Dataguard implementation is under investigation - test bed to be set up. |
None |
BNL |
Nothing to report |
None |
CNAF |
Nothing to report |
None |
KIT |
Nothing to report |
None |
IN2P3 |
Nothing to report |
None |
NDGF |
Upgrade firmware and Linux kernel on NDGF ATLAS 3D database servers - completed today 28th October 2010, 08-10 UTC |
None |
PIC |
Nothing to report |
None |
RAL |
- New hardware is coming for CASTOR. It will be set in data guard mode. We got a long testing plan for it. - CASTOR Gen and LHCb upgraded to 2.1.9 |
- Planning to apply October PSU on 3D, LFC, FTS DBs at the next CERN technical stop. - Apply Jan PSU 10.2.0.4.3 on CASTOR to fix bug 8437213 (bitmap block corruption) ASAP. |
SARA |
Corruption on the LHCb and ATLAS conditions database. Caused by missing instantiation of tables that were created between last SARA's restore and re-synchronization with RAL. Partial re-synchronization from CERN using export/import is now running |
|
TRIUMF |
Nothing to report |
None |
AOB
--
AndreaSciaba - 27-Oct-2010
Topic revision: r9 - 2010-10-28
- unknown