WLCG Tier1 Service Coordination Minutes - 7 July 2011
Attendance
- Local: Elisa, MariaA, MariaD, MariaG, Steve, Massimo, AndreaV, AndreaS, Cedric, Joel, Maarten, Jamie, Simone, Nicolo, Zsolt, Alessandro, Daniele)
- Remote: Stefane, Ron, Pavel, Felix, Alexander, Carlos, Michael, DaveD, John DeStefano, Cristina, Ulf, Luca, Gonzalo, Kyle, Gareth), Jhen Wei)
Action list review
Release update
Data Management & Other Tier1 Service Issues
Site |
Status |
Recent changes |
Planned changes |
CERN |
CASTOR 2.1.11 (SL5); SRM 2.10-x (SL4); xrootd: 2.1.11-1 FTS SL4 3.2.1 i.e old EOS -0.1.0/xrootd-3.0.4 |
All CASTOR instances have been upgraded (after the SL5 migration). SSL auth in xrootd had been obsoleted (in favour of gsi). New CASTOR scheduling (transfermanager) in production for ATLAS and CERNT3. Loadbalanced DNS aliases for EOSCMS & EOSATLAS deployed. |
SRM upgrade to 2.11 (includes SL5 migration) SL5 FTS 3.2.1-2 preparations well underway. 6 new channels on T2 service will be added on a new SL5 box shortly. Further steps in the migration will announced in due course. |
ASGC |
CASTOR 2.1.10-0 SRM 2.10-2 DPM 1.8.0-1 |
None |
None |
BNL |
dCache 1.9.5-23 (PNFS, Postgres 9) |
None |
Migration from pnfs to Chimera in August 2011 |
CNAF |
StoRM 1.5.6-3 SL4 (CMS, LHCb,ALICE) StoRM 1.6 SL5 (ATLAS) |
|
|
FNAL |
dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25 Scalla xrootd 2.9.1/1.4.2-4 Oracle Lustre 1.8.3 |
none |
none |
IN2P3 |
dCache 1.9.5-26 (Chimera) on core servers. Mix of 1.9.5-24 and 1.9.5-26 on pool nodes |
none |
none |
KIT |
dCache (admin nodes): 1.9.5-25 (ATLAS, Chimera), 1.9.5-26 (CMS, Chimera) 1.9.5-26 (LHCb, PNFS) dCache (pool nodes): 1.9.5-9 through 1.9.5-26 |
|
|
NDGF |
dCache 1.9.12 |
|
|
NL-T1 |
dCache 1.9.12-10 (Chimera) (SARA), DPM 1.7.3 (NIKHEF) |
|
|
PIC |
dCache 1.9.12-8 (since Aug 9th); PNFS, Postgres 9 |
|
|
RAL |
CASTOR 2.1.10-1 2.1.10-0 (tape servers) SRM 2.10-0 |
Upgraded to 2.1.10-1 on 5/7/11 |
Start repacking to T10KC during July/Aug |
TRIUMF |
dCache 1.9.5-21 with Chimera namespace |
Expanded tape system to 5.5PB |
|
CASTOR news
CERN operations
The CERNT3 and ATLAS deployments use the new internal scheduling. In testing prior to deployment we have validated the system to handle 5 times more requests per seconds than the original
LSF-based scheduler. With ATLAS we validated the system in conditions at the level (or slighlty exceeding) the top-rate operations experienced. ATLAS and CASTOR operations agreed to continue to operate the system in the continuation of the LHC 2011 run.
Development
EOS news
- CMS imported approx. all CAF data into EOSCMS 0.5 M files and 0.6 PB . CMS switched CAF to EOS with fallback to CASTOR.
- ATLAS import is at 4.5 M files 0.9 PB (~300.000 files still to import).
xrootd news
dCache news
StoRM news
FTS news
DPM news
- DPM 1.8.1 in EGI UMD Staged Rollout: changes
- A wlcg-preview repository has been set up which contains new middleware versions which have been extensively tested but have not been certified in a formal way neither for gLite nor for EMI
- Testing DPM version "1.8.2" (wlcg-preview release) with the help of ASGC. The main features coming with this version are:
- fast dpm-drain
- better filesystem selection algorithm (designed by Maarten)
- central banning (Argus)
LFC news
- LFC 1.8.1 in EGI UMD Verification
- A wlcg-preview repository has been set up which contains new middleware versions which have been extensively tested but have not been certified in a formal way neither for gLite nor for EMI
- Currently running LFC version "1.8.2" (wlcg-preview release) at CERN in production for LHCb and it is also the version installed for the Atlas Central LFC entering production soon at CERN. Compared to 1.8.0, this LFC version fixes a problem reported by IN2P3 when running the replica instance of LFC for LHCb. This version also brings support for central banning (Argus)
LFC deployment
Site |
Version |
OS, n-bit |
Backend |
Upgrade plans |
NDGF |
1.7.4.7-1 |
Ubuntu 9.10 64-bit |
MySQL |
None |
TRIUMF |
1.7.3-1 |
SL5 64-bit |
MySQL |
|
FNAL |
N/A |
|
|
Not deployed at Fermilab |
ASGC |
1.8.0-1 |
SLC5 64-bit |
Oracle |
None |
BNL |
1.8.0-1 |
SL5, 64-bit |
Oracle |
None |
CERN |
1.7.3 64-bit |
SLC4 |
Oracle |
Upgrade to SLC5 64-bit pending |
CNAF |
1.7.4-7 (ATLAS, to be dismissed> 1.8.0-1 (LHCb, recently updated) |
SL5 64-bit |
Oracle |
|
KIT |
1.7.4-7 |
SL5 64-bit |
Oracle |
Oracle backend migration pending |
NL-T1 |
1.7.4-7 |
CentOS5 64-bit |
Oracle |
|
PIC |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
RAL |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
IN2P3 |
1.8.0-1 |
SL5 64-bit |
Oracle 11g |
Oracle DB migrated to 11g on Feb. 8th |
Experiment issues
WLCG Baseline Versions
GGUS issues
Review of recent / open SIRs and other open service issues
Conditions data access and related services
- A new LCGCMT_60c has been prepared for ATLAS. The motivation for this release is the upgrade to newer versions of all Persistency Framework packages (CORAL, COOL, POOL) and of the ROOT and oracle external dependencies, with multiple fixes and improvements in all of the above. The largest changes are in the new CORAL (2.3.16), including in particular several fixes for the most serious problems observed in its handling of network and database connection glitches. The new ROOT (5.28.00e) includes a more robust fix for the xrootd/kerberos issue responsible for the KDC flood in the last few weeks, while the new Oracle client configuration (11.2.0.1.0p3) includes a workaround for another kerberos-related bug in the Oracle client, that redefines symbols conflicting with those in the system libraries. The full release notes are available at https://twiki.cern.ch/twiki/bin/view/Persistency/PersistencyReleaseNotes.
- A new frontier client version 2.8.2 has been released, including some performance optimizations that split into chunks the data received via http. The new client is now used in the CMS HLT system. While it was produced too late for its inclusion in the LCG60c release for ATLAS, this is now being tested in the CORAL nightlies and will be included in the next CORAL release prepared for ATLAS offline.
Database services
- Experiment reports:
- ALICE:
- ATLAS:
- Instance 2 of Atlas Offline (ADCR) database has rebooted on Friday (01.07) around 00:50 due to ORA-00600 error. Database instance was back in operation after few minutes. Rootcause of this problem was an Oracle bug.
- New schema has been added to the ATLAS AMI Streams replication setup from IN2P3 to CERN on Tuesday 05.07. Due to some troubles at IN2P3 replication downtime has been extended to ~2h.
- Some intervention by ATLAS developer on one of ATLAS conditions data schema caused aborts of replication on Wednesday (06/07) at midnight.
- Following up on ATLAS request, replication for Atlas Conditions to PiC will be discontinued on Tuesday 12.07 14:00.
- CMS:
- Node 2 of CMS Offline database has rebooted on Monday 27th of June around 7:50PM due to excessive memory consumption caused by one of CMS applications (Phedex). Database instance was back in operation after few minutes. Rootcause of this problem is under investigation.
- CMS integration database (INT2R) has been successfully upgaded to Oracle 11.2.0.2 on Tuesday 5th of July afternoon.
- LHCb:
- Misconfiguration of Advance Queuing in LHCb replication to SARA has been fixed on Monday 27th (10AM). Whole replication LHCb T1s streaming had to be stopped for 1 hour.
Site |
Status, recent changes, incidents, ... |
Planned interventions |
ASGC |
|
|
BNL |
Follow up of SR 3-3535183751 with ORACLE which is in reviewing state. |
Migration of VOMS database replica to a newer disk storage. |
CNAF |
|
2011-07-12 LHCb cluster (CONDDB & LFC) maintenance for patches installation and parameter changes |
KIT |
Nothing to report |
Concerning patch 9232517 ("PROPAGATION MISSING MESSAGES AFTER DESTINATION QUEUE OWNERSHIP SHIFT ON RAC"), we'll either migrate our "Streams RAC"s (LHCb 3D/LFC, ATLAS 3D) to new hardware and then apply this 64bit patch, or apply a 32bit version to them before migration. We'll request a 32bit patch version through metalink. |
IN2P3 |
|
|
NDGF |
|
|
PIC |
Nothing to report |
Following ATLAS request, replication for Atlas Conditions to PiC will be discontinued on Tuesday 12.07 14:00. |
RAL |
Nothing to report |
None |
SARA |
On 6th of July around 3:30 AM one of the node of our cluster lost network connectivity for a while and the listener went down. After that LHCb propagation stopped. |
None |
TRIUMF |
Nothing to report |
None |
AOB
--
AndreaSciaba - 06-Jul-2011