TWiki
>
LCG Web
>
WLCGTier1ServiceCoordinationMinutes110929
(revision 16) (raw view)
Edit
Attach
PDF
---+ WLCG Tier1 Service Coordination Minutes - 29 September 2011 %TOC% ---++ Attendance ---++ Action list review ---++ Release update ---+++ Data Management & Other Tier1 Service Issues | *Site* | *Status* | *Recent changes* | *Planned changes* | | !CERN | CASTOR 2.1.11-5 (SL5) for CMS and PUBLIC, others are on 2.1.11-2; SRM 2.10-x (SL4); xrootd: 2.1.11-1<br /> FTS: 5 nodes in SLC5 3.7.0-3; 7 nodes in SLC4 3.2.1<br /> EOS -0.1.0/xrootd-3.0.4 | | <p> CASTOR 2.1.11-6 has been officially released (maintenance release addressing some issues<br />with the transfermanager and the tapegateway components). This is scheduled for CASTORCMS on Monday (Oct 3rd). Soon this will deployed at least on PUBLIC.</p> | | ASGC | CASTOR 2.1.11-5<br />SRM 2.11-0<br />DPM 1.8.0-1 | 22/09: CASTOR upgrade, no issues encountered | None | | BNL | dCache 1.9.5-23 (PNFS, Postgres 9) | None | Transition from PNFS to Chimera during next LHC TS | | CNAF | !StoRM 1.7.0 (Atlas) %BR% Storm 1.5.0 (other endpoints) | the present version contains various patches which will be included in the new StoRM release, currently under certification | | | FNAL | dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25<br />Scalla xrootd 2.9.1/1.4.2-4<br />Oracle Lustre 1.8.3 | | | | !IN2P3 | dCache 1.9.5-26 (Chimera) on core servers. Mix of 1.9.5-24 to 1.9.5-28 on pool nodes | | Increase of RAM on Chimera node next week (site downtime on 2011-10-04) | | KIT | dCache (admin nodes): 1.9.5-27 (ATLAS, Chimera), 1.9.5-26 (CMS, Chimera) 1.9.5-26 (LHCb, PNFS)<br />dCache (pool nodes): 1.9.5-6 through 1.9.5-27 | | | | NDGF | dCache 1.9.14 (Chimera) on core servers. Mix of 1.9.13 and 2.0.0 on pool nodes. | | | | NL-T1 | dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF) | | | | PIC | dCache 1.9.12-10 (last upgrade to patch release on 13-Sep); PNFS on Postgres 9.0 | | | | !RAL | CASTOR 2.1.10-2 <br />2.1.10-0 (tape servers)<br />SRM 2.10-0 | None | None | | TRIUMF | dCache 1.9.5-21 with Chimera namespace | None | Upgrade dCache to 1.9.5-28 and FTS to SL5 3.7.0-3 next Wednesday | ---++++ Other site news ---++++ CASTOR news ---+++++ CERN operations and development * CASTOR 2.1.11-6 has been officially released ([[http://castorold.web.cern.ch/castorold/DIST/CERN/savannah/CASTOR.pkg/2.1.11-*/2.1.11-6/ReleaseNotes][Release notes]]). This is a maintenance release addressing some issues with the transfermanager and the tapegateway components. This is scheduled for CASTORCMS on Monday (Oct 3rd). Soon this will deployed at least on PUBLIC. ---++++ EOS news ---++++ xrootd news ---++++ dCache news ---++++ !StoRM news ---++++ FTS news * FTS 2.2.5 in gLite Staged Rollout: http://glite.cern.ch/staged_rollout * FTS 2.2.6 released in EMI-1 Update 6 on Sep 1 * restart/partial resume of failed transfers * FTS 2.2.7 being prepared for certification: [[https://savannah.cern.ch/patch/?4862][FTS 2.2.7 patch]] (see list of bugs at the end) * includes new overwrite logic * to be released for gLite and EMI ---++++ DPM news * DPM 1.8.2-2 - a problem was found in certification, fixed, and the release rebuilt * DPM 1.8.2-3 ready for final certification (code already validated extensively and in use at some sites) * fast dpm-drain * filesystem selection algorithm configurable by admin * support for central banning (Argus) * Allow definition of the number of threads for DPM and SRM servers at startup time * https://savannah.cern.ch/patch/?5005 * https://savannah.cern.ch/patch/?5006 * Monthly releases of new unstable components can be followed on the blog: https://svnweb.cern.ch/trac/lcgdm/blog * This covers NFSv4.1, !WebDAV, Nagios, Catalogue synchronisation & 'perfsuite'. ---++++ LFC news * LFC 1.8.2-2 has been certified * fix for read-only replica operation (LHCb) * support for central banning (Argus) * https://savannah.cern.ch/patch/?5003 * https://savannah.cern.ch/patch/?5004 ---++++ LFC deployment | *Site* | *Version* | *OS, n-bit* | *Backend* | *Upgrade plans* | | ASGC | 1.8.0-1 | SLC5 64-bit | Oracle | None | | BNL | 1.8.0-1 | SL5, 64-bit | Oracle | None | | [[https://twiki.cern.ch/twiki/bin/view/PESgroup/PesGridServicesSfwLevels][CERN]] | 1.8.2-0 64-bit | SLC5 | Oracle | Upgrade to SLC5 64-bit only pending for lfcshared1/2 | | CNAF | 1.7.4-7 (ATLAS, to be dismissed><br />1.8.0-1 (LHCb, recently updated) | SL5 64-bit | Oracle | | | FNAL | N/A | | | Not deployed at Fermilab | | !IN2P3 | 1.8.0-1 | SL5 64-bit | Oracle 11g | Plan to migrate to 1.8.0-2 asap | | KIT | 1.7.4-7 | SL5 64-bit | Oracle | Oracle backend migration pending | | NDGF | 1.7.4.7-1 | Ubuntu 10.04 64-bit | !MySQL | None | | NL-T1 | 1.7.4-7 | !CentOS5 64-bit | Oracle | | | PIC | 1.7.4-7 | SL5 64-bit | Oracle | | | !RAL | 1.7.4-7 | SL5 64-bit | Oracle | | | TRIUMF | 1.7.3-1 | SL5 64-bit | !MySQL | None | ---++++ Experiment issues ---+++ WLCG Baseline Versions * Release report: [[https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Deployment_Status][deployment status wiki page]] * WLCG Baseline versions: [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][table]] ---++ Status of open GGUS tickets ---++ Review of recent / open SIRs and other open service issues ---++ Conditions data access and related services ---++ Database services * Experiment reports: * ATLAS: * On Monday (19.09) at 10AM ATLAS streams replication to T1s got stuck. The reasons of the problem were Oracle internal queuing processes which were preventing from accessing the queues. All blocking process had to be killed and affected database was restarted. Replication service was available at 2PM. The service request to the Oracle has been opened as the same problem was observed 3 weeks ago after applying CPU July on downstream capture database. * Gancho updated the TWiki (https://twiki.cern.ch/twiki/bin/viewauth/Atlas/DatabaseVolumes) with the latest projections to 2014 for ATLAS Conditions DB Volumes to Tier-1s. This was prompted by a request from Carlos Gamboa, who was doing hardware purchase planning. (Elizabeth) * CMS: * On Thursday (15.09) second node of CMSR was rebooted and on Wednesday (21.09) all nodes but first were rebooted by Clusterware. The only indication of cause is high load which is growing very fast about 2 minutes before reboot. Unfortunately existing logs and trace files do not allow for determining the root cause. Oracle OS Watcher software will be deployed today to gather additional diagnostic information in case the problem re-appears. * On Friday 23rd September around 5:30 in the morning 5 out of 6 nodes of CMS online production database went down due to a failure of the cluster interconnect switch. The switch has been fixed by CMS sysadmins around 9am and by 9:30 the database was fully available again. In order to limit the impact of similar issues in the future CMS deployed a secondary switch dedicated for the cluster interconnect. * On Tuesday 27th September at 14:00 CMS offline database (CMSR) hung completely following a vendor mistake during replacement of a broken disk in one of the disk arrays used by the database. Even though normally such a problem should be transparently handled by Oracle ASM software, this time, for a reason which is still not understood, it caused unavailability of the whole system. We suspect issues with the disk array's controller and plan to drain the disk array and examine it. * As a side effect of the hang of CMSR on 27th September one of the tablespaces used by CMS Dataset Bookkeeping application was put offline by Oracle make the data stored in there unavailable. The problem has been reported by CMS at 20:30 on was fixed within half an hour. Additional monitoring is being deployed to discover such problems quicker. * General: * New procedure has been developed in order to crosscheck the content of the streams dictionary between source and replica databases. It has been deployed as a weekly database job in each LCG replication environment. This will provide low level validation of replication configuration and detect potential problems with data consistency (which we observed few times in the past). * Site reports: | *Site* | *Status, recent changes, incidents, ...* | *Planned interventions* | | BNL | -- Applied RHEL 5 Operating System (OS) kernel security patches.%BR%-- Applied quarterly Oracle Critical Update (CPU) 2011.%BR%-- Updated oracle Automatic Storage Management (ASM) file system libraries. | -- To apply proposed patch from Oracle (P6011045) in Conditions Database | | CNAF | Applied RHEL 5 OS latest kernel/updates %BR% Applied JUL PSU on FTS and LHCb | | | KIT | | Plans to apply CPU July 2011 for ATLAS ConditionDB and LFC/FTS-DB around 18-19th of October. Intervention will not be transparent for ATLAS as a short shutdown will be required to fix some issues with spfile (after last migration with DG). | | IN2P3 | | | | PIC | July CPU patch was applied on 20th and 21th of September on all databases.%BR%ATLAS database was definitely stopped last week. | None | | RAL | Incident on CASTOR on the 27th, resolved and root cause under investigation. | None | | SARA | Nothing to report | No interventions | | TRIUMF | Nothing to report (not attending today) | Plan to apply July 2011 CPU on Oct 5. | ---++ AOB -- Main.AndreaSciaba - 28-Sep-2011
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r18
<
r17
<
r16
<
r15
<
r14
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r16 - 2011-09-29
-
AlessandroCavalli
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback