TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsWeb
>
Tier1ServiceCoordination
>
WLCGTier1ServiceCoordinationMinutes101202
(2010-12-02,
unknown
)
(raw view)
E
dit
A
ttach
P
DF
---+ WLCG Tier1 Service Coordination Minutes - 2nd December 2010 %TOC% ---++ Attendance ---++ LHC machine - shutdown and 2011 startup plans Talk postponed. ---++ Security updates Romain reported about two incidents (details not given on purpose). ---++ GGUS news Thorsten reported about two problems. On November 16 a SOAP component caused the web services to block. A guess was that it could be due to the simultaneous submission of several tickets to Spain but an attempt to reproduce it did not cause any problem. The logs did not contain any useful information and the problem is not yet understood. On November 26 the GGUS Oracle database was unavailable for 1.5 hours, due to moving to a new database with high availability and a newer version of Oracle (11). There was a misconfiguration of this HA cluster which was fixed this morning during a short "at risk" downtime. ---++ CERNVM-FS Apart from what is said in the slides, it was clarified that the stress test foreseen at !RAL will involve all the affected parties, including the central repository at CERN. In general this is still an experimental service because support from CERN is not yet full (maybe it will be so after January). !RAL will set up a mirror web repository but the release of software will still happen at CERN. LHCb is starting doing tests at NIKHEF where the site will change the environment variable pointing to the software area (NFS or CVM-FS) as requested. It should be possible to do the same at !RAL. Joel asked about the plans of other Tier-1 sites: Pierre said that !IN2P3 is interested but they must give priority to solving their AFS problems. KIT was not available to answer during the meeting. Ian F. said that CMS plans to start using it at Tier-3 sites and possibly extending it to other sites later. Stephane said that the first ATLAS tests are encouraging (results will be shown tomorrow at an ATLAS meeting). Ron said that for SARA they don't have tests planned but will discuss with NIFHEF. ---++ Release Update The main point was the new version of the WMS, which allows to use !VOMS from gLite 3.2. Sites are urged to upgrade ASAP to be able to upgrade all !VOMS servers to 3.2. Many patches in staged rollout (DPF, LFC, glexec, etc.). A CREAM patch had to be rejected. ---+++ WLCG Baseline Versions * Release report: [[https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Deployment_Status][deployment status wiki page]] * WLCG Baseline versions: [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][table]] ---++ Data Management & Other Tier1 Service Issues | *Site* | *Status* | *Recent changes* | *Planned changes* | | !CERN | CASTOR 2.1.9-8 (ATLAS)<br />CASTOR 2.1.9-9 (ALICE, CMS and LHcb)<br />SRM 2.9-4 (all)<br />xrootd 2.1.9-7 | | | | ASGC | CASTOR 2.1.7-19 (stager, nameserver)<br />CASTOR 2.1.8-14 (tapeserver)<br />SRM 2.8-2 | 29/11: network [[https://gus.fzk.de/pages/ticket_lhcopn_details.php?ticket=64334][maintenance]], storage services stopped | None | | BNL | dCache 1.9.4-3 (PNFS) | None | None | | CNAF | !StoRM 1.5.4-5 (ATLAS, CMS, LHCb,ALICE) | | | | FNAL | dCache 1.9.5-23 (PNFS)<br />Scalla xrootd 2.9.1/1.4.2-4 | None | None | | !IN2P3 | dCache 1.9.5-22 (Chimera) | | | | KIT | dCache 1.9.5-15 (admin nodes) (Chimera)<br />dCache 1.9.5-5 - 1.9.5-15 (pool nodes) | | | | NDGF | dCache 1.9.7 (head nodes) (Chimera)<br />dCache 1.9.5, 1.9.6 (pool nodes) | | | | NL-T1 | dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF) | | | | PIC | dCache 1.9.5-23 (PNFS) | | | | !RAL | CASTOR 2.1.7-27 and 2.1.9-6 (stagers)<br />2.1.9-1 (tape servers)<br />SRM 2.8-2 and SRM 2.8-6 | Added 2 new SRM backends for ATLAS | ATLAS upgrade to 2.1.9-6 on 6-8/12/10 | | TRIUMF | dCache 1.9.5-21 with Chimera namespace | | | ---+++ Other site news The FTS channels to TW-FTT were created at all relevant sites. ---+++ CASTOR news ---++++ CERN operations There will be a deployment campaign in January. Now busy with closing the new release and with testing and planning. %RED%[ACTION]%ENDCOLOR% It would be good to have from the experiments information about low and high points of activity foreseen for January. ---++++ Development No significant news. ---+++ xrootd news ---+++ dCache news No significant news. ---+++ !StoRM news ---+++ FTS news FTS 2.2.5 still in certification. ---+++ DPM news No significant news. ---+++ LFC news No significant news. ---++++ LFC deployment | *Site* | *Version* | *OS, n-bit* | *Backend* | *Upgrade plans* | | ASGC | 1.7.2-4 | SLC4 64-bit | Oracle | Testing ongoing, upgrade by the end of the year | | BNL | 1.7.2-4 | SL4 | Oracle | 1.7.4 on SL5 postponed to January | | [[https://twiki.cern.ch/twiki/bin/view/PESgroup/PesGridServicesSfwLevels][CERN]] | 1.7.3 64-bit | SLC4 | Oracle | Will upgrade to SLC5 64-bit by the end of the year | | CNAF | 1.7.2-4 | SLC4 32-bit | Oracle | 1.7.4 on SL5 64-bit in November | | FNAL | N/A | | | Not deployed at Fermilab | | !IN2P3 | 1.7.4-7 | SL5 - 64 bits | Oracle | | | KIT | 1.7.4 | SL5 64-bit | Oracle | | | NDGF | | | | | | NL-T1 |1.7.4-7 |CentOS5 64-bit |Oracle | | | PIC | 1.7.4-7 | SL5 64-bit | Oracle | | | !RAL | 1.7.4-7 | SL5 64-bit | Oracle | | | TRIUMF | 1.7.3-1 | SL5 64 bit | !MySQL | | %RED%[NOTE]: %ENDCOLOR% BNL and CNAF should better upgrade to 1.8.0 because of the !VOMS library memory leaks in 1.7.4. ---+++ Experiment issues Simone reviewed the issues ATLAS has experienced with dCache at !IN2P3. Pierre explained that the suggestion from dCache that it could be related to using Solaris was actually wrong (it mistakenly referred to another problem). There is no real evidence that the problems are the consequence of the dCache upgrade and they still need to be understood. Jon reported as something potentially interesting for all dCache sites that FNAL had major process scheduling problems with the kernel coming with SL5 and they solved them by using the latest available kernel. dCache developers were not involved and it would be useful to let them aware of FNAL's findings. ---++ BDII deployment plan Some points were discussed during the talk. Highlights follow. The !MoU prescriptions for "other services" (like the BDII) require 98% availability at prime hours, 97% otherwise. Published data should be not more than 15' old (1 hour was considered too old). It was clarified that the quality of service of the top BDII at CERN should be no less than at Tier-1 sites and that best effort support does not imply a lower quality of service. Finally it was stressed that best practices and requirements should be clearly separated in the document (the requirements must be associated to specific metrics). | *Site* | *Plan* | |NL-T1|There are in total more than 5 top-level BDIIs at the NL-T1. In LCG_GFAL_INFOSYS at both SARA and NIKHEF there are three top-level BDIIs configured. At NIKHEF two BDIIs from NIKHEF and one BDII at SARA configured. At SARA there are two SARA BDIIs and one NIKHEF BDII in LCG_GFAL_INFOSYS | |US ATLAS-T1|Working with OSG on the deployment of a resilient and performant top-level BDII infrastructure in the US | ---++ Status of open GGUS tickets ---++ Review of recent / open SIRs and other open service issues ---++ Conditions Data Access and related services Dave reported an ATLAS Frontier server (and database) overload. The database server had to be rebooted. Alessandro offered a possible explanation, as the software used in the reprocessing campaign had a [[https://savannah.cern.ch/bugs/index.php?75508][bug]] and jobs were repeatedly connecting to the database instead than connecting to Frontier. ---++ Experiment Database Service Issues * Experiment reports: * ALICE: * Nothing to report * ATLAS: * Atlas offline database suffered from 4 instance reboots this week. Instance 4 rebooted on 28.11, 30.11, 02.12 morning around 4AM and instance 3 rebooted on 02.12 around 11:30AM. Initially high load caused by COOL application was suspected as rootcause however there have been corresponding I/O errors and spikes of physical writes observed on 02.12 which points out to disk or hardware related problems. DBAs are currently working on this problem to understand the root cause and provide a fix to the issue as soon as possible. * CMS: * On Wednesday (1st Dec) morning CMS PVSS streaming aborted once again for 30 minutes while executing modifications (adding new table partitions for 2011) on one of the replicated tables. In fact all changes were already there manually applied by user job. That caused dictionary inconsistency and abort of apply process. Colliding changes have been marked to be skipped and apply process was restarted. * On Thursday (2st Dec) CMS PVSS aborted several times due to missing tablespace on offline database - they were not created together with corresponding tablespaces on online database. All related streams errors were solved manually by creating proper tablespaces on the offline database. * LHCb: * nothing * Site reports: | *Site* | *Status, recent changes, incidents, ...* | *Planned interventions* | | ASGC | Nothing to report | None | | BNL | Validations for new harware%BR%Working on improvements for Weekly reports |None| | CNAF | Nothing to report | None | | KIT | Nothing to report | None | | IN2P3 | Nothing to report | None | | NDGF | Nothing to report | None | | PIC | Nothing to report | None | | RAL | Nothing to report | None | | SARA | Nothing to report | Next Tuesday migration to the cluster | | TRIUMF | Database was not accessible during last weekend due number of session exceeded because resource_limit parameter was set to FALSE profiles were not working | None | ---++ Dates & topics for future meetings ---++ AOB -- Main.JamieShiers - 23-Nov-2010
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r12
<
r11
<
r10
<
r9
<
r8
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r12 - 2010-12-02
-
unknown
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback