TWiki
>
LCG Web
>
LCGServiceChallenges
>
ServiceChallengeMeetings
>
SCDailyMeeting051128
(2005-12-02,
HarryRenshall
)
(raw view)
E
dit
A
ttach
P
DF
---+ Week of 051128 Open Actions from last week: * New LFC sensor to detect current thread usage, and external service availability via CLI tools (James) * Work out how to do log expiry with log4j (Gavin) * New version of LCG_MON_GRIDFTP (Maarten) * Check with ZS what is needed and why for gridview (James) * PS DB team will do a reboot of the DB for all LFC+FTS on Monday 9.30 AM =DONE= * Test + Deploy new QF for FTS (Gavin) =TESTED= * Test + Deploy new version of LFC (Sophie) =IN TESTING= Chair: Harry On Call: Sophie + Andrea ---++ Monday: Log: Castor2 outage Saturday evening thru sunday. Another possible outage (seen from monitoring ) on monday morning. DMA alarm on FTS node New Actions: * James - rewrite procedure so "standard" sysadmin alarms are handled by sysadmins Discussion: * Olof said the problem on saturday was due to the main LSF batch daemon not being able to communicate with the scheduler. approx 80K jobs backed up (mostly stages from LHCb ~75K). Jobs reinjected into the system after coming back online. * Eric said they had some corruption on the stager DB this morning which might be linked the the outage. They have a manual process to recover from this corruption, which was successful. * PS DB outage at 9.30 on all LFC and FTS production services. ---++ Tuesday: Log: Nothing to report New Actions: * FTS upgrade to QF tomorrow (WeD) Discussion: * Meeting upstairs tomorrow morning * lcg-mon-gridftp deployed on dpm - waiting for update of alarm before putting it on wan nodes ---++ Wednesday Log: Nothing Actions: T.Kleinwort is moving the lxserv function to a new machine so the FTS QF upgrade will wait for that to complete. Discussion: GRIDVIEW statistics showed no traffic due to the temporary stoppage of R-GMA waiting for a security fix. This has now been done. ---++ Thursday Log: lxshare030d root file system full with logs in /opt. Actions: Following a successful FTS QF upgrade P.Badino will stop the periodic reboots. E.Grancher will move the castor2 stager backup to 08.00. O.Barring will warn the service-challenge-tech list of a Monday stoppage for hardware migration of the castor2 stager. L.Field will look at the discrepancy between GRIDVIEW reports of cms traffic and what CMS (and lemon) see. There will be an immediate meeting to look at installing a new lcg-mon-gridftp. A longer term action is to decide how to avoid logs in /opt/'lcg-application'/var filling the root file system. Discussion: There was another castor stager database redo-logs corruption discovered, triggered by the early morning backup around 03.30. This was recovered without data loss by 06.55 (many thanks to DB team). Suspicion is hardware and a new server has been prepared. Initial planning is to move on Monday morning with a 1 hour downtime. To help immediately the backup will be started later. To profit SRM upgrades will be made at the same time. CMS observe GRIDVIEW file transfer traffic reporting too low. James thought R-GMA instabilities were the cause and LF will investigate. The problem of logs filling the lxshare030d root file system (which should not have happened) needs a general solution. ---++ Friday Log: A replacement disk in the LXFS6051 Elonex oracle server used by fts pilot and voms was not seen by the system. A service stop will be needed. Actions: Schedule service stop on LXFS6051. Proposal is 09.30 Monday for 90 minutes. Move castor2 stager backup to 08.00 to be done today. Discussion: Testing of new lcg-mon-gridftp sensor revealed a bug in the lemon monitoring framework in the IA64 architecture and it has been rolled back. This should be fixed today in which case we will reschedule the sensor upgrade. Tim Whibley announced that from Monday for 1 to 2 weeks the only UPS backup will be in the critical power area. ---++
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r6
<
r5
<
r4
<
r3
<
r2
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r6 - 2005-12-02
-
HarryRenshall
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback