Week of 051128
Open Actions from last week:
- New LFC sensor to detect current thread usage, and external service availability via CLI tools (James)
- Work out how to do log expiry with log4j (Gavin)
- New version of LCG_MON_GRIDFTP (Maarten)
- Check with ZS what is needed and why for gridview (James)
- PS DB team will do a reboot of the DB for all LFC+FTS on Monday 9.30 AM
DONE
- Test + Deploy new QF for FTS (Gavin)
TESTED
- Test + Deploy new version of LFC (Sophie)
IN TESTING
Chair: Harry
On Call: Sophie + Andrea
Monday:
Log: Castor2 outage Saturday evening thru sunday. Another possible outage (seen from monitoring ) on monday morning. DMA alarm on FTS node
New Actions:
- James - rewrite procedure so "standard" sysadmin alarms are handled by sysadmins
Discussion:
- Olof said the problem on saturday was due to the main LSF batch daemon not being able to communicate with the scheduler. approx 80K jobs backed up (mostly stages from LHCb ~75K). Jobs reinjected into the system after coming back online.
- Eric said they had some corruption on the stager DB this morning which might be linked the the outage. They have a manual process to recover from this corruption, which was successful.
- PS DB outage at 9.30 on all LFC and FTS production services.
Tuesday:
Log: Nothing to report
New Actions:
- FTS upgrade to QF tomorrow (WeD)
Discussion:
- Meeting upstairs tomorrow morning
- lcg-mon-gridftp deployed on dpm - waiting for update of alarm before putting it on wan nodes
Wednesday
Log: Nothing
Actions: T.Kleinwort is moving the lxserv function to a new machine so the FTS QF upgrade will wait for that to complete.
Discussion: GRIDVIEW statistics showed no traffic due to the temporary stoppage of R-GMA waiting for a security fix. This has now been done.
Thursday
Log: lxshare030d root file system full with logs in /opt.
Actions: Following a successful FTS QF upgrade P.Badino will stop the periodic reboots. E.Grancher will move the castor2 stager backup to 08.00. O.Barring will warn the service-challenge-tech list of a Monday stoppage for hardware migration of the castor2 stager. L.Field will look at the discrepancy between GRIDVIEW reports of cms traffic and what CMS (and lemon) see. There will be an immediate meeting to look at installing a new lcg-mon-gridftp. A longer term action is to decide how to avoid logs in /opt/'lcg-application'/var filling the root file system.
Discussion: There was another castor stager database redo-logs corruption discovered, triggered by the early morning backup around 03.30. This was recovered without data loss by 06.55 (many thanks to DB team). Suspicion is hardware and a new server has been prepared. Initial planning is to move on Monday morning with a 1 hour downtime. To help immediately the backup will be started later. To profit SRM upgrades will be made at the same time.
CMS observe GRIDVIEW file transfer traffic reporting too low. James thought R-GMA instabilities were the cause and LF will investigate.
The problem of logs filling the lxshare030d root file system (which should not have happened) needs a general solution.
Friday
Log: A replacement disk in the LXFS6051 Elonex oracle server used by fts pilot and voms was not seen by the system. A service stop will be needed.
Actions: Schedule service stop on LXFS6051. Proposal is 09.30 Monday for 90 minutes. Move castor2 stager backup to 08.00 to be done today.
Discussion: Testing of new lcg-mon-gridftp sensor revealed a bug in the lemon monitoring framework in the IA64 architecture and it has been rolled back. This should be fixed today in which case we will reschedule the sensor upgrade. Tim Whibley announced that from Monday for 1 to 2 weeks the only UPS backup will be in the critical power area.