Week of 051128

Open Actions from last week:
  • New LFC sensor to detect current thread usage, and external service availability via CLI tools (James)
  • Work out how to do log expiry with log4j (Gavin)
  • New version of LCG_MON_GRIDFTP (Maarten)
  • Check with ZS what is needed and why for gridview (James)
  • PS DB team will do a reboot of the DB for all LFC+FTS on Monday 9.30 AM
  • Test + Deploy new QF for FTS (Gavin)
  • Test + Deploy new version of LFC (Sophie)

Chair: Harry

On Call: Sophie + Andrea

Monday:

Log: Castor2 outage Saturday evening thru sunday. Another possible outage (seen from monitoring ) on monday morning. DMA alarm on FTS node

New Actions:

  • James - rewrite procedure so "standard" sysadmin alarms are handled by sysadmins

Discussion:

  • Olof said the problem on saturday was due to the main LSF batch daemon not being able to communicate with the scheduler. approx 80K jobs backed up (mostly stages from LHCb ~75K). Jobs reinjected into the system after coming back online.
  • Eric said they had some corruption on the stager DB this morning which might be linked the the outage. They have a manual process to recover from this corruption, which was successful.
  • PS DB outage at 9.30 on all LFC and FTS production services.

Tuesday:

Log:

New Actions:

Discussion:

Wednesday

Log:

Actions:

Discussion:

Thursday

Log:

Actions:

Discussion:

Friday

Log:

Actions:

Discussion:

Edit | Attach | Watch | Print version | History: r6 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2005-11-28 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback