LHCb Streaming to PIC hung, June 24th - 25th 2010

Description

The Streams process responsible for propagation of transactions of LHCb experiment to PIC got stuck unexpectedly on 24th June in the evening. On 25th June in the morning the Streams capture process failed as a side effect and replication to all Tier1 sites and LHCb online stopped working. Both processes had to be manually recovered.

Impact

  • Replication of LHCb data to PIC was not working between June 24th ~21:00 and June 25th 11:00. Replication to other Tier1 sites was not working between 10:20 and 11:00 on 25th June.

Time line of the incident

  • 24.06.2010 21:00 - Streams propagation process responsible for transferring data to PIC got stuck.
  • 25.06.2010 09:00 - investigation started
  • 25.06.2010 10:20 - Streams capture process failed as a side effect of the propagation hang.
  • 25.06.2010 11:00 - Both capture and propagation processes have been recovered and streaming started to process the backlog.

Analysis

  • We have encountered this kind of issue for the first time
  • Nothing in the logs that could directly indicate what was the root cause of the hang
  • Around the time when the process hung there was heavy DDL activity in the LHCB_COND schema. It is possible that this DDL activity triggered the issue, however it is impossible to prove or disprove that. There are no known issues with similar symptoms

Follow up

  • Next time the problem happens detailed hang analysis/tracing will be performed.

-- JacekWojcieszuk - 07-Jun-2010

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-06-29 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback