-- StuartPaterson - 28-Apr-2010

Log File Analysis (to be released in DIRAC v5r4)

Overview

The motivation for rewriting the original AnalyseLogFile module was to overcome issues with maintainability, ease of use and clarity. With the introduction of FSRs in production the code required a major reorganisation to support this new feature.

The functionality of the analyse log file module includes:

  • checking for core dumps from an application step
  • determining success or failure of an application
  • uploading the intermediate outputs of a failed job to a debugging SE
  • propagating several counters for successful jobs to be used in bookkeeping reports

originally all of the above was attempted to be done in the workflow module, certainly not making this simple to maintain. In this latest version the success / failure of LHCb applications is delegated to an LHCbDIRAC Core utility described below. Decoupling the LHCb conventions from the workflow module has the added advantage of allowing log file analysis to be repeated from the command line, using any log file as input.

ProductionLogAnalysis Module

The ProductionLogAnalysis module is a utility to simplify the maintenance of log file analysis. The primary client of this is AnalyseLogFile but the aim was to create a standalone script for checking the sanity of log files that can also be used outside of workflows. As mentioned below it is now possible to perform the application specific analysis of production log files on the command line.

Debugging Logs Outside of Jobs ( dirac-lhcb-analyse-log-file )

In order to repeat the analysis of a specific log file the dirac-lhcb-analyse-log-file script can be used, below shows the possible arguments:

dirac-lhcb-analyse-log-file --help

2010-04-28 09:17:04 UTC Framework  INFO: Usage:
2010-04-28 09:17:04 UTC Framework  INFO: scripts/dirac-lhcb-analyse-log-file.py (<options>|<cfgFile>)*
2010-04-28 09:17:04 UTC Framework  INFO: Options:
2010-04-28 09:17:04 UTC Framework  INFO: -o:  --option=  :  Option=value to add
2010-04-28 09:17:04 UTC Framework  INFO: -s:  --section=  :  Set base section for relative parsed options
2010-04-28 09:17:04 UTC Framework  INFO: -c:  --cert=  :  Use server certificate to connect to Core Services
2010-04-28 09:17:04 UTC Framework  INFO: -h  --help  :  Shows this help
2010-04-28 09:17:04 UTC Framework  INFO: -f:  --LogFile=  :  Path to log file you wish to analyse
2010-04-28 09:17:04 UTC Framework  INFO: -p:  --Project=  :  Optional: project name (will be guessed if not specified)

i.e. both the path to a log file and optionally a project name can be chosen. The project name is 'guessed' if not specified (but this is prone to changes in the log printing of the applications so can be overridden).

Real Operational Example

During the course of normal production operations a new type of crash was observed and reported to the ELOG. For this example the log file is attached below and linked here.

Now shifters can clearly reproduce the problem without rerunning the jobs:

dirac-lhcb-analyse-log-file.py -f Brunel_00006234_00000166_1.log

2010-04-28 09:32:07 UTC dirac-lhcb-analyse-log-file.py/ProductionLogAnalysis  INFO: Attempting to open log file: Brunel_00006234_00000166_1.log
2010-04-28 09:32:07 UTC dirac-lhcb-analyse-log-file.py/ProductionLogAnalysis  INFO: Check application ended successfully e.g. searching for "Application Manager Finalized successfully"
2010-04-28 09:32:07 UTC dirac-lhcb-analyse-log-file.py/ProductionLogAnalysis  INFO: Checking for "Terminating event processing loop due to errors" meaning job would fail with "Event Loop Not Terminated"
2010-04-28 09:32:08 UTC dirac-lhcb-analyse-log-file.py/ProductionLogAnalysis ERROR: Found error in log file => "Terminating event processing loop due to errors"
2010-04-28 09:32:08 UTC dirac-lhcb-analyse-log-file.py/ProductionLogAnalysis  INFO: Determined last file before crash to be: /lhcb/data/2010/RAW/FULL/LHCb/COLLISION10/69669/069669_0000000011.raw => ApplicationCrash
2010-04-28 09:32:08 UTC dirac-lhcb-analyse-log-file.py ERROR: Problem found with log file Brunel_00006234_00000166_1.log: "Event Loop Not Terminated"

Future Developments

The inclusion of another utility to interpret FSRs and make further decisions on the success / failure of jobs is imminent. With the new structure of the workflow module it will be far easier to plug this in. Whether or not this will lead to the retirement of the ProductionLogAnalysis module remains to be seen but there is no harm in both approaches coexisting.

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2010-11-24 - FedericoStagni
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback