SWGuideTroubleShootingMore

This page is used to collect and prepare further error messages for SWGuideTroubleShooting. You should view this to be in permanent flow. It is rather meant for developers than for users but you might still find some useful information, there.

How to determine which line number your CMSSW code crashed at

If your CMSSW job crashes, the error traceback will usually only tell you which module it crashed in. When developing code, it is therefore a good idea to add to your BuildFile the line <Flags CXXFLAGS="-g"/> (See https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideScram#CmsswSCRAMBuildFlags). This will compile your code with compiler option "-g". Although this may marginally slow down your code, it will ensure that if your job crashes with error traceback, then the latter will tell you the exact line number where the error occurred.

Note that CMSSW catches some errors and handles them itself. In this case, since it prevents the job crashing, you will not get error traceback, so will only be informed which module the crash occurs in. In this case, to determine the line number, you should run the debugger:

gdb cmsRun

catch throw

run MyAnalysis _cfg.py

where

This sequence tells the debugger to prevent C++ from throwing errors and to tell you "where" the code crashed.

Python Errors in gdb?

When trying to run gdb cmsRun, if you get errors like

Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
'import site' failed; use -v for traceback
----- Begin Fatal Exception 16-Jan-2012 16:48:28 CET-----------------------
An exception of category 'ConfigFileReadError' occurred while
  [0] Processing the python configuration file named cmsRun3.py
Exception Message:
python encountered the error: <type 'exceptions.ImportError'>
No module named os
----- End Fatal Exception -------------------------------------------------

You can get around this by either resetting your shell:

env SHELL=/bin/sh gdb cmsRun

or copy the python environment variable from scram:

scram tool info python

then set $PYTHONHOME to wherever scram thinks it is.

Memory leaks (St9bad_alloc)

A 'bad_alloc' exception means the program has run out of memory. Usually this means your job has a memory leak. The symptom of is the problem is an exception thrown with the somewhat obscure string in the body of the message "St9bad_alloc". There is no cure except "fix your memory leak", this is the suggested course of action:

  • Add the following to your cfg:

SimpleMemoryCheck = cms.Service("SimpleMemoryCheck",ignoreTotal = cms.untracked.int32(1) )

  • This will result in a large number of lines like:
           ++++-w MemoryIncrease:  CSCRecHit2DProducer:csc2DRecHits 
               28-May-2007 08:52:48 CEST Run: 1 Event: 1
          Memory increased from VSIZE=763.04MB and RSS=615.809MB to 
                 VSIZE=763.04MB and RSS=615.813MB
    in your log file.
  • You can look at the the numbers there to see two things:
    1. if there are particular modules that consistently cause the virtual memory size to be increased
    2. to see if it is a steady increase over the course of the job or if there is a very large increase on a particular event (e.g. the last event before it crashes with the std::bad_alloc error you saw)
  • If it is a steady increase over the course of the job, something is probably leaking memory and you can try to find that with valgrind (see WorkBookOptimizeYourCode). We have not systematically run valgrind in releases before 1_5_0_preX to find leaks that matter if you run very large numbers of events in a job. (And even there we are focusing on leaks that matter if we run the reco, not analysis.)
  • If a very large memory jump happens on the particular event where it crashes, there could be some particular to that event that causes some algorithm to behave badly. In this case reporting the particular event number and the cfg file you are using would help some expert to track it down.

For reference, there is an oflline page about the "SimpleMemoryCheck" service SWGuideEDMTimingAndMemory. Useful information can also be found in SWGuideFrameWork#Coding_tools_and_instructions under guidelines for using pointers.

Getting more information from the traceback

There are many options available to obtain better debug output to help track down problems.

Some of these optios are briefly described below.

Tracer

The service Tracer helps by identifying what module is called and when. The usage is explained elsewhere in the WorkBook in: https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookWriteFrameworkModule#WatcH

Message Logger

The service MessageLogger does logging, and provides messages including warnings and errors. For basic usage, one must include the MessageLogger service header in their module header:

from FWCore.MessageService.MessageLogger_cfi import *

In addition, it is strongly recommended (for consistency with the way all services are used ) that the .cfg file contain at least the line

MessageLogger = cms.Service("MessageLogger")

Configuration options and more usage instructions for the MessageLogger service are documented in:

Memory Checker

The service SimpleMemoryCheck does very basic memory checking (it can sometimes show memory leaks). Some notes on usage can be found in: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideEDMTimingAndMemory

EventContentAnalyzer

The module EventContentAnalyzer dumps all products stored in an event to the screen. Usage of this module is explained elsewhere in the WorkBook in: https://twiki.cern.ch/twiki/bin/view/CMS/WorkBookWriteFrameworkModule#SeE

Include it as a module in your configuration-file:

dump = cms.module("EventContentAnalyzer")

Then inculde dump in your path.

Other help

Benedikt Hegner has a script for helping with errors in BuildFiles. The path and filename on lxplus for this script are:

 ~hegner/public/cmsfilt.py 

Problems when reading files from CASTOR

If you suspect that you have trouble accessing data with CASTOR you can try the following

  • to check that the file exists nsls -l /castor/cern.ch/... (complete with the file name)
  • to check the staging status stager_qry -M /castor/cern.ch/... (it may be that your file is being staged and it may take a while)
  • to check that the file is really available you can try to copy it locally rfcp /castor/cern.ch/... /tmp

If the problem persists you can contact cms.support@cernNOSPAMPLEASE.ch specifying the file and the output of the above commands

Information Sources

Review status

Reviewer/Editor and Date (copy from screen) Comments
SudhirMalik - 13 Oct 2017 update some broken links
BenediktHegner - 16 Mar 2007 more info on memory/etc checking
IanTomalin - 11 Nob 2010 how to get line number of CMSSW crash

Responsible: SudhirMalik
Last reviewed by: SudhirMalik - 20 Jan 2010

-- RogerWolf - 15-Sep-2010

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2017-10-13 - SudhirMalik
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback