Policy for handling non fatal errors

This documents describe how non fatal errors in cmsRun jobs should be handled by the code going in error. The first part of the document describes the goal and final policy while the second part describes the steps and action to follow to enforce this policy and deploy the needed components in CMSSW releases 3XX series.

Non Fatal errors

Unexpected conditions happening in cmssw code can be considered not fatal for the whole processing. The errors should be fatal only if the processing of the rest of the event or the processing of the next events is most likely completely compromised.

In the case part of the event processing, or at least the processing of the next events can be continued the error must not be fatal (i.e. no exceptions, no abort, no assert). The code going in error should act as following:

  • Issue a LogError/Warning (see below for when to use one or the other)
  • Try to process what possible (e.g. in FED buffer unpacking, if one buffer is corrupted, still try to unpack the others)
  • Return to the calling code whatever possible (an incomplete collection or an empty collection or an "invalid" object if the returned object has such a flag)
  • (if the code is an EDProducer) put the collection in the event even if empty (this is important in order to keep the output file readable in later jobs)
  • Check that no leaks are introduced by early termination of the failing routine, e.g avoid things like:
  a = new MyObj(123);
  bool ok = checkAll()
  if(!ok) return false;
  bool result = a->something();
  delete a;
  return result;  

LogWarning and LogErrors

Both LogWarnings and LogErrors will be stored persistently in the event. The actual text of the message is not stored, but the following information are made persistent:
  • Category (currently the string passed as LogError("myCategory") )
  • Module name/instance producing the error (so the Category should not repeat this information)
  • Severity, i.e. Error vs Warning

Severity
The Severity has an impact on the way the problem will be followed up if it appears during data taking:
  • LogErrors: are always followed up, if it turns out that we have to "live with it" (e.g. problem originates from some hardware failure that is going to stay) they should be downgraded in the code to "LogWarning", otherwise the run/lumi will be marked as bad in DataCertification
  • LogWarning: they are monitored via DQM and thresholds on their rate. If the rate is higher than a threshold they are followed up as the errors. The follow up may end up in reviewing the threshold or in marking as bad the lumi/run in the DataCertification
  • Lower severities: the other "LogXXX" functions can be used to report conditions that are not symptom of a problem and are not spoiling the processing of the event. Those LogXXXX messages are not monitored and not persistently stored.

Category
The Category should contain enough information to know what are the consequences downstream (i.e. in dependent code/producers or in physics analysis).

Only a well defined set of categories is allowed. Each error category should be documented in a central place with detailed explaination of the possible consequence for physics analysis. To further enforce this it makes sense to change the category from a string to a enum like type, with enums defined only in a central location.

Usage in Analysis

The analysis will access this information in two ways:
  • DataCertification may have flagged a lumi/run as bad
  • Single events can have some Categories of problems even if overall its run/lumi was marked as GOOD

For analysis the relevant point is the Category of the problem rather than its severity (in fact the severity is used as an alarm for investigation when the problem occur). The same Category can have different severity in different reprocessing (e.g. if a problem is known and accepted as unavoidable).

Deployment plans

Phase I (can be deployed in 33X, even in "minor versions" )
  • Review all existing LogError/Warning
  • Reduce the number of categories
  • Remove module/instance/class name from the category where used, replace it with a name that summarize the description of the problem or the consequences of it (e.g. "TooManyHitsNoSeedsProduced")
  • Move all known/expected detector problems to "Warning"
  • Make a list from each subsystem of their Warning/Errors and call for documentation of them

Phase II (should be scheduled for a major release, perhaps 340 ?)
  • Move "strings" to enum
  • Complete documentation of all categories
  • Review the needs of additional severity levels beside Error/Warning

-- AndreaRizzi - 2009-08-24

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2009-09-04 - KatiLassilaPerini
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback