Motivation

The number of attempts defined for a task is not suitable for all types of errors and ADC would like to have a way of treating some particular error codes in a special way. For examples:
  • some errors may be considered final and retrying the job will not help
  • other errors may be considered worth retrying only a limited number of times

In other particular cases, we might be interested in defining a certain post-failure action, e.g. increase the memory limit of a failed job.

This is why we want to implement a central table, where Ops and GDP can start to populate infamous error codes, and start building the consecutive actions.

Database structure

Actions table

This table defines the existing post-error actions:

  • no_retry
  • limit_retry
  • increase_memory
This table is managed by the PanDA developers, since adding a new action requires a code update in PanDA server. The initial state could be:

ID Name Description Active
1 no_retry This action will prevent PanDA server from retrying the job again. It is considered a final error. Y
2 increase_memory Job ran out of memory. Increase memory setting for next retry. Y
3 limit_retry Set the number of max retries. Y

The Active column can be set to 'Y' or 'N'; in the latter case the action will not be effective and will only log a debug message.

Warning, important Still need to define logging mechanism. It was requested to send a message to TaskLogger/JEDILogger/PanDAmon/JIRA.

Error codes

The error table defines particular error codes and links them to the actions table. This table should be managed by Ops/GDP through the ProdSys2 interface. The error code can be applicable to all jobs, or you can narrow the scope by specifying cmtconfig and/or release and/or work queue ID. Some actions require parameters, for these cases the specific parameters will have to be added in the Parameters following an agreed notation.

RetryError_ID ErrorSource ErrorCode RetryAction Parameters Architecture Release WorkQueue_ID Description Expiration_Date Active ErrorDiag
1 pilotErrorCode 345 1 x86_64-slc5-gcc43-opt Atlas-17.2.5 1 Description, link to JIRA,... 1 Jun 2016 Y .*aaa.*
2 pilotErrorCode 212 2 maxAttempts=5 Y
3 pilotErrorCode 2000 3 Atlas-17.2.5 1 Description, link to JIRA,... 1 Jun 2016 Y

RetryError_ID: Incremental number taken from a sequence.
ErrorSource and Error Code: Has to match some error from here
RetryAction: Link to the action table.
Parameters: Has to match follow the key1=value1&key2=value2&key3=value3 pattern. Keys are specific for the action, e.g. limit_retry action needs the parameter maxAttempts=5.
Architecture: Optional field to narrow down the scope of a rule. String has to match exactly the value from column cmtconfig from ATLAS_PANDA.jobsactive4
Release: Optional field to narrow down the scope of a rule. String has to match exactly the value from column ATLASRelease from ATLAS_PANDA.jobsactive4
WorkQueue_ID: Optional field to narrow down the scope of a rule. ID has to match the number from column WorkQueue_ID, which links to table ATLAS_PANDA.JEDI_Work_Queue.
Description: Optional free text with description or link to e.g. JIRA.
Expiration_Date: If set, error rule will not be active afterwards.
Active: Can be Y or N. If set to N, the rule will only log. This is a more fine grained handle than the same column at Action level.
ErrorDiag: Regexp to be used in case the ErrorSource+ErrorCode are too generic and we want to specify a particular error message.

Example to enter a new error in the DB:

Insert into ATLAS_PANDA.RETRYERRORS (RETRYERROR_ID,ERRORSOURCE,ERRORCODE,RETRYACTION,PARAMETERS,ARCHITECTURE,RELEASE,WORKQUEUE_ID,DESCRIPTION,EXPIRATION_DATE,ACTIVE,ERRORDIAG) 
values (ATLAS_PANDA.RETRYERRORS_ID_SEQ.nextval,'exeErrorCode',65,3,'maxAttempt=2',null,null,null,null,null,'N','.*IncludeError: include file .* can not be found.*');
--rollback;
--commit;

Retrial module engine

Hooks

Hooks have to be inserted in the PanDA server to call the retrial module. At the moment, PanDA server will have a hook in the updateJob function. This is the function that the pilot calls to update the status of a job. If the pilot reports a failed job, the first thing will be to call the retrial module and the actions in the error tables will have precedence over stuff defined elsewhere (e.g. if you defined a maxAttempts=20 at task level, but a maxAttempt=5 at error level, then PanDA server will consider maxAttempt=5).

In the future we might have to add other hooks to handle errors that occur for example in PanDA server while interacting with DDM.

Caching

PanDA server caches the DB tables in memory, since the tables should be small and rather static.
  • refreshed every hour or at restart: be patient smile
  • unless the tables grow too much: keep them tidy smile

Handling inconsistent rules

The retrial module with use the narrowest rule, i.e. the one with most of the columns cmtconfig, release and workqueue ID defined. In case of draw, it will take the one with the lowest limit.
Topic revision: r4 - 2015-10-01 - FernandoHaraldBarreiroMegino
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback