Motivation
The number of attempts defined for a task is not suitable for all types of errors and ADC would like to have a way of treating some particular error codes in a special way. For examples:
- some errors may be considered final and retrying the job will not help
- other errors may be considered worth retrying only a limited number of times
In other particular cases, we might be interested in defining a certain post-failure action, e.g. increase the memory limit of a failed job.
This is why we want to implement a central table, where Ops and GDP can start to populate infamous error codes, and start building the consecutive actions.
Database structure
Actions table
This table defines the existing post-error actions:
- no_retry
- limit_retry
- increase_memory
This table is managed by the
PanDA developers, since adding a new action requires a code update in
PanDA server. The initial state could be:
ID |
Name |
Description |
Active |
1 |
no_retry |
This action will prevent PanDA server from retrying the job again. It is considered a final error. |
Y |
2 |
increase_memory |
Job ran out of memory. Increase memory setting for next retry. |
Y |
3 |
limit_retry |
Set the number of max retries. |
Y |
The
Active column can be set to 'Y' or 'N'; in the latter case the action will not be effective and will only log a debug message.
Still need to define logging mechanism. It was requested to send a message to TaskLogger/JEDILogger/PanDAmon/JIRA.
Error codes
The error table defines particular error codes and links them to the actions table. This table should be managed by Ops/GDP through the
ProdSys2 interface. The error code can be applicable to all jobs, or you can narrow the scope by specifying cmtconfig and/or release and/or work queue ID. Some actions require parameters, for these cases the specific parameters will have to be added in the Parameters following an agreed notation.
RetryError_ID: Incremental number taken from a sequence.
ErrorSource and Error Code: Has to match some error from
here
RetryAction: Link to the action table.
Parameters: Has to match follow the key1=value1&key2=value2&key3=value3 pattern. Keys are specific for the action, e.g. limit_retry action needs the parameter maxAttempts=5.
Architecture: Optional field to narrow down the scope of a rule. String has to match exactly the value from column
cmtconfig from ATLAS_PANDA.jobsactive4
Release: Optional field to narrow down the scope of a rule. String has to match exactly the value from column
ATLASRelease from ATLAS_PANDA.jobsactive4
WorkQueue_ID: Optional field to narrow down the scope of a rule. ID has to match the number from column
WorkQueue_ID, which links to table ATLAS_PANDA.JEDI_Work_Queue.
Description: Optional free text with description or link to e.g. JIRA.
Expiration_Date: If set, error rule will not be active afterwards.
Active: Can be Y or N. If set to N, the rule will only log. This is a more fine grained handle than the same column at Action level.
ErrorDiag: Regexp to be used in case the
ErrorSource+ErrorCode are too generic and we want to specify a particular error message.
Example to enter a new error in the DB:
Insert into ATLAS_PANDA.RETRYERRORS (RETRYERROR_ID,ERRORSOURCE,ERRORCODE,RETRYACTION,PARAMETERS,ARCHITECTURE,RELEASE,WORKQUEUE_ID,DESCRIPTION,EXPIRATION_DATE,ACTIVE,ERRORDIAG)
values (ATLAS_PANDA.RETRYERRORS_ID_SEQ.nextval,'exeErrorCode',65,3,'maxAttempt=2',null,null,null,null,null,'N','.*IncludeError: include file .* can not be found.*');
--rollback;
--commit;
Retrial module engine
Hooks
Hooks have to be inserted in the
PanDA server to call the retrial module. At the moment,
PanDA server will have a hook in the updateJob function. This is the function that the pilot calls to update the status of a job. If the pilot reports a failed job, the first thing will be to call the retrial module and the actions in the error tables will have precedence over stuff defined elsewhere (e.g. if you defined a maxAttempts=20 at task level, but a maxAttempt=5 at error level, then
PanDA server will consider maxAttempt=5).
In the future we might have to add other hooks to handle errors that occur for example in
PanDA server while interacting with DDM.
Caching
PanDA server caches the DB tables in memory, since the tables should be small and rather static.
- refreshed every hour or at restart: be patient
- unless the tables grow too much: keep them tidy
Handling inconsistent rules
The retrial module with use the narrowest rule, i.e. the one with most of the columns cmtconfig, release and workqueue ID defined. In case of draw, it will take the one with the lowest limit.