PandaErrorCodes

Introduction

This page describes the error codes and diagnostics of the Panda jobs.

transExitCode

transExitCode diagnostics
1 Athena release is not installed in the CE, or trf failed due to "Unknown Problem" (see checklog.txt)
2 Athena core dump
6 TRF_SEGVIO - Segmentation violation
10 ATH_FAILURE - Athena non-zero exit
26 TRF_ATHENACRASH - Athena crash
30 TRF_PYT - transformation python error
31 TRF_ARG - transformation argument error
32 TRF_DEF - transformation definition error
33 TRF_ENV - transformation environment error
34 TRF_EXC - transformation exception
40 Athena crash - consult log file
41 TRF_OUTFILE - output file error
42 TRF_CONFIG - transform config file error
50 Athena crash-consult log file (can be "VKalVrtPrim ERROR Primary vertex not found")
51 TRF_DBREL_TARFILE - Problems with the DBRelease tarfile
60 TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
79 Copying input file failed (Can't open source file : Invalid file name)
80 file in trf definition not found, using the expandable syntax
81 file in trf definition not found, using the expandable syntax -- pileup case
85 analysis output merge crash - consult log file
98 Oracle error - session limit reached
99 Unknown transform error (69999, TRF_UNKNOWN) -- consult log file
102 One of the output files did not get produced by the job
104 Copying the output file from the worker node to the local SE failed (md5sum mismatch, or size mismatch, or LFNnonunique)
126 trf is not executable - consult log file
127 trf is not installed in the CE
134 Athena core dump, or Athena time out, or ConditionsDB exception caught: MySQL error (database load problem), or Error ORA-03114: not connected to ORACLE
141 No input file is available - input dataset is broken or doesn't exist at WN's site
200 no Athena log file produced
220 Proot: An exception occurred in the user analysis code
221 Proot: Framework decided to abort the job due to an internal problem
222 Proot: Job completed without reading all input files
223 Proot: Input files cannot be opened
2100 MyProxyError: server name not specified (not really trf error)
2101 MyProxyError: voms attributes not specified (not really trf error)
2102 MyProxyError: user DN not specified (not really trf error)
2103 MyProxyError: pilot owner DN not specified (not really trf error)
2104 MyProxyError: invalid path for the delegated proxy (not really trf error)
2105 MyProxyError: invalid pilot proxy path (not really trf error)
2106 MyProxyError: no path to delegated proxy specified (not really trf error)
2200 MyProxyError: myproxy-init not available in PATH (not really trf error)
2201 MyProxyError: myproxy-logon not available in PATH (not really trf error)
2202 MyProxyError: myproxy-init version not valid (not really trf error)
2203 MyProxyError: myproxy-logon version not valid (not really trf error)
2300 MyProxyError: proxy delegation failed (not really trf error)
2301 MyProxyError: proxy retrieval failed (not really trf error)
2999 Unknown transExitCode error code (most likely a pilot script error, consult batch log)

pilotErrorCode

Recoverable error codes: 1101, 1114, 1122, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1140, 1141, 1142, 1152, 1154, 1155, 1157, 1181, 1185 (shown in green below, recovery of stranded jobs/output files, done by a later pilot on sites with schedconfig.retry = true)

Resubmission error codes: 1008, 1098, 1099, 1110, 1113, 1114, 1115, 1116, 1117, 1137, 1139, 1151, 1152, 1171, 1172, 1177, 1179, 1180, 1181, 1182, 1188, 1189 (pilot will instruct the server to retry the job)

pilotErrorCode diagnostics
1008 General pilot error, consult batch log
1097 Get function can't be called for staging input file
1098 No space left on local disk
1099 Get error: Staging input file failed
1100 Get error: Replica not found
1101 LRC registration error: Connection refused
1103 Get error: No such file or directory
1104 User work directory too large
1105 Put error: Failed to add file size and checksum to LFC
1106 Payload stdout file too big
1107 Get error: Missing DBRelease file
1108 Put error: LCG registration failed
1109 Required CMTCONFIG incompatible with WN
1110 Failed during setup
1111 Exception caught by runJob
1112 Exception caught by pilot
1113 Get error: Failed to import LFC python module
1114 Put error: Failed to import LFC python module
1115 NFS SQLite locking problems
1116 Pilot could not download queuedata
1117 Pilot found non-valid queuedata
1118 Pilot could not curl space report
1119 Pilot aborted due to DDM space shortage
1122 Bad replica entry returned by lfc_getreplicas(): SFN not set in LFC for this guid
1123 Missing guid in output file list
1124 Output file too large
1130 Get error: Failed to get POOL file catalog
1131 Put function can not be called for staging out
1132 LRC registration error (consult log file)
1133 Put error: Fetching default storage URL failed
1134 Put error: Error in mkdir on localSE, not allowed or no available space
1135 Could not get file size in job workdir
1136 Error running md5sum on the file in job workdir
1137 Put error: Error in copying the file from job workdir to localSE
1138 Put error: could not get the file size on localSE
1139 Put error: Problem with copying from job workdir to local SE: size mismatch
1140 Put error: Error running md5sum on the file on localSE
1141 Put error: Problem with copying from job workdir to local SE: md5sum mismatch
1143 Failed to chmod trf
1144 This job was killed by panda server
1145 Get error: md5sum mismatch on input file
1146 Trf installation dir does not exist and could not be installed
1148 Put error: Failed to remove readOnly file in dCache
1149 wget command failed to download trf
1150 Looping job killed by pilot
1151 Get error: Input file staging timed out
1152 Put error: File copy timed out
1153 Lost job was not finished
1154 Failed to register log file
1155 Failed to move output files for lost job
1156 Pilot could not recover job
1158 Reached maximum number of recovery trials
1159 Job recovery could not read PoolFileCatalog.xml file (guids lost)
1160 LRC registration error: file name string size exceeded limit of 250
1161 Job recovery could not generate xml for remaining output files
1162 LRC registration error: Non-unique LFN
1163 Grid proxy not valid
1164 Get error: Local input file missing
1165 Put error: Local output file missing
1166 Put error: File copy broken by SIGPIPE
1167 Get error: Input file missing in PoolFileCatalog.xml
1168 Get error: Total file size too large
1169 Put error: LFC registration failed
1170 Error running adler32 on the file in job workdir
1171 Get error: adler32 mismatch on input file
1172 Put error: adler32 mismatch on output file
1173 PandaMover staging error: File is not cached
1174 PandaMover transfer failure
1175 Get error: Problem with copying from local SE to job workdir: size mismatch
1176 Pilot has no child processes (job wrapper has either crashed or did not send final status
1177 Voms proxy not valid
1178 Get error: No input files are staged
1179 Get error: Failed to get LFC replicas
1180 Get error: Globus system error
1181 Put error: Globus system error
1182 Get error: Failed to get LFC replica
1183 LRC registration error: Guid-metadata entry already exists
1184 Put error: PoolFileCatalog could not be found in workdir
1186 Software directory does not exist
1187 Athena metadata is not available
1188 lcg-getturls failed
1189 lcg-getturls was timed-out
1190 LFN too long (exceeding limit of 150 characters)
1199 Could not create directory
1200 Job terminated by unknown kill signal
1201 Job killed by signal: SIGTERM
1202 Job killed by signal: SIGQUIT
1203 Job killed by signal: SIGSEGV
1204 Job killed by signal: SIGXCPU
1206 Job killed by signal: SIGBUS
1207 Job killed by signal: SIGUSR1
1210 No athena output
1211 Missing installation
1212 Payload ran out of memory
1213 Reached batch system time limit
1214 Site does not allow requested direct access or file stager
1215 Failed to open TCP connection to localhost (worker node network problem)
1216 Pilot TCP server has died
1217 Mismatch between core count in job and queue definition
1218 Exception caught by RunJobEvent
1219 uuidgen failed to produce a guid
1220 Job failed due to unknown reason (consult log file)
1221 File already exist
1222 Failed to get security key pair
1223 TRF failed due to bad_alloc
1224 Recoverable Event Service Merge error
1225 Recoverable Event Service error
1226 gLExec related error
1227 AthenaMP ended Event Service job prematurely
1228 Fatal Event Service error
1229 Fatal Token Extractor error
1230 Token Extractor error: Host name could not be resolved
1231 Token Extractor error: Bad URL
1232 Token Extractor error: Invalid GUID length
1233 Token Extractor error: No tokens for this GUID
1234 Already executed clone job
1235 Payload exceeded maximum allowed memory
1236 Failed by server
1237 Event Service job killed by server

brokerageErrorCode

brokerageErrorCode diagnostics
100 release is missing in the cloud

ddmErrorCode

ddmErrorCode diagnostics
100 DQ2 server error
200 Adder could not add files to the output datasets

jobDispatcherErrorCode

jobDispatcherErrorCode diagnostics
100 lost heartbeat
101 job recovery failed for three days

taskBufferErrorCode

taskBufferErrorCode diagnostics
100 Job expired and killed three days after submission (or killed by user)
101 Transfer timeout (2weeks)
102 Expired three days after submission
103 Aborted by ExtIF

See AtlasProdSysErrorCodes here


Major updates:
-- NurcanOzturk - 06 Dec 2005 -- NurcanOzturk - 29 Dec 2005 -- NurcanOzturk - 27 Feb 2006 -- NurcanOzturk - 28 Apr 2006 -- YuriSmirnov - 19 May 2006 -- NurcanOzturk - 10 Jul 2006 -- PaulNilsson - 01 Oct 2006



Responsible: NurcanOzturk

Topic revision: r77 - 2016-04-27 - PaulNilsson
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback