Debugging failed WMAgent jobs with Global Monitor and job wrapper stdout and stderr

The example below is from the CRAB3 test server which uses the new WMStats Global Monitor and does not have an old Global Monitor.

Using only WMStats

We have a workflow called pkonst_crab_crab3test-PK-001_121003_101259 and after clicking on its line in the Global Monitor we get this table:

status site exit code jobs error mesage
success T2_CH_CSCS 0 11  
success T2_ES_CIEMAT 0 105  
success T2_RU_JINR 60307 38 LogArchiveFailure

So we have 38 jobs at T2_RU_JINR that failed staging out the log archive but are otherwise successful. It is interesting to find out why and how this failure occurred while the stage out of the output file succeeded. We click on the T2_RU_JINR row of the table and get three almost identical boxes that contain:

  • workflow: pkonst_crab_crab3test-PK-001_121003_101259
  • task: /pkonst_crab_crab3test-PK-001_121003_101259/Analysis
  • state: success
  • site: T2_RU_JINR
  • exit code: 60307
  • retry count: 2
  • lumis:36
  • output:
    • lfn:/store/temp/user/pkonst/RelValProdTTbar/crab3test-PK-001/v1/00000/4822AD76-580D-E211-896A-002590189538.root
    • location: T 2 _ R U _ J I N R
    • checksums: adler32: 1585953b cksum: 1970636162
    • type: output
    • size: 366099584

This doesn't tell us much more about the failure than what we already know. And we can't go further if we want to stay within WMStats.

Going beyond WMStats

From the WMStats output we take the name of the output file and use it to find the wmbs id of the job that produced it. We login to the WMAgent host and execute:
[vocms120] /data/srv/wmagent/current > wget 'http://vocms120.cern.ch:5984/wmagent_jobdump%2Ffwjrs/_design/FWJRDump/_view/jobsByOutputLFN' -O - 2>/dev/null | grep 4822AD76-580D-E211-896A-002590189538.root
and we get
{"id":"1-2","key":["pkonst_crab_crab3test-PK-001_121003_101259","/store/temp/user/pkonst/RelValProdTTbar/crab3test-PK-001/v1/00000/4822AD76-580D-E211-896A-002590189538.root"],"value":1},
So the wmbs id is 1 ("value":1).

Looking at the job summary page

Now we can construct the URL of the job summary page (http://vocms120.cern.ch:5984/wmagent_jobdump%2Fjobs/_design/JobDump/_show/jobSummary/1) and open it using a tunnel. In the old Global Monitor it was possible to do it with a mouse click. A snaphot of the page is attached.

On that page under 'Retry 2' we see a lot of code where the error occurred, but we don't understand what really happened. We can see that the command to do the stage out was

srmcp -2 -report=$REPORT_FILE -retry_num=0 -debug=true  file:////scr/u/19072899.lxbsrv01.jinr.ru/glide_Y30635/execute/dir_2525/job/WMTaskSpace/logArch1/logArchive.tar.gz  srm://lcgsedc01.jinr.ru:8443/srm/managerv2?SFN=/pnfs/jinr.ru/data/cms/store/temp/user/pkonst/logs///pkonst_crab_crab3test-PK-001_121003_101259/Analysis/0000/2/eb1a797a-0d47-11e2-893f-003048f1c5ce-0-2-logArchive.tar.gz 2>&1 | tee srm.output.$$ 
And then the file size was obtained by:
SRM_SIZE=`srmls -recursion_depth=0 -retry_num=0 srm://lcgsedc01.jinr.ru:8443/srm/managerv2?SFN=/pnfs/jinr.ru/data/cms/store/temp/user/pkonst/logs///pkonst_crab_crab3test-PK-001_121003_101259/Analysis/0000/2/eb1a797a-0d47-11e2-893f-003048f1c5ce-0-2-logArchive.tar.gz 2>/dev/null | grep '/pnfs/jinr.ru/data/cms/store/temp/user/pkonst/logs///pkonst_crab_crab3test-PK-001_121003_101259/Analysis/0000/2/eb1a797a-0d47-11e2-893f-003048f1c5ce-0-2-logArchive.tar.gz' | grep -v 'lcgsedc01.jinr.ru' | awk '{print $1;}'`

Looking at the job wrapper stdout, stderr and the condor log

Now we want to look at the stdout and stderr of the jobwrapper. We go to the WMAgent working directory, search for the archived files for job with wmbs id 1 and unpack them:
[vocms120] /data/srv/wmagent/current > tar xf $(find . -name Job_1.tar.bz2)
In the resulting directory we find the stdout as Job_1/condor.1674.0.out. The entire file is attached. There we can see:
==>Working on file: /store/temp/user/pkonst/logs///pkonst_crab_crab3test-PK-001_121003_101259/Analysis/0000/2/eb1a797a-0d47-11e2-893f-003048f1c5ce-0-2-logArchive.tar.gz
===> Attempting Local Stage Out.
LFN to PFN match made:
LFN: /store/temp/user/pkonst/logs///pkonst_crab_crab3test-PK-001_121003_101259/Analysis/0000/2/eb1a797a-0d47-11e2-893f-003048f1c5ce-0-2-logArchive.tar.gz
PFN: srm://lcgsedc01.jinr.ru:8443/srm/managerv2?SFN=/pnfs/jinr.ru/data/cms/store/temp/user/pkonst/logs///pkonst_crab_crab3test-PK-001_121003_101259/Analysis/0000/2/eb1a797a-0d47-11e2-893f-003048f1c5ce-0-2-logArchive.tar.gz

...

srmPutDone status code=SRM_SUCCESS
copy_jobs is empty
stopping copier
srmcp exit status: 0

Hmm, turns out the stage out of the log archive was in fact successful! Why is a failure reported then? Immediately below the successful copy we see:

Local File Size is: 17767
SRM Size is 
SRM Size is 
SRM Size is 
SRM Size is 
SRM Size is 
SRM Size is 
SRM Size is 
SRM Size is 
SRM Size is 
SRM Size is 
Cleaning up failed file:
Command exited with status: 151
ERROR: Exception During Stage Out:

So the command to get the file size on the SE returned nothing. The stage out was considered to have failed and the file was deleted.

Then we notice that the PFN has a triple '/' in the path. And the command to get the file size does 'srmls' on the PFN then pipes it to 'grep' searching for the local part of it which also has a triple '/'. But the output of 'srmls' will have just one '/' and the 'grep' won't match. So we get nothing.

Problem found. We open ticket #4131 requesting to fix the srmv2 stageout implementation in https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/Backends/SRMV2Impl.py

Conclusion

  • for successful debugging of failing WMAgent jobs we need the information from the job summary page of the job from the agent CouchDB. WMStats shows only part of it.
  • more often than not this is not enough and we need to look at the job wrapper stdout, stderr and the condor log which are archived in a tar ball under the WMAgent working directory. They turn out to be more helpful for understanding what happened and finding the problem. We need the wmbs id of the job that failed in order to get the archive .

Response

  • It seems this is very particular problem which needs to go through log archive. log archive location is in wmstats but not log archive itself. Error message is propagated only for last retry.
  • We need to provide some standard way to propagate error details by picking the specific part of error message. But I am not sure what part of error message needs to be added.
  • Still there will be the case to go through the condor logs and archives, which won't be propagated to central monitor.
  • I will provide the error message from the last retry as well as wmbs id.

-- PreslavKonstantinov - 26-Oct-2012

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatout condor.1674.0.out r1 manage 44.2 K 2012-10-29 - 18:31 PreslavKonstantinov job wrapper stdout
HTMLhtml jobsummary-1.html r1 manage 36.0 K 2012-10-29 - 17:47 PreslavKonstantinov Job summary page
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2012-11-01 - SeangchanRyu
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback