Debugging RelVal workflows with Global Monitor

The relval agent is currently connected to the "old" Global Monitor, so the following debugging instructions may change when using the new GM. Let's use "amaltaro_RVCMSSW_6_1_0_pre4ZElSkim2012B_v2_121016_155631_3009" request to illustrate these instructions.

  1. When we see failing jobs in the Global Monitor (failure or cool off jobs), then the first thing to be done is clicking in the cool off link, which will open a list with tasks/jobIDs for the failed ones.

  1. In the cool off (aka failed jobs) page, for a given workflow, we have a list of tasks and job IDs. See "ClickingCoolOffs.png" picture. Workflow.png

  1. Clicking in the job ID will bring us to a page where we can see details of the job and why it failed. In the top of this page we can x-check the requestName and the task in which the failure happened - in this case a DQM merge job - see picture "ClickingJobID_1.png". In the bottom of this page we can see where the job ran (T1_US_FNAL in this case) and a timestamp as well. Further down we can see why the job failed - looking at Fatal Exception section - and in this example we were lucky getting the a proper error message (which can now be reported back to the requestor). ClickingJobID_1.png

ClickingJobID_2.png

  1. If the job dump that we looked above is still not enough to figure out the reason for this failure, then we can get the condor logs in the agent. In order to do that we have to get the jobID (166480 in this example) of the failed job and look for it in the agent machine (cmssrv113 in this case). Type the following in the agent machine
    find . -name \*ob_166480\*
    If the workflow has already completed, you'll get a ".tar.bz2" file with all condor.out, condor.err and condor.log for this job (including all retries). Otherwise it'll be a directory containing these logs. In this case we got the following path:
    ./install/wmagent/JobArchiver/logDir/a/amaltaro_RVCMSSW_6_1_0_pre4RunTau2012A_v2_121016_155622_1165/JobCluster_166/Job_166480.tar.bz2
    Open it and look at the condor.out and condor.err, your answer is probably in these files.

  1. Additional step to report back to the requestors - when the workflow has completed - providing the PSet.py. Here we want to get the logCollect tarball from castor, get a single job/tarball and provide it to the requestors in case they want to reproduce or debug the issue. The step-by-step is:
    1. Get the workflow + task name that failed in the summary page of the workflow. It corresponds to this string "/amaltaro_RVCMSSW_6_1_0_pre4ZElSkim2012B_v2_121016_155631_3009/RECOSKIM/RECOSKIMMergeDQMoutput".
    2. Access the workload summary in the couchDB (in cmsweb). Note the workflow name in the end of the URL: https://cmsweb.cern.ch/couchdb/_utils/document.html?workloadsummary/amaltaro_RVCMSSW_6_1_0_pre4ZElSkim2012B_v2_121016_155631_3009
    3. Now look for workflow + task that you got two steps above (keeping the quotes in the beginning/end). You'll find a few occurrences and you're looking for the one with "logArch1" below it. When you find it, we can also see the PFN for the logCollect tarball files (those tarballs contain all the logs for every job that belongs to this workflow + task). Now we just need to get the LFN of this logCollect (in this case /store/logs/prod/2012/10/WMAgent/amaltaro_RVCMSSW_6_1_0_pre4ZElSkim2012B_v2_121016_155631_3009/amaltaro_RVCMSSW_6_1_0_pre4ZElSkim2012B_v2_121016_155631_3009-RECOSKIMDQMoutputMergeLogCollect-1-logs.tar)
    4. In the vobox (vocms23 here), type the following command to stage this file from castor (it may take a few hours):
      cmsStage /store/logs/prod/2012/10/WMAgent/amaltaro_RVCMSSW_6_1_0_pre4ZElSkim2012B_v2_121016_155631_3009/amaltaro_RVCMSSW_6_1_0_pre4ZElSkim2012B_v2_121016_155631_3009-RECOSKIMDQMoutputMergeLogCollect-1-logs.tar . 
    5. When you get this logCollect tarball, just untar it. The output is a bunch of tarballs, get one of those and make it available in the AFS for the requestors.

Response

  • logarchive location should be directly available from the wmstats instead of going through the workload summary. (implemented)
  • I will provide error message (last error) as well as wmbs id.
  • Error details can get very big, so it is hard to propagate that information to the central monitor, if there is a generic rule to filter it. we can implement that.

-- AlanMalta - 30-Oct-2012

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng ClickingJobID_1.png r1 manage 110.5 K 2012-10-30 - 16:08 AlanMalta  
PNGpng ClickingJobID_2.png r1 manage 181.0 K 2012-10-30 - 16:08 AlanMalta  
PNGpng Workflow.png r1 manage 35.5 K 2012-10-30 - 16:07 AlanMalta  
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2012-12-18 - AlanMalta
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback