How to Find Log Files for CMS Production Jobs

Introduction

This page is aimed at CMS sysadmins who would like to retrieve log files for failed production jobs. It assumes:

  • Basic knowledge on navigating the CMS Dashboard.
  • You have a valid CMS VOMS proxy.
  • A login host with xrdcp installed.

Find a Workflow

Skip this step if you already know the name of the workflow with failing jobs at your site. To examine the success rates of all production workflows, visit the following URL (adjusting T2_US_Nebraska for your site name):

http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#site=T2_US_Nebraska&submissiontool=wmagent&sortby=task

Each row will be a different task, sorting in descending number of jobs within the last 24 hours. Click on the Jobs tab, then the Expand task name button. This will print the full Dashboard task name, such as wmagent_pdmvserv_SUS-RunIISpring15DR74-00122_00554_v0__160109_193720_9358.

The CMS request name will be the portion after wmagent_; in our example, this is pdmvserv_SUS-RunIISpring15DR74-00122_00554_v0__160109_193720_9358.

Find log file LFN

In the previous section, we found a workflow and site name to examine (such as T2_US_Nebraska and pdmvserv_SUS-RunIISpring15DR74-00122_00554_v0__160109_193720_9358). Now, we must find a log LFN associated with a failed job.

To do this, start by loading WMStats: https://cmsweb.cern.ch/wmstats/index.html . This will take between 10 seconds and 2 minutes to load, depending on your connection speed.

Type the workflow name into the workflow text box (second from left). This will filter the table to a single campaign. Click on the gray box in the "L" column.

This will display the request view and should have a single row (your desired request). Again, click on the gray box in the "L" column.

This will display the job view. The second table will have a listing of job failures for this request. Filter the table on your site name in the associate search box (found on the upper-right-hand-side of the table). This will filter to approximately 10 different failed jobs. Click the gray box in the "L" column to select a job.

This will pull up a job record on the page. Toward the bottom is the the output list; find the LFN associated with the logArchive output. It will have the following structure:

/store/unmerged/logs/prod/2016/1/20/pdmvserv_SUS-RunIISpring15DR74-00122_00554_v0__160109_193720_9358/StepOneProc/0001/1/0f4cbc3c-be15-11e5-861a-a0369f23d01e-75-1-logArchive.tar.gz

Note: these are cleaned up and archived after 500 jobs from this workflow are run at your site (typically, around 24 hours). You may have to try this a few times before you find a working log.

Download Log File

Open a terminal and SSH into the server where your VOMS proxy is located; we will download the log file with xrdcp. To form a PFN, prepend root://cms-xrd-global.cern.ch/ to the log file LFN. For example, the LFN

/store/unmerged/logs/prod/2016/1/20/pdmvserv_SUS-RunIISpring15DR74-00122_00554_v0__160109_193720_9358/StepOneProc/0001/1/0f4cbc3c-be15-11e5-861a-a0369f23d01e-75-1-logArchive.tar.gz

becomes the PFN

root://cms-xrd-global.cern.ch//store/unmerged/logs/prod/2016/1/20/pdmvserv_SUS-RunIISpring15DR74-00122_00554_v0__160109_193720_9358/StepOneProc/0001/1/0f4cbc3c-be15-11e5-861a-a0369f23d01e-75-1-logArchive.tar.gz

Now, download this file using xrdcp; the syntax is approximately:

xrdcp $PFN $DEST_DIR

Within the log tarball, the file cmsRun1/cmsRun-stdout.log will hold the output log you are interested in.

For our example:

$ xrdcp root://cms-xrd-global.cern.ch//store/unmerged/logs/prod/2016/1/20/pdmvserv_SUS-RunIISpring15DR74-00122_00554_v0__160109_193720_9358/StepOneProc/0001/1/0f4cbc3c-be15-11e5-861a-a0369f23d01e-75-1-logArchive.tar.gz /tmp/
[1.338MB/1.338MB][100%][==================================================][456.6kB/s]  
$ cd /tmp
$ tar zxf 0f4cbc3c-be15-11e5-861a-a0369f23d01e-75-1-logArchive.tar.gz
$ ls -lh cmsRun1/cmsRun1-stdout.log 
-rw-r--r--. 1 bbockelm cse496 705K Jan 20 02:32 cmsRun1/cmsRun1-stdout.log

None of this works!

There are a few reasons why this page may fail:

  • The log archive stageout failed. This can occur if the runtime site is missing gfal-copy.
  • The stageout site is not part of the AAA infrastructure.
  • cmsRun failed very early in startup.
  • The underlying failure was recorded in the wrapper stdout in HTCondor, not as part of cmsRun.
  • The log file has been cleaned out (can happen within 24 hours).
  • The pilot failed suddenly, leaving neither stdout nor log files.

In this case, you will need to file a GGUS ticket to receive central help.

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2016-01-21 - BrianBockelman
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback