5.8 Grid Analysis Job Diagnosis Template

Complete: 5
Detailed Review status

Newsbox
Under review. Analysis Operations will simplify this significantly !!!

Goals of this page:

This page will guide you through steps which will allow you to identify the problems you experience with your grid analysis jobs.

Contents

Introduction

Detailed instructions on how to run analysis jobs on GRID are given in WorkBookCRAB2Tutorial. You can monitor the progress of your jobs following the instruction in WorkBookMonitoringTutorial. If you encounter problems with your jobs follow the steps below.

Diagnosis steps

To check first

  1. Did you check if the problem is among the CRAB FAQ (See SWGuideCrabFaq) or reported in the CRAB Feedback list or in Grid Annoucements CMS.HyperNews forum (all CMS members using the Grid must subscribe to this list).
  2. Do you have a valid grid certificate?
  3. Have you tested your code locally? (See Test your code locally)
  4. Have you validated your python CMSSW config file? (See How to validate a CMSSW config file to run on CRAB)
  5. Are you using the latest CRAB version? (See How to get CRAB )
  6. Are you submitting at Tier 2 or Tier 3?

Problems with the computing infrastructure

  1. Is your CMSSW version available at target site(s)?
    • Show typical symptoms  Hide 
       
      
      CRAB doesn't submit to a site if the CMSSW version the users ask for is not advertised to be installed at the site.
      Error message during CRAB (standalone) submission:
      crab. Checking available resources...
           crab. No compatible site found, will not submit jobs ..
           crab. The whole task doesn't found compatible site
      
      Error message from CRABServer:
      job status "NotSubmitted"
    • Chech if the version is available on the site, following this page
    • If the version is not available at the site where your jobs are running, report to CRAB feedback hypernews list
      • report all relevant information:
        • on which site you are running
        • which CMSSW version you are using
  2. Is your job failing because it cannot access the CMSSW area at a site?
    • Show typical symptoms  Hide 
       
      
      CRAB Exit code: 10034
      Error message from stdout:
      Unable to find SCRAM version VX_Y_Z for slc4_ia32_gcc345 architecture.
         ERROR ==> CMSSW CMSSW_X_Y_Z not found on <sitenode>
      
      or:
        CMSSW_X_Y_Z Error...no release area! SCRAM fatal: No release area found
         ERROR ==> CMSSW CMSSW_X_Y_Z not found on <sitenode>
    • As above, report to CRAB feedback hypernews list
      • report all relevant information:
        • on which site you are running
        • which CMSSW version you are using
  3. Is(are) the remote site(s) operational?
    • Show typical symptoms  Hide 
       
      
      CRAB doesn't submit to a site if it is not in BDII.
      Error message during CRAB (standalone) submission:
      crab. Checking available resources...
           crab. No compatible site found, will not submit jobs ..
           crab. The whole task doesn't found compatible site
      
      Error message from CRABServer:
      job status "NotSubmitted"
    • How to check: see the site availability page
    • If you see the site being marked red during the time when the job failed, the reason for the failure is likely the site unavailability. The red comes from the results of the standard tests, and the failure has already been reported to the site. Try later.
    • If the problem persists:
      • check CRAB log: you should get a SE name where you were trying to access data from
      • check if you are e.g. whitelisting only that SE: if so, use all SEs which have a replica of the data
      • check into CMS.SiteDB::Reports and identify a site name from the SE name
      • check the site name in the CMS.SiteStatusBoard, click on its name, and see the status of the bdII check
      • click on "visible" to get the history: maybe the SE is out of the bdII for a reason
      • if you get to this point, open a ticket to CRAB feedback hypernews list

4 Is your job failing because it cannot access the data (exit code 8020) ?

    • Show typical symptoms  Hide 
       
      
      CRAB Exit code: 50115/7 or 30001 for CMSSW16x, 8001 for CMSSW>=17x
      Error message from stderr (from CMSSW code): 
        ---- Configuration BEGIN
        Error occured while creating source PoolSource
        ---- DCacheFile BEGIN
          DCacheFile::open()
          dc_open failed: filename = /path/to/file
          open flags = 0
          permissions = 438
          dcache error code = 0
        ---- DCacheFile END
        ---- Configuration END
      
      
    • How many jobs are failing in your task ?
      • A few
        • retry
      • A large percentage of (or persistently same jobs in task after retrying)
        • same as below
      • All
        • report to CRAB feedback hypernews list including the dashboard link from crab -status AND at least an example with the exact error message you get from the log file CMSSW*.stdout/err.
        • while the problem with the site is being investigated you can try to resubmit blacklisting the offending site

  1. Are your jobs aborting at a site?
    • Use the -postMortem option of CRAB
      •  crab -postMortem
      • this option creates for the user a file containing the Grid information about the aborted job (the loggingInfo information). If the abort reason is not clear, you can write to CRAB feedback hypernews list including your crab.cfg and the postMortem file.

Problems with the output

  1. Have you planned your output handling carefully? Your job output has to be smaller than 50MB, otherwise you will be not able to retrieve it. CRAB limited the OutputSandbox size due to disk space problem in WMSs. If the output of your job will be bigger than 50MB you have to copy directly the produced output from the WorkerNode to a StorageElement, selecting copy_data = 1 in crab.cfg. (Ouput files size)
  2. Are you having problems with remote stageout of your output files?
    • Show typical symptoms  Hide 
       
      
      CRAB Exit code: 60307
      Error message: 
       Copy output files from WN = <sitenode> to SE = <SEname> :
      
          Trying to copy output file </path/to/file> to <SEname>
          path_out_file = </path/to/file>
          destination = <fileSURL>
          .....
          StageOutExitStatus = 60307
          StageOutExitStatusReason = .....
      
    • if the site is CERN, check which are the currently used SRM endpoints used at CERN, maybe the User is wrong
    • if the site is not CERN, open a ticket to CRAB feedback hypernews list
      • report all relevant information:
        • the full name of the remote SE
        • the complete SURL from the error message (destination = ...)
        • the full failing command
  3. Jobs ran but didn't produce expected output, stdout and stderr returned? (application problem, send to application support list)

Nothing above helped

  1. If everything above worked, but the problem persists, put crab.cfg, crab.log, job stderr and job stdout on webspace or in afs-public and send your question to the CRAB feedback hypernews list.

Error messages/codes

Error Messages/Codes page

Review status

Reviewer/Editor and Date (copy from screen) Comments
CMSUserSupport - 13 Jul 2007 created the template page

Responsible: MarcoCalloni
Last reviewed by: FedericaFanzago - 28 Feb 2008

Edit | Attach | Watch | Print version | History: r67 < r66 < r65 < r64 < r63 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r67 - 2015-04-15 - FreyaBlekman


ESSENTIALS

ADVANCED TOPICS


 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback