Debugging the reason for Jobs being rescheduled.

Jobs get automatically rescheduled by the DIRAC Wrapper if there is a problem in the preparation of the environment. Normally due to either a problem with the software installation or a problem with access to input data.

The simplest way to find out is to have a look at the LoggingInfo of the Job. Simply go to the Job Monitoring page, insert the given JobID (or select the affected user) and in the raw of the problematic job left-click and select "Loggin Info". It will display the list of States through which the Job has gone and you should see several times "Rescheduled". Check the associated MinorStatus and ApplicationStatus. This should give a clear message of the nature of the problem:

Rescheduled                   Input Data Resolution         Failed Input Data Resolution  2010-11-16 11:31:21

To further diagnose the problem we need to access the output of the pilots that tried to execute the Job. If the Job has reached a final "Failed" state after a certain number of reschedule cycles, you can try "Pilot -> Get Stdout" from the Job Monitoring page. If the Job is still being reschedule you can get list of pilots using the command line tools (requires lhcb_prod proxy):

dirac-admin-get-job-pilots [JobID]

From the output of the command, you are interested on several things:

* Status of the pilots, Stdout can only be retrieved for Done pilots.

* GridSite are they all being scheduled to the same Site?

* PilotJobReference this will allow you to continue.

To get the Stdout of the pilots use either the Pilot Monitoring page or the command line tool (requires lhcb_prod group/proxy):

dirac-admin-get-pilot-output [PilotJobReference]...

A simple way to get to the relevant info is to grep for the JobID (to get only messages associated to the JobWrapper) and then search for ERROR messages like:

pilot_syTBEbnNPZ7Kp32Sn-zQwA/std.out:2010-11-16 12:18:29 UTC Wrapper_12800362 ERROR: SRM2Storage.__gfal_exec: Failed to perform gfal_turlsfromsurls. [SE][GetSpaceTokens][] httpg://srm.grid.sara.nl:8443/srm/managerv2: CGSI-gSOAP running on wn-smrt-076.farm.nikhef.nl reports could not open connection to srm.grid.sara.nl

-- RicardoGraciani - 16-Nov-2010

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-11-16 - Graciani
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback