If many jobs start to fail at a site they should be immediately investigated. There can be many reasons for these failures.

  • Site problem with networking.
  • Shared software area is not working.
  • Jobs are using up too much memory/CPU/storage (perhaps a memory leak) and being killed by the local batch system.
  • Site grid storage is down and data access is no longer possible.

dirac-lhcb-job-logging-check

At the moment it is difficult to spot correlations amongst error message using the DIRAC job monitoring page. This is for a couple of reasons.

  • The ApplicationStatus often tells you what error has occurred to cause the job to fail. This is currently being overwritten by some generic error message.
  • The information you really want to look at is buried away inside the Parameters and Logging menu options and not visible on the main page (so not available for sorting in the table).

The script dirac-lhcb-job-logging-check is available to pull together the Logging and Parameter information and present a summary to the user. This is particularly useful when trying to debug Failed jobs. i.e.:

$ dirac-lhcb-job-logging-check --Status=Failed --Site=LCG.NIPNE-07.ro --JobGroup=00003145 --Verbose=True
1026388 wn48.nipne.ro 2008-12-16 00:18:20 Application not Found 0.62 897092kB
1026449 wn48.nipne.ro 2008-12-16 00:31:29 Application not Found 0.68 1069816kB
1026476 wn48.nipne.ro 2008-12-16 00:31:33 Application not Found 0.57 1070308kB
...
...
Error summary information

Application not Found occurred on 111 nodes

Node summary information

wn47.nipne.ro had 30 errors, of which 2 were unique
wn48.nipne.ro had 81 errors, of which 2 were unique

Average CPU efficiency is 0.64
Standard Deviation of CPU efficiency is 0.13 

The schema is as follows:

DIRACid | node_job_ran_on | timestamp_of_first_error | first_error_message_in_logging | cpu_efficiency memory_consumed

The above example shows that there are many "Application not Found" messages on two nodes at NIPNE-07. This indicates that there is a problem with the shared software area at this site (i.e., broken NFS mounts). The site should be contacted about this via GGUS.

-- GreigCowan - 16 Dec 2008


This topic: LHCb > WebHome > LHCbComputing > ProductionProcedures > ProductionProceduresInvestigatingFailedJobs
Topic revision: r1 - 2008-12-16 - GreigCowan
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback