Questionnaire for job processing monitoring

  • 1). What is currently used in your VO for monitoring of the production jobs, analysis jobs?
  • 2). Is there any web interface provided for monitoring of the production jobs, analysis jobs?
  • 3). How user support and help with troubleshooting in general are organized in production, in analysis?
  • 4). Do you think that current set of failure reasones and codes returned by the Grid or application is sufficient to understand the underlying problem?
  • 5). For the application related failures is it possible to understand the reason of the failure (data access, sw distribution , user error...) from the exit code (reason) returned from application? If not, is any work planned/ongoing to improve the situation?
  • 6). Is some parsing of the stdout, stderr, application log file currently implemented by the job wrapper to create the meaningful error report? If yes , how is this information propagated back to the job submission UI or some monitoring system?
  • 7). Would you consider to be useful to define a generic way for communicating of the application specific smry information from the job to the job submission UI or some monitoring system?
  • 8). Do you have SAM tests for identifying sites problems related to job processing? Are they useful? If not - why? If yes- would you consider to increase their number, make them more specific?
  • 9). Are task concept and dataset consept used by the experiment?

-- Main.julia - 12 Feb 2007

