!DeReTo (working title) plug-in - 14/09/2006

Developed by Derek Groen

Attached code snippets

Description:

Derek's Reliability Tool (DeReTo) seeks to improve user job success rate by analyzing the logs and output data of previous job submissions.
  • Using user-specified, application-specific pattern definition, the tool parses the job output and detects the precise error that has occured. In addition, the tool parses log files to determine the final status of a job, the time a job spent in a queue and the time a job spent executing on the worker nodes.
  • This information is stored in a file in the gangadir. The contents of this file can then be used to calculate a reliability score for individual sites in the Grid.
  • The reliability scores provide the user with insight of past site behavior in terms of reliability, and allows the user (or an automated algorithm) to select a more optimal site.

Benefits:

  • In the validation experiments of this tool, even using basic reliability score and site selection schemes, we were able to greatly reduce the number of job failures without submitting excess jobs.
  • Site selected by this tool using job statistics data less than a month old have shown, on average, to behave more reliably than sites selected by the RB or at random.

How to set it up:

The current setup is somewhat crude, but adjustments have been made in the following files:
  • The LCG backend handler (LCG.py): this file contains a lot of changes, mainly in parsing job output, job log files and adding in optional automated site selection.
  • Executable Object: A custom executable have been developed to support the specification of devious flow patterns (dfp's). Currently this object resides in a "App/TinyJob" subdirectory in the python dir.
  • .gangarc: The following properties have been defined for the object TinyJob in this file:
#Properties for example Tiny Job with Devious Flows.
[TinyJob_Properties]
exe = File('/home/djgroen/tinyjob.sh')
args = ['5']
dfp = [["stdout","contains","Segmentation
violation",7.20],["stdout","contains","source got
corrupted",7.21],["stdout","has length",0,8.1]]

  • Saved job files. Two job types have been written to file from Ganga. See Appendix for their contents.

Status:

  • Working code compatible with version 4.1.4 is available. Testing has been performed with this version during validation runs.
  • Code still needs to be packaged, partially debugged & cleaned and updated.
  • Code is untested with using subjobs (might work, might not?)

Many thanks go out to:

Dick van Albada David Groep Jeff Templon Alfredo Tirado-Ramos ... all for supervising my project Willem van Leeuwen ... for providing me with sample production jobs and job statistics data Dennis Kaarsemaker ... for helping with some of the initial job analysis The GANGA development team ... for provide ample support during development and testing of the plug-in

Any suggestions for a more suitable title are highly welcome. Feel free to mail them to djgroen AT science DOT uva DOT nl

Appendix:

Production-type job specification used:
#Ganga# File created by Ganga - Thu Jul 20 16:25:40 2006
#Ganga#
#Ganga# Object properties may be freely edited before reloading into Ganga
#Ganga#
#Ganga# Lines beginning #Ganga# are used to divide object definitions,
#Ganga# and must not be deleted

#Ganga# Job object (category: jobs)
Job (
 name = '' ,
 outputsandbox = [] ,
 splitter = None ,
 inputsandbox =
["/home/djgroen/bigjob/run_mcc.sh","/home/djgroen/bigjob/minbi","/home/djgroen/bigjob/Request-1982-06208134123.tar.gz"]
,
 application = TinyJob (
    exe = File(name='/home/djgroen/bigjob/run_mcc.sh',subdir='.') ,
    env = {} ,
    args = ['Request-1982-06208134123'] ,
    dfp = [['stderr','contains','cat: minbi: No such
file',4.10],['stderr','contains','globus_ftp_client: the server
responded with an error',4.10],['stdout','contains','egmentation
violation',7.20],['stderr','contains','lcg_cp: Communication error on
send',7.21],['stdout','has
length',0,5.1],['stdout','contains','evpack-F-Zlib_Error',7.20],['stdout','contains','evpack-F-Bad_Magic',7.20],['stdout','contains','MCpythia.x:
command not found',7.21]]
    ) ,
 inputdata = None ,
 backend = LCG (
    CE = None ,
    requirements = LCGRequirements (
       other = [] ,
       memory = None ,
       software = [] ,
       ipconnectivity = None ,
       cputime = None ,
       walltime = None
       )
    )
 )
Simple job specification used:
#Ganga# File created by Ganga - Thu Jul 20 16:25:40 2006
#Ganga#
#Ganga# Object properties may be freely edited before reloading into Ganga
#Ganga#
#Ganga# Lines beginning #Ganga# are used to divide object definitions,
#Ganga# and must not be deleted

#Ganga# Job object (category: jobs)
Job (
 name = '' ,
 outputsandbox = [] ,
 splitter = None ,
 inputsandbox = [ ] ,
 application = TinyJob (
    exe = File(name='/home/djgroen/tinyjob.sh',subdir='.') ,
    env = {} ,
    args = ['5'] ,
    dfp = [['stdout','contains','Segmentation
Violation',7.20],['stdout','contains','source got
corrupted',7.21],['stdout','has length',0,5.1]]
    ) ,
 inputdata = None ,
 backend = LCG (
    CE = None ,
    requirements = LCGRequirements (
       other = [] ,
       memory = None ,
       software = [] ,
       ipconnectivity = None ,
       cputime = None ,
       walltime = None
       )
    )
 )
-- JakubMoscicki - 04 Oct 2006
Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt LCG.py.txt r1 manage 29.9 K 2006-10-04 - 15:35 JakubMoscicki LCG handler
Unknown file formatext README r1 manage 1.0 K 2006-10-04 - 15:33 JakubMoscicki README for the code snippets
Texttxt TinyJob.py.txt r1 manage 7.9 K 2006-10-04 - 15:34 JakubMoscicki TinyJob.py
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2006-10-04 - JakubMoscicki
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback