!DeReTo (working title) plug-in - 14/09/2006
Developed by Derek Groen
Attached code snippets
Description:
Derek's Reliability Tool (DeReTo) seeks to improve user job success rate
by analyzing the logs and output data of previous job submissions.
- Using user-specified, application-specific pattern definition, the tool parses the job output and detects the precise error that has occured. In addition, the tool parses log files to determine the final status of a job, the time a job spent in a queue and the time a job spent executing on the worker nodes.
- This information is stored in a file in the gangadir. The contents of this file can then be used to calculate a reliability score for individual sites in the Grid.
- The reliability scores provide the user with insight of past site behavior in terms of reliability, and allows the user (or an automated algorithm) to select a more optimal site.
Benefits:
- In the validation experiments of this tool, even using basic reliability score and site selection schemes, we were able to greatly reduce the number of job failures without submitting excess jobs.
- Site selected by this tool using job statistics data less than a month old have shown, on average, to behave more reliably than sites selected by the RB or at random.
How to set it up:
The current setup is somewhat crude, but adjustments have been made in the following files:
- The LCG backend handler (LCG.py): this file contains a lot of changes, mainly in parsing job output, job log files and adding in optional automated site selection.
- Executable Object: A custom executable have been developed to support the specification of devious flow patterns (dfp's). Currently this object resides in a "App/TinyJob" subdirectory in the python dir.
- .gangarc: The following properties have been defined for the object TinyJob in this file:
#Properties for example Tiny Job with Devious Flows.
[TinyJob_Properties]
exe = File('/home/djgroen/tinyjob.sh')
args = ['5']
dfp = [["stdout","contains","Segmentation
violation",7.20],["stdout","contains","source got
corrupted",7.21],["stdout","has length",0,8.1]]
- Saved job files. Two job types have been written to file from Ganga. See Appendix for their contents.
Status:
- Working code compatible with version 4.1.4 is available. Testing has been performed with this version during validation runs.
- Code still needs to be packaged, partially debugged & cleaned and updated.
- Code is untested with using subjobs (might work, might not?)
Many thanks go out to:
Dick van Albada
David Groep
Jeff Templon
Alfredo Tirado-Ramos ... all for supervising my project
Willem van Leeuwen ... for providing me with sample production jobs and
job statistics data
Dennis Kaarsemaker ... for helping with some of the initial job analysis
The GANGA development team ... for provide ample support during
development and testing of the plug-in
Any suggestions for a more suitable title are highly welcome. Feel free to
mail them to djgroen AT science DOT uva DOT nl
Appendix:
Production-type job specification used:
#Ganga# File created by Ganga - Thu Jul 20 16:25:40 2006
#Ganga#
#Ganga# Object properties may be freely edited before reloading into Ganga
#Ganga#
#Ganga# Lines beginning #Ganga# are used to divide object definitions,
#Ganga# and must not be deleted
#Ganga# Job object (category: jobs)
Job (
name = '' ,
outputsandbox = [] ,
splitter = None ,
inputsandbox =
["/home/djgroen/bigjob/run_mcc.sh","/home/djgroen/bigjob/minbi","/home/djgroen/bigjob/Request-1982-06208134123.tar.gz"]
,
application = TinyJob (
exe = File(name='/home/djgroen/bigjob/run_mcc.sh',subdir='.') ,
env = {} ,
args = ['Request-1982-06208134123'] ,
dfp = [['stderr','contains','cat: minbi: No such
file',4.10],['stderr','contains','globus_ftp_client: the server
responded with an error',4.10],['stdout','contains','egmentation
violation',7.20],['stderr','contains','lcg_cp: Communication error on
send',7.21],['stdout','has
length',0,5.1],['stdout','contains','evpack-F-Zlib_Error',7.20],['stdout','contains','evpack-F-Bad_Magic',7.20],['stdout','contains','MCpythia.x:
command not found',7.21]]
) ,
inputdata = None ,
backend = LCG (
CE = None ,
requirements = LCGRequirements (
other = [] ,
memory = None ,
software = [] ,
ipconnectivity = None ,
cputime = None ,
walltime = None
)
)
)
Simple job specification used:
#Ganga# File created by Ganga - Thu Jul 20 16:25:40 2006
#Ganga#
#Ganga# Object properties may be freely edited before reloading into Ganga
#Ganga#
#Ganga# Lines beginning #Ganga# are used to divide object definitions,
#Ganga# and must not be deleted
#Ganga# Job object (category: jobs)
Job (
name = '' ,
outputsandbox = [] ,
splitter = None ,
inputsandbox = [ ] ,
application = TinyJob (
exe = File(name='/home/djgroen/tinyjob.sh',subdir='.') ,
env = {} ,
args = ['5'] ,
dfp = [['stdout','contains','Segmentation
Violation',7.20],['stdout','contains','source got
corrupted',7.21],['stdout','has length',0,5.1]]
) ,
inputdata = None ,
backend = LCG (
CE = None ,
requirements = LCGRequirements (
other = [] ,
memory = None ,
software = [] ,
ipconnectivity = None ,
cputime = None ,
walltime = None
)
)
)
--
JakubMoscicki - 04 Oct 2006