Design of an Expert System based on found Association Rules in Grid Job Monitoring Data

Abstract

Given the complexity of a Grid infrastructure it is impracticable to provide an absolutely fail-safe system. Therefore, strong error reporting and handling is needed. There are Grid monitoring systems in place, which are also able to deliver error codes of failed Grid jobs. However, the error codes do not always denote the actual source of the error. Instead, a more sophisticated methodology is required to locate problematic Grid elements.

In the scope of my PhD I investigate the possibility to mine Grid monitoring data using association rule mining. This approach produces additional knowledge about the Grid elements' behavior by taking correlations and dependencies between the characteristics of failed Grid jobs into account. Aside from the detection, also the interpretation and understanding of errors are necessary to solve occurring problems. This crucial task is accomplished by the experienced users and administrators of everyday Grid operations. During my research I am designing an expert system, which combines found error patterns from mining the monitoring data with the experts' knowledge about their underlying problem and its solution.

The design combines machine created knowledge with human knowledge and provides a way to automatically detect and react to problematic Grid elements.

Approach

The design of a system called QAOES, short for Quick Analysis Of Error Sources, consists of two main building blocks. Firstly, the data mining step, in which association rule mining is applied to Grid job monitoring data after some preprocessing of the data. Secondly, the collection of human knowledge, which is transformed to be machine understandable and used to build an expert system. Each of these two blocks comprised several phases, which are illustrated in two very simple diagrams.

Data Mining Step
QAOES - Web Interface
Currently there is a web interface available (QAOES), which visualizes the output of association rule mining applied to a certain data set. A data set in this respect is for example "monitoring data of CMS analysis jobs from the last 12 hours". This monitoring data consists of a number of job characteristics, such as the user who submitted the job, the site where it was executed, the used dataset, and the exit code. A data mining algorithm called apriori is applied to the set of data and returns rules in the following format: {I1,I2,... In-1} => {In}. The first part of the rule is called antecedent and the second consequent. Each of them consists of job characteristics, which are key-value pairs like USERNAME=xxx. There are two numbers involved in the execution of the algorithm:
  • the support of a rule, which is the percentage of jobs in the dataset that contain all the job characteristics of a rule (from the antecedent and the consequent).
  • the confidence states how significant the rule is, it is the percentage of how many jobs which contain the job characteristics of the antecedent also contain the job characteristics of the consequent.
The apriori algorithm takes minimum values for those parameters as an input. Reasonable for the support and the confidence would be 1% and 80% respectively.

In the web interface, all rules which satisfy the minimum support and minimum confidence are listed, plus the exact support, confidence and a third value, which is called lift and represents the interestingness of a rule, whereas the higher the value, the more interesting, e.g. a low support, but 100% confidence makes a rule interesting.

Building the Expert System

Contact and Further Information

mail: Gerhild . Maier @ cern . ch

-- GerhildMaier - 09 Feb 2009

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2009-09-16 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback