Show Children Hide Children

FTS monitoring group

This page tracks the work of the FTS monitoring group who are looking at monitoring the service to improve the overall operations.

The group is fully integrated with the WLCG reliability and monitoring working group.

Error classification

Details about the error classification (category, scope and phase) can be found at FTSErrorClassification.

Main information about monitoring system

System consists of two parts. First one is the DB part and the second one is web-interface.

DB part can be found at - http://jra1mw.cvs.cern.ch/cgi-bin/jra1mw.cgi/org.glite.data.transfer-monitor-spider/SQL/

fts_mon.sql - DB schema, clear.sql - cleaning scripts

also in http://jra1mw.cvs.cern.ch/cgi-bin/jra1mw.cgi/org.glite.data.transfer-monitor-spider/ can be found old version of web-interface which uses php+xhtml.

New version implemented in dashboard and can be found at http://dashboard.cvs.cern.ch/cgi-bin/dashboard.cgi/arda.dashboard.spider/

Date base:

In this section DB part will be considered

tables

FTS_MON_AGENT_ERRORS - store an information about an agent errors ( if "ERROR_SCOPE"='AGENT' )

AGR_ID - index

CHANNEL_NAME

REASON_CLASS

VO_NAME

REASON

HIDE - flag used to show/hide a error in the web-interface

CDATE

CTIME

FTS_MON_ALARM - store an information about alarm triggers

REC_ID - index

ID - id of the object for which the alarm trigger is set

O_TYPE - type of the object (1-channel, 2-site, 3-host, 4-VO)

A_TYPE - type of the alarm trigger (1 - if amount of errors > then some level; 2 - if amount(t)-amount(t-1) > then some level (when the level of the error increases on more then a value from the last check time); 3 - if % of failure > then some level (allowed only for an channels or VO))

V_ID - id of the VO for which the alarm trigger is set (if 0 - then for all VO)

M_ID - id of the error for which the alarm trigger is set (if 0 - then for all errors)

LEVEL

FTS_MON_CUR_ALARMS - store an information about an active (for the last script time) alarm triggers.

ID - see above

O_TYPE - see above

A_TYPE - see above

V_ID - see above

M_ID - see above

LEVEL - see above

CTIME

CURRENT_VAL - current level

FTS_MON_MISTAKE - store an information about an errors samples and patterns and the FTS error categories

M_ID - index

SAMPLE - sample of the error or name of the category

T1 - pattern 1

T2 - pattern 2

T3 - pattern 3

TYPE - Type of the error (If 0 then it's source or destination error, if 1 then it's transfer error)

CATEGORY - 0 means that it's mistake. 1 means that it's category

FTS_MON_CHANNEL - store an information about the channels

C_ID - index

NAME - name of the channels

SOURCE_ID - id of the source site

DEST_ID - id of the destination site

FTS_MON_SD - store an information about monitored sites and hosts

SD_ID - index

NAME - name of the site or host

PARENT_ID - Identify if it's site or host (if PARENT_ID=0 then it's site, else it's host with site which SD_ID= PARENT_ID).

FTS_MON_VO - store an information about monitored VOs

V_ID - index

NAME - name of the VO

FTS_MON_SETTING - store an information about system settings

NAME - name of the setting

ON - identify if the setting is on/of

TYPE - type of setting

FTS_MON_TIME - store an information about scripts last run time

LTIME - lust script run time

TMPTIME - pre-lust script run time

N_ROWS - some statistic

FTS_MON_AGGREGATION - store row data about failed transfers (from T_TRANSFER table) for processing

REQUEST_ID

CHANNEL_NAME

SOURCE_SITE

DEST_SITE

SOURCE_HOST

DEST_HOST

M_ID - error ID (0 if it's unknown mistake).

REASON_CLASS

ERROR_SCOPE

ERROR_PHASE

DURATION

VO_NAME

CTIME

REASON

FTS_MON_AGGREGATION_COMPLETE - store row data about completed transfers (from T_TRANSFER table) for processing

REQUEST_ID

CHANNEL_NAME

SOURCE_SITE

DEST_SITE

SOURCE_HOST

DEST_HOST

DURATION

VO_NAME

CTIME

FILESIZE

FTS_MON_SD_MST - store an information about errors on the sites and the hosts

SD_ID - site/host id

M_ID - error/category id

V_ID - virtual organization ID

MISTAKE_NUMBER - number of errors

SOURCE - identify were the error occure (1 on source site, 0 - on destination side)

CATEGORY - 1-category, 0 - simple error

CDATE

CTIME

FTS_MON_CHANNEL_MST - store an information about errors on the channels

C_ID - channel id

M_ID - error/category id

V_ID - virtual organization ID

MISTAKE_NUMBER - error number

CATEGORY - 1-category, 0 - mistakes

TYPE - identify where the errors occure (0 -source or destination; 1-transfer)

CDATE

CTIME

FTS_MON_CHANNEL_FULL - store an general information about the channels

C_ID - channel ID

V_ID - virtual organization ID

N_SUCCED - number on the succeed transfers

N_FAILED - number on the failed transfers

N_SOURCE - number of the errors on source

N_DEST - number of the errors on destination

N_TRANSFER - number of the errors on transfer

CDATE

CTIME

NF_FAILED - number of the failed jobs

NF_FINISHED - number of the finished jobs

NF_FINISHEDDIRTY - number of the finished dirty jobs

NF_CANCELED - number of the canceled jobs

NF_SUBMITTED - number of the submitted jobs

NF_READY - number of the ready jobs

NF_ACTIVE - number of the avtive jobs

Functions

FUNCTION GET_ID - provide us with error ID

REASON - (in) reason field from the T_TRANSFER table NUMBER - (out) error id

Procedure

ALARMS - check an active alarm triggers (used in FTS_MON_P.mon_main). Provide information for FTS_MON_CUR_ALARMS table.

Triggers

AGGREGATION_TRIGGER provide us with an information for FTS_MON_AGGREGATION table uses GET_ID function

AGGREGATION_COMPLETE_TRIGGER provide us with an information for FTS_MON_AGGREGATION_COMPLETED tabl

Packages

FTS_MON_P.mon_main - summarize information from the FTS_MON_AGGREGATION. Provide information for the FTS_MON_CHANNEL_MST, FTS_MON_CHANNEL_FULL and FTS_MON_SD_MST tables.

FTS_MON_CLEAR.mon_clear_aggreagation - delete information from FTS_MON_AGGREGATION and FTS_MON_AGGREGATION_COMPLETED where date<(sysdate-1) and also summarize information in the FTS_MON_CHANNEL_MST, FTS_MON_CHANNEL_FULL and FTS_MON_SD_MST tables where date < to_date((sysdate-7),'DD-MON-YY'). We store information for every hour for 7 days, and after that we summarize it and store information for the whole day.


WEB-interface

In current section the web-interface implemented in dashboard notification will be considered

We have unified user interface, so there are same main parts in every system module: Object list – where user can select one or several objects; Time options – represents last information, information for the last 24h and any period in days; Different filters – a general information, an information about errors or error categories, sorted by VO, channels, sites etc;

At the main module a user can get information about transfers and jobs for whole FTS. He can also get information about succeed and failed transfers on channels and the general reports about a situation for VO and channels. At the “FTS settings” module, a user can get an information about channels and agent settings and VO share. “Alarms” module contains a list of the active alarm triggers. It is one of the notification mechanism tools. The module “Channels” is used to provide an information about the general situation and errors on the channels. “VO” module provides an information about the general situation and errors for the VO. It allows a user to create ratings of the channels/sites with a biggest error number.

“Sites” module provide an information about the errors on the sites.

"Hosts" - now the module is empty

"Errors" module - The module allows also to create ratings of the most frequenting errors, or the errors with biggest total amount for the whole service, channels or VO.

Almost every module in the system has cross-module links. A user can start his work from an information about the errors on channels. After that by just click, he goes to an information about errors on a destination site. At the end, he can find out, for example, that a reason of an errors was only one bad storage element

"Admin panel" - allow to change monitoring system settings, manage (add/edit/delete/find) system objects (channels, sites, hosts, VO), error patterns and alarm trigger. Uses https (SSL_CLIENT_S_DN) to identify admin. Admins credentials should be stored at "/opt/dashboard/etc/dashboard-web/dashboard-web.cfg" variable name "fts.admin", separator is ":".

Some information about PHP+XHTML version can be found here - article_uzhinskiy.doc

Last edit: AlexanderUzhinskiy on 2008-12-18 - 12:49

Number of topics: 1

Maintainer: GavinMcCance

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2008-12-18 - AlexanderUzhinskiy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback