FTS monitoring group
This page tracks the work of the FTS monitoring group who are looking at monitoring the service to improve the overall operations.
The group is fully integrated with the
WLCG reliability and monitoring working group.
Error classification
Details about the error classification (category, scope and phase) can be found at
FTSErrorClassification.
Main information about monitoring system
System consists of two parts. First one is the DB part and the second one is web-interface.
DB part can be found at -
http://jra1mw.cvs.cern.ch/cgi-bin/jra1mw.cgi/org.glite.data.transfer-monitor-spider/SQL/
fts_mon.sql - DB schema, clear.sql - cleaning scripts
also in
http://jra1mw.cvs.cern.ch/cgi-bin/jra1mw.cgi/org.glite.data.transfer-monitor-spider/
can be found old version of web-interface which uses php+xhtml.
New version implemented in dashboard and can be found at
http://dashboard.cvs.cern.ch/cgi-bin/dashboard.cgi/arda.dashboard.spider/
Date base:
In this section DB part will be considered
tables
FTS_MON_AGENT_ERRORS - store an information about an agent errors ( if "ERROR_SCOPE"='AGENT' )
AGR_ID - index
CHANNEL_NAME
REASON_CLASS
VO_NAME
REASON
HIDE - flag used to show/hide a error in the web-interface
CDATE
CTIME
FTS_MON_ALARM - store an information about alarm triggers
REC_ID - index
ID - id of the object for which the alarm trigger is set
O_TYPE - type of the object (1-channel, 2-site, 3-host, 4-VO)
A_TYPE - type of the alarm trigger (1 - if amount of errors > then some level; 2 - if amount(t)-amount(t-1) > then some level (when the level of the error increases on more then a value from the last check time); 3 - if % of failure > then some level (allowed only for an channels or VO))
V_ID - id of the VO for which the alarm trigger is set (if 0 - then for all VO)
M_ID - id of the error for which the alarm trigger is set (if 0 - then for all errors)
LEVEL
FTS_MON_CUR_ALARMS - store an information about an active (for the last script time) alarm triggers.
ID - see above
O_TYPE - see above
A_TYPE - see above
V_ID - see above
M_ID - see above
LEVEL - see above
CTIME
CURRENT_VAL - current level
FTS_MON_MISTAKE - store an information about an errors samples and patterns and the
FTS error categories
M_ID - index
SAMPLE - sample of the error or name of the category
T1 - pattern 1
T2 - pattern 2
T3 - pattern 3
TYPE - Type of the error (If 0 then it's source or destination error, if 1 then it's transfer error)
CATEGORY - 0 means that it's mistake. 1 means that it's category
FTS_MON_CHANNEL - store an information about the channels
C_ID - index
NAME - name of the channels
SOURCE_ID - id of the source site
DEST_ID - id of the destination site
FTS_MON_SD - store an information about monitored sites and hosts
SD_ID - index
NAME - name of the site or host
PARENT_ID - Identify if it's site or host (if PARENT_ID=0 then it's site, else it's host with site which SD_ID= PARENT_ID).
FTS_MON_VO - store an information about monitored VOs
V_ID - index
NAME - name of the VO
FTS_MON_SETTING - store an information about system settings
NAME - name of the setting
ON - identify if the setting is on/of
TYPE - type of setting
FTS_MON_TIME - store an information about scripts last run time
LTIME - lust script run time
TMPTIME - pre-lust script run time
N_ROWS - some statistic
FTS_MON_AGGREGATION - store row data about failed transfers (from T_TRANSFER table) for processing
REQUEST_ID
CHANNEL_NAME
SOURCE_SITE
DEST_SITE
SOURCE_HOST
DEST_HOST
M_ID - error ID (0 if it's unknown mistake).
REASON_CLASS
ERROR_SCOPE
ERROR_PHASE
DURATION
VO_NAME
CTIME
REASON
FTS_MON_AGGREGATION_COMPLETE - store row data about completed transfers (from T_TRANSFER table) for processing
REQUEST_ID
CHANNEL_NAME
SOURCE_SITE
DEST_SITE
SOURCE_HOST
DEST_HOST
DURATION
VO_NAME
CTIME
FILESIZE
FTS_MON_SD_MST - store an information about errors on the sites and the hosts
SD_ID - site/host id
M_ID - error/category id
V_ID - virtual organization ID
MISTAKE_NUMBER - number of errors
SOURCE - identify were the error occure (1 on source site, 0 - on destination side)
CATEGORY - 1-category, 0 - simple error
CDATE
CTIME
FTS_MON_CHANNEL_MST - store an information about errors on the channels
C_ID - channel id
M_ID - error/category id
V_ID - virtual organization ID
MISTAKE_NUMBER - error number
CATEGORY - 1-category, 0 - mistakes
TYPE - identify where the errors occure (0 -source or destination; 1-transfer)
CDATE
CTIME
FTS_MON_CHANNEL_FULL - store an general information about the channels
C_ID - channel ID
V_ID - virtual organization ID
N_SUCCED - number on the succeed transfers
N_FAILED - number on the failed transfers
N_SOURCE - number of the errors on source
N_DEST - number of the errors on destination
N_TRANSFER - number of the errors on transfer
CDATE
CTIME
NF_FAILED - number of the failed jobs
NF_FINISHED - number of the finished jobs
NF_FINISHEDDIRTY - number of the finished dirty jobs
NF_CANCELED - number of the canceled jobs
NF_SUBMITTED - number of the submitted jobs
NF_READY - number of the ready jobs
NF_ACTIVE - number of the avtive jobs
Functions
FUNCTION GET_ID - provide us with error ID
REASON - (in) reason field from the T_TRANSFER table
NUMBER - (out) error id
Procedure
ALARMS - check an active alarm triggers (used in FTS_MON_P.mon_main). Provide information for FTS_MON_CUR_ALARMS table.
Triggers
AGGREGATION_TRIGGER provide us with an information for FTS_MON_AGGREGATION table uses GET_ID function
AGGREGATION_COMPLETE_TRIGGER provide us with an information for FTS_MON_AGGREGATION_COMPLETED tabl
Packages
FTS_MON_P.mon_main - summarize information from the FTS_MON_AGGREGATION. Provide information for the FTS_MON_CHANNEL_MST, FTS_MON_CHANNEL_FULL and FTS_MON_SD_MST tables.
FTS_MON_CLEAR.mon_clear_aggreagation - delete information from FTS_MON_AGGREGATION and FTS_MON_AGGREGATION_COMPLETED where date<(sysdate-1) and also summarize information in the FTS_MON_CHANNEL_MST, FTS_MON_CHANNEL_FULL and FTS_MON_SD_MST tables where date < to_date((sysdate-7),'DD-MON-YY'). We store information for every hour for 7 days, and after that we summarize it and store information for the whole day.
WEB-interface
In current section the web-interface implemented in dashboard notification will be considered
We have unified user interface, so there are same main parts in every system module: Object list – where user can select one or several objects; Time options – represents last information, information for the last 24h and any period in days; Different filters – a general information, an information about errors or error categories, sorted by VO, channels, sites etc;
At the main module a user can get information about transfers and jobs for whole
FTS. He can also get information about succeed and failed transfers on channels and the general reports about a situation for VO and channels. At the “FTS settings” module, a user can get an information about channels and agent settings and VO share. “Alarms” module contains a list of the active alarm triggers. It is one of the notification mechanism tools. The module “Channels” is used to provide an information about the general situation and errors on the channels. “VO” module provides an information about the general situation and errors for the VO. It allows a user to create ratings of the channels/sites with a biggest error number.
“Sites” module provide an information about the errors on the sites.
"Hosts" - now the module is empty
"Errors" module - The module allows also to create ratings of the most frequenting errors, or the errors with biggest total amount for the whole service, channels or VO.
Almost every module in the system has cross-module links. A user can start his work from an information about the errors on channels. After that by just click, he goes to an information about errors on a destination site. At the end, he can find out, for example, that a reason of an errors was only one bad storage element
"Admin panel" - allow to change monitoring system settings, manage (add/edit/delete/find) system objects (channels, sites, hosts, VO), error patterns and alarm trigger. Uses https (SSL_CLIENT_S_DN) to identify admin. Admins credentials should be stored at "/opt/dashboard/etc/dashboard-web/dashboard-web.cfg" variable name "fts.admin", separator is ":".
Some information about PHP+XHTML version can be found here -
article_uzhinskiy.doc
Last edit:
AlexanderUzhinskiy on 2008-12-18 - 12:49
Number of topics: 1
Maintainer:
GavinMcCance