End to end monitoring

Introduction

The amount of software layers distributed over the machines involved into data transfers prevents an easy access to the log information. Besides, every layer has its own way to log the information. This makes difficult the fact of tracing data transfers over the whole system by looking on the log files manually.

The aim of this project is to follow the trace of the transfers by looking at the log files of every machine (of the layers involved, actually SRM, GRIDFTP and FTS machines). To do this, the design of this project is divided in two parts, the user application and a remote application, which communicate on demand. The user application objective is to offer a friendly frontend to the user from where they can interact with the system and get the important information from the transfers by giving some references of it (file name, job id or request id). On the other hand, the remote application is in charge of looking for the information on the machines (from where it is installed), parse and return it to the user application.

The user application is a web frontend based on Django framework that can be found on: Go to Logserv Only accesible from the FIO group.

Architecture

Due to the fact that this tool has to work in a distributed system, it is going to have two parts well defined, the user application (interface) and the remote daemons spread over the SRM, GRIDFTP and FTS machines.

The communication between them is needed, so it is based on message oriented middleware ActiveMQ (Apache project). ActiveMQ is the responsible of the plumbing and handling of the messages reliably and easily between the components that belong to this project. There are two ways for the message delivering:

broadcast_pointtopoint.png

Figure 1.- (a) Publish/subscribe broadcast. (b) Point to point.

And the library used to handle the messages from python is Stomp (version 2.0.2).

On the first approach, it was decided to use selectors, so the messages could be sent point to point (figure 1.b) to the GRIDFTP and FTS machines because the specific machine that was involved on a transfer is known (optimal, one message to the machine involved). But for the SRM machines messages should be sent using the broadcast approach (figure 1.a) because the information could be spread over more than one machine. But this broadcast could not be done to all SRM machines, it is done to all the machines that belong to the VO that handled the transfer (also known).

Due to the fact that with the new stomp version (2.0.2) does not include the selector function, the approach was modified. In the actual version of the web application the messages are sent using the broadcast method (figure 1.a). This method is not optimal because it spreads useless messages around the system, so in order to minimize the impact of the messaging, the headers of the messages were used to limit the machines that will process the messages. When a remote daemon receives a message, checks the headers to know if the message is for him and need to start working or not.

If on the header appears 'fts' as target, this message will arrive to all FTS machines an they all will start working. On the other hand, not all the SRM machines will start working the a 'srm' header appears, because that one will make all srm machines work and we know the VO in advance, so using the VO name into that header will make the SRM machines that belong to that cluster work, not the others.

The workflow of the components is described on the picture below (figure 2). The user interact with the web frontend, wich is in charge of getting the information from the FTS database and then send the messages to the remote daemons. When all the work (from the remote daemons) is done, they send the important information back to the web frontend. Finally it shows the results in a friendly format to the user.

sequence_diagram.png

Figure 2.- Workflow

Modules

As it was said on the architecture, this project has two parts that can be easily observed on the next picture (figure 3). It shows the modules and components that belong to the application using the web interface on the left side, and on the right side the remote application components. Between them appears the plumbing service (managed by ActiveMQ) which is transparent in this point.

component_diagram.png

Figure 3.- Web application and Remote daemon components

First lets start with the Web application. As it can be seen (figure 3), it is divided into three blocks of components to clarify their function. The middle box contains the views class, and the other three components are extensions of the views class. This extensions are done to separate code from the views class to have the similar functions under a different file with a descriptive name. FTS_methods file contains all the methods needed from the views that query the FTS database, html_methods has all the methods that build html code and Format_methods contains all the methods used to extract the information from the messages to give and store them in a common format (useful if the message structure changes). The components that appear at the bottom of the Web application (Diagram, Machines and Table) are auxiliar classes that are used to store the information from the messages and keep the differences between the different parts of the web frontend. The listener is another auxiliar class used to receive the messages.

On the Remote daemons (figure 3) appear two boxes, one for each rpm. First the logserve package contains the universal communication rpm for this project and all remote machines must have this one installed. On it can be seen that it is very simple and only has one class called ConsumerCore that handle all the communication, it uses Stomp, Loggable and SchedulerUtils for auxiliary functions. The parser is independent from the logserve because it depends on the machine that it is going to be running, but all machines must have one in order to work fine. This class handles all the parsing of the logfiles and return to the logserve the result.

Internally, the workflow can be described as in the following activity diagrams. The figure 4 shows how the Web application does in outline, and on the figure 5 can be seen the activity diagram of the behavior of the remote daemons.

activity_diagram.png

Figure 4.-Activity diagram of the web frontend.

activity_diagram_1.png

Figure 5.- Activity diagram of the remote applications (daemons)

Installation

There are five packages available:
  • For the web server: Log? ---- Web application v.1.0
  • For the remote machines: logserve ---- Remote daemon v.1.2-2 srm_parser_v2 ---- SRM Parser v.1.2-4 gridftp_parser ---- GRIDFTP Parser v.1.2-5 fts_parser ---- FTS Parser v.1.2-3

  • Quattor component (for the remote machines): ncm-logserve ---- Remote machine quattor component v.0.0.0-4

For a manual installation on a remote machine starts with the logserve package:

rpm -ivh logserve-1.2-1.noarch.rpm
Under root privileges.

Then install the parser that is needed on the remote machine. For example, on a SRM machine:

rpm -ivh srm_parser-1.2-4.noarch.rpm
Under root privileges.

The next step is to configure the both applications. The files that need to be configured are:

For the daemon: /etc/logservice-monitor/logservice-monitor

[parser]
parser_module = NAME OF THE PARSER

[stomp]
source = /topic/srm_monitor
destination = /queue/logserve
broker_host = gridmsg001
broker_port = 6163

[include]
library_dir = /usr/libexec/logserve/library
include_dir = /etc/logserve/modules.d
include_pattern = *.conf

[site]
this_site = CERN-PROD

This is the configuration file for the daemon (ConsumerCore class). On it appears the source topic from where to get the messages and the destination queue where to send the response. While destination field is always the same, the source field must be changed depending on the parser installed on that machine (so the source will be XXX_monitor, where XXX is the type of the parser (srm, gridftp or fts)).

Broker host and broker port are relative to the Active MQ account used for plumbing.

The field called this_site is used to configure the word used on FTS to know if the transfer is for reading or for writing, depending on where it appears.

And the last and most important one is the first field that appears in the configuration file, called parser_module: this one is used to tell to the ConsumerCore which is the parser that is going to be called when a message is received. It can be: SRM_parser, GRIDFTP_parser or FTS_parser.

For the parser: /etc/logservice-monitor/modules.d/SRM_parser.conf

[task]
module_name = SRM_parser
timeout = 0
tag = SRM_parser
description = This is the configuration file for the SRM parser.
frequency = 0

[configuration]
directory = /var/javitest/srm501/srm/
timezone = 2
voname = atlas

Every parser has a configuration file with specific information about the task that it does (which does not need to be modified), and three fields for specific configuration. This fields are: directory, wich contains the full path where the log files can be found. timezone, with an integer to define how many hours of difference are between the UTC and the log files timestamp. And voname (which is only useful for SRM) is used to define the VO to that the machine belong.

Using the quattor component the configuration is automatic (if configured on the profile).

Uninstall

To uninstall this packages just run

rpm -e and the name of the package
. Then clean logfile created by deleting the file called: /var/log/logserve.log and the compresed ones if necessary (due to logrotate if configured).

Use

Using the webpage, the only thing to do is to fill the boxes with the file name, the job id (or both), the request id, or the remedy ticket (copied and pasted). Click on send and the application will start to query the remote machines.

first.png

Results

The results page is divided into four blocks of information:

  • FTS database info.
  • Table of messages.
  • Blocks diagram.
  • Machines list.
  • SRM frontend and backend.

Results screenshot:

results.png

The FTS database info can be found on the top center of the page and above, the Table of messages is shown. On this table appears the content of the message field of every line of the logs, timestamp of that line, type of message and machine from where that line comes from. The table is sorted by timestamp to make easier the fact of following the trace of the transfer. It is important to notice that the lines that contain the words "failed" or "error" will be highlighted in red.

On the right side of the result page can be seen the blocks diagram, which is an outline of the trace of the transfer. It has a colour code which means:

  • Background: No message received for that block.
  • Green: Message received and does not contain "failed" or "error".
  • Red: Message received and contains "failed" or "error",
This diagram is sorted by the natural trace of the transfers. Above the diagram, appears the machines list, on where all the machines that responded will be shown. The machines that sent back useful information will be highlighted with a green tick and the others with a red cross.

It is important to note that on the blocks diagram and on the machines list, will appear specific links for the received blocks and also for the highlighted machines. By following this link, a new tab will be opened with the details of that block or machine.

For the SRM frontend and backend info, it can be found on the text upon the table of events. A special window will appear if the mouse is over those links.

Notes

  • It is working with the SRM and FTS parsers running now.

  • The main application (running on a web server) needs Python 2.5 and YAML installed.

  • Stomp is also needed (Version 2.0.2)

  • When an update of SRM, GRIDFTP or FTS software change the way they log the information on their log files, the parsers of the logservice-monitor must be adapted to the new changes or in the worst case, do a new parser.

  • It is important to notice that if the structure of the messages that the parsers send back to the main application is changed, then it will be mandatory to modify the main application to be able to work with the new messages in order to avoid a failure of the logservice-monitor application, and if new information is added to the messages, the way that the main application processes and shows it must be changed.

Future Work

  • Improve the SRM parser to find the backend information always. Due to this information is not always between the time window.

  • Deploy the Gridftp parsers over the diskservers in order to get this logs information.

  • Do an NCM component for the web application.

Related links

https://twiki.cern.ch/twiki/bin/view/FIOgroup/FsPresentations

http://indico.cern.ch/contributionDisplay.py?contribId=158&sessionId=63&confId=35523

-- FranciscoJavierConejeroBanon - 2009-09-28

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng activity_diagram.png r1 manage 16.4 K 2009-12-02 - 17:01 UnknownUser Figure 4.-Activity diagram of the web frontend.
PNGpng activity_diagram_1.png r1 manage 10.5 K 2009-12-02 - 17:01 UnknownUser Figure 5.- Activity diagram of the remote applications (daemons)
PNGpng component_diagram.png r1 manage 24.9 K 2009-12-02 - 16:58 UnknownUser Figure 3.- Web application and Remote daemon components
PNGpng first.png r1 manage 31.7 K 2009-12-02 - 17:04 UnknownUser Logserve welcome screenshot
PNGpng results.png r1 manage 125.6 K 2009-12-02 - 17:05 UnknownUser Logserve results screenshot
PNGpng sequence_diagram.png r1 manage 5.0 K 2009-12-02 - 16:54 UnknownUser Figure 2.- Logserve sequence diagram
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2020-08-19 - TWikiAdminUser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox/SandboxArchive All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback