Welcome to the User Guide for the Siteview Sandbox.GridMap monitoring tool

Introduction

Motivation of this activity

The CCRC08 experience was a very valuable benchmark for testing all Grid activities related to LHC experiments. In particular, it gave the opportunity to test the monitoring infrastructure and to evaluate its functionality both from the experiments and the sites point of view. Even if many different monitoring tools for Grid services are already in place and showed to work well, as well as the experiments specific monitoring systems, in many situations site administrators were not able to say if their site was serving the VO well, since none of these tools provide an overall view of the performance of a site. According to the feedback of site administrators, the information published by the already existing monitoring systems should be integrated in a general high level tool which extracts information from experiment specific monitoring systems and offers a generic view of the computing activities of the LHC experiments at the site.

Objective

This activity aims to set up a tool which should enable the administrator of a site to monitor all the activities going on at the site, even if he is not an expert in any of them, in an easy and immediate way.

The metrics common for all LHC VOs are stored in a central repository. If needed, more metrics provided by monitoring tools specific of other VOs served by the site and not necessarily of the HEP community, could be added in the repository.

The site performance relative to all the activities and all the VO supported by the site is displayed through a Sandbox.GridMap based on the shared database. Links to the underlying monitoring systems are provided in order to redirect the user to the information source in a fast and efficient way. This document is meant to be a quick guide to help users to get familiar with the Sandbox.GridMap interface.

Object of the monitoring

Activities

The object of the monitoring are the main activities of the VO at the sites.

First of all, an evaluation of the of the site status from the VO perspective is given. This doesn't refer to any activity in particular, but it's a general evaluation of the site status, usually based on VO specific tests, like SAM tests. The site can be judged to be in good shape even if the VO is not carrying on any activity in the site in that moment, so it is an evaluation independent on the current activity of the VO.

In addition to that, there are 2 main activities: job processing activity and data transfers. And, for each of them, there are some sub activities which depend on the particular VO. For example, the VO ATLAS distinguishes jobs on the basis of their origin, they can be jobs of MC production or jobs of a private analysis of a user, so the overall job processing activity splits into to sub activities: MC production and user analysis. Other VO can have different sub activities, depending on their computing model.

Metrics

For every activity there is a set of metrics which characterize the activity. For example, for job processing we have figured out the following list: running jobs, number of completed jobs in the last hour, number of successfully completed jobs in the last hour, CPU time used by jobs completed in the last hour, wall time. By default, metrics refer to the last hour, unless it is explicitly stated. This is the complete list of metrics and activities.

The main Sandbox.GridMap

This is the link to the Sandbox.GridMap interface. The structure of the Sandbox.GridMap is the following: there is one Sandbox.GridMap for every site, which can be reached selecting the site with the drop down menus on the right. First you have to select the Tier, and then the site name.

Structure of the main Sandbox.GridMap

The main map is divided into 4 main regions: (from top to bottom) general site status, job processing activity, incoming data transfers, outgoing data transfers. Every region displays maps (rectangles) for the VOs supported at the sites. See figure below.

See picture: mainGridmap.png

The size of the rectangle depends on some metric which characterizes that activity. For job processing we have set the size of the rectangle on the basis of the number of running job (the average value over the last hour). For data transfer the size is proportional to the average transfer rate (always referred to the last hour). Finally, for the general site status, the size is the same for all the VOs.

The color of the rectangles represents the status of the corresponding activity for that VO at the site. The meaning of the color is explained in the legend at the bottom of the map (green of course means good, yellow warning, red is error and so on). So, for example, in the Sandbox.GridMap of FZK displayed above, a site status map green for LHCb means that for this VO the site FZK is in good shape, even if there is some problem with job processing (the job processing activity for LHCb is yellow).

About the general site status: this doesn't refer to any particular activity, but it is a general evaluation of the site status, from the VO perspective. This evaluation is based on VO specific SAM tests for ATLAS and LHCb, whereas for CMS it is taken from the CMS Site Status Board (this is comprehensive tool which keeps into account SAM tests and also whether the site is visible or not in BDII). Finally for ALICE it is retrieved from the Monalisa monitoring system. In any case, the definition of the site status is reported in the popup window which appears moving the cursor on the corresponding rectangle area.

The status for job processing and for data transfer is an evaluation of the status of that activity for the given VO. It is important to remark that if an activity status is red for a VO at a given site, this does not mean that the site is the responsible. If the site status (upper rectangle) is green for that VO, then the site is ok, and the problems observed in job processing or data transfer are most probably due to the VO (some buggy version of the software, some unscheduled interruption of data transfer), but not to the site infrastructure.

In this way, the site evaluation is decoupled from the particular VO activity as much as possible. Even if sometimes there can be some border situation where it is not easy to identify the origin of the issue, in principle it should be clear that the site status is defined by the upper region ('general site status') and not by the maps relative to job processing and data transfer.

Context help in the main Sandbox.GridMap

Moving the cursor over the header of a region (general site status, job processing or data transfer) will display a context help with the list of supported VOs for each of them, together with the size of the rectangle, the status of the activity and the date of the last update. See figure below.

job processing header2.png

Then, moving the cursor over the area of a rectangle relative to one VO, another context help will show, relative to the combination of that activity and VO, as shown in the figure below.

job processing contextHelp.png

In this window all the available information about the activity of this VO is shown: a list of all the available metrics and for each metric the value, the expected value (if available), the status (if available), a link to the source of the information and the date of the last update. If a user is interested to retrieve more information about that metric, he can block the context help window by clicking on the right mouse button and then click on the desired URL, and another tab will open for the URL. This URL is a link to the VO specific monitoring tool where the information about the metrics come from.

The submaps

How to access to the sub maps

Main activities can split into one or more sub activities. The information relative to sub activities is given in a sub map which is displayed clicking on the left button of the mouse on the rectangle area relative to that VO and activity. In the case of the 'Site Status' there is no sub activity, so if you click on one of the 'Site Status' rectangles no sub map will appear.

Structure of the sub map

The sub map can be divided into more rectangles, according the number of sub activities. For example, for the activity job processing and VO ATLAS the sub map shows 2 areas: MC production and user analysis. Whereas for LHCb it can show up to 4: MC production, user analysis, data reconstruction and test activities (of course depending on the activity going on in that moment at the site).

Moving the cursor on the header of the sub map a summary is displayed with the name of the sub activity, the size, the status and the last update. In the figure below, we have clicked on the area of the rectangle relative to data_transfer_out/CMS and a sub map has been displayed, divided into 3 regions, which correspond to 3 sub activities: data transfer t1-t0, data transfer t1-t2, data transfer t1-t1. Moving the cursor on the header of the sub map will show the size of the sub activities maps, which are proportional to the respective average transfer rate.

data transfer submaps.png

The context help in the submaps

Moving the cursor on the rectangle area corresponding to one of the sub activities the context help will show up, with exactly the same items than in the main gridmap: the last update, the name of the metrics and for each of them the measured value, the expected value, the status (if available) and a link to a URL where the metrics come from.

How the information is retrieved

Information work flow

Even if this goes beyond the scope of a user guide, it can be interesting for the user to know how the information is retrieved from the source.

information work flow.png

The picture above shows a schematic view of the information work flow from the origin to the publication in the Gridmap. The information about activities and metrics is extracted from the VO specific monitoring tools (Dashboard for CMS and ATLAS, Monalisa for ALICE and DIRAC for LHCb) and then dumped in some public URLs. A collector runs periodically (once every hour) and reads the content of these URLs. It also makes some operation on the values read, in order to have the information in the desired format. Then it inserts the new values in a database schema common to the 4 VOs. The Gridmap server access the data stored in the common database and performs the necessary queries to fill the maps which are displayed through a web browser.

The status of the activities and of the site

Meaning of the status

How it is computed

The status is computed directly on the VO side. This tool only displays the value retrieved from the VO specific monitoring tool. Of course different VOs can adopt different criteria to judge the status of their activity. In order to make it clear for the user, the rule is published in the context help. (put image)

In general, the status is computed on the basis of the success rate, if there is a not negligible activity going on (otherwise it is unknown). Taking into account a period of one hour, it happens quite often, especially for small sites, that there is not enough statistics to set a status. It also can happen that there is activity going on, but a temporary problem occurs which causes the status to turn red. In order to have a more reliable definition of status it has been decided to compute it on the basis of the success rate over a longer period (4 hours for data transfers and 24 hours for job processing). For example in the picture below taken from the Gridmap of PIC site, the context help for outgoing data transfer for ATLAS show a very little activity in the last hour and by consequent a status cannot be assigned (it's unknown), whereas the transfers over the last 4 hours are enough to set a status, which is good (success rate is over 80 %). So finally the status assigned to the overall activity is good, and the rectangle is filled in green.

data transfer status.png

Feedback

In case you find a bug or you have a suggestion to improve the Siteview Sandbox.GridMap you are welcome to open a ticket through Savannah in the 'Dashboard' project and category Site View.

-- ElisaLanciotti - 13 Jan 2009

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2020-08-30 - TWikiAdminUser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox/SandboxArchive All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback