Siteview GridMap: A Monitoring Application for Grid VOs activities at the Sites

Dashboard Siteview application: the status of VOs activity in a given site summarized in one display

Motivation of this activity

The CCRC08 experience was a very valuable benchmark for testing all Grid activities related to LHC experiments (link to CCRC08Workshop). In particular, it gave the opportunity to test the monitoring infrastructure and to evaluate its functionality both from the experiments and the sites point of view. Monitoring was the key service to measure whether the performance of the service was good or bad and to detect problems and efficiently fix them. The outcome of the test is the following:

Site administrators would like to be able to:

  • Compare the experiment's view of the site contribution to the information they get from their own monitoring systems
  • Understand if their site is contributing to the VO activity as expected

Some problems they found:

  • The main monitoring tools during this exercise were experiment specific tools. They proved to work well, but they are not straightforward to use for a person external to the experiment. Furthermore, they are different for every experiment
  • An additional requirement from sites is to have a definition of the targets from the experiments, otherwise it's not possible to understand whether the activity going on at their site meets the VO expectations or not

The objective of this project is:

A new tool which should:

  • Provide an overall view of all the activities going on at the site, from one unique console. This should be a tool easy to use, also for persons external to the VO, and which does not require a particular knowledge of each experiment
  • Provide an overall view of the status of the activities, as it is evaluated by the VO, and allow fast and efficient detection of problems
  • For every activity and VO provide links to the source of information (VO specific monitoring system), so the problem can be investigated in an efficient way

Proposal for a new monitoring tool

Siteview GridMap is a high level monitoring tool which, from one unique console, offers an overall view of the computing activities of the LHC experiments at the site. This is a high level tool which extracts data from the VO specific monitoring tools (Dasboard, Phedex, Monalisa, Dirac) and displays them in a uniform and simplified way in a common web interface using Gridmap technology.
The objects to monitor are the main VO activities at the site, as job processing activities and data transfers, as well as a general site status evaluation from the VO perspective.

New Siteview implementation

The Siteview application has been totally restructured by Marco Devesa during 2009-2010 (after Elisa's departure, July 2009).

The collector

Information about the collector, its structure, where it runs, location of logs, how to start/stop it, configuration files here.

The DB schema

Gridmap display

About Gridmap server and its configuration here.

Current activity

Activity going on here.

Old implementation of Siteview

We keep here the documentation relative to the old implementation of Siteview

Information flow and architecture

The information sources are the different monitoring tools used by each experiment: DIRAC for LHCb, Monalisa and Dashboard for ALICE, Dashboard for ATLAS, Dashboard for job processing and Phedex for data transfer for CMS,

Once the metrics have been extracted from the sources, they are published in some URLs.
A Dashboard collector periodically reads them from the URLs and stores them in a common database.
The values are then displayed in a Gridmap.

The fact that the metrics of all 4 experiments are stored in the same schema allows to display in the same plot results coming from different experiments. No new data are generated! The same data existing in the VO specific monitoring tools are presented in a different format, in parallel for every VO.

Database schema

The database schema: the same schema should contain the data of the 4 experiment.

The collector

The collector reads the metrics published in the URL, elaborates the information, and writes it to the shared database.

GridMap display

The GridMap: the information stored in the database will be displayed using the gridmap technology. A gridmap is being developed and can already display the data stored in the schema.

Activities and metrics to monitor

The first proposal for the list of metrics to monitor has been figured out on the basis of the feedback given by sites after the CCRC08. The list will be updated according to further requirements.

This is the format that should be used to provide the data.

A summary of the metrics currently available.

Current status of the activity

This a page which summarizes the current status of the activity.

Open questions

  • How to define the targets for each activity?

  • About the data transfer: is it worth storing in our shared database all the data channel by channel (this can be a lot of data...). Some considerations here.

  • What data experiments can provide? this strictly depends on VO specific monitoring systems. See here a review

  • Metrics for Alice activities: job processing metrics are provided, together with the pledged values. Also the overall site status is provided. Still missing data about data transfer.

  • Metrics for Atlas activities: job processing and data transfer metrics are obtained through an API. The script calling the API is imported inside my collector. Also the status is computed.

  • Metrics for CMS activities: job processing metrics are provided, together with data transfers

  • Metrics for LHCb activities: metrics about data transfer and job processing activities are provided. Still missing the status of data transfers.

Feedback from users

Feedback from users here

Meetings

Presentations

Documentation

Maintenance of the Application

Siteview runs on dashb-virtual06 and is comprised of a single collector -- SiteView.AllVos. To check any anomalous situation, the log file is available at /opt/dashboard/var/log/dashb-AllVOsiteview.log.

There is a cron job that checks if the service is up every 21 minutes; should something come up, it should take care of restarting the collector. It runs every hour. The script is stored in /opt/dashboard/cron/dashbCollectorsSSB.sh.

The collector itself is capable of catching exceptions (notably database related) and it immediately warns the admins (micha;, julia and pablo); in these situations, the collector doesn't restart but proceeds as normal. Should there be any problems of this sort, please check that there are no hanging database connections here: https://oraweb.cern.ch/pls/int11r/webinstance.sessions.show_sessions (user: LCG_SITEMONITORING_INT_W).

If this problem persists it may be a good idea to do a full restart of the collector:

 su - dboard
export PYTHONPATH=/opt/dashboard/lib
/opt/dashboard/bin/dashb-agent-stop AllVOsiteview
rm -rf /opt/dashboard/var/log/dashb-AllVOsiteview.log
/opt/dashboard/bin/dashb-agent-start AllVOsiteview

Then, ypu can monitor the bahaviour of the collector using:

tail -f /opt/dashboard/var/log/dashb-AllVOsiteview.log

Finally, there is the possibilities of some old instances of the collector to run in a zoombie state. In that case the ps+grep command above will return more than two pids; should that happen, kill all of them and restart the collector.

Related links

About Gridmap

Links to the existing CCRC08 servicemap showing experiment specific SLS and SAM data for critical services (only the map on the right!).

And link to the original CERN GridMap, showing the SAM and GridView data for reference.

Gridmap for the view of all the sites for a given VO (here CMS). Under development.

SAM visualization

for LHCb

and for Atlas

-- ElisaLanciotti - 04 Jun 2009

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2011-01-26 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback