Siteview Sandbox.GridMap: A New Monitoring System for Grid Services at the Sites
Motivation of this activity
The CCRC08 experience was a very valuable benchmark for testing all Grid activities related to LHC experiments (link to
CCRC08Workshop
). In particular, it gave the opportunity
to test the monitoring infrastructure and to evaluate its functionality both from the experiments and the sites point of view. Monitoring was the key service to measure whether the performance of the service was good or bad and to detect problems and efficiently fix them.
The outcome of the test is the following:
Site administrators would like to be able to:
- Compare the experiment's view of the site contribution to the information they get from their own monitoring systems
- Understand if their site is contributing to the VO activity as expected
Some problems they found:
- The main monitoring tools during this exercise were experiment specific tools. They proved to work well, but they are not straightforward to use for a person external to the experiment. Furthermore, they are different for every experiment
- An additional requirement from sites is to have a definition of the targets from the experiments, otherwise it's not possible to understand whether the activity going on at their site meets the VO expectations or not
The
objective of this project is:
A new tool which should:
- Provide an overall view of all the activities going on at the site, from one unique console. This should be a tool easy to use, also for persons external to the VO, and which does not require a particular knowledge of each experiment
- Provide an overall view of the status of the activities, as it is evaluated by the VO, and allow fast and efficient detection of problems
- For every activity and VO provide links to the source of information (VO specific monitoring system), so the problem can be investigated in an efficient way
Proposal for a new monitoring tool
Siteview Sandbox.GridMap is a high level monitoring tool which, from one unique console, offers an overall view of the computing activities of the LHC experiments at the site. This is a high level tool which extracts data from the VO specific monitoring tools (Dasboard, Phedex, Monalisa, Dirac) and displays them in a uniform and simplified way in a common web interface using Gridmap technology.
The objects to monitor are the main VO activities at the site, as job processing activities and data transfers, as well as a general site status evaluation from the VO perspective.
Information flow and architecture
The
information sources are the different monitoring tools used by each experiment:
DIRAC
for LHCb,
Monalisa
and
Dashboard
for ALICE,
Dashboard
for ATLAS,
Dashboard
for job processing and
Phedex
for data transfer for CMS,
Once the metrics have been extracted from the sources, they are published in some
URLs.
A Dashboard collector periodically reads them from the URLs and stores them in a common database.
The values are then displayed in a Gridmap.
The fact that the metrics of all 4 experiments are stored in the same schema allows to display in the same plot results coming from different experiments.
No new data are generated! The same data existing in the VO specific monitoring tools are presented in a different format, in parallel for every VO.
The database schema: the same schema should contain the data of the 4 experiment.
The gridmap: the information stored in the database will be displayed using the gridmap technology. A gridmap is being developed and can already display the data stored in the schema.
Activities and metrics to monitor
The first proposal for the
list of metrics to monitor has been figured out on the basis of the feedback given by sites after the CCRC08.
The list will be updated according to further requirements.
This is the
format that should be used to provide the data.
A summary of the
metrics currently available.
Current status of the activity
This a page which summarizes the
current status of the activity.
Open questions
- How to define the targets for each activity?
- About the data transfer: is it worth storing in our shared database all the data channel by channel (this can be a lot of data...). Some considerations here.
- Can the experiments provide information about the different sub activities (MC production, user analysis etc...) or do they just publish information about the main activities (job processing)? See more..
- Metrics for Alice activities: job processing metrics are provided, together with the pledged values. Also the overall site status is provided. Still missing data about data transfer.
- Metrics for Atlas activities: job processing and data transfer metrics are obtained through an API. The script calling the API is imported inside my collector. Also the status is computed.
- Metrics for CMS activities: job processing metrics are provided, together with the pledged values for the number of jobs
- Metrics for LHCb activities: metrics about data transfer and job processing activities are provided. Still missing the status of data transfers.
Feedback from users
Feedback from users
here
Meetings
Presentations
Documentation
User Guide
Related links
About Gridmap
Links to the existing
CCRC08 servicemap
showing experiment specific SLS and SAM data for critical services (only the map on the right!).
And link to the original
CERN Sandbox.GridMap
, showing the SAM and
GridView data for reference.
Gridmap for the
view of all the sites for a given VO
(here CMS). Under development.
SAM visualization
for
LHCb
and for
Atlas
--
ElisaLanciotti - 11 Jul 2008