Activities and metrics that we want to monitor

There are 2 main activities: job processing and data transfer. For each main activity, there is a group of sub activities.

Job Processing:

(the names as they appear in the database are displayed in verbatim)

mc_production activity of job processing for MC production
data_reconstruction activity of job processing for real data reconstruction
user_analysis activity of job processing due to data analysis done by users.
test jobs run for any type of test, including SAM tests
unknown jobs which have an unknown origin
JobRobot test jobs for CMS
private_production MC production run by private users

for each activity the metrics to monitor are:

  •  parallel_jobs 
    the number of running jobs at the site (average in the last hour). If the measurement of number of parallel jobs is taken more than once per hour (like for ex for CMS, it is taken every 5 min) then this is the average over the last hour. Otherwise, the current number of parallel job in the end of the hour. Important remark: this is the number of running jobs in the grid sense of 'running', and not from the specific application point of view. So the job has to be considered in the running status as long as it is active on the worker node. In many cases, it will be necessary to compute this number adding jobs in different status of the application, like for example (for Alice): running, assigned, started.
  •  completed_jobs 
    Number of completed jobs in the last hour
  •  completed_jobs_24h 
    Number of completed jobs in the last 24 hours
  •  successfully_completed_jobs
    Number of successfully completed jobs in the last hour
  •  successfully_completed_jobs_24h
    Number of successfully completed jobs in the last 24 hours
  • CPU_time, CPU_time_KSI2K
    CPU time if possible in KSI2K (in the last hour). This is the CPU consumed by the jobs completed in the last hour.
  • wall_time, wall_time_KSI2K
    Wall clock time (in the last hour). Like before, this is the wall clock for jobs completed in the last hour. These 2 measurements will be used to compute the efficiency: cpu time / wall time.

and for each measurement the additional information has to be provided:

  • Site (in GOCDB naming convention)
  • Start time and end time of measurement. The time from start to end is 1 hour.

every measurement should be given per VO and per site.

Data transfer:

data_transfer_t0_t1 : data transfer from T0 to T1
data_transfer_production : data transfer relative to the activity of MC production. The newly produced files are copied to other sites on a regular basis
data_transfer_analysis : data transfer of the files needed to run the users analysis to have the files available in a given site. This kind of data transfer is not relevant for all the VO, for example for LHCb there is no data transfer before running a user analysis because the files to open should be already present at all Tiers1, where the data analysis will be done.
data_transfer_t1_t1 : redistribution of processed data from T1 to T1
and all possible combination of transfers from Tiers (with the obvious meaning).

for each of them, the interesting metrics are:

  • average_transfer_rate
    Transfer rate in MB/s (average in the last hour)
  • average_transfer_rate_4h
    Transfer rate in MB/s (average in the last 4 hours)
  • success_rate
    Success rate (average in last hour)
  • success_rate
    Success rate_4h (average in last 4 hours)
  • throughput
    Total throughput integrated over the last hour. This measurement is not necessary. If it is not provided, we can compute it on the basis of the average transfer rate in the last hour.

And also these additional information should be provided for each measurement:

  • Source (in GOCDB naming convention)
  • Destination (in GOCDB naming convention)
  • Start time and end time of measurement. The time from start to end is 1 hour.

General

In addition to specific activities, the gridmap will display the general status of a site, as seen by the VO.

Every VO will provide its specific estimation of this status, based on different tests. For example for Alice this status is already defined in Monalisa. It is displayed in Monalisa under Services/Site Services/Site Overview. Based on test on storage elements, computing elements, proxy server, job processing, etc. If this is green, then it means the site is ok, even if there is no job running in that moment.

In case of CMS it includes not only SAM tests but also production success rate, only taking into account the failures which by exit code with very high probability point to the site problem. For other experiments we can for the moment just base this status on the VO-specific SAM tests.

Important remark: if the status is red, it does NOT necessarily mean that the responsible is the site. It could be responsibility of the VO. We will try to define the status with tests as much as possible related to the site, but most probably it will happen sometimes that a site turns red due to some problems related to the VO. Even in the case when the VO is responsible for fixing the issue, it is useful to notify the site about the problem.

In the database it will appear as:
Activity: general
Metric: all

-- ElisaLanciotti - 14 Aug 2008

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2020-08-20 - TWikiAdminUser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox/SandboxArchive All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback