Metrics for LHCb

  • The following data transfer activities are available: data_transfer_t0_t1, data_transfer_t1_t1. For these activities, the metrics provided are: success_rate, average_transfer_rate. OK Important remark: LHCb does not distinguish the data transfer activity due to MC activity or reconstruction or else. They just keep track of the data transfer and the rates, so they can only make a classification on the basis of source and destination (ex. t0 to t1 etc...)

  • Are there any other type of data transfer activity (for example upload of MC production from the T2 to the T1?)? Answer No, they only distinguish the data transfer on the basis of the source and destination OK

  • The pledged values are not given. And the status is given for job processing but not for data transfer. Isn't a pledged value defined somewhere? in the computing model of LHCb isn't it defined the expected transfer rate during normal operations? Answer Yes, the values are defined somewhere else. For example the detailed transfer rates are given in this pictorial view, but the DIRAC accounting system is not aware of them. So it cannot associate an expected value to the measured values. As Adrian explained: "It's a bit difficul for us because we don't keep the transfer type (mc, data taking, reprocessing, ...). We can provide total t0-t1 and t1-t1 throughput data but calculating the pledged value would require knowing what activities are going on and if they are using those channels. Unfortunately we don't have that info in the data accounting. I can put there fixed values but I think that's worse than unknown" . After a meeting, it was concluded that the status cannot be computed comparing the actual values to the expected values, nevertheless it can be computed on the basis of the success rate. For job processing Andrei provides the status (from 20th Dec. 08 on) this is something that should be done also for data transfer, but it is still not done (on Jan. 5th 2009)

  • About the job processing activity, metrics provided on Nov 13th. For the following activities: MC production, user analysis and the total job processing activity. For each activity the following metrics: parallel_jobs, completed jobs, successfully_completed_jobs, wall_time, CPU_time. ok
    What about pledged values? for example for the CPU time they should be available in some document Answer (Andrei, 13 Nov.)I think it is feasible but needs quite a lot of hand work. I do not know if these numbers are available programmatically anywhere. Therefore, they should be added to our configuration database.

  • The CPU time and wall time, are they normalized in KSI2K? Answer (Andrei, 13 Nov.) We do not record normalized CPU per job depending on which machine it has run. This is quite tricky - the CPU normalization business. We are discussing it now and will come with some solution soon I hope. For the moment one can make a rough estimation based on an "average" CPU power on the grid.

  • About an overall site status estimation? is it possible to have it for LHCb? on which tests would it be based on? Answer Yes! it is already provided by William. It is based on SAM tests ok

  • How is the status computed for job processing? (Summary from a mail thread with Andrei on 18th Dec. 2008). The status is computed on the basis of the success rate over the last 24 hours. More in detail, Andrei provides the metrics completed_jobs, successfully_completed_jobs etc for both period: last hour and last 24 hours. But he only computes the status relatively to the last 24 hours, since the status relative to the last hour jobs is considered not reliable. Then, in the database we store everything is provided and also in the context help in the gridmap we show everything: metric and its status. But in the gridmap configuration the color of the map will be set on the basis only of the status for the last 24 hours. Ok, done on Jan 5 2009: the job activity is set as the status of the metric 'completed_jobs_24h'

  • Important remark on the site status evaluation for LHCb: The status provided by Andrei refers to the job processing activity (this meaning that it is not a general evaluation of the site status). Nevertheless, in the case that the status is 'banned' for the job processing, also the general site status should be set as 'down', regardless of the SAM tests results. In fact, it can happen sometimes that the general site status as computed by SAM tests is green, whereas the site is banned for the job processing. This situation reveals some lack in the SAM tests and from the experiment point of view their result is not reliable in this case. This is something that has to be implemented ad hoc for LHCb in the gridmap.
    From the mail thread of Dec 18th: Andrei: "The status is evaluated from the success rate of a given job processing activity as you call it. So, it is related to this one in the first place. However, it would be strange that a site is fully green according to the SAM data while it is failing all the jobs. On the other hand, the jobs can be failing because of LHCb faults. These are all the complications that we have discussed already. I would suggest the following. The site status is taken from the SAM test unlessit is 'banned'. In the latter case it should be red."
    Elisa: "Then I will keep on taking the general site status from SAM tests. I think that we should always publish in the gridmap the result of the sam tests, even in the case that you mention (green SAM test result but the site is banned for job processing). It can be useful to spot some problem in SAM tests, do you agree?"
    Andrei: "This is in theory that SAM tests should spot all the problems. The real life is always more complicated with this respect. If a site is banned, it is definitely not usable for LHCb. This is not always based only on the SAM tests but also other possible problems that production managers experience with a site in question. Keeping it green in this case may please the site managers but does not help to resolve problems. discuss with Julia about this before implementing. This is in conflict with what we have said so far, that the site status and the activity status are two distinct concepts

  • The rule to assign a status to the job processing activity:
    if banned:
    site_status = 'banned' <<<<----- This means that the site is banned in the DIRAC WMS mask
elif completed_jobs_24h < 10:
site_status = 'idle'
elif success_rate > 90.0:
site_status = 'good'
elif success_rate > 80.0:
site_status = 'fair' (normal)
elif success_rate > 50.0:
site_status = 'warning'
else:
site_status = 'bad'

-- ElisaLanciotti - 10 Nov 2008

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2020-08-19 - TWikiAdminUser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox/SandboxArchive All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback