Dirac monitoring tool
The
Dirac monitoring tool
provides information about job status in different sites.
The minimum time granularity in 1 hour.
The user can choose a site and a period and then a histogram is displayed with the number of jobs versus time. All the jobs are plotted in the same histogram, with a different colour depending on the status. Possible status are: Done, failed, running, stalled.
Is this the status from the application point of view? or Grid?
The jobs are not separated by job type (MC, user analysis and so on..).
Asked about that in the dirac-developers mailing list on Aug 12 2008. Summary of the mail thread:
Answer from Andrei Tsaregorotsev:
"As far as your question is concerned - what is exactly the granularity of
different kinds of activities you want to have ? In the new system user and production jobs are put
together. We can not classify user activities (analysis or custom MC or whatever). However, each production
job has associated production ID which details should be available on the Production Summary
pages. Have a look and let's then discuss further."
Which production Summary? ask more info.
Answer from Philippe Charpentier:
"I think this has always been a requirement that we should be able to classify jobs in broad categories. Even in DIRAC2
there are "Production" (should be "Simulation") jobs and "Processing" jobs. This is provided by the "JobType". We should consider whether we want to
extend this to other activities such as splitting "Processing" into "Reconstruction" / "Stripping" / "Merging" etc....�
In DIRAC3, the
JobType of the currently running reconstruction productions is... "test" which is somewhat misleading...
Can we make a proposal to be discussed on the usage of the
JobType, including the valid request from Elisa. I think it would be very important as well for
the accounting to be able to group our CPU usage by activity if we want to be able to answer simple questions like "how much CPU is used for stripping?"
My answer:
About the granularity of the classification: we do not need a very fine granularity. From the site perspective it would be interesting to have a first broad classification as:
1- user analysis jobs
2- jobs of an official production of the experiment.
In this way they can set priorities. For example, failures of jobs of an official production of the VO will be investigated with higher priority
than failures of jobs of a particular user analysis.
Then, among the category 2, it would be useful to distinguish:
2.a- MC simulation
2.b- following steps of digitization and reconstruction
so that they can check that jobs requiring very high CPU usage are MC simulation jobs.
Of course, if a finer granularity is possible, it will be even better.
Do you think it is feasable to publish this information through the dirac monitoring? As Philippe says, the jobtype should provide this
information. And for the first broad classification between user jobs and LCHb jobs maybe also the certificate owner used to submit the job could be useful
(maybe)?
Monalisa
Questions:
- can monalisa distinguish among user analysis jobs and production jobs?
We only store centrally either site-wide or per user statistics, but not /site and /user at the same time; this information was not considered relevant for alice.
- for each metric define: the unit? in time granularity? usually 1 minute.
- data transfer: transfer rate?
- is monalisa used at the sites? any feedback from site admins? yes, they have a test suite to set the overall status of a site
. It is in the Monalisa web, under Services/Site Services/ site overview. The test are done every 15 minutes. If this is green, the site is ok, even if there is no job running.
--
ElisaLanciotti - 12 Aug 2008