Feedback from users
Feedback from French sites
Contact persons: LEROY Christine <c.leroy at cea.fr>, Frederique Chollet <Frederique.Chollet at lapp.in2p3.fr>
Link to the Gridmap has been provided to Frederique and Christine on Jan 14th 09.
They have set-up a french EGEE-LCG working group, coordinated by Christine Leroy. They volunteered as pilot users for the french region. They will make a first try, report to Elisa and to French sites. Hopefully this will bring to an early adoption in France.
- It is not working with IE but is OK with Opera and Firefox Asked Lukasz. He is aware of the problem and will fix it
Fixed on 29 Jan 09
- Some problems encountered with the navigation: not easy to control the Popups, to click on the URL (some time it works sometime not), and some URL are not reachable. Ask for more detailed description. This is not enough to debug the problem
OK
- Site status for the VO lhcb doesn't seem correct nor for IN2P3-CPPM, nor for GRIF (missing for IN2P3-CPPM, and missing for the Sub-site LAL of GRIF). True! the site status in the gridmap is green, as it is provided in the source URL. But then the link redirects to a page where some tests are red...
Asked Pablo 28 jan, waiting for answer...
- Very important! GRIF site is split into IPNO and GRIF_DAPNIA. For ALICE Monalisa publish metrics for GRIF (which actually is IPNO) and for GRIF_DAPNIA. The information is NOT merged. SERIOUS PROBLEM to be fixed
Feedback from Spanish Tier1
Contact persons: Josep Flix (jflix at pic.es), Xavier Espinal (espinal at pic.es) and Gonzalo Merino (merino at pic.es).
Friday 23rd January 2009: first feedback from Gonzalo Merino.
- Problem due to too many simultaneous connections: already planned an action to fix it.
Very high priority!
Migration to integration database done on Tuesday 27th Jan
- About the site status: suggested to employ something similar to Site Status Board also for other VOs. Actually this has already been proposed to LHCb. Still no answer yet.
Eventually Andrei gave a positive answer (27th Jan 09), Pablo promptly answered he will provide a prototype implementation for LHCb
Waiting ...
- About the targets: it would be interesting to have them published in the Gridmap. The problem is where to find them.
Gonzalo provided values for the pledged values about CPU slots for jobs VO by VO, and Elisa introduced them in the DB schema. Now they are displayed as 'targets' in the gridmap (27th Jan 09)ok
Important remark Don't call them 'pledged values'! They are not pledged values. The only pledged values are the KSI2K for every VO average in a time period. See feedback from Nikhef where Jeff gives a clear explanation of this.
- Some metrics are missing: yes, not all VOs provide all metrics. Ask again
- About metrics: add queued jobs.
Julia answered (26th Jan 09)Most of job processing information (at least for ATLAS and CMS) is consumed from ATLAS and CMS Dashboard. The problem with pending jobs is that this data is unfortunately not very reliable, due to incomplete and not reliable input from GRID related sources. While info about running and accomplished jobs is mainly obtained either from jobs themselves or UIs of the Job submission tools (and it is pretty consistent with what we see in the local batch systems monitoring at the sites), number of pending jobs is highly dependent on information of Grid related sources, and it is normally not reliable at all. We are certainly looking for ways of improvements and are working in this direction, but for the current moment I am afraid, if we publish this data, it can be quite confusing.
OK
- About CMS/data transfer: the values published are inconsistent with the content of the URL associated to that metric. Moreover, it would be better to have a link redirecting to a URL phedex page with a graph showing the evolution of the rate into pic from all sources in the last 24h for instance.
Check again with Pablo
Julia adds (26th Jan 09): The limitation we currently have is due to the fact that Phedex does not for the moment provide machine readable information about metrics which we would like to publish. It is also not clear how to generate links to Phedex graphs with concrete set of attributes. But this limitation is temporary. We are working with Phedex developers and hopefully soon will get all necessary API from Phedex
Waiting for new release of Phedex...
- About LHCb/data transfer: no value available. This seems to be because of a lack of activity. Asked to Adrian many times.
Check from time to time if some data are published.. they are published, but often the values are zero
Feedback from German Tier1
Contact persons: Natalia Ratnikova (natasha at fnal.gov) and Artem Trunov (artem.trunov at cern.ch). No feedback from German users.
Feedback from NIKHEF
Contact people are Jeff Templon and Ronald Starink.
- Ronald 22 Jan 2009: Nevertheless, as site administrator, I would like to be notified of problems instead of checking various sources (web sites, portals) for problems. We have good experience with Nagios in this respect. Not only do we use it to monitor our farm, but also to periodically retrieve the results of critical SAM test. If there is a failure, we'll be notified instantly. Note that we use the work of the EGEE-2 Monitoring Working Group for this, as developed at CERN by James Casey and others. Perhaps a similar approach can be taken for the experiment's view; on the site's Nagios server runs a probe- once per hour- to retrieve the results by some kind of remote database query. Shouldn't be difficult because a working example exists! In case of a failure, Nagios could even show a link to the URL that provides details (e.g. the gridmap page).
Answer: I have discussed with JUlia (in cc) about this. In fact from the technical point of view it should not be difficult to implement a Nagios probe which retrieves results from our database, and sends an alarm in case of error. The problem is that , before that, we should set some rules which clearly define the failures from the point of view of VOs, and this is not done yet. In our tool we have tried to get from the VOs the rules to define the status, which determine the color of the maps. But these are still tentative definitions. Most probably these rules will have to be tuned before they become in some way official for the VO. For this reason we consider that sending to the site an automatic alarm based on these rules could be confusing.
I certainly agree that there should be rules or criteria for failures before notifying sites of problems. We are already working with the HEP VOs to imoprove their SAM tests. My point was that as site admins, we'd like to get an overview of the status from one source, and ideally problems should be "pushed" to us,
instead of us "pulling" potential problems from various sources. But I'm already happy with your tool because it combines the results from the various dashboards!
ok
- Ronald 22 Jan 2009: I can confirm that the right click solution work, although in my opinion it is not an intuitive action. I was tempted to click with the left mouse button, expecting to see details. Unless the left-click is reserved for future enhancements, could you please consider this option?
Answer: actually, if you click with the left button on a map, a sub map should appear. This is explained also with some screenshots in the User Guide: https://twiki.cern.ch/twiki/bin/view/Sandbox/UserGuide. I hope it helps (though it cannot be as clear as in a real time demo..)
- Ronald: Concerning the pop-up: could it be changed to behave as a tooltip, meaning that it is shown only if the mouse was not moved for more than ~0.5 seconds? That would make the user interface more quiet.
Answer: I will take into account this suggestion. I'll talk about it with Max Boehm, who is the developer of gridmap. TALK WITH MAX
- About target values to associate to the number of running jobs: I asked to NIKHEF about the pledged values for job slots and Jeff pointed out that there are no pledged values for that! His mail: We don't have "pledged values" for job slots. We have fair share here. This means that averaged over a period of time, the VO is guaranteed to get a certain share of the entire farm. This is measured not by job slots, but by the total amount of processor time used by all jobs. The calculation of the fair share is based on :
a) kSI2K pledged to the VO (this is a real number pledged to the VO and not directly related to job slots)
b) total kSI2K in the farm and comes out to a percent value. You could indeed use this percent and multiply by the total number of processors, however there are situations in which a VO could not get this many job slots, and it would be perfectly acceptable behavior and in line with what we promised to the VO.
Hence I would not like to provide any value for "pledged job slots" since we have never pledged a certain number. I would not like to encourage the VOs to believe in any number that does not really exist.
Feedback from UK sites
General feedback
- The URL should keep memory of the site, so one can save the bookmark for the Sandbox.GridMap for a given site (now you are always redirected to the default site, CERN). this has been done, but not deployed yet on the production instance. Why?
- The site name lists are NOT complete. This is because it's read from the database, and it is not dynamically updated. A better solution has been implemented: the site list is directly downloaded from the gridops portal. this has been done, but not deployed yet on the production instance. Why?
--
ElisaLanciotti - 03 Jun 2009