At the WLCG Operations Coordination Meeting on the 7th of May migration of SAM to MONIT infrastructure has been discussed.
In order to follow up on the discussion after presentation, there is a set of questions to be answered by the experiments and sites/federations. Mind: we will have to see what is feasible to be implemented beyond what we already have in MONIT today.
- Retention policy. Proposed compromise is the following:
- 1 year detailed history, with the possibility to transparently navigate to a log file of a particular test. Is it good enough?
- ALICE
- ATLAS
- CMS
- yes, assuming data in HDFS is available similarly/with a dashboard to easily query data for a non-expert
- LHCb
- 1 year of detailed history should be OK, as this is mostly used to solve current issues.
- Sites/Federations
- As CBPF site and NGI(Latin America) manager I believe 1 year detailed data should be enough. As you know it's important to justify investments (and ask for more;)
- For how long daily summaries should be available ?
- ALICE
- ATLAS
- CMS
- indefinitely, except if daily site/element status/availability/reliability can be easily obtained from HDFS by a non-expert
- LHCb
- We think, similar as CMS, that it should be stored indefinitely, especially if the data itself is small.
- Sites/Federations
- As CBPF site and NGI(Latin America) manager I believe it's important to have it for "several" years, depending on the data size.
- For analysis of the long term statistics data should be available on HDFS. For how long?
- ALICE
- ATLAS
- CMS
- LHCb
- Sites/Federations
- Anything else?
- ALICE
- ATLAS
- CMS
- simplicity, fewer retention categories ==> less confusion
- LHCb
- Simple as possible in order to retrieve site data
- Sites/Federations
- Are the HTML A/R reports needed?
- Their images take up a lot of inodes and disk space. Are they important for something or can we stop providing them?
- ALICE
- ATLAS
- We guess no. But this is a site/federation question/issue.
- CMS
- we believe only PDF version is used
- LHCb
- Sites/Federations
- As CBPF site and NGI(Latin America) manager I believe HTML is not necessary. I would keep the others.
- A/R calculation algorithm
- In the old implementation, if test results for a given site were unknown, the site was considered to be UP over this period of time unless it was in downtime. It was done in order to avoid the problems related to test submission to be counted against sites. In the new implementation, the suggestion is not to consider 'unknown' status as 'OK'. So if site has OK tests for 50% of time, and 50% is unknown, availability will be 0.5. In case the reason is not a site fault, recalculation request should be created. Recalculation will be performed by MONIT team or experiment representative(s). The procedure for recalculation of multiple sites without creating one request per site will be foreseen. Is it fine?
- ALICE
- ATLAS
- Recalculation by the MONIT team is acceptable.
- CMS
- fine, more appropriate from our point of view
- LHCb
- If recalculation will be done by the experiment team, then we need to decide about some ways to do it, for instance, cross-checking with the data in LHCb-DIRAC, etc.
- Sites/Federations
- Federation availability calculation. In the old implementation, if one or several sites were completely missing test data, it was not counted against federation. Availability of the federation was calculated based on sites for which test data existed. In the new implementation, the suggestion is to consider sites without data as unavailable if they have a production flag in VO feed. Is it fine?
- ALICE
- ATLAS
- In the old implementation, we can keep a site in AGIS without including it in the federation A/R calculation, by disabling all services for SAM test. With the proposed new algorithm, this will not be possible. We need to check if this is a problem.
- CMS
- no comment, this is a site/federation question/issue
- LHCb
- Sites/Federations
- As Latin American Federation manager, I believe that "sites without data" can still be productive. There could be just a "service issue", etc. In my opinion it's more important to sites to keep their data, than to the Federation...
- Comment, May be these questions should be also addressed to the site and federation representatives, since they are certainly concerned. Should be included in the GDB presentation.
- A/R profiles
- In the old implementation, it was possible to create profiles via UI. Not possible any more in the new implementation. In order to create a new profile a SNOW ticket has to be submitted and new profile will be created by MONIT team. Similarly for the changes in the existing profiles. Is it fine?
- ALICE
- ATLAS
- The profile is rather static; so it is fine to request changes by a ticket. We also need to think about metrics, which associate tests to a service availability. Definition of metrics could be frequently changed during a development of probes.
- CMS
- acceptable as profiles are quite static
- LHCb
- Machine readable format
- In the old version data included in the A/R reports is available in json and csv formats. Looks like csv format is not always consistent and sometime contains json parts. Since nobody complained so far, it looks like nobody used csv. In case someone is using csv, please, indicate it below, otherwise it will be dropped.
- ALICE
- ATLAS
- CSV format is not used centrally in ATLAS. But this is a question/issue for also site/federation.
- TRIUMF uses the interface at: link
. Will it be kept or replaced/retired?
- CMS
- LHCb
- Sites/Federations
- Feedback for the UI
- Please provide any feedback to the new SAM UI
.
- ALICE
- Several things do not work properly yet, e.g. various views do not show all the selected sites
- We probably need to follow up e.g. through tickets
- ATLAS
- Do we have a UI for preprod ? Is it quite important for development.
- WLCG Site Monitoring Latest:
- Some sites show empty tables (e.g. IN2P3-CC
)
- In the old monitor, status icons (OK, W, C..) for each metric are linked to detailed logs. This is a useful feature to track problems.
- BNNLAKE
, GRIF-IRFU
, GRIF-LAL
, GRIF-LPNHE
, NIKHEF
: No data to show (the old monitor show data)
- CMS
- WLCG Site Monitoring Historical :
- the Recomputations Start/End is very prominent on the dashboard yet it is difficult to understand what this is
- would be nice to have the default no or all VOs and 12/24 hours
- i think the Mode availability/reliability don't work properly: if i select availability, it shows the downtime fraction and if i select reliability it shows the downtime fraction as critical
- one could list the fraction of each state and both availability and reliability after a site/element name and remove the Mode selection at the top, just a thought
- the last N hours view will always miss some not-yet-processed entries, i.e. never show 100% that will confuse sites are cause lots of questions "why is my site only 95%"
- clicking on a site bar (due to old SAM3 behaviour) people will expect to get to service/element availability/reliability but instead it triggers a time selection change that is not easily recognized, this will be confusing
- similarly clicking on an endpoint bar people will expect to get to the tests and from test to details
- a little space between hosts in the test subwindow would help navigation, for instance site, host, and test on different lines and only printing in case it's different than the previous entry
- the Details subwindow could be omitted if boundaries between tests were visible in the Tests subwindow and those being click-able (the Details section test information but the detail; clicking on the summary to get the detail is actually quite slow compared to SAM3)
- there are white dashed vertical bars every 6 hours that suggest a switch of the evaluation/new value but they are really quarter day no daylight-savings at CERN indicators, a day indicator in current CERN timezone might be more helpful
- to see all sites/elements, one needs to click in the site (or element) subwindow (but somewhere on the site as to not trigger a time interval change) before one can scroll inside that subwindow otherwise the page scrolls (at least for me); i suspect all but the first seven sites will find this confusing
- the subwindows frequently lack in the updating such that they scroll to white space; it would be useful to have at least an indicator/spinning wheel or something to know the page is still loading
- thanks for setting this up!
- WLCG Site Monitoring Latest :
- only four sites (three Tier-3 and one "unknown") are visible, all with UNKNOWN country/federation
- I thought the same, but actually there is a way of accessing the other sites, even if it took some time to find out. Still, it's no good to present such view by default
- I could not find a way to visualise all latest test results for all sites in a single page, like in the old SAM3 UI. This kind of view is essential and it must be preserved in the MONIT implementation
- I could not access the test outputs from MONIT; like for the previous point, this is an essential feature
- LHCb
- Sites/Federations
--
JuliaAndreeva - 2020-05-20