At the WLCG Operations Coordination Meeting on the 7th of May migration of SAM to MONIT infrastructure has been discussed.

In order to follow up on the discussion after presentation, there is a set of questions to be answered by the experiments and sites/federations. Mind: we will have to see what is feasible to be implemented beyond what we already have in MONIT today.

  • Retention policy. Proposed compromise is the following:
    • 1 year detailed history, with the possibility to transparently navigate to a log file of a particular test. Is it good enough?
      • ALICE
      • ATLAS
      • CMS
        • yes, assuming data in HDFS is available similarly/with a dashboard to easily query data for a non-expert
      • LHCb
      • Sites/Federations
    • For how long daily summaries should be available ?
      • ALICE
      • ATLAS
      • CMS
        • indefinitely, except if daily site/element status/availablility/reliability can be easily obtained from HDFS by a non-expert
      • LHCb
      • Sites/Federations
    • For analysis of the long term statistics data should be available on HDFS. For how long?
      • ALICE
      • ATLAS
      • CMS
        • indefinitely
      • LHCb
      • Sites/Federations
    • Anything else?
      • ALICE
      • ATLAS
      • CMS
        • simplicity, fewer retention categories ==> less confusion
      • LHCb
      • Sites/Federations

  • Are the HTML A/R reports needed?
    • Their images take up a lot of inodes and disk space. Are they important for something or can we stop providing them?
      • ALICE
      • ATLAS
      • CMS
        • we believe only PDF version is used
      • LHCb
      • Sites/Federations

  • A/R calculation algorithm
    • In the old implementation, if test results for a given site were unknown, the site was considered to be UP over this period of time unless it was in downtime. It was done in order to avoid the problems related to test submission to be counted against sites. In the new implementation, the suggestion is not to consider 'unknown' status as 'OK'. So if site has OK tests for 50% of time, and 50% is unknown, availability will be 0.5. In case the reason is not a site fault, recalculation request should be created. Recalculation will be performed by MONIT team or experiment representative(s). The procedure for recalculation of multiple sites without creating one request per site will be foreseen. Is it fine?
      • ALICE
      • ATLAS
      • CMS
        • fine, more appropriate from our pont of view
      • LHCb
      • Sites/Federations
    • Federation availability calculation. In the old implementation, if one or several sites were completely missing test data, it was not counted against federation. Availability of the federation was calculated based on sites for which test data existed. In the new implementation, the suggestion is to consider sites without data as unavailable if they have a production flag in VO feed. Is it fine?
      • ALICE
      • ATLAS
      • CMS
        • no comment, this is a site/federation question/issue
      • LHCb
      • Sites/Federations
    • Comment, May be these questions should be also addressed to the site and federation representatives, since they are certainly concerned. Should be included in the GDB presentation.

  • A/R profiles
    • In the old implementation, it was possible to create profiles via UI. Not possible any more in the new implementation. In order to create a new profile a SNOW ticket has to be submitted and new profile will be created by MONIT team. Similarly for the changes in the existing profiles. Is it fine?
      • ALICE
      • ATLAS
      • CMS
        • acceptable as profiles are quite static
      • LHCb

  • Machine readable format
    • In the old version data included in the A/R reports is available in json and csv formats. Looks like csv format is not always consistent and sometime contains json parts. Since nobody complained so far, it looks like nobody used csv. In case someone is using csv, please, indicate it below, otherwise it will be dropped.
      • ALICE
      • ATLAS
      • CMS
        • we know of no csv use
      • LHCb
      • Sites/Federations

  • Feedback for the UI
    • Please provide any feedback to the new SAM UI.
      • ALICE
      • ATLAS
      • CMS
        • WLCG Site Monitoring Historical :
        • the Recomputations Start/End is very prominent on the dashboard yet it is difficult to understand what this is
        • would be nice to have the default no or all VOs and 12/24 hours
        • i think the Mode availability/reliability don't work properly: if i select availability, it shows the downtime fraction and if i select reliability it shows the downtime fraction as critical
          • one could list the fraction of each state and both availability and reliability after a site/element name and remove the Mode selection at the top, just a thought
        • the last N hours view will always miss some not-yet-proccessed entries, i.e. never show 100% that will confuse sites are cause lots of questions "why is my site only 95%"
        • clicking on a site bar (due to old SAM3 behaviour) people will expect to get to service/element availability/reliability but instead it triggeres a time selection change that is not easily recognized, this will be confusing
        • similary clicking on an endpoint bar people will expect to get to the tests and from test to details
        • a little space between hosts in the test subwindow would help navigation, for instance site, host, and test on different lines and only printing in case it's different than the previous entry
        • the Details subwindow could be omitted if boundaries between tests were visible in the Tests subwindow and those being click-able (the Details section test information but the detail; clicking on the summary to get the detail is actually quite slow compared to SAM3)
        • there are white dashed vertical bars every 6 hours that suggest a switch of the evaluation/new value but they are really quarter day no daylight-savings at CERn indicators, a day indicator in current CERN timezone might be more helpfull
        • to see all sites/elements, one needs to click in the site (or element) subwindow (but somewhere on the site as to not trigger a time interval change) before one can scroll inside that subwindow otherwise the page scrolls (at least for me); i suspect all but the first seven sites will find this confusing
        • the subwindows frequently lack in the updating such that they scroll to white space; it would be usefull to have at least an indicator/spining wheel or something to know the page is still loading
        • thanks for setting this up!
        • WLCG Site Monitoring Latest :
        • only four sites (three Tier-3 and one "unknown") are visible, all with UNKNOWN country/federation
          • I thought the same, but actually there is a way of accessing the other sites, even if it took some time to find out. Still, it's no good to present such view by default
        • I could not find a way to visualise all latest test results for all sites in a single page, like in the old SAM3 UI. This kind of view is essential and it must be preserved in the MONIT implementation
        • I could not access the test outputs from MONIT; like for the previous point, this is an essential feature
      • LHCb
      • Sites/Federations

-- JuliaAndreeva - 2020-05-20

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2020-05-25 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback