Comparison of SAM and Nagios Availabilities for T1 Sites

From February 1st to February 7th

  • Differences:
    • The differences in T1 sites only happened on the 3rd of February and for a few hours. This was due to a bug in the WN-RepFree metric (we started raising CRITICAL on empty space published for storage area) that was corrected by Wednesday midday.
Sites Explanation
IN2P3-CC CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: CRITICAL: bad GlueSAFreeOnlineSize\n----------\nSA ops :\n GlueSAFreeOnlineSize = GB\n GlueSAStateAvailableSpace = 1048576000 KByte\n ACBRs for VO ops = VO:ops,ops\nERROR: bad GlueSAFreeOnlineSize: \n\n
INFN-T1 CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: CRITICAL: bad GlueSAFreeOnlineSize\n----------\nSA ops :\n GlueSAFreeOnlineSize = GB\n GlueSAStateAvailableSpace = 116455098023 KByte\n ACBRs for VO ops = VO:ops,ops\nERROR: bad GlueSAFreeOnlineSize: \n----------\nSA ops:replica:online :\n GlueSAFreeOnlineSize = 116455 GB\n GlueSAStateAvailableSpace = 116455098023 KByte\n ACBRs for VO ops = VO:ops,ops\n\n
SARA-MATRIX am91-46.gina.sara.nl: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: CRITICAL: bad GlueSAFreeOnlineSize
Taiwan-LCG2 w-wn0989: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: CRITICAL: bad GlueSAFreeOnlineSize
TRIUMF-LCG2 wn205.triumf.lcg: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: CRITICAL: bad GlueSAFreeOnlineSize

From January 25th to January 31st

  • Differences:
Sites Explanation
FZK-LCG2 & pic The sBDII sanity check used in Nagios is more strict compared to the one in SAM. On Wednesday midday we asked Laurence F. to reduce the level of this test to raise error status and make it more similar to the one in SAM. He did it, and since then, it shows green in Nagios.
NDGF-T1 There were two bugs in NCG: 1) certificate & key had to be provided to the check (missing in original version of Hash.pm) 2) LDAP discovery part of NCG was incorrectly extracting ports for SRMv2 endpoints (thus, wrong port was contacted and check was issuing CRITICAL). Fixed both issues in Nagios.
INFN-T1 Nagios had a huge number of services configured (>3000) and was skipping some of them. We changed the profile to contain only the critical metrics and the problem was solved
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2010-02-08 - DavidCollados
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback