Evaluation of the ROC Nagios instances at CERN

  • org.sam.WN-* in PENDING due to problems connecting to MB see Issue 5.

Thursday, 18th of February 2010

CENTRAL EUROPE
Service Host Problem Description
Top-BDII Comments: ALL OK!
CREAM-CE Comments: All nodes failing the Broker Info test. This is a known issue. bug #61322
FTS Comments: Does not apply. No FTS services
WMS Comments: ALL OK!
LFC_C Comments: ALL OK!
LFC_L atlas.uibk.ac.at ch.cern.LFC-Read-ops - CRITICAL: Trying to statg(/grid/ops) : No such file or directory
RGMA mon1.farm.particle.cz CERT LIFETIME CRITICAL - SSL ERROR: IO::Socket::INET configuration failederror:00000000:lib(0):func(0):reason(0). This is a known issue. Bug: https://savannah.cern.ch/bugs/?62482
VOBOX Comments: In Nagios service is not VO dependent. Thus, more service instances are tested than in SAM. Fix in NCG requested.
VOMS Comments: ALL OK!
FRANCE
Service Host Problem Description
LFC_C/L Comments: Currently problems with proper distinction between the service flaivours. Fix in NCG requested.
VOBOX Comments: In Nagios service is not VO dependent. Thus, more service instances are tested than in SAM. Fix in NCG requested.
CREAM-CE,FTS,MON,MyProxy,Top-BDII,VOMS,WMS Comments: CREAM-CE Bi test bug #61322
UKI
Service Host Problem Description
LFC_C/L Comments: Currently problems with proper distinction between the service flaivours. Fix in NCG requested.
VOBOX Comments: In Nagios service is not VO dependent. Thus, more service instances are tested than in SAM. Fix in NCG requested.
MON mon.glite.ecdf.ed.ac.uk now ERROR in SAM (proto failed) - OK in Nagios. Was failing in Nagios also ~12h ago. But details data in Nagios doesn't say much (SSL ERROR: IO::Socket::INET configuration failederror:00000000:lib(0):func(0):reason(0))... A better output on errors is needed in hr.srce.RGMA-CertLifetime metric. Updated bug #62482.
CREAM-CE,FTS,MyProxy,Top-BDII,VOMS,WMS Comments: OK.
ITALY
Service Host Problem Description
CREAM-CE test7200a.cnaf.infn.it bug #62482 hr.srce.CREAMCE-CertLifetime: CERT LIFETIME CRITICAL - SSL ERROR: IO::Socket::INET configuration failederror:00000000:lib(0):func(0):reason(0).
LFC_C grid-eo-engine04.esrin.esa.int ch.cern.LFC-Ping-ops: UNKNOWN: Metric ch.cern.LFC-Ping does not exist.
LFC_C lfcserver.cnaf.infn.it ch.cern.LFC-Ping-ops: UNKNOWN: Metric ch.cern.LFC-Ping does not exist.
LFC_L cert-39.pd.infn.it ch.cern.LFC-Ping-ops: UNKNOWN: Metric ch.cern.LFC-Ping does not exist.
MON grid002.ca.infn.it bug #62482 hr.srce.RGMA-CertLifetime: CERT LIFETIME CRITICAL - SSL ERROR: IO::Socket::INET configuration failederror:00000000:lib(0):func(0):reason(0)
VO-box vobox.ca.infn.it org.nagios.gsissh-Check: Current Status: CRITICAL (for 1d 0h 16m 24s), Status Information:CRITICAL - Socket timeout after 60 seconds
MyProxy, Top-BDII,VOMS,WMS Comments: ALL OK!
NORTHERN EUROPE
Service Host Problem Description
Top-BDII Comments: ALL OK!
CREAM-CE Comments: All nodes failing the Broker Info test. This is a known issue. bug #61322
FTS Comments: ALL OK!
WMS Comments: ALL OK!
LFC_C Comments: ALL OK!
LFC_L Comments: ALL OK!
MON Comments: OK - The same as in SAM, 17 ok, 1 warning
MyProxy Comments: ALL OK!
VOBOX Comments: In Nagios service is not VO dependent. Thus, more service instances are tested than in SAM. Fix in NCG requested.
VOMS Comments: ALL OK!

Thursday, 11th of February 2010

ASIA PACIFIC
Service Host Problem Description
CANADA
Service Host Problem Description
CENTRAL EUROPE
Service Host Problem Description
CERN
Service Host Problem Description
FRANCE
Service Host Problem Description
GERMANY SWITZERLAND
Service Host Problem Description
ITALY
Service Host Problem Description
NORTHERN EUROPE
Service Host Problem Description
SOUTH EASTERN EUROPE
Service Host Problem Description
SOUTH WESTERN EUROPE
Service Host Problem Description
UKI
Service Host Problem Description

Wednesday, 10th of February 2010

ASIA PACIFIC
Service Host Problem Description
CANADA
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM
CENTRAL EUROPE
Service Host Problem Description
CERN
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM
FRANCE
Service Host Problem Description
GERMANY SWITZERLAND
Service Host Problem Description
CE,sBDII,SRMv2 Comments: small discrepancies due to timed out connections or info not found in IS.
ITALY
Service Host Problem Description
CE,sBDII,SRMv2 Comments: small discrepancies due to timed out connections or info not found in IS.
NORTHERN EUROPE
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM
SOUTH EASTERN EUROPE
Service Host Problem Description
SOUTH WESTERN EUROPE
Service Host Problem Description
UKI
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM

Tuesday, 9th of February 2010

ASIA PACIFIC
Service Host Problem Description
CANADA
Service Host Problem Description
CE ce01.eela.if.ufrj.br too many timeouts for CE-org.sam.WN-Rep: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepDel-ops]: CRITICAL: Replicas for [lfn:/grid/ops/SAM/sam-lcg-rm-cr-lnx147.eela.if.ufrj.br.100210083847.2441279] were NOT deleted.\n[BDII] lcg-bdii.cern.ch:2170: Connection Timeout\n
CE gantt.cefet-rj.br too many timeouts for CE-org.sam.WN-Rep: WARNING: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: WARNING: LDAP search timed out after 20 sec. lcg-bdii.cern.ch:2170\n
CENTRAL EUROPE
Service Host Problem Description
CERN
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM
FRANCE
Service Host Problem Description
GERMANY SWITZERLAND
Service Host Problem Description
CE grid-ce.physik.uni-bonn.de, ce-enmr.chemie.uni-frankfurt.de, diana.switch.ch, ce-goegrid.gwdg.de, ce1.bfg.uni-freiburg.de Common error: "CRITICAL: Job aborted." "Reason(s): Globus error 10: data transfer to the server failed. Status Reason: hit job retry count (0)". WMS couldn't transfer job to CE on the first attempt - and in JDL we don't allow resubmission. Abortion rate is relatively low - 2-3 times in 24h
CE grid13.gsi.de CRITICAL. OK in SAM. "lfc-mkdir: error while loading shared libraries: libuuid.so.1: cannot open shared object file". lfc-mkdir wasn't use in SAM (introduced in Nagios), but WN mustn't fail on importing libs. This is problem of the site.
ITALY
Service Host Problem Description
CE pbs-enmr.cerm.unifi.it, ce2.egee.unisalento.it Common error: "CRITICAL: Job aborted." "Reason(s): Globus error 10: data transfer to the server failed. Status Reason: hit job retry count (0)". WMS couldn't transfer job to CE on the first attempt - and in JDL we don't allow resubmission. Abortion rate is relatively low - 2-3 times in 24h
sBDII cmsrm-ce01.roma1.infn.it, ce.scope.unina.it, prod-bdii-02.pd.infn.it org.gstat.SanityCheck
NORTHERN EUROPE
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM
SOUTH EASTERN EUROPE
Service Host Problem Description
CE Comments: All OK
SRMv2 Comments: All OK
sBDII Comments: All OK
SOUTH WESTERN EUROPE
Service Host Problem Description
CE axon-g01.ieeta.pt org.sam.CE-JobSubmit-ops - ERROR: No Brokers found in 'PROD' network [BDII topbdii01.ncg.ingrid.pt:2170]. Exiting.
CE grid001.fc.up.pt org.sam.CE-JobSubmit-ops - ERROR: No Brokers found in 'PROD' network [BDII topbdii01.ncg.ingrid.pt:2170]. Exiting.
CE grid001.fe.up.pt org.sam.CE-JobSubmit-ops - ERROR: No Brokers found in 'PROD' network [BDII topbdii01.ncg.ingrid.pt:2170]. Exiting.
UKI
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM

Monday, 8th of February 2010

ASIA PACIFIC
Service Host Problem Description
CE Comments Better in Nagios. 2 services fail in Nagios in SAM. Beyond that Nagios has 1 in ERROR and WARNING
CE ce.hut.vngrid.vinaren.vn ERROR: "org.sam.CE-JobSubmit-ops" - UNKNOWN: [Ready->Cancelled [timeout/dropped]]
CE grid01.phy.ncu.edu.tw WARNING: "org.sam.CE-JobSubmit-ops" - WARNING: [Running->Cancelled [timeout/dropped]].
CE Comments 5 other services failing (the same as in SAM)
CE Comments SAM - 11 endpoints failing in total
SRMv2 Comments: Nagios: All tests pasing.
sBDII Comments: Nagios & SAM: 2 test failing. In addition in SAM 3 endpoints in WARNING state
CANADA
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM
CENTRAL EUROPE
Service Host Problem Description
CE Comments: 1 WARNING, all others in sync with SAM
CE dwarf.wcss.wroc.pl org.sam.CE-JobSubmit-ops "WARNING: [Running->Cancelled [timeout/dropped]] "
SRMv2 Comments: All is in sync with SAM
sBDII Comments: All tests passing
CERN
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM
FRANCE
Service Host Problem Description
CE Comments: Nagios & SAM: tests for 2 endpoints failing.
CE clrlcgce03.in2p3.fr org.sam.CE-JobSubmit-ops "WARNING: [Scheduled->Cancelled [timeout/dropped]]"
SRMv2 ccsrmt2.in2p3.fr All endpoints in OK status. 1 issue: ccsrmt2.in2p3.fr has only 1 service which is being tested. All other endpoints have 9 services. To be investigated.
sBDII Comments: Nagios & SAM: all tests passing
GERMANY SWITZERLAND
Service Host Problem Description
ITALY
Service Host Problem Description
NORTHERN EUROPE
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM
SOUTH EASTERN EUROPE
Service Host Problem Description
CE Comments: All OK
SRMv2 torik1.ulakbim.gov.tr CRITICAL: File was NOT copied to SRM. srm://torik1.ulakbim.gov.tr:8446/srm/managerv2?SFN=/dpm/ulakbim.gov.tr/home/ops/testfile-put-1265663366-361a2ebe176d.txt: Invald argument SURL: srm://torik1.ulakbim.gov.tr:8446/srm/managerv2?SFN=/dpm/ulakbim.gov.tr/home/ops/testfile-put-1265663366-361a2ebe176d.txt
sBDII Comments: All OK
SOUTH WESTERN EUROPE
Service Host Problem Description
CE axon-g01.ieeta.pt JS: No brokers found in top BDII. Issue being investigated by Konstantin.
CE ce3.egee.cesga.es WN-Rep-ops: LDAP search timed out after 20 sec. In SAM it's passing the test but just because the job has more time to finish.
CE grid001.fc.up.pt JS: Problem on WN. ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. Issue being investigated by Konstantin
SRMv2 Comments: All OK
sBDII Comments: All OK
UKI
Service Host Problem Description
CE gridgate.cp.dias.ie results are not comming from WN because of Exit Code !!=0. UNKNOWN: job submission OK - problem on WN [Done (Exit Code !!=0)]
SRMv2,sBDII Comments: All is in sync with SAM

Friday, 5th of February 2010

  • ITALY: There is a problem with connection to MB on WNs at GRISU-ENEA-GRID on the CEs: egce-cresco.portici.enea.it, egce.frascati.enea.it, egce1-cresco.portici.enea.it (GGUS #55378)

  • GERMANY SWITZERLAND: dcache-se-desy.desy.de fails due to /ops/ as primary VOMS attr in Nagios cert. Also, WN-Rep on CEs grid-ce4.desy.de, grid-ce5.desy.de are affected. Need /ops/Role=lcgadmin as primary VOMS attr. Latest NCG has a solution for that - testing on samnag017. Now (07-02-2010) OK Yes / Done after manually building/deploying grid-monitoring-config-gen-0.46.1-1.el5.noarch.rpm. It's not yet in any of egee-SA1 repos.

ASIA PACIFIC
Service Host Problem Description
CE Comments Better in Nagios. However Nagios has 3 endpoints in WARNING when they are OK in SAM
CE grid01.phy.ncu.edu.tw WARNING: "org.sam.CE-JobSubmit-ops" - WARNING: [Running->Cancelled [timeout/dropped]].
CE quanta.grid.sinica.edu.tw WARNING: "org.sam.CE-JobSubmit-ops" - WARNING: [Running->Cancelled [timeout/dropped]].
CE w-ce03.grid.sinica.edu.tw WARNING: "org.sam.CE-JobSubmit-ops" - WARNING: [Running->Cancelled [timeout/dropped]].
CE Comments 5 other services failing (the same as in SAM)
CE Comments SAM - 11 endpoints failing in total
SRMv2 Comments: The same in Nagios & SAM: 1 test failing.
sBDII Comments: Nagios & SAM: 2 test failing. In addition in SAM 3 endpoints in WARNING state
CANADA
Service Host Problem Description
sBDII lcg-ce.rcf.uvic.ca sBDII-org.gstat.SanityCheck is failling, but current SAM is OK - new feature
CE,SRMv2,sBDII Comments: All is in sync with SAM
CENTRAL EUROPE
Service Host Problem Description
CE dgt01.ui.savba.sk ERROR: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE dgt02.ui.savba.sk and registered in LFC prod-lfc-shared-central.cern.ch. "
CE gn0.hpcc.sztaki.hu ERROR: org.sam.CE-JobSubmit-ops and org.sam.CE-JobState-ops failed "CRITICAL: Job was aborted." In addition all "WN" tests are in PENDING state. "
CE ce.grid.bntu.by UNKNOWN: hr.srce.GRAM-CertLifetime "CERTLIFETIME-PROBE UNKNOWN - Timeout occured."
CE Comments: In addition 1 test failing in SAM & Nagios
SRMv2 Comments: All tests passing in Nagios, 1 test failing in SAM
sBDII Comments: All tests passing
CERN
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM
FRANCE
Service Host Problem Description
CE Comments: Nagios & SAM: tests for 4 endpoints failing.
SRMv2 ccsrmt2.in2p3.fr All endpoints in OK status. 1 issue: ccsrmt2.in2p3.fr has only 1 service which is being tested. All other endpoints have 9 services. To be investigated.
sBDII Comments: Nagios & SAM: all tests passing
GERMANY SWITZERLAND
Service Host Problem Description
CE grid-ce4.desy.de,grid-ce5.desy.de Yes / Done Solved (07-03-2010) with manually built NCG 0.46.1-1... WN-Rep CRITICAL (OK in SAM). Due to permission problems on SRM dcache-se-desy.desy.de
CE grid13.gsi.de WN-RepCr CRITICAL (OK in SAM). "lfc-mkdir: error while loading shared libraries: libuuid.so.1" This is problem of the CE. In nagios probe lcg-mkdir was added that doesn't exit in SAM but anyways it shouldn't fail in such way. Yes / Done
SRMv2 dcache-se-desy.desy.de Yes / Done Solved (07-03-2010) with manually built NCG 0.46.1-1... Permission denied. Problem with Judit's proxy. This SRM seems to be requiring /ops/Role=lcgadmin as primary attribute. Testing new NCG feature on samnag017.
sBDII Comments: In sync with SAM.
ITALY
Service Host Problem Description
CE ce-b1-1.mi.infn.it Today JS UNKNOWN: [Ready->Cancelled [timeout/dropped]] ... and still WN-* in PENDING - thus, problems with MB connection/discovery on WNs
CE egce.frascati.enea.it, egce1-cresco.portici.enea.it Canceled from Running/Scheduled after timeout. As WN-* are in PENDING there are apparently problems with MB connection/discovery (same as egce-cresco.portici.enea.it)
CE egce-cresco.portici.enea.it WARNING: job submission OK - problem on WN [Done (Exit Code =0)]. Can't connect to given MB, then fails connecting to tBDDI to discover avail MB.
CE egceaix.frascati.enea.it org.sam.CE-JobSubmit-ops OK: success. org.sam.WN-* are in PENDING Consistent with SAM - js OK but NA for WN tests. This is AIX CE - problems are "perl: Setting locale failed." & "Badly formed number" in head (needs explicitly '-n' (in SVN & testing)). Why they always return 0 from jobs even though they fail?
CE grid001.ts.infn.it [hr.srce.GRAM-CertLifetime CRITICAL: CERT LIFETIME CRITICAL - SSL ERROR:] But OK from CLI and in SAM. Ticket #62482 Yes / Done Proposed fix manually applied on samnag006 in hr.srce/CertLifetime-probe
SRMv2 storm02.cr.cnaf.infn.it need a better error output for hr.srce.SRM2-CertLifetime. Followed up in Ticket #62482
SRMv2 t2cmcondor.mi.infn.it Intermittent problems. Sometime fails Put or Get with Globus error. In SAM OK with CLI. In Nagios we use Python API. Needs debugging.
sBDII ce.scope.unina.it,cmsrm-ce01.roma1.infn.it,prod-bdii-02.pd.infn.it org.gstat.SanityCheck CRITICAL (OK in SAM)
NORTHERN EUROPE
Service Host Problem Description
CE ingrid.cism.ucl.ac.be CE-org.sam.WN-Rep is failling, but current SAM tests are OK. LOG: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE ingrid-se02.cism.ucl.ac.be and registered in LFC prod-lfc-shared-central.cern.ch.\nlcg-cr --vo ops -d ingrid-se02.cism.ucl.ac.be -l lfn:/grid/ops/SAM/sam-lcg-rm-cr-wn05.cism.ucl.ac.be.100205221222.805333416 /scratch/condor/execute/dir_896/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fggQ9NA-ZIrYeWB7m-NVO_5fg /.gridprobes/ops/org.sam/WN/localhost.localdomain/testFile.txt\n[SE][Mkdir][SRM_FAILURE] httpg://ingrid-se02.cism.ucl.ac.be:8444/srm/managerv2: srm://ingrid-se02.cism.ucl.ac.be/storage/data/ops/generated/2010-02-05/file3902b376-c05e-4d0a-b4cc-a e5e4689b27b: \nlcg_cr: Invalid argument\n
SRMv2 ingrid-se02.cism.ucl.ac.be SRMv2-org.sam.SRM-Put test is failling, but current SAM tests are OK. LOG: CRITICAL: File was NOT copied to SRM
sBDII ce01.grid.etf.rtu.lv sBDII-org.gstat.SanityCheck is failling, but current SAM is OK - new feature
sBDII bdii-no-t2.ndgf.org sBDII-org.bdii.Entries test is failling, but current SAM is OK - new feature
sBDII Comments: All is in sync with SAM
SOUTH EASTERN EUROPE
Service Host Problem Description
CE ce-grid.grid.uaic.ro org.sam.WN-RepCr-ops fails with this error message: gs-65: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE se-grid.uaic.ro and registered in LFC prod-lfc-shared-central.cern.ch. CLI CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE se-grid.uaic.ro and registered in LFC prod-lfc-shared-central.cern.ch. lcg-cr --vo ops -d se-grid.uaic.ro -l lfn:/grid/ops/SAM/sam-lcg-rm-cr-gs-65.100205220316.3612479 /home/ops028/globus-tmp.gs-65.19977.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fspXPBKhfdBUb8_5f2C347VAw/.gridprobes/ops/org.sam/WN/localhost.localdomain/testFile.txt srm://se-grid.uaic.ro/dpm/uaic.ro/home/ops/generated/2010-02-06/filef6fad16d-164b-4c7a-a2dd-d92a82525827: Invalid argument lcg_cr: Invalid argument It says 'lcg_cr' instead of 'lcg-cr', but it looks like the problem is with srm://se-grid.uaic.ro/dpm/uaic.ro/home/ops/generated/2010-02-06/filef6fad16d-164b-4c7a-a2dd-d92a82525827 In SAM it's OK: https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=ce-grid.grid.uaic.ro&vo=ops&testname=CE-sft-lcg-rm-cr&testtimestamp=1265409142 Comment from Konstantin on this issue: Looking at the history for this particular CE this problem happens from time to time - 3-4 times a week. However, usually next time it works OK. All our OPS jobs for this CE lend on one WN gs-65 big grin This is an intermittent problem and could well be due to a bug in lcg_util and appear due to a problem at communication with remote SE. lcg_cr is a library function to which lcg-cr is CLI equivalent and in fact a wrapper. I think we can ignore it for now.
CE ce01.mosigrid.utcluj.ro Current Status: Aborted. Logged Reason(s): - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. Status Reason: hit job shallow retry count (1). This works in SAM (every two hours) but the WMS used in Nagios are different to the ones used by SAM, so the difference can be there. Unfortunately we cannot read the whole test output due to a limitation in the size, but we are in the process of increasing this size so we'll be able to debug better this issue in the coming days.
CE ce02.grid.acad.bg Current Status: Aborted. Logged Reason(s): - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. Status Reason: hit job shallow retry count (1). This works in SAM (every two hours) but the WMS used in Nagios are different to the ones used by SAM, so the difference can be there. Unfortunately we cannot read the whole test output due to a limitation in the size, but we are in the process of increasing this size so we'll be able to debug better this issue in the coming days.
CE cox01.grid.metu.edu.tr CRITICAL: Getting job output: Failed. Connecting to the service https://wmssamtest02.cern.ch:7443/glite_wms_wmproxy_server Error - Operation failed HTTP Error 500 Internal Server Error

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator, [no address given] and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.

Error code: SOAP-ENV:Server. Test passes in SAM.
SOUTH WESTERN EUROPE
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM
UKI
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM

Thursday, 4th of February 2010

ASIA PACIFIC
Service Host Problem Description
CE grid01.phy.ncu.edu.tw Last result for "org.sam.CE-JobSubmit-ops" on "01-02-2010" with status "CRITICAL: Job was aborted." To be investigated, as they fail only in Nagios.
CE quanta.grid.sinica.edu.tw Last result for "org.sam.CE-JobSubmit-ops" on "31-01-2010" with status "CRITICAL: Job was aborted." To be investigated, as they fail only in Nagios.
CE ce.indiacms.res.in org.sam.CE-JobState-ops & org.sam.CE-JobSubmit-ops "WARNING: job submission OK - problem on WN [Done (Exit Code =0)]" ERROR in SAM.
CE 3 other services failing (the same as in SAM)
SRMv2 Comments: The same in Nagios & SAM: 1 test failing.
sBDII Comments: Nagios & SAM: 1 test failing. In addition in SAM 3 endpoints in WARNING state
CANADA
Service Host Problem Description
sBDII lcg-ce.rcf.uvic.ca sBDII-org.gstat.SanityCheck is failling, but current SAM is OK - new feature
CE,SRMv2 Comments: All is in sync with SAM
CENTRAL EUROPE
Service Host Problem Description
CE ce2.egee.cesnet.cz WARNING: "skurut2-2.egee.cesnet.cz: WARNING: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: WARNING: No information for [attribute(s): ['GlueSALocalID', 'GlueSAAccessControlBaseRule', 'GlueSAFreeOnlineSize', 'GlueSAStateAvailableSpace']] in ldap://bdii.cyf-kr.edu.pl:2170."
CE dgt01.ui.savba.sk WARNING: on org.sam.WN-*-ops: "dgt04.ui.savba.sk: WARNING: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: WARNING: LDAP search timed out after 20 sec. glite-rb.ct.infn.it:2170"
CE Comments: All tests passing in SAM
SRMv2 Comments: All tests passing in Nagios, 1 test failing in SAM
sBDII Comments: All tests passing
CERN
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM
FRANCE
Service Host Problem Description
CE Comments: Nagios & SAM: tests for 2 endpoints (cemauvergridce01.univ-bpclermont.fr, iut15auvergridce01.univ-bpclermont.fr) failing.
SRMv2 ccsrmt2.in2p3.fr All endpoints in OK status. 1 issue: ccsrmt2.in2p3.fr has only 1 service which is being tested. All other endpoints have 9 services. To be investigated.
sBDII Comments: Nagios: all tests passing, SAM: 1 tests failing
GERMANY SWITZERLAND
Service Host Problem Description
CE grid13.gsi.de NoJobState gets updated, but JobSubmit is not; JobSubmit must be updated always (if not a problem then UNKNONWN just for information).
grid-ce4.desy.de,grid-ce5.desy.de same as yesterday (WN-Rep problem due to permission problem on dcache-se-desy.desy.de)
CE ce-goegrid.gwdg.de JS: SAM OK - Nagios CRITICAL; but in SAM jobs take long time to come from CE.
SRMv2 dcache-se-desy.desy.de Nosame problem as yesterday (permissions). Ether submit ticket to the site or change to new proxy with lcgadmin as primary role.
sBDII Comments: All is in sync with SAM
ITALY
Service Host Problem Description
CE atlasce1.lnf.infn.it JS: SAM OK - Nagios CRITICAL; but in SAM jobs take long time to come from CE.
CE ce-b1-1.mi.infn.it same problem as yesterday.
CE egce-cresco.portici.enea.it same problem as yesterday.
CE egce.frascati.enea.it same problem as yesterday.
CE egce1-cresco.portici.enea.it same problem as yesterday.
CE egceaix.frascati.enea.it same problem as yesterday.
CE grid012.ct.infn.it JS: SAM OK - Nagios CRITICAL: Job was aborted.; but in SAM jobs take long time to come from CE.
CE unime-ce-01.me.pi2s2.it Yes / DoneJobState gets updated, but JobSubmit is not; JobSubmit must be updated always (if not a problem then UNKNONWN just for information).
SRMv2 Comments: All is in sync with SAM
sBDII ce.scope.unina.it,cmsrm-ce01.roma1.infn.it SAM OK - Nagios CRITICAL; "gluesiteuniqueid - value is not in the correct format"
NORTHERN EUROPE
Service Host Problem Description
CE ingrid.cism.ucl.ac.be CE-org.sam.WN-Rep is failling, but current SAM tests are OK. LOG: wn05.cism.ucl.ac.be: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE ingrid-se02.cism.ucl.ac.be and registered in LFC prod-lfc-shared-central.cern.ch. CLI
CE ce01.lcg.cscs.ch CE-org.sam.WN-Rep test has too many ldap time-outs while current SAM tests are OK. LOG: wn24.lcg.cscs.ch: WARNING: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: WARNING: LDAP search timed out after 20 sec. lcg-bdii.cern.ch:2170
CE cream01.iihe.ac.be CE-org.sam.WN-Rep test is failling, but current SAM is OK. LOG: node12-17.wn.iihe.ac.be: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE maite.iihe.ac.be and registered in LFC prod-lfc-shared-central.cern.ch. [ErrDB:[('lcg_util', 'server', 'CRITICAL')]] CLI
CE gridce.iihe.ac.be CE-org.sam.WN-Rep test is failling, but current SAM is OK. LOG: node17-1.wn.iihe.ac.be: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE maite.iihe.ac.be and registered in LFC prod-lfc-shared-central.cern.ch. [ErrDB:[('lcg_util', 'server', 'CRITICAL')]] CLI
SRMv2 ingrid-se02.cism.ucl.ac.be SRMv2-org.sam.SRM-Put test is failling, but current SAM tests are OK. LOG: CRITICAL: File was NOT copied to SRM
SRMv2 maite.iihe.ac.be SRMv2-org.sam.SRM-Put is failling, but current SAM is OK. LOG: CRITICAL: File was NOT copied to SRM
sBDII ce01.grid.etf.rtu.lv sBDII-org.gstat.SanityCheck is failling, but current SAM is OK - new feature
sBDII bdii-no-t2.ndgf.org sBDII-org.bdii.Entries test is failling, but current SAM is OK - new feature
sBDII Comments: All is in sync with SAM
SOUTH EASTERN EUROPE
Service Host Problem Description
CE ce01.mosigrid.utcluj.ro JS Aborted due to: - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. This works in SAM. Same problem as for ce.egee.di.uminho.pt.
CE ce02.grid.acad.bg JS Aborted due to: - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. This works in SAM. Same problem as for ce.egee.di.uminho.pt and ce01.mosigrid.utcluj.ro
CE grid-lab-ce.ii.edu.mk JS Aborted due to: - Got a job held event, reason: Globus error 10: data transfer to the server failed - Job got an error while in the CondorG queue. In SAM this works, but using different WMS and less TTL. Cannot read the full details output due to limitation in buffer size.
SOUTH WESTERN EUROPE
Service Host Problem Description
CE axon-g01.ieeta.pt JS: UNKNOWN: job submission OK - problem on WN [Done (Exit Code =0)] ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. This is probably a firewall issue in the site. The WN cannot contact the default message broker, then it tries to do the discovery from their top BDII and this also fails because even if they are configured in their BDII, the information is not returned to the WN. This works in SAM because the submission is not done via message bus.
CE ce.egee.di.uminho.pt Current Status: Aborted. Logged Reason(s): - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. Status Reason: hit job shallow retry count (1). This works in SAM (every two hours) but the WMS used in Nagios are different to the ones used by SAM, so the difference can be there. Unfortunately we cannot read the whole test output due to a limitation in the size, but we are in the process of increasing this size so we'll be able to debug better this issue in the coming days.
CE grid001.fc.up.pt UNKNOWN: job submission OK - problem on WN [Done (Exit Code =0)] Trying to obtain it from IS. ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. Exiting. This is the same problem as for axon-g01.ieeta.pt. In SAM it works.
CE grid001.fe.up.pt Trying to obtain it from IS. ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. This is the same as for axon-g01.ieeta.pt and axon-g01.ieeta.pt. It works in SAM
UKI
Service Host Problem Description
CE,SRMv2,sBDII Comments: All is in sync with SAM

Wednesday, 3rd of February 2010

ASIA PACIFIC
Service Host Problem Description
CE grid01.phy.ncu.edu.tw Last result for "org.sam.CE-JobSubmit-ops" on "01-02-2010" with status "CRITICAL: Job was aborted." To be investigated, as they fail only in Nagios.
CE quanta.grid.sinica.edu.tw Last result for "org.sam.CE-JobSubmit-ops" on "31-01-2010" with status "CRITICAL: Job was aborted." To be investigated, as they fail only in Nagios.
CE ce.indiacms.res.in org.sam.CE-JobState-ops & org.sam.CE-JobSubmit-ops "WARNING: job submission OK - problem on WN [Done (Exit Code =0)]" ERROR in SAM.
CE Comments: Better in Nagios. 9 different endpoints failing in SAM. However Nagios has 2 endpoints failing when they are OK in SAM
sBDII Comments: 2 endpoints failing in SAM and 3 in warning state, but not in Nagios
CANADA
Service Host Problem Description
Comments: Services OK
CENTRAL EUROPE
Service Host Problem Description
CE ce1.egee.cesnet.cz Error on org.sam.CE-JobSubmit-ops failing with "CRITICAL: Job was aborted."
CE gn0.hpcc.sztaki.hu Error on org.sam.CE-JobSubmit-ops and org.sam.CE-JobState-ops failed "CRITICAL: Job was aborted." In addition all "WN" tests are in PENDING state.
CE dgt01.ui.savba.sk Warning on org.sam.WN-*-ops: "dgt04.ui.savba.sk: WARNING: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: WARNING: LDAP search timed out after 20 sec. glite-rb.ct.infn.it:2170"
CE Comments: All tests passing in SAM
CERN
Service Host Problem Description
CE ce130.cern.ch WARNING: LDAP search timed out after 20 sec. lcg-bdii.cern.ch:2170. This happens & normal.
CE ce131.cern.ch WARNING: LDAP search timed out after 20 sec. lcg-bdii.cern.ch:2170. This happens & normal.
FRANCE
Service Host Problem Description
CE iut15auvergridce01.univ-bpclermont.fr In Nagios 2 endpoints failing: cemauvergridce01.univ-bpclermont.fr, iut15auvergridce01.univ-bpclermont.fr. In SAM 3 endpoints failing. However "iut03auvergridce01.univ-bpclermont.fr" in Nagios has 1 Warning: "WARNING: job submission OK - problem on WN [Done (Exit Code =0)]"
SRMv2 ccsrmt2.in2p3.fr All endpoints in OK status. 1 issue: ccsrmt2.in2p3.fr has only 1 service which is being tested. All other endpoints have 9 services. To be investigated.
sBDII Comments: 2 tests failing in SAM but not in Nagios
GERMANY SWITZERLAND
Service Host Problem Description
CE ce-goegrid.gwdg.de Same as unime-ce-01.me.pi2s2.it. In SAM - js OK, but with >2h gaps between WN and js results. There seems to be a problem with job state update/output sandbox delivery problems. Needs debugging.
CE grid-ce4.desy.de org.sam.WN-RepCr-ops CRITICAL: File was NOT copied to SE dcache-se-desy.desy.de and registered in LFC prod-lfc-shared-central.cern.ch. [ErrDB:[('lcg_util', 'server', 'CRITICAL')]] lcg-cr --vo ops -d dcache-se-desy.desy.de -l lfn:/grid/ops/SAM/sam-lcg-rm-cr-grid-wn0786.desy.de.100203223931.30933334 /home/opsusr010/globus-tmp.grid-wn0786.1186.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fQSI8Q1M8g7qS9E6mS11mpw/.gridprobes/ops/org.sam/WN/localhost.localdomain/testFile.txt srm://dcache-se-desy.desy.de/pnfs/desy.de/ops/generated/2010-02-03/filec24dd465-59a6-48e2-838a-b1206f6a6d92: Permission denied. This is due to SRMv2 failure on dcache-se-desy.desy.de
CE grid-ce5.desy.de org.sam.WN-RepCr-ops CRITICAL: File was NOT copied to SE dcache-se-desy.desy.de and registered in LFC prod-lfc-shared-central.cern.ch. [ErrDB:[('lcg_util', 'server', 'CRITICAL')]] lcg-cr --vo ops -d dcache-se-desy.desy.de -l lfn:/grid/ops/SAM/sam-lcg-rm-cr-grid-wn0786.desy.de.100203223931.30933334 /home/opsusr010/globus-tmp.grid-wn0786.1186.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fQSI8Q1M8g7qS9E6mS11mpw/.gridprobes/ops/org.sam/WN/localhost.localdomain/testFile.txt srm://dcache-se-desy.desy.de/pnfs/desy.de/ops/generated/2010-02-03/filec24dd465-59a6-48e2-838a-b1206f6a6d92: Permission denied. This is due to SRMv2 failure on dcache-se-desy.desy.de
CE udo-ce03.grid.tu-dortmund.de
SRMv2 dcache-se-desy.desy.de org.sam.SRM-Put-ops CRITICAL: File was NOT copied to SRM. srm://dcache-se-desy.desy.de:8443/srm/managerv2?SFN=/pnfs/desy.de/ops/testfile-put-1265236357-114af6f20d81.txt: Permission denied. Problem with Judit's proxy. This SRM seems to be requiring /ops/Role=lcgadmin as primary attribute.
ITALY
Service Host Problem Description
CE atlasce01.na.infn.it org.sam.CE-JobSubmit-ops UNKNOWN: [Ready->Cancelled [timeout/dropped]]. CE is in Maint. In SAM job gets aborted on WMS after 6 hours and job submission is set to ERROR. On Nagios we don't wait that long and cancel the job from Ready state after 45-55 min. As at this stage it's not 100% known what is the cause of the state the metric issues UNKNOWN. Improvement: parse job's logging info to find eg. Event: Pending ... Reason = BrokerHelper: no compatible resources. This could be indication for CRITICAL.
CE ce-b1-1.mi.infn.it org.sam.CE-JobSubmit-ops UNKNOWN: [Ready->Cancelled [timeout/dropped]] org.sam.WN-* are in PENDING. In SAM it's OK. Tests delivered to CE/WNs in minutes and after execution (<2min) results are published to SAM DB. But after that it takes >1h for the job to get back notification/output sandbox to WMS (wms206/8/9). In Nagios with WMS wmssamtest01/2 jobs don't even reach the CE in 45 min. Due to limitations on metrics detailed output (Savannah ticket http://savannah.cern.ch/bugs/?62300) now it's not possible to see full logging info output, which could shed some light on the problem.
CE egce-cresco.portici.enea.it org.sam.CE-JobSubmit-ops WARNING: job submission OK - problem on WN [Done (Exit Code =0)] org.sam.WN- are missing since 11-20-2009 09:51:35 Problem with connecting to gridmsg002.cern.ch:6163. Needs investigation. ... Check if provided MB is accessible [stomp://gridmsg002.cern.ch:6163/]. WARNING: Provided MB isn't accessible [stomp://gridmsg002.cern.ch:6163/]. Trying to obtain it from IS. ERROR: Failed to obtain Message Broker URI [BDII egee-bdii.cnaf.infn.it:2170]. Could not connect to BDII at egee-bdii.cnaf.infn.it:2170 at /gpor_proj/spagogrid/egee/home/crescoops004/globus-tmp.cresco1x029.11915.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2ff9lHsBed4jb7ZpZdXN1rfw/.nagios/bin/find_all_brokers line 107, line 225. Exiting. ...
CE egce.frascati.enea.it org.sam.CE-JobSubmit-ops UNKNOWN: [Ready->Cancelled [timeout/dropped]] org.sam.WN-* are in PENDING. Same problem as with ce-b1-1.mi.infn.it
CE egce1-cresco.portici.enea.it org.sam.CE-JobSubmit-ops CRITICAL: Job was aborted. org.sam.WN-* are in PENDING. Last reason: - Got a job held event, reason: Globus error 10: data transfer to the server failed - Job got an error while in the CondorG queue. Now in SAM jobs are aborted each 5-6 hours with - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. So, there seems to be problems on the CE.
CE egceaix.frascati.enea.it org.sam.CE-JobSubmit-ops OK: success. org.sam.WN-* are in PENDING Consistent with SAM - js OK but NA for WN tests. Problem with publication of results.
CE t2-ce-02.to.infn.itYes / Done org.sam.WN-Csh-ops CRITICAL. rm: cannot remove `env-csh.txt': Operation not permitted. Will fix this. Working in /tmp, but this seems to be not the best place in this case.
CE unime-ce-01.me.pi2s2.itYes / Done org.sam.CE-JobState-ops OK: [Running] org.sam.CE-JobSubmit-ops CRITICAL: [Waiting->Cancelled ... since 02-01-2010 17:16:03 WN tests are OK. There are problems with submission to CE. But when job is canceled there seems to be a problem with updating passive org.sam.CE-JobSubmit-ops. In SAM it's OK, but there definitely problems with jobs submission to CE - big gaps (up to 9h) between OK's. Needs debugging.
SRMv2 storm02.cr.cnaf.infn.it org.sam.SRM-Put-ops UNKNOWN: File was NOT copied to SRM. UI error: 'NoneType' object has no attribute 'kill' exceptions.AttributeError In SAM there are timeouts on SRMv2-gt and the node is in ERROR. In Nagios seems like a bug in the code which badly handles the timeout exception. Seems to be a bug - needs debugging.
sBDII ce.scope.unina.it CRITICAL - errors 2, warnings 0, info 0 ERROR: gluesiteuniqueid=unina-egee,mds-vo-name=unina-egee,o=grid, A value is not in the correct format., GlueSiteLongitude must be a number. ERROR: gluesiteuniqueid=unina-egee,mds-vo-name=unina-egee,o=grid, A value is not in the correct format., GlueSiteLatitude must be a number. OK in SAM
sBDII cmsrm-ce01.roma1.infn.it Same as ce.scope.unina.it
sBDII prod-bdii-02.pd.infn.it CRITICAL - errors 1, warnings 0, info 0 ERROR: glueceuniqueid=prod-ce-02.pd.infn.it:2119/jobmanager-lcglsf-cms,mds-vo-name=infn-padova,o=grid, A value is not in the correct format., GlueCEAccessControlBaseRule does not match VOMS:.+
NORTHERN EUROPE
Service Host Problem Description
CE Nagios WN test (CE-org.sam.WN-Rep) org.sam.WN-RepCr is failling, but old one works fine: BelGrid-UCL, ingrid.cism.ucl.ac.be CRITICAL: File was NOT copied to SE ingrid-se02.cism.ucl.ac.be and registered in LFC prod-lfc-shared-central.cern.ch. CLI, \nlcg_cr: Invalid argument\n
SRMv2 Nagios SRMv2 all tests are failling, but old ones are passing: BEgrid-ULB-VUB, maite.iihe.ac.be
SOUTH EASTERN EUROPE
Service Host Problem Description
CE ce02.grid.acad.bg JS: Job was aborted. File not available.Cannot read JobWrapper output, both from Condor and from Maradona. In SAM it's passing the test. We have seen this error also in SAM in the last 24h, so there seems to be problems on the CE.
CE ce1.inrne.bas.bg WN-Rep-ops: LDAP search timed out after 20 sec. In SAM it's passing the test. This happens & it's normal.
CE grid-lab-ce.ii.edu.mk JS: Job was aborted. Got a job held event, reason: Globus error 10: data transfer to the server failed. Job got an error while in the CondorG queue. Iin SAM it's passing the test. To be discussed with Konstantin.
CE ituce.grid.itu.edu.tr JS: Job was aborted. Got a job held event, reason: Globus error 10: data transfer to the server failed. Job got an error while in the CondorG queue. Iin SAM it's passing the test. To be discussed with Konstantin.
SOUTH WESTERN EUROPE
Service Host Problem Description
CE axon-g01.ieeta.pt JS: No brokers found in top BDII since 1.5 days. In SAM it's passing the test. To be investigated.
CE ce3.egee.cesga.es WN-Rep-ops: LDAP search timed out after 20 sec. In SAM it's passing the test. This happens & it's normal.
CE grid001.fc.up.pt JS: Problem on WN. ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. In SAM it's passing the test. To be discussed with Konstantin
CE grid001.fe.up.pt JS: Problem on WN. ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. In SAM it's passing the test. To be discussed with Konstantin
UKI
Service Host Problem Description
Comments: All services OK
Edit | Attach | Watch | Print version | History: r34 < r33 < r32 < r31 < r30 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r34 - 2010-02-19 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback