CENTRAL EUROPE |
Service |
Host |
Problem Description |
Top-BDII |
Comments: |
ALL OK! |
CREAM-CE |
Comments: |
All nodes failing the Broker Info test. This is a known issue. bug #61322 |
FTS |
Comments: |
Does not apply. No FTS services |
WMS |
Comments: |
ALL OK! |
LFC_C |
Comments: |
ALL OK! |
LFC_L |
atlas.uibk.ac.at |
ch.cern.LFC-Read-ops - CRITICAL: Trying to statg(/grid/ops) : No such file or directory |
RGMA |
mon1.farm.particle.cz |
CERT LIFETIME CRITICAL - SSL ERROR: IO::Socket::INET configuration failederror:00000000:lib(0):func(0):reason(0). This is a known issue. Bug: https://savannah.cern.ch/bugs/?62482 |
VOBOX |
Comments: |
In Nagios service is not VO dependent. Thus, more service instances are tested than in SAM. Fix in NCG requested. |
VOMS |
Comments: |
ALL OK! |
|
|
FRANCE |
Service |
Host |
Problem Description |
LFC_C/L |
Comments: |
Currently problems with proper distinction between the service flaivours. Fix in NCG requested. |
VOBOX |
Comments: |
In Nagios service is not VO dependent. Thus, more service instances are tested than in SAM. Fix in NCG requested. |
CREAM-CE,FTS,MON,MyProxy,Top-BDII,VOMS,WMS |
Comments: |
CREAM-CE Bi test bug #61322 |
UKI |
Service |
Host |
Problem Description |
LFC_C/L |
Comments: |
Currently problems with proper distinction between the service flaivours. Fix in NCG requested. |
VOBOX |
Comments: |
In Nagios service is not VO dependent. Thus, more service instances are tested than in SAM. Fix in NCG requested. |
MON |
mon.glite.ecdf.ed.ac.uk |
now ERROR in SAM (proto failed) - OK in Nagios. Was failing in Nagios also ~12h ago. But details data in Nagios doesn't say much (SSL ERROR: IO::Socket::INET configuration failederror:00000000:lib(0):func(0):reason(0))... A better output on errors is needed in hr.srce.RGMA-CertLifetime metric. Updated bug #62482 . |
CREAM-CE,FTS,MyProxy,Top-BDII,VOMS,WMS |
Comments: |
OK. |
|
ITALY |
Service |
Host |
Problem Description |
CREAM-CE |
test7200a.cnaf.infn.it |
bug #62482 hr.srce.CREAMCE-CertLifetime: CERT LIFETIME CRITICAL - SSL ERROR: IO::Socket::INET configuration failederror:00000000:lib(0):func(0):reason(0). |
LFC_C |
grid-eo-engine04.esrin.esa.int |
ch.cern.LFC-Ping-ops: UNKNOWN: Metric ch.cern.LFC-Ping does not exist. |
LFC_C |
lfcserver.cnaf.infn.it |
ch.cern.LFC-Ping-ops: UNKNOWN: Metric ch.cern.LFC-Ping does not exist. |
LFC_L |
cert-39.pd.infn.it |
ch.cern.LFC-Ping-ops: UNKNOWN: Metric ch.cern.LFC-Ping does not exist. |
MON |
grid002.ca.infn.it |
bug #62482 hr.srce.RGMA-CertLifetime: CERT LIFETIME CRITICAL - SSL ERROR: IO::Socket::INET configuration failederror:00000000:lib(0):func(0):reason(0) |
VO-box |
vobox.ca.infn.it |
org.nagios.gsissh-Check: Current Status: CRITICAL (for 1d 0h 16m 24s), Status Information:CRITICAL - Socket timeout after 60 seconds |
MyProxy, Top-BDII,VOMS,WMS |
Comments: |
ALL OK! |
|
NORTHERN EUROPE |
Service |
Host |
Problem Description |
Top-BDII |
Comments: |
ALL OK! |
CREAM-CE |
Comments: |
All nodes failing the Broker Info test. This is a known issue. bug #61322 |
FTS |
Comments: |
ALL OK! |
WMS |
Comments: |
ALL OK! |
LFC_C |
Comments: |
ALL OK! |
LFC_L |
Comments: |
ALL OK! |
MON |
Comments: |
OK - The same as in SAM, 17 ok, 1 warning |
MyProxy |
Comments: |
ALL OK! |
VOBOX |
Comments: |
In Nagios service is not VO dependent. Thus, more service instances are tested than in SAM. Fix in NCG requested. |
VOMS |
Comments: |
ALL OK! |
|
|
ASIA PACIFIC |
Service |
Host |
Problem Description |
|
|
CANADA |
Service |
Host |
Problem Description |
CE |
ce01.eela.if.ufrj.br |
too many timeouts for CE-org.sam.WN-Rep: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepDel-ops]: CRITICAL: Replicas for [lfn:/grid/ops/SAM/sam-lcg-rm-cr-lnx147.eela.if.ufrj.br.100210083847.2441279] were NOT deleted.\n[BDII] lcg-bdii.cern.ch:2170: Connection Timeout\n |
CE |
gantt.cefet-rj.br |
too many timeouts for CE-org.sam.WN-Rep: WARNING: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: WARNING: LDAP search timed out after 20 sec. lcg-bdii.cern.ch:2170\n |
CENTRAL EUROPE |
Service |
Host |
Problem Description |
|
|
CERN |
Service |
Host |
Problem Description |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
|
FRANCE |
Service |
Host |
Problem Description |
|
|
GERMANY SWITZERLAND |
Service |
Host |
Problem Description |
CE |
grid-ce.physik.uni-bonn.de, ce-enmr.chemie.uni-frankfurt.de, diana.switch.ch, ce-goegrid.gwdg.de, ce1.bfg.uni-freiburg.de |
Common error: "CRITICAL: Job aborted." "Reason(s): Globus error 10: data transfer to the server failed. Status Reason: hit job retry count (0)". WMS couldn't transfer job to CE on the first attempt - and in JDL we don't allow resubmission. Abortion rate is relatively low - 2-3 times in 24h |
CE |
grid13.gsi.de |
CRITICAL. OK in SAM. "lfc-mkdir: error while loading shared libraries: libuuid.so.1: cannot open shared object file". lfc-mkdir wasn't use in SAM (introduced in Nagios), but WN mustn't fail on importing libs. This is problem of the site. |
|
ITALY |
Service |
Host |
Problem Description |
CE |
pbs-enmr.cerm.unifi.it, ce2.egee.unisalento.it |
Common error: "CRITICAL: Job aborted." "Reason(s): Globus error 10: data transfer to the server failed. Status Reason: hit job retry count (0)". WMS couldn't transfer job to CE on the first attempt - and in JDL we don't allow resubmission. Abortion rate is relatively low - 2-3 times in 24h |
sBDII |
cmsrm-ce01.roma1.infn.it, ce.scope.unina.it, prod-bdii-02.pd.infn.it |
org.gstat.SanityCheck |
|
NORTHERN EUROPE |
Service |
Host |
Problem Description |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
|
SOUTH EASTERN EUROPE |
Service |
Host |
Problem Description |
CE |
Comments: |
All OK |
SRMv2 |
Comments: |
All OK |
sBDII |
Comments: |
All OK |
|
SOUTH WESTERN EUROPE |
Service |
Host |
Problem Description |
CE |
axon-g01.ieeta.pt |
org.sam.CE-JobSubmit-ops - ERROR: No Brokers found in 'PROD' network [BDII topbdii01.ncg.ingrid.pt:2170]. Exiting. |
CE |
grid001.fc.up.pt |
org.sam.CE-JobSubmit-ops - ERROR: No Brokers found in 'PROD' network [BDII topbdii01.ncg.ingrid.pt:2170]. Exiting. |
CE |
grid001.fe.up.pt |
org.sam.CE-JobSubmit-ops - ERROR: No Brokers found in 'PROD' network [BDII topbdii01.ncg.ingrid.pt:2170]. Exiting. |
|
UKI |
Service |
Host |
Problem Description |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
|
ASIA PACIFIC |
Service |
Host |
Problem Description |
CE |
Comments |
Better in Nagios. 2 services fail in Nagios in SAM. Beyond that Nagios has 1 in ERROR and WARNING |
CE |
ce.hut.vngrid.vinaren.vn |
ERROR: "org.sam.CE-JobSubmit-ops" - UNKNOWN: [Ready->Cancelled [timeout/dropped]] |
CE |
grid01.phy.ncu.edu.tw |
WARNING: "org.sam.CE-JobSubmit-ops" - WARNING: [Running->Cancelled [timeout/dropped]]. |
CE |
Comments |
5 other services failing (the same as in SAM) |
CE |
Comments |
SAM - 11 endpoints failing in total |
SRMv2 |
Comments: |
Nagios: All tests pasing. |
sBDII |
Comments: |
Nagios & SAM: 2 test failing. In addition in SAM 3 endpoints in WARNING state |
|
|
CANADA |
Service |
Host |
Problem Description |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
CENTRAL EUROPE |
Service |
Host |
Problem Description |
CE |
Comments: |
1 WARNING, all others in sync with SAM |
CE |
dwarf.wcss.wroc.pl |
org.sam.CE-JobSubmit-ops "WARNING: [Running->Cancelled [timeout/dropped]] " |
SRMv2 |
Comments: |
All is in sync with SAM |
sBDII |
Comments: |
All tests passing |
|
|
CERN |
Service |
Host |
Problem Description |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
|
FRANCE |
Service |
Host |
Problem Description |
CE |
Comments: |
Nagios & SAM: tests for 2 endpoints failing. |
CE |
clrlcgce03.in2p3.fr |
org.sam.CE-JobSubmit-ops "WARNING: [Scheduled->Cancelled [timeout/dropped]]" |
SRMv2 |
ccsrmt2.in2p3.fr |
All endpoints in OK status. 1 issue: ccsrmt2.in2p3.fr has only 1 service which is being tested. All other endpoints have 9 services. To be investigated. |
sBDII |
Comments: |
Nagios & SAM: all tests passing |
|
|
GERMANY SWITZERLAND |
Service |
Host |
Problem Description |
|
|
ITALY |
Service |
Host |
Problem Description |
|
|
NORTHERN EUROPE |
Service |
Host |
Problem Description |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
SOUTH EASTERN EUROPE |
Service |
Host |
Problem Description |
CE |
Comments: |
All OK |
SRMv2 |
torik1.ulakbim.gov.tr |
CRITICAL: File was NOT copied to SRM. srm://torik1.ulakbim.gov.tr:8446/srm/managerv2?SFN=/dpm/ulakbim.gov.tr/home/ops/testfile-put-1265663366-361a2ebe176d.txt: Invald argument SURL: srm://torik1.ulakbim.gov.tr:8446/srm/managerv2?SFN=/dpm/ulakbim.gov.tr/home/ops/testfile-put-1265663366-361a2ebe176d.txt |
sBDII |
Comments: |
All OK |
|
|
SOUTH WESTERN EUROPE |
Service |
Host |
Problem Description |
CE |
axon-g01.ieeta.pt |
JS: No brokers found in top BDII. Issue being investigated by Konstantin. |
CE |
ce3.egee.cesga.es |
WN-Rep-ops: LDAP search timed out after 20 sec. In SAM it's passing the test but just because the job has more time to finish. |
CE |
grid001.fc.up.pt |
JS: Problem on WN. ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. Issue being investigated by Konstantin |
SRMv2 |
Comments: |
All OK |
sBDII |
Comments: |
All OK |
|
UKI |
Service |
Host |
Problem Description |
CE |
gridgate.cp.dias.ie |
results are not comming from WN because of Exit Code !!=0. UNKNOWN: job submission OK - problem on WN [Done (Exit Code !!=0)] |
SRMv2,sBDII |
Comments: |
All is in sync with SAM |
ASIA PACIFIC |
Service |
Host |
Problem Description |
CE |
Comments |
Better in Nagios. However Nagios has 3 endpoints in WARNING when they are OK in SAM |
CE |
grid01.phy.ncu.edu.tw |
WARNING: "org.sam.CE-JobSubmit-ops" - WARNING: [Running->Cancelled [timeout/dropped]]. |
CE |
quanta.grid.sinica.edu.tw |
WARNING: "org.sam.CE-JobSubmit-ops" - WARNING: [Running->Cancelled [timeout/dropped]]. |
CE |
w-ce03.grid.sinica.edu.tw |
WARNING: "org.sam.CE-JobSubmit-ops" - WARNING: [Running->Cancelled [timeout/dropped]]. |
CE |
Comments |
5 other services failing (the same as in SAM) |
CE |
Comments |
SAM - 11 endpoints failing in total |
SRMv2 |
Comments: |
The same in Nagios & SAM: 1 test failing. |
sBDII |
Comments: |
Nagios & SAM: 2 test failing. In addition in SAM 3 endpoints in WARNING state |
|
|
CANADA |
Service |
Host |
Problem Description |
sBDII |
lcg-ce.rcf.uvic.ca |
sBDII-org.gstat.SanityCheck is failling, but current SAM is OK - new feature |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
|
CENTRAL EUROPE |
Service |
Host |
Problem Description |
CE |
dgt01.ui.savba.sk |
ERROR: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE dgt02.ui.savba.sk and registered in LFC prod-lfc-shared-central.cern.ch. " |
CE |
gn0.hpcc.sztaki.hu |
ERROR: org.sam.CE-JobSubmit-ops and org.sam.CE-JobState-ops failed "CRITICAL: Job was aborted." In addition all "WN" tests are in PENDING state. " |
CE |
ce.grid.bntu.by |
UNKNOWN: hr.srce.GRAM-CertLifetime "CERTLIFETIME-PROBE UNKNOWN - Timeout occured." |
CE |
Comments: |
In addition 1 test failing in SAM & Nagios |
SRMv2 |
Comments: |
All tests passing in Nagios, 1 test failing in SAM |
sBDII |
Comments: |
All tests passing |
|
|
CERN |
Service |
Host |
Problem Description |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
FRANCE |
Service |
Host |
Problem Description |
CE |
Comments: |
Nagios & SAM: tests for 4 endpoints failing. |
SRMv2 |
ccsrmt2.in2p3.fr |
All endpoints in OK status. 1 issue: ccsrmt2.in2p3.fr has only 1 service which is being tested. All other endpoints have 9 services. To be investigated. |
sBDII |
Comments: |
Nagios & SAM: all tests passing |
|
|
GERMANY SWITZERLAND |
Service |
Host |
Problem Description |
CE |
grid-ce4.desy.de ,grid-ce5.desy.de |
Solved (07-03-2010) with manually built NCG 0.46.1-1... WN-Rep CRITICAL (OK in SAM). Due to permission problems on SRM dcache-se-desy.desy.de |
CE |
grid13.gsi.de |
WN-RepCr CRITICAL (OK in SAM). "lfc-mkdir: error while loading shared libraries: libuuid.so.1" This is problem of the CE. In nagios probe lcg-mkdir was added that doesn't exit in SAM but anyways it shouldn't fail in such way. |
SRMv2 |
dcache-se-desy.desy.de |
Solved (07-03-2010) with manually built NCG 0.46.1-1... Permission denied. Problem with Judit's proxy. This SRM seems to be requiring /ops/Role=lcgadmin as primary attribute. Testing new NCG feature on samnag017. |
sBDII |
Comments: |
In sync with SAM. |
|
ITALY |
Service |
Host |
Problem Description |
CE |
ce-b1-1.mi.infn.it |
Today JS UNKNOWN: [Ready->Cancelled [timeout/dropped]] ... and still WN-* in PENDING - thus, problems with MB connection/discovery on WNs |
CE |
egce.frascati.enea.it, egce1-cresco.portici.enea.it |
Canceled from Running/Scheduled after timeout. As WN-* are in PENDING there are apparently problems with MB connection/discovery (same as egce-cresco.portici.enea.it) |
CE |
egce-cresco.portici.enea.it |
WARNING: job submission OK - problem on WN [Done (Exit Code =0)]. Can't connect to given MB, then fails connecting to tBDDI to discover avail MB. |
CE |
egceaix.frascati.enea.it |
org.sam.CE-JobSubmit-ops OK: success. org.sam.WN-* are in PENDING Consistent with SAM - js OK but NA for WN tests. This is AIX CE - problems are "perl: Setting locale failed." & "Badly formed number" in head (needs explicitly '-n' (in SVN & testing)). Why they always return 0 from jobs even though they fail? |
CE |
grid001.ts.infn.it |
[hr.srce.GRAM-CertLifetime CRITICAL: CERT LIFETIME CRITICAL - SSL ERROR:] But OK from CLI and in SAM. Ticket #62482 Proposed fix manually applied on samnag006 in hr.srce/CertLifetime-probe |
SRMv2 |
storm02.cr.cnaf.infn.it |
need a better error output for hr.srce.SRM2-CertLifetime. Followed up in Ticket #62482 |
SRMv2 |
t2cmcondor.mi.infn.it |
Intermittent problems. Sometime fails Put or Get with Globus error. In SAM OK with CLI. In Nagios we use Python API. Needs debugging. |
sBDII |
ce.scope.unina.it,cmsrm-ce01.roma1.infn.it,prod-bdii-02.pd.infn.it |
org.gstat.SanityCheck CRITICAL (OK in SAM) |
|
NORTHERN EUROPE |
Service |
Host |
Problem Description |
CE |
ingrid.cism.ucl.ac.be |
CE-org.sam.WN-Rep is failling, but current SAM tests are OK. LOG: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE ingrid-se02.cism.ucl.ac.be and registered in LFC prod-lfc-shared-central.cern.ch.\nlcg-cr --vo ops -d ingrid-se02.cism.ucl.ac.be -l lfn:/grid/ops/SAM/sam-lcg-rm-cr-wn05.cism.ucl.ac.be.100205221222.805333416 /scratch/condor/execute/dir_896/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fggQ9NA-ZIrYeWB7m-NVO_5fg /.gridprobes/ops/org.sam/WN/localhost.localdomain/testFile.txt\n[SE][Mkdir][SRM_FAILURE] httpg://ingrid-se02.cism.ucl.ac.be:8444/srm/managerv2: srm://ingrid-se02.cism.ucl.ac.be/storage/data/ops/generated/2010-02-05/file3902b376-c05e-4d0a-b4cc-a e5e4689b27b: \nlcg_cr: Invalid argument\n |
SRMv2 |
ingrid-se02.cism.ucl.ac.be |
SRMv2-org.sam.SRM-Put test is failling, but current SAM tests are OK. LOG: CRITICAL: File was NOT copied to SRM |
sBDII |
ce01.grid.etf.rtu.lv |
sBDII-org.gstat.SanityCheck is failling, but current SAM is OK - new feature |
sBDII |
bdii-no-t2.ndgf.org |
sBDII-org.bdii.Entries test is failling, but current SAM is OK - new feature |
sBDII |
Comments: |
All is in sync with SAM |
|
SOUTH EASTERN EUROPE |
Service |
Host |
Problem Description |
CE |
ce-grid.grid.uaic.ro |
org.sam.WN-RepCr-ops fails with this error message: gs-65: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE se-grid.uaic.ro and registered in LFC prod-lfc-shared-central.cern.ch. CLI CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE se-grid.uaic.ro and registered in LFC prod-lfc-shared-central.cern.ch. lcg-cr --vo ops -d se-grid.uaic.ro -l lfn:/grid/ops/SAM/sam-lcg-rm-cr-gs-65.100205220316.3612479 /home/ops028/globus-tmp.gs-65.19977.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fspXPBKhfdBUb8_5f2C347VAw/.gridprobes/ops/org.sam/WN/localhost.localdomain/testFile.txt srm://se-grid.uaic.ro/dpm/uaic.ro/home/ops/generated/2010-02-06/filef6fad16d-164b-4c7a-a2dd-d92a82525827: Invalid argument lcg_cr: Invalid argument It says 'lcg_cr' instead of 'lcg-cr', but it looks like the problem is with srm://se-grid.uaic.ro/dpm/uaic.ro/home/ops/generated/2010-02-06/filef6fad16d-164b-4c7a-a2dd-d92a82525827 In SAM it's OK: https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=ce-grid.grid.uaic.ro&vo=ops&testname=CE-sft-lcg-rm-cr&testtimestamp=1265409142 Comment from Konstantin on this issue: Looking at the history for this particular CE this problem happens from time to time - 3-4 times a week. However, usually next time it works OK. All our OPS jobs for this CE lend on one WN gs-65 This is an intermittent problem and could well be due to a bug in lcg_util and appear due to a problem at communication with remote SE. lcg_cr is a library function to which lcg-cr is CLI equivalent and in fact a wrapper. I think we can ignore it for now. |
CE |
ce01.mosigrid.utcluj.ro |
Current Status: Aborted. Logged Reason(s): - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. Status Reason: hit job shallow retry count (1). This works in SAM (every two hours) but the WMS used in Nagios are different to the ones used by SAM, so the difference can be there. Unfortunately we cannot read the whole test output due to a limitation in the size, but we are in the process of increasing this size so we'll be able to debug better this issue in the coming days. |
CE |
ce02.grid.acad.bg |
Current Status: Aborted. Logged Reason(s): - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. Status Reason: hit job shallow retry count (1). This works in SAM (every two hours) but the WMS used in Nagios are different to the ones used by SAM, so the difference can be there. Unfortunately we cannot read the whole test output due to a limitation in the size, but we are in the process of increasing this size so we'll be able to debug better this issue in the coming days. |
CE |
cox01.grid.metu.edu.tr |
CRITICAL: Getting job output: Failed. Connecting to the service https://wmssamtest02.cern.ch:7443/glite_wms_wmproxy_server Error - Operation failed HTTP Error 500 Internal Server ErrorInternal Server ErrorThe server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, [no address given] and inform them of the time the error occurred, and anything you might have done that may have caused the error. More information about this error may be available in the server error log. Error code: SOAP-ENV:Server. Test passes in SAM. |
|
SOUTH WESTERN EUROPE |
Service |
Host |
Problem Description |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
|
UKI |
Service |
Host |
Problem Description |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
|
ASIA PACIFIC |
Service |
Host |
Problem Description |
CE |
grid01.phy.ncu.edu.tw |
Last result for "org.sam.CE-JobSubmit-ops" on "01-02-2010" with status "CRITICAL: Job was aborted." To be investigated, as they fail only in Nagios. |
CE |
quanta.grid.sinica.edu.tw |
Last result for "org.sam.CE-JobSubmit-ops" on "31-01-2010" with status "CRITICAL: Job was aborted." To be investigated, as they fail only in Nagios. |
CE |
ce.indiacms.res.in |
org.sam.CE-JobState-ops & org.sam.CE-JobSubmit-ops "WARNING: job submission OK - problem on WN [Done (Exit Code =0)]" ERROR in SAM. |
CE |
3 other services failing (the same as in SAM) |
SRMv2 |
Comments: |
The same in Nagios & SAM: 1 test failing. |
sBDII |
Comments: |
Nagios & SAM: 1 test failing. In addition in SAM 3 endpoints in WARNING state |
|
CANADA |
Service |
Host |
Problem Description |
sBDII |
lcg-ce.rcf.uvic.ca |
sBDII-org.gstat.SanityCheck is failling, but current SAM is OK - new feature |
CE,SRMv2 |
Comments: |
All is in sync with SAM |
|
CENTRAL EUROPE |
Service |
Host |
Problem Description |
CE |
ce2.egee.cesnet.cz |
WARNING: "skurut2-2.egee.cesnet.cz: WARNING: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: WARNING: No information for [attribute(s): ['GlueSALocalID', 'GlueSAAccessControlBaseRule', 'GlueSAFreeOnlineSize', 'GlueSAStateAvailableSpace']] in ldap://bdii.cyf-kr.edu.pl:2170." |
CE |
dgt01.ui.savba.sk |
WARNING: on org.sam.WN-*-ops: "dgt04.ui.savba.sk: WARNING: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: WARNING: LDAP search timed out after 20 sec. glite-rb.ct.infn.it:2170" |
CE |
Comments: |
All tests passing in SAM |
SRMv2 |
Comments: |
All tests passing in Nagios, 1 test failing in SAM |
sBDII |
Comments: |
All tests passing |
|
CERN |
Service |
Host |
Problem Description |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
|
FRANCE |
Service |
Host |
Problem Description |
CE |
Comments: |
Nagios & SAM: tests for 2 endpoints (cemauvergridce01.univ-bpclermont.fr, iut15auvergridce01.univ-bpclermont.fr) failing. |
SRMv2 |
ccsrmt2.in2p3.fr |
All endpoints in OK status. 1 issue: ccsrmt2.in2p3.fr has only 1 service which is being tested. All other endpoints have 9 services. To be investigated. |
sBDII |
Comments: |
Nagios: all tests passing, SAM: 1 tests failing |
|
GERMANY SWITZERLAND |
Service |
Host |
Problem Description |
CE |
grid13.gsi.de |
JobState gets updated, but JobSubmit is not; JobSubmit must be updated always (if not a problem then UNKNONWN just for information). |
|
grid-ce4.desy.de,grid-ce5.desy.de |
same as yesterday (WN-Rep problem due to permission problem on dcache-se-desy.desy.de) |
CE |
ce-goegrid.gwdg.de |
JS: SAM OK - Nagios CRITICAL; but in SAM jobs take long time to come from CE. |
SRMv2 |
dcache-se-desy.desy.de |
same problem as yesterday (permissions). Ether submit ticket to the site or change to new proxy with lcgadmin as primary role. |
sBDII |
Comments: |
All is in sync with SAM |
|
ITALY |
Service |
Host |
Problem Description |
CE |
atlasce1.lnf.infn.it |
JS: SAM OK - Nagios CRITICAL; but in SAM jobs take long time to come from CE. |
CE |
ce-b1-1.mi.infn.it |
same problem as yesterday. |
CE |
egce-cresco.portici.enea.it |
same problem as yesterday. |
CE |
egce.frascati.enea.it |
same problem as yesterday. |
CE |
egce1-cresco.portici.enea.it |
same problem as yesterday. |
CE |
egceaix.frascati.enea.it |
same problem as yesterday. |
CE |
grid012.ct.infn.it |
JS: SAM OK - Nagios CRITICAL: Job was aborted.; but in SAM jobs take long time to come from CE. |
CE |
unime-ce-01.me.pi2s2.it |
JobState gets updated, but JobSubmit is not; JobSubmit must be updated always (if not a problem then UNKNONWN just for information). |
SRMv2 |
Comments: |
All is in sync with SAM |
sBDII |
ce.scope.unina.it,cmsrm-ce01.roma1.infn.it |
SAM OK - Nagios CRITICAL; "gluesiteuniqueid - value is not in the correct format" |
|
NORTHERN EUROPE |
Service |
Host |
Problem Description |
CE |
ingrid.cism.ucl.ac.be |
CE-org.sam.WN-Rep is failling, but current SAM tests are OK. LOG: wn05.cism.ucl.ac.be: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE ingrid-se02.cism.ucl.ac.be and registered in LFC prod-lfc-shared-central.cern.ch. CLI |
CE |
ce01.lcg.cscs.ch |
CE-org.sam.WN-Rep test has too many ldap time-outs while current SAM tests are OK. LOG: wn24.lcg.cscs.ch: WARNING: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: WARNING: LDAP search timed out after 20 sec. lcg-bdii.cern.ch:2170 |
CE |
cream01.iihe.ac.be |
CE-org.sam.WN-Rep test is failling, but current SAM is OK. LOG: node12-17.wn.iihe.ac.be: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE maite.iihe.ac.be and registered in LFC prod-lfc-shared-central.cern.ch. [ErrDB:[('lcg_util', 'server', 'CRITICAL')]] CLI |
CE |
gridce.iihe.ac.be |
CE-org.sam.WN-Rep test is failling, but current SAM is OK. LOG: node17-1.wn.iihe.ac.be: CRITICAL: CRITICAL METRIC FAILED [org.sam.WN-RepCr-ops]: CRITICAL: File was NOT copied to SE maite.iihe.ac.be and registered in LFC prod-lfc-shared-central.cern.ch. [ErrDB:[('lcg_util', 'server', 'CRITICAL')]] CLI |
SRMv2 |
ingrid-se02.cism.ucl.ac.be |
SRMv2-org.sam.SRM-Put test is failling, but current SAM tests are OK. LOG: CRITICAL: File was NOT copied to SRM |
SRMv2 |
maite.iihe.ac.be |
SRMv2-org.sam.SRM-Put is failling, but current SAM is OK. LOG: CRITICAL: File was NOT copied to SRM |
sBDII |
ce01.grid.etf.rtu.lv |
sBDII-org.gstat.SanityCheck is failling, but current SAM is OK - new feature |
sBDII |
bdii-no-t2.ndgf.org |
sBDII-org.bdii.Entries test is failling, but current SAM is OK - new feature |
sBDII |
Comments: |
All is in sync with SAM |
SOUTH EASTERN EUROPE |
Service |
Host |
Problem Description |
CE |
ce01.mosigrid.utcluj.ro |
JS Aborted due to: - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. This works in SAM. Same problem as for ce.egee.di.uminho.pt. |
CE |
ce02.grid.acad.bg |
JS Aborted due to: - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. This works in SAM. Same problem as for ce.egee.di.uminho.pt and ce01.mosigrid.utcluj.ro |
CE |
grid-lab-ce.ii.edu.mk |
JS Aborted due to: - Got a job held event, reason: Globus error 10: data transfer to the server failed - Job got an error while in the CondorG queue. In SAM this works, but using different WMS and less TTL. Cannot read the full details output due to limitation in buffer size. |
|
SOUTH WESTERN EUROPE |
Service |
Host |
Problem Description |
CE |
axon-g01.ieeta.pt |
JS: UNKNOWN: job submission OK - problem on WN [Done (Exit Code =0)] ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. This is probably a firewall issue in the site. The WN cannot contact the default message broker, then it tries to do the discovery from their top BDII and this also fails because even if they are configured in their BDII, the information is not returned to the WN. This works in SAM because the submission is not done via message bus. |
CE |
ce.egee.di.uminho.pt |
Current Status: Aborted. Logged Reason(s): - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. Status Reason: hit job shallow retry count (1). This works in SAM (every two hours) but the WMS used in Nagios are different to the ones used by SAM, so the difference can be there. Unfortunately we cannot read the whole test output due to a limitation in the size, but we are in the process of increasing this size so we'll be able to debug better this issue in the coming days. |
CE |
grid001.fc.up.pt |
UNKNOWN: job submission OK - problem on WN [Done (Exit Code =0)] Trying to obtain it from IS. ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. Exiting. This is the same problem as for axon-g01.ieeta.pt. In SAM it works. |
CE |
grid001.fe.up.pt |
Trying to obtain it from IS. ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. This is the same as for axon-g01.ieeta.pt and axon-g01.ieeta.pt. It works in SAM |
|
UKI |
Service |
Host |
Problem Description |
CE,SRMv2,sBDII |
Comments: |
All is in sync with SAM |
|
ASIA PACIFIC |
Service |
Host |
Problem Description |
CE |
grid01.phy.ncu.edu.tw |
Last result for "org.sam.CE-JobSubmit-ops" on "01-02-2010" with status "CRITICAL: Job was aborted." To be investigated, as they fail only in Nagios. |
CE |
quanta.grid.sinica.edu.tw |
Last result for "org.sam.CE-JobSubmit-ops" on "31-01-2010" with status "CRITICAL: Job was aborted." To be investigated, as they fail only in Nagios. |
CE |
ce.indiacms.res.in |
org.sam.CE-JobState-ops & org.sam.CE-JobSubmit-ops "WARNING: job submission OK - problem on WN [Done (Exit Code =0)]" ERROR in SAM. |
CE |
Comments: |
Better in Nagios. 9 different endpoints failing in SAM. However Nagios has 2 endpoints failing when they are OK in SAM |
sBDII |
Comments: |
2 endpoints failing in SAM and 3 in warning state, but not in Nagios |
|
CANADA |
Service |
Host |
Problem Description |
|
Comments: |
Services OK |
|
CENTRAL EUROPE |
Service |
Host |
Problem Description |
CE |
ce1.egee.cesnet.cz |
Error on org.sam.CE-JobSubmit-ops failing with "CRITICAL: Job was aborted." |
CE |
gn0.hpcc.sztaki.hu |
Error on org.sam.CE-JobSubmit-ops and org.sam.CE-JobState-ops failed "CRITICAL: Job was aborted." In addition all "WN" tests are in PENDING state. |
CE |
dgt01.ui.savba.sk |
Warning on org.sam.WN-*-ops: "dgt04.ui.savba.sk: WARNING: CRITICAL METRIC FAILED [org.sam.WN-RepFree-ops]: WARNING: LDAP search timed out after 20 sec. glite-rb.ct.infn.it:2170" |
CE |
Comments: |
All tests passing in SAM |
|
CERN |
Service |
Host |
Problem Description |
CE |
ce130.cern.ch |
WARNING: LDAP search timed out after 20 sec. lcg-bdii.cern.ch:2170. This happens & normal. |
CE |
ce131.cern.ch |
WARNING: LDAP search timed out after 20 sec. lcg-bdii.cern.ch:2170. This happens & normal. |
|
FRANCE |
Service |
Host |
Problem Description |
CE |
iut15auvergridce01.univ-bpclermont.fr |
In Nagios 2 endpoints failing: cemauvergridce01.univ-bpclermont.fr, iut15auvergridce01.univ-bpclermont.fr. In SAM 3 endpoints failing. However "iut03auvergridce01.univ-bpclermont.fr" in Nagios has 1 Warning: "WARNING: job submission OK - problem on WN [Done (Exit Code =0)]" |
SRMv2 |
ccsrmt2.in2p3.fr |
All endpoints in OK status. 1 issue: ccsrmt2.in2p3.fr has only 1 service which is being tested. All other endpoints have 9 services. To be investigated. |
sBDII |
Comments: |
2 tests failing in SAM but not in Nagios |
|
GERMANY SWITZERLAND |
Service |
Host |
Problem Description |
CE |
ce-goegrid.gwdg.de |
Same as unime-ce-01.me.pi2s2.it. In SAM - js OK, but with >2h gaps between WN and js results. There seems to be a problem with job state update/output sandbox delivery problems. Needs debugging. |
CE |
grid-ce4.desy.de |
org.sam.WN-RepCr-ops CRITICAL: File was NOT copied to SE dcache-se-desy.desy.de and registered in LFC prod-lfc-shared-central.cern.ch. [ErrDB:[('lcg_util', 'server', 'CRITICAL')]] lcg-cr --vo ops -d dcache-se-desy.desy.de -l lfn:/grid/ops/SAM/sam-lcg-rm-cr-grid-wn0786.desy.de.100203223931.30933334 /home/opsusr010/globus-tmp.grid-wn0786.1186.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fQSI8Q1M8g7qS9E6mS11mpw/.gridprobes/ops/org.sam/WN/localhost.localdomain/testFile.txt srm://dcache-se-desy.desy.de/pnfs/desy.de/ops/generated/2010-02-03/filec24dd465-59a6-48e2-838a-b1206f6a6d92: Permission denied. This is due to SRMv2 failure on dcache-se-desy.desy.de |
CE |
grid-ce5.desy.de |
org.sam.WN-RepCr-ops CRITICAL: File was NOT copied to SE dcache-se-desy.desy.de and registered in LFC prod-lfc-shared-central.cern.ch. [ErrDB:[('lcg_util', 'server', 'CRITICAL')]] lcg-cr --vo ops -d dcache-se-desy.desy.de -l lfn:/grid/ops/SAM/sam-lcg-rm-cr-grid-wn0786.desy.de.100203223931.30933334 /home/opsusr010/globus-tmp.grid-wn0786.1186.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fQSI8Q1M8g7qS9E6mS11mpw/.gridprobes/ops/org.sam/WN/localhost.localdomain/testFile.txt srm://dcache-se-desy.desy.de/pnfs/desy.de/ops/generated/2010-02-03/filec24dd465-59a6-48e2-838a-b1206f6a6d92: Permission denied. This is due to SRMv2 failure on dcache-se-desy.desy.de |
CE |
udo-ce03.grid.tu-dortmund.de |
SRMv2 |
dcache-se-desy.desy.de |
org.sam.SRM-Put-ops CRITICAL: File was NOT copied to SRM. srm://dcache-se-desy.desy.de:8443/srm/managerv2?SFN=/pnfs/desy.de/ops/testfile-put-1265236357-114af6f20d81.txt: Permission denied. Problem with Judit's proxy. This SRM seems to be requiring /ops/Role=lcgadmin as primary attribute. |
|
ITALY |
Service |
Host |
Problem Description |
CE |
atlasce01.na.infn.it |
org.sam.CE-JobSubmit-ops UNKNOWN: [Ready->Cancelled [timeout/dropped]]. CE is in Maint. In SAM job gets aborted on WMS after 6 hours and job submission is set to ERROR. On Nagios we don't wait that long and cancel the job from Ready state after 45-55 min. As at this stage it's not 100% known what is the cause of the state the metric issues UNKNOWN. Improvement: parse job's logging info to find eg. Event: Pending ... Reason = BrokerHelper: no compatible resources. This could be indication for CRITICAL. |
CE |
ce-b1-1.mi.infn.it |
org.sam.CE-JobSubmit-ops UNKNOWN: [Ready->Cancelled [timeout/dropped]] org.sam.WN-* are in PENDING. In SAM it's OK. Tests delivered to CE/WNs in minutes and after execution (<2min) results are published to SAM DB. But after that it takes >1h for the job to get back notification/output sandbox to WMS (wms206/8/9). In Nagios with WMS wmssamtest01/2 jobs don't even reach the CE in 45 min. Due to limitations on metrics detailed output (Savannah ticket http://savannah.cern.ch/bugs/?62300 ) now it's not possible to see full logging info output, which could shed some light on the problem. |
CE |
egce-cresco.portici.enea.it |
org.sam.CE-JobSubmit-ops WARNING: job submission OK - problem on WN [Done (Exit Code =0)] org.sam.WN- are missing since 11-20-2009 09:51:35 Problem with connecting to gridmsg002.cern.ch:6163. Needs investigation. ... Check if provided MB is accessible [stomp://gridmsg002.cern.ch:6163/]. WARNING: Provided MB isn't accessible [stomp://gridmsg002.cern.ch:6163/]. Trying to obtain it from IS. ERROR: Failed to obtain Message Broker URI [BDII egee-bdii.cnaf.infn.it:2170]. Could not connect to BDII at egee-bdii.cnaf.infn.it:2170 at /gpor_proj/spagogrid/egee/home/crescoops004/globus-tmp.cresco1x029.11915.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2ff9lHsBed4jb7ZpZdXN1rfw/.nagios/bin/find_all_brokers line 107, line 225. Exiting. ... |
CE |
egce.frascati.enea.it |
org.sam.CE-JobSubmit-ops UNKNOWN: [Ready->Cancelled [timeout/dropped]] org.sam.WN-* are in PENDING. Same problem as with ce-b1-1.mi.infn.it |
CE |
egce1-cresco.portici.enea.it |
org.sam.CE-JobSubmit-ops CRITICAL: Job was aborted. org.sam.WN-* are in PENDING. Last reason: - Got a job held event, reason: Globus error 10: data transfer to the server failed - Job got an error while in the CondorG queue. Now in SAM jobs are aborted each 5-6 hours with - File not available.Cannot read JobWrapper output, both from Condor and from Maradona. So, there seems to be problems on the CE. |
CE |
egceaix.frascati.enea.it |
org.sam.CE-JobSubmit-ops OK: success. org.sam.WN-* are in PENDING Consistent with SAM - js OK but NA for WN tests. Problem with publication of results. |
CE |
t2-ce-02.to.infn.it |
org.sam.WN-Csh-ops CRITICAL. rm: cannot remove `env-csh.txt': Operation not permitted. Will fix this. Working in /tmp, but this seems to be not the best place in this case. |
CE |
unime-ce-01.me.pi2s2.it |
org.sam.CE-JobState-ops OK: [Running] org.sam.CE-JobSubmit-ops CRITICAL: [Waiting->Cancelled ... since 02-01-2010 17:16:03 WN tests are OK. There are problems with submission to CE. But when job is canceled there seems to be a problem with updating passive org.sam.CE-JobSubmit-ops. In SAM it's OK, but there definitely problems with jobs submission to CE - big gaps (up to 9h) between OK's. Needs debugging. |
SRMv2 |
storm02.cr.cnaf.infn.it |
org.sam.SRM-Put-ops UNKNOWN: File was NOT copied to SRM. UI error: 'NoneType' object has no attribute 'kill' exceptions.AttributeError In SAM there are timeouts on SRMv2-gt and the node is in ERROR. In Nagios seems like a bug in the code which badly handles the timeout exception. Seems to be a bug - needs debugging. |
sBDII |
ce.scope.unina.it |
CRITICAL - errors 2, warnings 0, info 0 ERROR: gluesiteuniqueid=unina-egee,mds-vo-name=unina-egee,o=grid, A value is not in the correct format., GlueSiteLongitude must be a number. ERROR: gluesiteuniqueid=unina-egee,mds-vo-name=unina-egee,o=grid, A value is not in the correct format., GlueSiteLatitude must be a number. OK in SAM |
sBDII |
cmsrm-ce01.roma1.infn.it |
Same as ce.scope.unina.it |
sBDII |
prod-bdii-02.pd.infn.it |
CRITICAL - errors 1, warnings 0, info 0 ERROR: glueceuniqueid=prod-ce-02.pd.infn.it:2119/jobmanager-lcglsf-cms,mds-vo-name=infn-padova,o=grid, A value is not in the correct format., GlueCEAccessControlBaseRule does not match VOMS:.+ |
|
NORTHERN EUROPE |
Service |
Host |
Problem Description |
CE |
Nagios WN test (CE-org.sam.WN-Rep) org.sam.WN-RepCr is failling, but old one works fine: BelGrid-UCL, ingrid.cism.ucl.ac.be CRITICAL: File was NOT copied to SE ingrid-se02.cism.ucl.ac.be and registered in LFC prod-lfc-shared-central.cern.ch. CLI, \nlcg_cr: Invalid argument\n |
SRMv2 |
Nagios SRMv2 all tests are failling, but old ones are passing: BEgrid-ULB-VUB, maite.iihe.ac.be |
|
SOUTH EASTERN EUROPE |
Service |
Host |
Problem Description |
CE |
ce02.grid.acad.bg |
JS: Job was aborted. File not available.Cannot read JobWrapper output, both from Condor and from Maradona. In SAM it's passing the test. We have seen this error also in SAM in the last 24h, so there seems to be problems on the CE. |
CE |
ce1.inrne.bas.bg |
WN-Rep-ops: LDAP search timed out after 20 sec. In SAM it's passing the test. This happens & it's normal. |
CE |
grid-lab-ce.ii.edu.mk |
JS: Job was aborted. Got a job held event, reason: Globus error 10: data transfer to the server failed. Job got an error while in the CondorG queue. Iin SAM it's passing the test. To be discussed with Konstantin. |
CE |
ituce.grid.itu.edu.tr |
JS: Job was aborted. Got a job held event, reason: Globus error 10: data transfer to the server failed. Job got an error while in the CondorG queue. Iin SAM it's passing the test. To be discussed with Konstantin. |
|
SOUTH WESTERN EUROPE |
Service |
Host |
Problem Description |
CE |
axon-g01.ieeta.pt |
JS: No brokers found in top BDII since 1.5 days. In SAM it's passing the test. To be investigated. |
CE |
ce3.egee.cesga.es |
WN-Rep-ops: LDAP search timed out after 20 sec. In SAM it's passing the test. This happens & it's normal. |
CE |
grid001.fc.up.pt |
JS: Problem on WN. ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. In SAM it's passing the test. To be discussed with Konstantin |
CE |
grid001.fe.up.pt |
JS: Problem on WN. ERROR: No brokers found [BDII topbdii01.ncg.ingrid.pt:2170]. In SAM it's passing the test. To be discussed with Konstantin |
|
UKI |
Service |
Host |
Problem Description |
|
Comments: |
All services OK |
|