Worklog to log the effort for the ATLAS debugging exercise

2007-03-23

Site view
  • ce101
    • 11 GLITE_WMS errors, 0 RunTransform.log files.
      • 11 "hit job shallow retry count".

  • ce102
    • 9 GLITE_WMS errors, 0 RunTransform.log files.
      • 9 "hit job shallow retry count".

  • ce106
    • 67 GLITE_WMS errors, 0 RunTransform.log files.
      • 65 "hit job shallow retry count".
      • 2 "request expired".

  • ce107
    • 41 GLITE_WMS errors, 0 RunTransform.log files.
      • 38 "hit job shallow retry count".
      • 3 "request expired".

  • gridka
    • 97 GLITE_WMS errors, 0 RunTransform.log files.
      • 97 "request expired".
      • For all these jobs, it seems that the lifetime of the proxy used was too short (< 5 days), generating a "Got a job held event, reason: Globus error 131: the user proxy expired (job is still running)" error message when in the CE.

  • nikhef
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired".

2007-03-22

Site view
  • ce101
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job shallow retry count".
      • 1 "hit job retry count".

  • ce102
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "hit job shallow retry count".

  • ce105
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce106
    • 11 GLITE_WMS errors, 0 RunTransform.log files.
      • 11 "hit job shallow retry count".

  • ce107
    • 9 GLITE_WMS errors, 0 RunTransform.log files.
      • 9 "hit job shallow retry count".

  • gridka
    • 28 GLITE_WMS errors, 0 RunTransform.log files.
      • 28 "request expired".

2007-03-21

Cloud view for STAGEOUT errors
  • CA
    • No errors

  • DE
    • 4 "FAILEDSRMSOUT".

  • ES
    • 1 "FAILEDSRMSOUT".

  • FR
    • 135 "FAILEDSRMSOUT".
    • 3 "lcg_cr: File exists".

  • IT
    • 1 "FAILEDSRMSOUT".

  • NL
    • 121 "FAILEDSRMSOUT".
    • 5 "lcg_cr: Permission denied".
    • 3 "lcg_cr: File exists".
    • 2 "copyReplicate: ERROR - GUID already exists".
    • 1 "lcg_cr: Transport endpoint is not connected".
    • 1 "lcg_cr: No space left on device".
    • 1 "CGSI-gSOAP: Could not open connection".

  • T0
    • 805 "FAILEDSRMSOUT".

  • TW
    • 281 "FAILEDSRMSOUT".

  • UK
    • 46 "FAILEDSRMSOUT".

Site view
  • ce106
    • 9 GLITE_WMS errors, 0 RunTransform.log files.
      • 9 "hit job shallow retry count".

  • ce107
    • 8 GLITE_WMS errors, 0 RunTransform.log files.
      • 8 "hit job shallow retry count".

  • gridka
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 3 "request expired".

2007-03-20

Cloud view for STAGEOUT errors
  • CA
    • 80 "FAILEDSRMSOUT".

  • DE
    • 45 "FAILEDSRMSOUT".

  • ES
    • 2 "FAILEDSRMSOUT".

  • FR
    • 180 "FAILEDSRMSOUT".

  • IT
    • 1 "FAILEDSRMSOUT".

  • NL
    • 5 "FAILEDSRMSOUT".

  • T0
    • 332 "FAILEDSRMSOUT".
    • 2 "CastorStagerInterface.c:2457 Device or resource busy".
    • 1 "Copy Failed: Unregistering alias from catalog".

  • TW
    • 109 "FAILEDSRMSOUT".
    • 1 "lcg_cr: File exists".

  • UK
    • 104 "FAILEDSRMSOUT".

Site view
  • ce101
    • 32 GLITE_WMS errors, 0 RunTransform.log files.
      • 32 "job was stuck forever".

  • ce102
    • 35 GLITE_WMS errors, 0 RunTransform.log files.
      • 35 "job was stuck forever".

  • ce106
    • 33 GLITE_WMS errors, 1 RunTransform.log files.
      • 31 "job was stuck forever".
      • 1 "request expired".
      • 1 "hit job shallow retry count".
      • lcg-cp error:
        • Transport endpoint is not connected
          • srm.cern.ch (19/03 between 20h55 and 23h15). Castor is down at CERN since 19/03 in the afternoon.

  • ce107
    • 30 GLITE_WMS errors, 3 RunTransform.log files.
      • 27 "job was stuck forever".
      • 3 "hit job shallow retry count".
      • lcg-cp error:
        • Transport endpoint is not connected
          • srm.cern.ch (19/03 between 20h55 and 23h15). Castor is down at CERN since 19/03 in the afternoon.

  • gridka
    • 183 GLITE_WMS errors, 0 RunTransform.log files.
      • 179 "request expired".
      • 4 "job was stuck forever".

  • nikhef
    • 6 GLITE_WMS errors, 0 RunTransform.log files.
      • 4 "job was stuck forever".
      • 2 "request expired".

  • ral
    • 9 GLITE_WMS errors, 0 RunTransform.log files.
      • 9 "job was stuck forever".

  • sara
    • 5 GLITE_WMS errors, 0 RunTransform.log files.
      • 5 "job was stuck forever".

  • sinica
    • 12 GLITE_WMS errors, 0 RunTransform.log files.
      • 12 "job was stuck forever".

2007-03-19

Cloud view for STAGEOUT errors
  • CA
    • No errors

  • DE
    • 31 "FAILEDSRMSOUT".
      • FAILEDSRMSOUT srm://gridka-dcache.fzk.de/pnfs/gridka.de/atlas
      • Intermittent problems during last few days according to SAM in gridka-dcache.fzk.de
      • SAM errors:
        • BDII ERROR: sam-bdii.cern.ch:2170 Success
        • Timeout when executing test SRM-put after 600 seconds!
      • In the DE logs, I found: lcg-cr, Command was timed out after 420 seconds (using castorsrm.cern.ch)
      • In SAM, castorsrm.cern.ch is failing intermittently for atlas vo due to (Timeout when executing test SRM-put after 600 seconds!)

  • ES
    • No errors

  • FR
    • No errors

  • IT
    • No errors

  • NL
    • 10 "FAILEDSRMSOUT".
    • These errors are probably due to:
      • castorsrm.cern.ch time outs
      • GUID already exists

  • T0
    • No errors

  • TW
    • 16 "FAILEDSRMSOUT".

  • UK
    • No errors

Site view
  • ce106
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job retry count".

  • gridka
    • 14 GLITE_WMS errors, 0 RunTransform.log files.
      • 14 "request expired".

  • nikhef
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired".

Summary week 12/03 to 17/03

  • EXELEXOR_GLITE_WMS error:
    • GGUS tickets:
      • #18462: it was a temporary problem with srm.cern.ch (no further details) but the service is not stable according to SAM... Status: ticket solved.
      • #19635: lot of errors "request expired" with CE mu6.matrix.sara.nl (SARA). Status: solved ("Most likely it was due to the rather long queues that we are experiencing lately which caused proxies to be expired before the jobs started").
      • #19673: failed to lcg-cp and lcg-cr files with nodes f-dpm001.grid.sinica.edu.tw and castorsc.grid.sinica.edu.tw. These errors could come from the central catalogue. lcg-utils fail to find the catalogue wrt the files concerned. Some tests need to be done, but I have no Atlas certificate, so I asked to Simone to take a look on this ticket. Note also that in the errror message, it seems that there is no dq2 subdirectory on f-dpm001.grid.sinica.edu.tw. It should be created by the DDM (Distributed Data Management system). Fixed by the sysadmins. Status: solved.
      • #19675: error "Could not get virtual id" with LFC node lfc.cr.cnaf.infn.it and this node is failing the SAM tests since more than 5 days. Status: in solved ("This error was caused by a temporary unavailability of our authorization ldap server, used by lfc to retrive id/gid of users").
      • #19676: CE ce-fzk.gridka.de to check because of error "File not available.Cannot read JobWrapper output, both from Condor and from Maradona" found in the LB informations. This suggests an error on the site GRIDKA itself. Status: unsolved because "atlasprd has to be replaced by a pool of prd-users". I received an email from GRIDKA with the following explanations (status: unsolved):
  I think the problem is the following: the user is mapped to atlasprd with other users at the same time, 
this leads to:
    - multiple users are mapped to atlasprd at the same time
    - the proxies on the ce will be changed when a new user is mapped to atlasprd 
    - old user uses its proxy on the wn, this will not be valid on the ce as this is meanwhile changed 
      resulting in Maradonna errors.
    • Mails sent:
      • The proxy used to submit a job had a (remaining) lifetime too short (2 days) because the submitter forgot to generate a new one. As a result, the job was aborted with error "request expired". The error reported should be something mentioning the proxy expiration (see for example with GRIDKA site on 2007-03-17).
      • I found a job which was reported to fail at the TRIUMF site with error "hit job shallow retry count". However, this job were submitted 5 times to the CE ce01.ific.uv.es (IFIC-LCG2 site), but failed because of an error on the site itself (Unspecified gridmanager error). The job was then submitted to CE lcgce01.triumf.ca and finished successfully. There is therefore something wrong since this job has been reported to have failed on the TRIUMF site, not on the IFIC-LCG2 site. Simone will investigate this problem.

  • SWMISS error: no error found.

2007-03-17

Site view

  • ce105
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "job was stuck forever".

  • ce106
    • 4 GLITE_WMS errors, 0 RunTransform.log files.
      • 3 "hit job retry count":
        • Cannot read JobWrapper output, both from Condor and from Maradona (rb101 -> ce106 -> running on WN -> error when done). Only one resubmission try.
      • 1 "request expired": the proxy expired (lifetime too short - 2 days).

  • ce107
    • 4 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "request expired": the proxy expired (lifetime too short - 2 days).
      • 2 "hit job retry count":
        • Cannot read JobWrapper output, both from Condor and from Maradona (rb101 -> ce107 -> running on WN -> error when done). Only one resubmission try.

  • gridka
    • 129 GLITE_WMS errors, 1 RunTransform.log files.
      • 129 "request expired": the lifetime of the proxy used is really too short...
        • 10 jobs were submitted with a 2-days proxy lifetime.
        • 118 jobs were submitted with a 3-days proxy lifetime.
        • 1 job was submitted with a 3-days proxy lifetime.

  • nikhef
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired": the proxy expired (lifetime too short - 2 days).
      • 1 "hit job retry count".

  • ral
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "job was stuck forever".

  • sara
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job retry count".

2007-03-16

Site view
  • ce101
    • 37 GLITE_WMS errors, 0 RunTransform.log files.
      • 15 "request expired".
      • 13 "aborted by user".
      • 6 "hit job shallow retry count".
      • 3 "X509 proxy expired".

  • ce102
    • 36 GLITE_WMS errors, 0 RunTransform.log files.
      • 12 "request expired".
      • 12 "aborted by user".
      • 6 "X509 proxy expired".
      • 6 "hit job shallow retry count".

  • ce107
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job shallow retry count".

  • gridka
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired".

  • in2p3
    • 6 GLITE_WMS errors, 0 RunTransform.log files.
      • 6 "request expired".

  • nikhef
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "request expired".

  • ral
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired".

  • sara
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job retry count".

2007-03-16

Cloud view for STAGEOUT errors
  • CA
    • 1 "FAILEDSRMSOUT".

  • DE
    • 239 "FAILEDSRMSOUT".

  • ES
    • 12 "FAILEDSRMSOUT".

  • FR
    • No errors

  • IT
    • 30 "FAILEDSRMSOUT".

  • NL
    • 51 "FAILEDSRMSOUT".

  • T0
    • No errors

  • TW
    • 247 "FAILEDSRMSOUT".
    • 6 "lcg_cr: Permission denied".
    • 1 "Copy Failed: Unregistering alias from catalog".

  • UK * 6 "FAILEDSRMSOUT".

2007-03-15

Cloud view for STAGEOUT errors
  • CA
    • 1 "FAILEDSRMSOUT".

  • DE
    • 239 "FAILEDSRMSOUT".

  • ES
    • 12 "FAILEDSRMSOUT".

  • FR
    • No errors

  • IT
    • 30 "FAILEDSRMSOUT".

  • NL
    • 51 "FAILEDSRMSOUT".

  • T0
    • No errors

  • TW
    • 247 "FAILEDSRMSOUT".
    • 6 "lcg_cr: Permission denied".
    • 1 "Copy Failed: Unregistering alias from catalog".

  • UK * 6 "FAILEDSRMSOUT".

Site view
  • ce101
    • 43 GLITE_WMS errors, 0 RunTransform.log files.
      • 21 "request expired".
      • 15 "X509 proxy expired".
      • 7 "aborted by user".

  • ce102
    • 44 GLITE_WMS errors, 0 RunTransform.log files.
      • 21 "X509 proxy expired".
      • 17 "request expired".
      • 6 "aborted by user".

  • ce105
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job retry count".

  • ce106
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "request expired". For one of these jobs, the user proxy was too short: the user proxy was generated on 10 March with a lifetime of 4 days, job was submitted on 12 March and the proxy expired on 14 March when the job was on the CE.

  • ce107
    • 10 GLITE_WMS errors, 0 RunTransform.log files.
      • 5 "request expired".
      • 4 "X509 proxy expired".
      • 1 "aborted by user".

  • gridka
    • 32 GLITE_WMS errors, 1 RunTransform.log files.
      • 32 "request expired".
      • proxy expired (Valid time x hours less than required y hours)

  • in2p3
    • 18 GLITE_WMS errors, 0 RunTransform.log files.
      • 17 "request expired".
      • 1 "X509 proxy expired".

  • nikhef
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired".

  • ral
    • 16 GLITE_WMS errors, 0 RunTransform.log files.
      • 12 "request expired".
      • 2 "X509 proxy expired".
      • 2 "aborted by user".

  • sara
    • 375 GLITE_WMS errors, 0 RunTransform.log files.
      • 337 "X509 proxy expired".
      • 23 "aborted by user".
      • 10 "removal retries exceeded".
      • 3 "hit job shallow retry count".
      • 1 "request expired".

  • sinica
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired".

  • triumf
    • 2 GLITE_WMS errors, 1 RunTransform.log files.
      • 1 "request expired": mail sent to the support-exp-glite-rb mailing list because the LB informations are weird.
      • 1 "hit job shallow retry count": mail sent to Simone because this job was submitted 5 times to a CE in Spain, failed and then went to TRIUMF.

2007-03-14

  • Broadcast received: problem with the SRM at SARA site (lot of pnfs timeouts). Solved in the morning.

Cloud view for STAGEOUT errors
  • CA
    • No errors

  • DE
    • 69 "FAILEDSRMSOUT".
    • 1 "lcg_cr: File exists".

  • ES
    • 30 "FAILEDSRMSOUT".
    • 19 "lcg_cr: File exists".
    • 1 "send2nsd: NS000 - name server not available on xxxx".
    • 1 "lcg_cr: Transport endpoint is not connected".

  • FR
    • No errors

  • IT
    • 23 "FAILEDSRMSOUT".

  • NL
    • 21 "FAILEDSRMSOUT".

  • T0
    • No errors

  • TW
    • 63 "FAILEDSRMSOUT".
    • 3 "lcg_cr: Permission denied".

  • UK * 17 "FAILEDSRMSOUT".

Site view
  • ce105
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired".

  • ce106
    • 18 GLITE_WMS errors, 7 RunTransform.log files.
      • 13 "job was stuck forever".
      • 5 "request expired".
      • RunTransform.py error
      • lcg-cp
        • connection timed out
          • castorsc.grid.sinica.edu.tw (13/03 between 04:16 and 05:03). GGUS ticket #19673.
        • transport endpoint is not connected
          • castorsc.grid.sinica.edu.tw (14/03 @ 09:35). GGUS ticket #19673.
        • no such file or directory
          • se2.itep.ru (14/03@09:05).
      • lcg-cr
        • file exists
          • castorsc.grid.sinica.edu.tw. GGUS ticket #19673.
          • koala.unimelb.edu.au (13/03@17:21). SAM tests are failing since 13 March. According to GOCDB, this site (Australia-UNIMELB-LCG2) is under maintenance from 14 March to 16 March because of the deployment of a new storage hardware.
        • GUID alread exists
          • f-dpm001.grid.sinica.edu.tw (13/03@17:22). SAM tests are failing since 10 March (at least). GGUS ticket #19673.
        • Command was timed out after xxx seconds
          • srm.cern.ch (13/03 @ 17:57)
      • proxy expired (Valid time xx hours less than required yy hours)

  • ce107
    • 26 GLITE_WMS errors, 10 RunTransform.log files.
      • 26 "job was stuck forever".
      • RunTransform.py error.
      • lcg-cr
        • Best pool too high : 2.0E8
          • srm-disk.pic.es (14/03 between 09:52 and 10:21). SAM tests are failing since 10 March (at least). No ping to this host. The PIC site is in maintenance today because of the change of the central network router.

  • gridka
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired": it seems to be a problem with CE ce-fzk.gridka.de. See GGUS ticket #19676.

  • ral
    • 23 GLITE_WMS errors, 0 RunTransform.log files.
      • 18 "request expired".
      • 5 "aborted by user".

  • sara
    • 16 GLITE_WMS errors, 0 RunTransform.log files.
      • 15 "X509 proxy expired" (jobs submitted on 09/03, update on 13/03).
      • 1 "request expired".

2007-03-13

Cloud view for STAGEOUT errors
  • CA
    • 8 "FAILEDSRMSOUT".
    • 1 "lcg_cr: File exists".

  • DE
    • 158 "FAILEDSRMSOUT".
    • 6 "lcg_cr: File exists".
    • 2 "lcg_cr: Permission denied".

  • ES
    • 595 "lcg_cr: File exists".
    • 179 "FAILEDSRMSOUT".
    • 7 "BDII Connection Timeout: lcg-bdii.cern.ch:2170. lcg_cr: Connection timed out".
    • 1 "send2nsd: NS000 - name server not available on xxxx".
    • 1 "No information found for SE xxx, lcg_cr: Invalid argument, Failed to get PFN for stored file: xxx".

  • FR
    • 105 "FAILEDSRMSOUT".

  • IT
    • 31 "FAILEDSRMSOUT".

  • NL
    • 33 "FAILEDSRMSOUT".
    • 1 "lcg_cr: Transport endpoint is not connected".

  • T0
    • 39 "Copy Failed: Unregistering alias from catalog".
    • 12 "CastorStagerInterface.c:2457 Device or resource busy".
    • 4 "SRM Get request failed, but no errorMessage supplied".
    • 4 "lcg_cr: Transport endpoint is not connected".

  • TW
    • 78 "FAILEDSRMSOUT".
    • 4 "lcg_cr: File exists".
    • 3 "lcg_cr: Permission denied".

  • UK
    • 21 "FAILEDSRMSOUT".

Site view
  • ce106
    • 64 GLITE_WMS errors, 0 RunTransform.log files.
      • 30 "request expired".
      • 14 "job was stuck forever".
      • 9 "removal retries exceeded".
      • 1 "hit job shallow retry count".

  • ce107
    • 43 GLITE_WMS errors, 0 RunTransform.log files.
      • 15 "removal retries exceeded".
      • 13 "request expired".
      • 8 "job was stuck forever".

  • cnaf
    • 5 GLITE_WMS errors, 5 RunTransform.log files.
      • 5 "job was stuck forever".
      • No replica found. Mail sent to the Atlas LCG production mailing list and ggus ticket #19675 created.
      • lcg-cr error:
        • Could not get virtual id
          • sc.cr.cnaf.infn.it
        • File exists
          • grid-cert-03.roma1.infn.it

  • in2p3
    • 7 GLITE_WMS errors, 0 RunTransform.log files.
      • 7 "request expired".

  • nikhef
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "request expired".
      • 1 "hit job retry count".

  • ral
    • 11 GLITE_WMS errors, 0 RunTransform.log files.
      • 6 "request expired".
      • 2 "X509 proxy expired".
      • 2 "aborted by user".
      • 1 "job was stuck forever".

  • sara
      • 334 GLITE_WMS errors, 0 RunTransform.log files.
      • 227 "request expired". GGUS ticket #19635 submitted.
      • 76 "X509 proxy expired".
      • 30 "hit job retry count".
      • 1 "hit job shallow retry count".

  • sinica
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "job was stuck forever".

2007-03-12

Cloud view for STAGEOUT errors
  • CA
    • 3 "FAILEDSRMSOUT".
    • 1 "lcg_cr: File exists".

  • DE
    • 29 "FAILEDSRMSOUT".
    • 3 "lcg_cr: File exists".

  • ES
    • 149 "lcg_cr: File exists".
    • 142 "FAILEDSRMSOUT".
    • 3 "lcg_cr: Transport endpoint is not connected".

  • FR
    • 100 "FAILEDSRMSOUT".
    • 1 "lcg_cr: File exists".

  • IT
    • 29 "FAILEDSRMSOUT".

  • NL
    • 11 "FAILEDSRMSOUT".

  • T0
    • 46 "lcg_cr: Transport endpoint is not connected".
    • 14 "CastorStagerInterface.c:2457 Device or resource busy".
    • 3 "connection fails or timeout".
    • 2 "FAILEDSRMSOUT".

  • TW
    • 48 "FAILEDSRMSOUT".
    • 4 "lcg_cr: File exists".
    • 1 "lcg_cr: Transport endpoint is not connected".

  • UK
    • 11 "FAILEDSRMSOUT".

Site view
  • ce105
    • 30 GLITE_WMS errors, 0 RunTransform.log files.
      • 30 "request expired".

  • ce106
    • 93 GLITE_WMS errors, 1 RunTransform.log files.
      • 84 "removal retries exceeded".
      • 8 "request expired".
      • 1 "hit job retry count".
      • 1 RunTransform.py error

  • ce107
    • 176 GLITE_WMS errors, 0 RunTransform.log files.
      • 166 "removal retries exceeded".
      • 7 "request expired".
      • 1 "X509 proxy expired".
      • 1 "hit job shallow retry count".
      • 1 "aborted by user".

  • gridka
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired".

  • ral
    • 17 GLITE_WMS errors, 0 RunTransform.log files.
      • 13 "request expired".
      • 2 "removal retries exceeded".
      • 1 "X509 proxy expired".
      • 1 "hit job shallow retry count".

  • sara
    • 118 GLITE_WMS errors, 0 RunTransform.log files.
      • 118 "request expired".

2007-03-09

Site view
  • ce101
    • 52 GLITE_WMS errors, 4 RunTransform.log files.
      • 52 "removed by executor (drain mode)".

  • ce102
    • 47 GLITE_WMS errors, 4 RunTransform.log files.
      • 47 "removed by executor (drain mode)".

  • ce105
    • 140 GLITE_WMS errors, 48 RunTransform.log files.
      • 63 "removed by executor (drain mode)".
      • 35 "removal retries exceeded".
      • 24 "request expired".
      • 11 "X509 proxy expired".
      • 6 "aborted by user".
      • 1 "hit job shallow retry count".

  • ce106
    • 257 GLITE_WMS errors, 96 RunTransform.log files.
      • 137 "removed by executor (drain mode)".
      • 120 "removal retries exceeded".

  • ce107
    • 153 GLITE_WMS errors, 7 RunTransform.log files.
      • 71 "removal retries exceeded".
      • 40 "removed by executor (drain mode)".
      • 14 "X509 proxy expired".
      • 14 "request expired".
      • 13 "aborted by user".
      • 1 "hit job shallow retry count".

  • cnaf
    • 21 GLITE_WMS errors, 21 RunTransform.log files.
      • 21 "removed by executor (drain mode)".

  • gridka
    • 156 GLITE_WMS errors, 16 RunTransform.log files.
      • 156 "removed by executor (drain mode)".

  • in2p3
    • 409 GLITE_WMS errors, 35 RunTransform.log files.
      • 320 "removal retries exceeded".
      • 87 "removed by executor (drain mode)".
      • 2 "aborted by user".

  • nikhef
    • 8 GLITE_WMS errors, 1 RunTransform.log files.
      • 7 "removed by executor (drain mode)".
      • 1 "request expired".

  • pic
    • 20 GLITE_WMS errors, 5 RunTransform.log files.
      • 20 "removed by executor (drain mode)".

  • ral
    • 126 GLITE_WMS errors, 1 RunTransform.log files.
      • 95 "removed by executor (drain mode)".
      • 27 "request expired".
      • 3 "X509 proxy expired".
      • 1 "hit job shallow retry count".

  • sara
    • 10 GLITE_WMS errors, 0 RunTransform.log files.
      • 9 "request expired".
      • 1 "removal retries exceeded".

  • sinica
    • 135 GLITE_WMS errors, 21 RunTransform.log files.
      • 128 "removed by executor (drain mode)".
      • 5 "removal retries exceeded".
      • 2 "request expired".

  • triumf
    • 17 GLITE_WMS errors, 14 RunTransform.log files.
      • 17 "removed by executor (drain mode)".

2007-03-08

Site view
  • ce105
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 3 "hit job retry count".

  • ce106
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job retry count".

  • ce107
    • 5 GLITE_WMS errors, 0 RunTransform.log files.
      • 4 "hit job retry count".
      • 1 "removal retries exceeded".

  • ral
    • 15 GLITE_WMS errors, 0 RunTransform.log files.
      • 13 "request expired".
      • 1 "removal retries exceeded".
      • 1 "hit job shallow retry count".

  • sara
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job shallow retry count".

2007-03-07

Site view
  • ce107
    • 7 GLITE_WMS errors, 1 RunTransform.log files.
      • 7 "hit job retry count".
      • lcg-cp (from lxb6180.cern.ch):
      • lcg-cr (from lxb6180.cern.ch):
        • transport endpoint not connected
          • sc.cr.cnaf.infn.it
      • CGSI-gSOAP: Error reading token data: Connection reset by peer
          • grid006.mi.infn.it
      • RunTransform.py

  • in2p3
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job retry count".

  • sara
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job retry count".

  • sinica
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • triumf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

2007-03-06

Site view
  • ce101
    • 6 GLITE_WMS errors, 0 RunTransform.log files.
      • 6 "hit job shallow retry count".

  • ce102
    • 4 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "request expired".
      • 2 "hit job shallow retry count".

  • ce105
    • 5 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "request expired".
      • 2 "hit job shallow retry count".
      • 1 "hit job retry count".

  • ce107
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "hit job retry count".
      • 1 "request expired".

  • sara
    • 10 GLITE_WMS errors, 1 RunTransform.log files.
      • 10 "hit job retry count".
      • lcg-cp error with srm.cern.ch on 2007-03-05 between 02:10 and 10:52 (SAM test failure since yesterday).

2007-03-05

Cloud view for STAGEOUT errors

  • CERN (at 11:32): 205 out of 205 log files retrieved
    • 70 "CastorStagerInterface.c:2457 Device or resource busy".
    • 66 "lcg_cr: Transport endpoint is not connected".
    • 26 "FAILEDSRMSOUT".
    • 4 "connection fails or timeout".

  • CERN - STAGEOUT (earlier than 11:32)
    • 52 "lcg_cr: Transport endpoint is not connected".
    • 35 "CastorStagerInterface.c:2457 Device or resource busy".
    • 26 "FAILEDSRMSOUT".

  • DE (11:32): 2 out of 7 log files retrieved
    • 2 "lcg_cr: File exists".
    • 1 "lcg_cr: Permission denied".

  • Taiwan: 8 out of 8 log files retrieved
    • 4 "lcg_cr: Transport endpoint is not connected".
    • 1 "lcg_cr: File exists".

  • IT: 1 out of 1 log file retrieved
    • 1 send2nsd: NS002 - send error : No valid credential found. Bad magic number

Site view
  • ce101
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired".
      • 1 "removal retries exceeded".

  • ce102
    • 8 GLITE_WMS errors, 0 RunTransform.log files.
      • 6 "removal retries exceeded".
      • 2 "hit job shallow retry count".

  • ce105
    • 31 GLITE_WMS errors, 7 RunTransform.log files.
      • 18 "removal retries exceeded".
      • 11 "hit job retry count".
      • 1 "request expired".
      • 1 "hit job shallow retry count".
      • lcg-cp errors
        • Get request failed, but no errorMessage supplied
          • srm.cern.ch (2007-03-04@18hxx). Confirmed by SAM which shows problem with this node during the week-end and this morning.
          • srm-durable-atlas.cern.ch. Confirmed by SAM which shows problem with this node during the week-end and this morning.
      • RunTransform.py error.

  • ce106
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "removal retries exceeded".

  • ce107
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job shallow retry count".

  • sara
    • 8 GLITE_WMS errors, 0 RunTransform.log files.
      • 8 "hit job retry count".

2007-03-02

Site view
  • ce105
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job retry count".

  • triumf
    • 50 GLITE_WMS errors, 4 RunTransform.log files.
      • 32 "hit job shallow retry count".
      • 18 "hit job retry count".
      • Connections refused to lcg-bdii on 2007-03-01 (08:05 -> 08:13).
      • No replicas found in any LFC (timeout problem from wn020.triumf.lcg to LFC hosts on 2007-03-01 (07:57 -> 08:18). => Network problem at TRIUMF ?!!? Nothing found in SAM.

2007-03-01

Site view
  • ce101
    • 48 GLITE_WMS errors, 0 RunTransform.log files.
      • 48 "parent DAG was aborted".

  • ce102
    • 32 GLITE_WMS errors, 0 RunTransform.log files.
      • 32 "parent DAG was aborted".

  • ce105
    • 4 GLITE_WMS errors, 1 RunTransform.log files.
      • 3 "parent DAG was aborted".
      • 1 "hit job retry count".
    • lcg-cp error:
      • end-of-file was reached
        • ccsrm.in2p3.fr
    • All the errors found in the RunTransform.log files occured on 2007-02-07.

  • ce106
    • 17 GLITE_WMS errors, 10 RunTransform.log files.
      • 12 "parent DAG was aborted".
      • 5 "hit job retry count".
    • Setting error acronym from jobInfo.xml: TRF_OUTFILE_TOOFEW (2007-02-07).
    • lcg-cp:
      • remote file size mismatch (200) (2007-02-07)
        • dcache.gridpp.rl.ac.uk
        • lxfs07.jinr.ru
        • tbn18.nikhef.nl
        • srm-disk.pic.es
        • gridka-dcache.fzk.de
        • ccsrm.in2p3.fr
    • lcg-bdii time-out (2007-02-07).
    • All the errors found in the RunTransform.log files occured on 2007-02-07.

  • ce107
    • 4 GLITE_WMS errors, 2 RunTransform.log files.
      • 4 "parent DAG was aborted". * All the errors found in the RunTransform.log files occured on 2007-02-08.

  • gridka
    • 19 GLITE_WMS errors, 0 RunTransform.log files.
      • 19 "parent DAG was aborted".

  • in2p3
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "parent DAG was aborted".

  • nikhef
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired".

  • pic
    • 10 GLITE_WMS errors, 0 RunTransform.log files.
      • 10 "parent DAG was aborted".

  • ral
    • 36 GLITE_WMS errors, 0 RunTransform.log files.
      • 28 "parent DAG was aborted".
      • 6 "X509 proxy expired".
      • 2 "removal retries exceeded".

  • sinica
    • 2 GLITE_WMS errors, 1 RunTransform.log files.
      • 2 "parent DAG was aborted".
    • All the errors found in the RunTransform.log files occured on 2007-02-18.

  • triumf
    • 182 GLITE_WMS errors, 0 RunTransform.log files.
      • 179 "hit job shallow retry count".
      • 3 "aborted by user".

2007-02-28

Site view
  • ce101
    • 4 GLITE_WMS errors, 0 RunTransform.log files.
      • 3 "X509 proxy expired".
      • 1 "hit job shallow retry count".

  • ce102
    • 11 GLITE_WMS errors, 0 RunTransform.log files.
      • 6 "X509 proxy expired".
      • 5 "hit job shallow retry count".

  • ce105
    • 5 GLITE_WMS errors, 0 RunTransform.log files.
      • 5 "X509 proxy expired".

  • ce107
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "X509 proxy expired".

  • gridka
    • 16 GLITE_WMS errors, 0 RunTransform.log files.
      • 15 "X509 proxy expired".
      • 1 "hit job shallow retry count".

  • in2p3
    • 14 GLITE_WMS errors, 0 RunTransform.log files.
      • 14 "X509 proxy expired".
    • 1 GETOUT_EMPTYOUT_ errors, 0 RunTransform.log files.

  • nikhef
    • 4 GLITE_WMS errors, 0 RunTransform.log files.
      • 4 "X509 proxy expired".

  • ral
    • 14 GLITE_WMS errors, 0 RunTransform.log files.
      • 13 "X509 proxy expired".
      • 1 "hit job retry count".

  • sinica
    • 5 GLITE_WMS errors, 0 RunTransform.log files.
      • 5 "X509 proxy expired".

  • triumf
    • 61 GLITE_WMS errors, 0 RunTransform.log files.
      • 56 "hit job shallow retry count".
      • 5 "request expired".

2007-02-27

Site view

  • ce101
    • 27 GLITE_WMS errors, 0 RunTransform.log files.
      • 19 "request expired".
      • 7 "hit job shallow retry count".
      • 1 "X509 proxy expired" (job submitted on 21/02, updated on 26/02).

  • ce102
    • 17 GLITE_WMS errors, 0 RunTransform.log files.
      • 8 "request expired".
      • 7 "hit job shallow retry count".
      • 1 "X509 proxy expired" (job submitted on 24/02, updated on 26/02).
      • 1 "aborted by user".

  • ce105
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job shallow retry count".

  • gridka
    • 79 GLITE_WMS errors, 1 RunTransform.log files.
      • 54 "removal retries exceeded".
      • 19 "X509 proxy expired" (jobs submitted on 22/02, updated on 26/02).
      • 6 "hit job retry count".
      • job proxy expired.
    • 21 GETOUT_EMPTYOUT_ errors, 0 RunTransform.log files.

  • in2p3
    • 6 GLITE_WMS errors, 0 RunTransform.log files.
      • 5 "hit job retry count".
      • 1 "X509 proxy expired" (job submitted on 23/02, updated on 27/02).
    • 308 GETOUT_EMPTYOUT_ errors, 12 RunTransform.log files.

  • nikhef
    • 2 GLITE_WMS errors, 2 RunTransform.log files.

  • ral
    • 5 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "request expired".
      • 1 "X509 proxy expired" (job submitted on 22/02, updated on 26/02).
      • 1 "hit job shallow retry count".
      • 1 "hit job retry count".

  • sinica
    • 7 GLITE_WMS errors, 0 RunTransform.log files.
      • 7 "hit job retry count".

  • triumf
    • 193 GLITE_WMS errors, 0 RunTransform.log files.
      • 188 "hit job shallow retry count".
      • 3 "aborted by user".
      • 2 "X509 proxy expired" (jobs submitted on 21/02, updated on 26/02).

Cloud view for STAGEOUT errors

  • IT
    • 1104 "lcg_cr: No space left on device".
    • 4 "lcg_cr: Transport endpoint is not connected".

  • CERN
    • 656 "lcg_cr: Transport endpoint is not connected".
    • 411 "connection fails or timeout".
    • 372 "CastorStagerInterface.c:2457 Device or resource busy".
    • 8 "lcg_cr: File exists".
    • 1 "BDII Connection Timeout: lcg-bdii.cern.ch:2170. lcg_cr: Connection timed out".

2007-02-26

Site view

  • ce101
    • 74 GLITE_WMS errors, 0 RunTransform.log files.
      • 30 "aborted by user".
      • 27 "X509 proxy expired" (suspicious: all these jobs were submitted on 25 February).
      • 13 "request expired".
      • 4 "hit job shallow retry count".
    • 562 GETOUT_EMPTYOUT_ errors, 231 RunTransform.log files.

  • ce102
    • 75 GLITE_WMS errors, 0 RunTransform.log files.
      • 29 "X509 proxy expired" (suspicious: all these jobs were submitted on 25 February).
      • 26 "request expired".
      • 15 "aborted by user".
      • 5 "hit job shallow retry count".
    • 415 GETOUT_EMPTYOUT_ errors, 3 RunTransform.log files.

  • ce105
    • 87 GLITE_WMS errors, 59 RunTransform.log files.
      • 87 "removal retries exceeded".
      • lcg-cr errors:
        • timeout
          • srm.cern.ch
        • failed with Internal error/stage_putDone
          • srm.cern.ch
        • connection fails or timeout
          • srm.cern.ch
        • CastorStagerInterface.c:2457 Device or resource busy
          • srm.cern.ch
      • lcg-cp errors:
        • SRM Get request failed, but no errorMessage supplied
          • srm.cern.ch
        • transport endpoint is not connected
          • srm.cern.ch
    • 489 GETOUT_EMPTYOUT_ errors, 0 RunTransform.log files.

  • ce106
    • 66 GLITE_WMS errors, 30 RunTransform.log files.
      • 66 "removal retries exceeded".
      • lcg-cp errors:
        • timeout
          • srm.cern.ch
      • lcg-cr errors:
        • SRM Get request failed, but no errorMessage supplied
          • srm.cern.ch
    • 307 GETOUT_EMPTYOUT_ errors, 264 RunTransform.log files.

  • ce107
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 49 GETOUT_EMPTYOUT_ errors, 1 RunTransform.log files.

  • cnaf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 0 GETOUT_EMPTYOUT_ errors, 0 RunTransform.log files.

  • gridka
    • 52 GLITE_WMS errors, 0 RunTransform.log files.
      • 46 "X509 proxy expired" (jobs submitted on 21-22/02, updated on 26/02).
      • 3 "request expired".
      • 1 "hit job shallow retry count".
      • 1 "hit job retry count".
      • 1 "aborted by user".
    • 35 GETOUT_EMPTYOUT_ errors, 0 RunTransform.log files.

  • in2p3
    • 4 GLITE_WMS errors, 0 RunTransform.log files.
      • 3 "request expired".
      • 1 "hit job retry count".
    • 59 GETOUT_EMPTYOUT_ errors, 6 RunTransform.log files.

  • nikhef
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 0 GETOUT_EMPTYOUT_ errors, 0 RunTransform.log files.

  • pic
    • 4 GLITE_WMS errors, 0 RunTransform.log files.
      • 4 "request expired".
    • 0 GETOUT_EMPTYOUT_ errors, 0 RunTransform.log files.

  • ral
    • 21 GLITE_WMS errors, 0 RunTransform.log files.
      • 11 "request expired".
      • 8 "hit job retry count".
      • 2 "X509 proxy expired" (suspicious: jobs submitted on 25/02).
    • 0 GETOUT_EMPTYOUT_ errors, 0 RunTransform.log files.

  • sara
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 0 GETOUT_EMPTYOUT_ errors, 0 RunTransform.log files.

  • sinica
    • 6 GLITE_WMS errors, 0 RunTransform.log files.
      • 6 "request expired".
    • 0 GETOUT_EMPTYOUT_ errors, 0 RunTransform.log files.

  • triumf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 0 GETOUT_EMPTYOUT_ errors, 0 RunTransform.log files.

Cloud view for STAGEOUT errors

  • IT
    • 1028 "lcg_cr: No space left on device".
    • 3 "lcg_cr: Transport endpoint is not connected".

  • CERN
    • 165 "CastorStagerInterface.c:2457 Device or resource busy".
    • 51 "connection fails or timeout".
    • 9 "lcg_cr: Transport endpoint is not connected".
    • 5 "lcg_cr: File exists".
    • 1 "BDII Connection Timeout: lcg-bdii.cern.ch:2170. lcg_cr: Connection timed out".

2007-02-25

  • ce101
    • 166 GLITE_WMS errors, 0 RunTransform.log files.
      • 77 "aborted by user".
      • 41 "X509 proxy expired".
      • 29 "hit job shallow retry count".
      • 19 "request expired".
    • SAM test failures on 23-24 February.

  • ce102
    • 162 GLITE_WMS errors, 0 RunTransform.log files.
      • 62 "aborted by user".
      • 43 "X509 proxy expired".
      • 37 "hit job shallow retry count".
      • 20 "request expired".

  • ce105
    • 207 GLITE_WMS errors, 3 RunTransform.log files.
      • 206 "removal retries exceeded".
      • 1 "request expired".
    • lcg-cr errors:
      • failed with Internal error/stage_putDone: No responses received nbresps:0 (errno=0, serrno=1015)
        • srm.cern.ch
    • SAM test failures since 24 February.

  • ce106
    • 470 GLITE_WMS errors, 167 RunTransform.log files.
      • 437 "removal retries exceeded".
      • 22 "X509 proxy expired".
      • 10 "aborted by user".
      • 1 "hit job shallow retry count".
    • SAM test failures since 24 February.
    • lcg-cp errors:
      • timeout (900 seconds)
        • srm.cern.ch
      • transport endpoint is not connected
        • srm.cern.ch
    • lcg-cr errors:
      • CSI_setFileStatus() stage_putDone failed with 1004/rm_enterjob failure
        • srm.cern.ch (25/02).
      • timeout
        • srm.cern.ch (25/02).
    • runPyJT: TRF returns non zero: 31 (TRFERROR).

  • ce107
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • cnaf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • gridka
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "request expired".
      • 1 "hit job shallow retry count".

  • in2p3
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • nikhef
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • pic
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job retry count".

  • ral
    • 28 GLITE_WMS errors, 0 RunTransform.log files.
      • 17 "hit job retry count".
      • 8 "request expired".
      • 2 "aborted by user".
      • 1 "X509 proxy expired".

  • sara
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • sinica
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "hit job retry count".

  • triumf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

2007-02-23

  • ce101
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce102
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job shallow retry count".

  • ce105
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce106
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce107
    • 29 GLITE_WMS errors, 0 RunTransform.log files.
      • 29 "removal retries exceeded".

  • cnaf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • gridka
    • 6 GLITE_WMS errors, 0 RunTransform.log files.
      • 4 "hit job shallow retry count".
      • 2 "hit job retry count".

  • in2p3
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "hit job retry count".

  • nikhef
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • pic
    • 13 GLITE_WMS errors, 0 RunTransform.log files.
      • 7 "hit job shallow retry count".
      • 4 "hit job retry count".
      • 2 "aborted by user".

  • ral
    • 9 GLITE_WMS errors, 0 RunTransform.log files.
      • 9 "hit job retry count".

  • sara
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • sinica
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "hit job shallow retry count".

  • triumf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

2007-02-22

  • ce101
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job shallow retry count".

  • ce102
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job shallow retry count".

  • ce105
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce106
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce107
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "removal retries exceeded".

  • cnaf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • gridka
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • in2p3
    • 13 GLITE_WMS errors, 1 RunTransform.log files.
      • 13 "hit job retry count".

  • nikhef
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • pic
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ral
    • 38 GLITE_WMS errors, 0 RunTransform.log files.
      • 35 "hit job retry count".
      • 3 "hit job shallow retry count".

  • sara
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • sinica
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job shallow retry count".

  • triumf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

2007-02-21

  • ce101
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce102
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce105
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce106
    • 1 GLITE_WMS errors, 1 RunTransform.log files.
      • 1 "job was stuck forever".

  • ce107
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • cnaf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • gridka
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "X509 proxy expired" (jobs submitted on 16 February, updated on 20 February).

  • in2p3
    • 29 GLITE_WMS errors, 0 RunTransform.log files.
      • 29 "hit job retry count".

  • nikhef
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • pic
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "hit job shallow retry count".
      • 1 "aborted by user".

  • ral
    • 5 GLITE_WMS errors, 0 RunTransform.log files.
      • 5 "hit job retry count".

  • sara
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • sinica
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • triumf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

2007-02-20

  • ce101
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired".

  • ce102
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "request expired".

  • ce105
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired".
      • 1 "hit job retry count".

  • ce106
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce107
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired".

  • cnaf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • gridka
    • 181 GLITE_WMS errors, 2 RunTransform.log files.
      • 162 "X509 proxy expired". Job submitted on 16-17 February, updated on 19 February.
      • 16 "aborted by user".
      • 2 "request expired".
      • 1 "hit job retry count".
      • proxy expired error.

  • in2p3
    • 20 GLITE_WMS errors, 0 RunTransform.log files.
      • 12 "hit job retry count".
      • 5 "request expired".
      • 3 "X509 proxy expired".

  • nikhef
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • pic
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired".

  • ral
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 3 "X509 proxy expired".

  • sara
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • sinica
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "X509 proxy expired".

  • triumf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

2007-02-19

  • An unalarmed condition caused the CERN CASTOR SRM v11 endpoint srm.cern.ch to be down for ~24 hours starting Saturday morning at 6am. During this period the CERN prod site failed the SAM test.

  • ce101
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce102
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce105
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ce106
    • 163 GLITE_WMS errors, 136 RunTransform.log files.
      • 163 "removal retries exceeded".
      • lcg-cp errors:
        • connections fails or timeout:
          • srm.cern.ch
        • no such file or directory
          • se2.itep.ru
        • remote file size mismatch (value = 200)
          • srm.cern.ch
          • koala.unimelb.edu.au
          • castorsc.grid.sinica.edu.tw
          • srm.grid.sara.nl
          • tbn18.nikhef.nl
          • dcache.gridpp.rl.ac.uk
          • atlasse.phys.sinica.edu.tw
          • sc.cr.cnaf.infn.it
          • lxfs07.jinr.ru
        • lifetime expired
          • dcache.gridpp.rl.ac.uk
        • transport endpoint not connected
          • srm.cern.ch
        • CGSI-gSOAP: Error reading token data: Success
          • srm.cern.ch
        • Error writing data to request repository:Timed out
          • srm.cern.ch
      • lcg-cr errors:
        • connection fails or timout
          • srm.cern.ch
        • a system call failed (connection reset by peer)
          • dcache.gridpp.rl.ac.uk
        • device or resource busy
          • srm.cern.ch
        • the server sent an error response: 425 425 Cannot open port [....] Best pool too high: 2.0E8
          • lcg60.sinp.msu.ru
        • permission denied
          • se2.itep.ru
          • se01.esc.qmul.ac.uk
          • t2se01.physics.ox.ac.uk
        • invalid argument
          • lcgse1.shef.ac.uk
        • CGSI-gSOAP: Error reading token data: Success
          • gw-3.ccc.ucl.ac.uk
          • srm-disk.pic.es
        • Can't get req uniqueid
          • dpm0001.m45.ihep.su
      • proxy expired errors
      • RunTransform.py errors
      • No replicas found in any LFC

  • ce107
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • cnaf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • gridka
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • in2p3
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • nikhef
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • pic
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • ral
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job shallow retry count".

  • sara
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • sinica
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job retry count".

  • triumf
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

2007-02-18

  • ce105:
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired".

  • ce106:
    • 163 GLITE_WMS errors, 29 RunTransform.log files.
      • 163 "removal retries exceeded" (all jobs submitted between 16 and 18 January).

  • gridka:
    • 19 GLITE_WMS errors, 2 RunTransform.log files.
      • 15 "X509 proxy expired" (all jobs submitted 15 January, and status updated on 17 January).
      • 3 "request expired".
      • 1 "hit job shallow retry count".

  • pic:
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 3 "X509 proxy expired" (all jobs submitted 15 January, and status updated on 17 January).

  • ral:
    • 4 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "hit job shallow retry count".
      • 1 "X509 proxy expired" (job submitted 15 January, and status updated on 17 January).
      • 1 "request expired".

  • sinica:
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "X509 proxy expired" (all jobs submitted 15 January, and status updated on 17 January).
      • 1 "hit job retry count".

2007-02-16

  • ce101:
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 3 "X509 proxy expired".

  • ce102:
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "X509 proxy expired".

  • ce105:
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired".
      • 1 "hit job retry count".

  • ce106:
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired".

  • ce107:
    • 2 GLITE_WMS errors, 0 RunTransform.log files.

  • gridka:
    • 56 GLITE_WMS errors, 5 RunTransform.log files.
      • 53 "X509 proxy expired".
      • 3 "removal retries exceeded".

  • in2p3:
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "request expired".
      • 1 "hit job retry count".

  • nikhef:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • pic:
    • 5 GLITE_WMS errors, 0 RunTransform.log files.
      • 5 "X509 proxy expired".

  • ral:
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "hit job retry count".
      • 1 "X509 proxy expired".

  • sara:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • sinica:
    • 301 GLITE_WMS errors, 0 RunTransform.log files.
      • 244 "X509 proxy expired".
      • 57 "request expired".

  • triumf:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

2007-02-15

  • ce101:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 11 WRAPLCG STAGEOUT LCGCR errors, 11 RunTransform.log files.
      • 12 "lcg_cr: Transport endpoint is not connected".
      • 2 "connection fails or timeout".
      • 2 "CastorStagerInterface.c:2457 Device or resource busy".

  • ce102:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 12 WRAPLCG STAGEOUT LCGCR errors, 12 RunTransform.log files.
      • 13 "lcg_cr: Transport endpoint is not connected".
      • 2 "connection fails or timeout".
      • 1 "FAILEDSRMSOUT".

  • ce105:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 0 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log files.

  • ce106:
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired" (job submitted 10/02, updated 15/02).
      • 1 "aborted by user".
    • 1 WRAPLCG STAGEOUT LCGCR errors, 1 RunTransform.log files.
      • 1 "FAILEDSRMSOUT".

  • ce107:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 0 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log files.

  • gridka:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 2 WRAPLCG STAGEOUT LCGCR errors, 2 RunTransform.log files.
        • getErrorFromTRFjobInfo: INFO Found error: TRF_ARG csc_digi does not have an argument named mblist

  • in2p3:
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired" (job submitted 10/02, updated 15/02).
    • 33 WRAPLCG STAGEOUT LCGCR errors, 33 RunTransform.log files.
      • 11 "send2nsd: NS000 - connect error: Connection timed out".
      • 6 "lcg_cr: Transport endpoint is not connected".
      • 4 "Timed out, lcg_cr: Communication error on send".
      • 4 "CastorStagerInterface.c:2457 Device or resource busy".
      • 3 "FAILEDSRMSOUT".
      • 3 "connection fails or timeout".
      • 1 "lcg_cr: File exists".

  • nikhef:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 14 WRAPLCG STAGEOUT LCGCR errors, 14 RunTransform.log files. * 25 "FAILEDSRMSOUT". * 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • pic:
    • 11 GLITE_WMS errors, 0 RunTransform.log files.
      • 9 "X509 proxy expired" (job submitted 10-11/02, updated 15/02).
      • 2 "hit job shallow retry count".
    • 0 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log files.

  • ral:
    • 5 GLITE_WMS errors, 0 RunTransform.log files.
      • 5 "X509 proxy expired" (job submitted 11/02, updated 15/02).
    • 2 WRAPLCG STAGEOUT LCGCR errors, 2 RunTransform.log files. * 1 "FAILEDSRMSOUT".

  • sara:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 0 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log files.

  • sinica:
    • 5 GLITE_WMS errors, 0 RunTransform.log files.
      • 5 "X509 proxy expired" (job submitted 10/02, updated 15/02).
    • 15 WRAPLCG STAGEOUT LCGCR errors, 15 RunTransform.log files.
      • 6 "send2nsd: NS000 - connect error: Connection timed out".
      • 3 "Timed out, lcg_cr: Communication error on send".
      • 2 "FAILEDSRMSOUT".
      • 2 "connection fails or timeout".
      • 2 "CastorStagerInterface.c:2457 Device or resource busy".
      • 1 "lcg_cr: Transport endpoint is not connected".

  • triumf:
    • 24 GLITE_WMS errors, 0 RunTransform.log files.
      • 15 "hit job shallow retry count".
      • 9 "X509 proxy expired" (job submitted 10-11/02, updated 15/02).
    • 12 WRAPLCG STAGEOUT LCGCR errors, 12 RunTransform.log files.
      • 4 "connection fails or timeout".
      • 2 "lcg_cr: File exists".
      • 1 "FAILEDSRMSOUT".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".
      • 1 "BDII Connection Timeout: lcg-bdii.cern.ch:2170. lcg_cr: Connection timed out".

2007-02-14

  • SRM server ccsrm.in2p3.fr will be unavailable from 18:00 on wednesday february 14th to 10:00 on thursday frebruary 15th.

  • ce101:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 16 WRAPLCG STAGEOUT LCGCR errors, 16 RunTransform.log files.
      • 12 "lcg_cr: Transport endpoint is not connected".
      • 5 "FAILEDSRMSOUT".
      • 2 "connection fails or timeout".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • ce102:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 12 WRAPLCG STAGEOUT LCGCR errors, 12 RunTransform.log files.
      • 13 "lcg_cr: Transport endpoint is not connected".
      • 2 "FAILEDSRMSOUT".
      • 1 "connection fails or timeout".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • ce105:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 21 WRAPLCG STAGEOUT LCGCR errors, 21 RunTransform.log file.
      • 8 "lcg_cr: Transport endpoint is not connected".
      • 6 "send2nsd: NS000 - connect error: Connection timed out".
      • 2 "Timed out, lcg_cr: Communication error on send".
      • 2 "CastorStagerInterface.c:2457 Device or resource busy".
      • 1 "No information found for SE xxx, lcg_cr: Invalid argument, Failed to get PFN for stored file: xxx".
      • 1 "lcg_cr: File exists".
      • 1 "FAILEDSRMSOUT".
      • 1 "connection fails or timeout".
      • 1 "BDII Connection Timeout: lcg-bdii.cern.ch:2170. lcg_cr: Connection timed out".

  • ce106:
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired" (job submitted 10/02, updated 14/02).
      • 1 "aborted by user".
    • 18 WRAPLCG STAGEOUT LCGCR errors, 9 RunTransform.log file.
      • 13 "FAILEDSRMSOUT".
      • 2 "CastorStagerInterface.c:2457 Device or resource busy".
      • 1 "Timed out, lcg_cr: Communication error on send".
      • 1 "send2nsd: NS000 - connect error: Connection timed out".
      • 1 "lcg_cr: File exists".
      • 1 "BDII Connection Timeout: lcg-bdii.cern.ch:2170. lcg_cr: Connection timed out".

  • ce107:
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job retry count".
    • 8 WRAPLCG STAGEOUT LCGCR errors, 8 RunTransform.log file.
      • 3 "lcg_cr: File exists".
      • 2 "lcg_cr: Transport endpoint is not connected".
      • 1 "Too many events in file".
      • 1 "Timed out, lcg_cr: Communication error on send".
      • 1 "send2nsd: NS000 - connect error: Connection timed out".

  • gridka:
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired" (job submitted 09/02, updated 13/02).
    • 21 WRAPLCG STAGEOUT LCGCR errors, 29 RunTransform.log file.
      • 26 "lcg_cr: Transport endpoint is not connected".
      • 7 "CastorStagerInterface.c:2457 Device or resource busy".
      • 3 "connection fails or timeout".
      • 1 "send2nsd: NS000 - connect error: Connection timed out".

  • in2p3:
    • 1 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired" (job submitted 10/02, updated 14/02).
    • 95 WRAPLCG STAGEOUT LCGCR errors, 95 RunTransform.log file.
      • 37 "send2nsd: NS000 - connect error: Connection timed out".
      • 21 "lcg_cr: Transport endpoint is not connected".
      • 13 "Timed out, lcg_cr: Communication error on send".
      • 7 "connection fails or timeout".
      • 7 "CastorStagerInterface.c:2457 Device or resource busy".
      • 2 "lcg_cr: File exists".

  • nikhef:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 18 WRAPLCG STAGEOUT LCGCR errors, 18 RunTransform.log file.
      • 15 "FAILEDSRMSOUT".
      • 3 "send2nsd: NS000 - connect error: Connection timed out".
      • 1 "lcg_cr: File exists".

  • pic:
    • 5 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "X509 proxy expired" (job submitted 10/02, updated 14/02).
      • 2 "hit job shallow retry count".
      • 1 "aborted by user".
    • 15 WRAPLCG STAGEOUT LCGCR errors, 14 RunTransform.log file.
      • 6 "lcg_cr: Transport endpoint is not connected".
      • 3 "connection fails or timeout".
      • 2 "send2nsd: NS000 - connect error: Connection timed out".
      • 1 "Timed out, lcg_cr: Communication error on send".
      • 1 "send2nsd: NS000 - name server not available on xxxx".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • ral:
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "hit job shallow retry count".
      • 1 "aborted by user".

  • sara:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 1 WRAPLCG STAGEOUT LCGCR errors, 1 RunTransform.log file.
      • 1 "lcg_cr: Permission denied".

  • sinica:
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "X509 proxy expired" (job submitted 10/02, updated 14/02).
    • 26 WRAPLCG STAGEOUT LCGCR errors, 26 RunTransform.log file.
      • 28 "lcg_cr: Transport endpoint is not connected".
      • 6 "FAILEDSRMSOUT".
      • 4 "send2nsd: NS000 - connect error: Connection timed out".
      • 1 "Timed out, lcg_cr: Communication error on send".
      • 1 "connection fails or timeout".

  • triumf:
    • 18 GLITE_WMS errors, 0 RunTransform.log files.
      • 15 "hit job shallow retry count".
      • 3 "X509 proxy expired" (job submitted 09-10/02, updated 13-14/02).
    • 3 WRAPLCG STAGEOUT LCGCR errors, 3 RunTransform.log file.
      • 2 "lcg_cr: Transport endpoint is not connected".
      • 1 "FAILEDSRMSOUT".

2007-02-13

  • ce101:
    • 25 GLITE_WMS errors, 0 RunTransform.log files.
      • 24 "parent DAG was aborted".
      • 1 "X509 proxy expired" (job submitted 09/02, updated 13/02).
    • 23 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log files.
      • 17 "FAILEDSRMSOUT".
      • 6 "lcg_cr: Transport endpoint is not connected".
      • 1 "lcg_cr: Communication error on send".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • ce102:
    • 26 GLITE_WMS errors, 0 RunTransform.log files.
      • 26 "parent DAG was aborted".
    • 18 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log files.
      • 16 "FAILEDSRMSOUT".
      • 1 "lcg_cr: Communication error on send".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • ce105:
    • 2 GLITE_WMS errors, 0 RunTransform.log files.
      • 1 "X509 proxy expired" (job submitted 10/02, updated 13/02).
      • 1 "parent DAG was aborted".
    • 21 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 16 "FAILEDSRMSOUT".
      • 6 "lcg_cr: Transport endpoint is not connected".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • ce106:
    • 48 GLITE_WMS errors, 9 RunTransform.log files.
      • 47 "removal retries exceeded".
      • 1 "hit job retry count".
      • 9 lcg-cp SRM Get request failed, but no errorMessage supplied: srm.cern.ch (today from 07h13 to 12h10). SAM test fails only for VO Atlas since this morning. ggus ticket #18462. In progress.
      • 1 lcg-cp No such file or directory: tbn18.nikhef.nl (today 12h20). SAM test ok for VO Atlas.
      • 3 lcg-cr Protocol not supported: srm-disk.pic.es (today 10h01). SAM tests fails all the morning for VO Atlas, and are ok after.
    • 2 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 2 "CastorStagerInterface.c:2457 Device or resource busy"

  • ce107:
    • 3 GLITE_WMS errors, 0 RunTransform.log files.
      • 2 "parent DAG was aborted".
      • 1 "hit job retry count".
    • 5 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 3 "CastorStagerInterface.c:2457 Device or resource busy".
      • 2 "lcg_cr: Transport endpoint is not connected".

  • gridka:
    • 181 GLITE_WMS errors, 1 RunTransform.log files.
      • 156 "X509 proxy expired": all the job were submitted between 09/02 and 10/02, and updated the 13/02.
      • 20 "hit job shallow retry count".
      • 3 "parent DAG was aborted".
      • 2 "request expired".
      • 1 proxy expired during the execution of the job on the WN.
    • 2 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 1 "lcg_cr: Protocol not supported".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • in2p3:
    • 4 GLITE_WMS errors, 0 RunTransform.log files.
      • 3 "hit job retry count".
      • 1 "X509 proxy expired": submitted the 09/02, updated the 13/02.
    • 33 WRAPLCG STAGEOUT LCGCR errors, 1 RunTransform.log file.
      • 9 "CastorStagerInterface.c:2457 Device or resource busy".
      • 8 "connection fails or timeout".
      • 6 "Timed out, lcg_cr: Communication error on send".
      • 6 "lcg_cr: Transport endpoint is not connected".
      • 5 "lcg_cr: File exists".
      • 4 "lcg_cr: Permission denied".

  • nikhef:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.

  • pic:
    • 21 GLITE_WMS errors, 0 RunTransform.log files.
      • 7 "parent DAG was aborted".
      • 5 "hit job shallow retry count".
      • 4 "request expired".
      • 3 "aborted by user".
      • 1 "X509 proxy expired": submitted the 09/02, updated the 13/02.
      • 1 "hit job retry count".
    • 14 WRAPLCG STAGEOUT LCGCR errors, 1 RunTransform.log file.
      • 6 "CastorStagerInterface.c:2457 Device or resource busy".
      • 3 "FAILEDSRMSOUT".
      • 2 "lcg_cr: Transport endpoint is not connected".

  • ral:
    • 16 GLITE_WMS errors, 0 RunTransform.log files.
      • 8 "parent DAG was aborted".
      • 4 "hit job shallow retry count".
      • 1 "X509 proxy expired": submitted the 09/02, updated the 13/02.
      • 1 "request expired".
      • 1 "hit job retry count".
      • 1 "aborted by user".

  • sinica:
    • 0 GLITE_WMS errors, 0 RunTransform.log files.
    • 2 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 2 "lcg_cr: Transport endpoint is not connected".
      • 1 "connection fails or timeout".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • triumf:
    • 6 GLITE_WMS errors, 0 RunTransform.log files.
      • 3 "hit job shallow retry count".
      • 2 "aborted by user".
      • 1 "hit job retry count".
    • 6 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 3 "FAILEDSRMSOUT".
      • 2 "connection fails or timeout".
      • 1 "lcg_cr: Transport endpoint is not connected".

2007-02-12

  • GRIDKA downtime all this day (maintenance).

  • ce101:
    • 49 GLITE_WMS errors, 0 RunTransform.log files.
      • 24 "X509 proxy expired".
      • 16 "aborted by user".
      • 8 "removal retries exceeded".
      • 1 "hit job shallow retry count".
    • 2 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log files.
      • 1 "lcg_cr: Communication error on send".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • ce102:
    • 16 GLITE_WMS errors, 0 RunTransform.log files.
      • 16 "removal retries exceeded".
    • 4 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log files.
      • 2 "lcg_cr: Communication error on send".
      • 2 "CastorStagerInterface.c:2457 Device or resource busy".

  • ce105:
    • 2 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 2 "lcg_cr: Communication error on send".

  • ce106:
    • 3 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 2 "lcg_cr: Communication error on send".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • ce107:
    • 2 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 1 "lcg_cr: Communication error on send".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

  • gridka:
    • 20 GLITE_WMS errors, 0 RunTransform.log files.
      • 19 "X509 proxy expired".
      • 1 "hit job retry count".
    • 19 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 19 "lcg_cr: Communication error on send".

  • in2p3:
    • 11 GLITE_WMS errors, 0 RunTransform.log files.
      • 5 "hit job retry count".
      • 3 "X509 proxy expired".
      • 3 "request expired".
    • 307 WRAPLCG STAGEOUT LCGCR errors, 2 RunTransform.log file.
      • 300 "lcg_cr: Communication error on send".
      • 4 "CastorStagerInterface.c:2457 Device or resource busy".
      • 3 "(Could not get virtual id: Internal error !, No user mapping, lcg_lr: Communication error on send) or (copyReplicate: ERROR Failed to get PFN for stored file)".

  • nikhef:
    • 8 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 8 "lcg_cr: Communication error on send".

  • pic:
    • 13 GLITE_WMS errors, 0 RunTransform.log files.
      • 5 "hit job shallow retry count".
      • 3 "X509 proxy expired".
      • 3 "aborted by user".
      • 1 "request expired".
      • 1 "removal retries exceeded".
    • 14 WRAPLCG STAGEOUT LCGCR errors, 1 RunTransform.log file.
      • 8 "lcg_cr: Communication error on send".
      • 4 "CastorStagerInterface.c:2457 Device or resource busy".
      • 2 "(Could not get virtual id: Internal error !, No user mapping, lcg_lr: Communication error on send) or (copyReplicate: ERROR Failed to get PFN for stored file)".

  • ral:
    • 36 GLITE_WMS errors, 0 RunTransform.log files.
      • 8 "removal retries exceeded".
      • 8 "hit job shallow retry count".
      • 7 "request expired".
      • 7 "hit job retry count".
      • 6 "X509 proxy expired".

  • sinica:
    • 1 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 1 "lcg_cr: Communication error on send".

  • triumf:
    • 13 GLITE_WMS errors, 0 RunTransform.log files.
      • 8 "hit job shallow retry count".
      • 2 "removal retries exceeded".
      • 2 "aborted by user".
      • 1 "request expired".
    • 4 WRAPLCG STAGEOUT LCGCR errors, 0 RunTransform.log file.
      • 3 "lcg_cr: Communication error on send".
      • 1 "CastorStagerInterface.c:2457 Device or resource busy".

2007-02-10

  • gridka: 102 GLITE_WMS errors, 0 RunTransform.log files.
    • 88 "X509 proxy expired".
    • 14 "hit job retry count".

  • ce107: 30 GLITE_WMS errors, 3 RunTransform.log files.
    • 22 "removal retries exceeded".
    • 6 "hit job retry count".
    • 2 "X509 proxy expired".

  • ce106: 25 GLITE_WMS errors, 0 RunTransform.log files.
    • 18 "X509 proxy expired".
    • 7 "hit job retry count".

  • ral: 19 GLITE_WMS errors, 0 RunTransform.log files.
    • 7 "hit job shallow retry count".
    • 6 "X509 proxy expired".
    • 3 "request expired".
    • 2 "hit job retry count".
    • 1 "aborted by user".

  • ce105: 11 GLITE_WMS errors, 0 RunTransform.log files.
    • 6 "hit job retry count".
    • 5 "X509 proxy expired".

2007-02-09

  • High load on the BDIIs during the evening.

2007-02-08

  • Task 4904 has about 2000 jobs failing due to the infamous file-size mimatch error -> 15% efficiency (from Atlas mailing list).
  • Alarm no_contact for ce105 (2007-02-07 from 08h30 to 09h30).
  • "hole" in the cpu utilization of the three BDIIs bdii105, bdii108 and bdii112 between 17h00 and 18h00.

  • RAL: 71 GLITE_WMS errors, 21 RunTransform.log files.
    • Jobs submitted between 2007-02-05 and 2005-02-08.
    • 46 "job was stuck forever":
    • 13 "request expired":
    • 7 "hit job shallow retry count":
    • 5 "hit job retry count":

  • PIC: 69 GLITE_WMS errors, 43 RunTransform.log files.
    • 49 "job was stuck forever":
    • 12 "request expired":
    • 5 "hit job shallow retry count":
    • 3 "Aborted by user":

  • GRIDKA: 44 GLITE_WMS errors, 0 RunTransform.log files.
    • 39 "request expired":
    • 5 "hit job retry count":

  • ce101: 37 GLITE_WMS errors, 1 RunTransform.log files.
    • 32 "request expired":
    • 3 "hit job shallow retry count":
    • 1 "Aborted by user":
    • 1 "job was stuck forever":

  • ce102: 27 GLITE_WMS errors, 0 RunTransform.log files.
    • 26 "request expired":
    • 1 "hit job shallow retry count":

  • IN2P3: 24 GLITE_WMS errors, 1 RunTransform.log files.
    • 14 "job was stuck forever":
    • 10 "hit job retry count":

  • TRIUMF: 13 GLITE_WMS errors, 12 RunTransform.log files.
    • 12 "job was stuck forever":
    • 1 "hit job retry count":

2007-02-07

  • According to a mail sent by Rod Walker sent to the Atlas mailing list, all the files of size 200 will be deleted from all the LFCs.
  • AFS problem at site FZK-LCG2 (see ggus ticket #18237). This triggered a SWMISS error on this site. Solved on 2007-02-07.

  • CNAF (ce01): 73 GLITE_WMS errors, 71 RunTransform.log files.
    • 51 "job was stuck forever": it means that it is gLite WMS problems. These jobs have been submitted from 2007-01-24 to 2007-01-25.
    • 22 "Parent DAG was aborted": David Rebatto killed some jobs...
    • Some lcg-bdii timeouts (because of the BDIIs overloaded last week).
    • Some lcg-bdii connection refused (from WN wn-03-02-03-a.cr.cnaf.infn.it).
    • lcg-cr error (transport endpoint not connected) to dcache.gridpp.rl.ac.uk, gallows.dur.scotgrid.ac.uk, srm.cern.ch.
    • lcg-cr error (permission denied) to pc55.hep.ucl.ac.uk, epgse1.ph.bham.ac.uk, se1.pp.rhul.ac.uk, gw-3.ccc.ucl.ac.uk, dgc-grid-34.brunel.ac.uk.
    • lcg-cr error (CGSI-gSOAP: Error reading token data) to svr018.gla.scotgrid.ac.uk.
    • lcg-cr error (lifetime expired) to hepgrid5.ph.liv.ac.uk.
    • lcg-cr error (CGSI-gSOAP: Could not open connection ! lcg_cr: Connection refused) to sc4.triumf.ca.
    • lcg-cr error (No information found for SE) to se01.esc.qmul.ac.uk, fal-pygrid-20.lancs.ac.uk, t2se01.physics.ox.ac.uk
    • lcg-cr error (No space left on device) to se01.esc.qmul.ac.uk.
    • lcg-cr error (SRM Get request failed, but no errorMessage supplied - Connunication error on send) to srm.cern.ch.
    • lcg-cr error (CastorStagerInterface.c:2457 Device or resource busy (errno=0, serrno=0) to srm.cern.ch.
    • lcg-cp error (transport endpoint not connected) to wormhole.westgrid.ca, castorsc.grid.sinica.edu.tw, atlasse.phys.sinica.edu.tw.
    • lcg-cp error (file not found : can't get pnfsId (not a pnfsfile)) to dcache01.tier2.hep.manchester.ac.uk.
    • lcg-cp error (Connection timed out) to strm.cern.ch.
    • lcg-cp error (No such file or directory) to grid-cert-03.roma1.infn.it).
    • node atlas.web.cern.ch were not accessible when the WN tried to download file http://atlas.web.cern.ch/Atlas/GROUPS/DATABASE/project/ddm/releases/TiersOfATLASCache.py (25 January 2007 after midnight). It was due to the power cut at CERN during the night.
    • proxy expired (not enough time).
    • CONCLUSION: it seems that there were some problems with the SEs at UK mainly from 24 to 26 January. Almost all these SEs are still failing the SAM tests:
      • pc55.hep.ucl.ac.uk
      • epgse1.ph.bham.ac.uk
      • se1.pp.rhul.ac.uk
      • gw-3.ccc.ucl.ac.uk
      • dgc-grid-34.brunel.ac.uk
      • sc4.triumf.ca
      • se01.escc.qmul.ac.uk
    • I contacted some admins at the UK sites to ask them to investigate what is wrong with the SEs at UK.
  • FZK-LCG2 (gridka): 65 GLITE_WMS errors, 1 RunTransform.log files.
    • 39 "hit job retry count".
    • 24 "job was stuck forever".
    • 2 "request expired".
    • error KeyboardInterrupt.
  • ce105: 47 GLITE_WMS errors, 4 RunTransform.log files.
    • 20 "hit job retry count" .
    • 19 "Removal retries exceeded".
    • 4 "job was stuck forever".
    • lcg-cp error (could not open connection) to srm.cern.ch. Seems to be solved today later in the morning according to SAM.
    • lcg-cp error (SRM Get request failed, but no errorMessage supplied) to srm.cern.ch. Seems to be solved today later in the morning according to SAM.
    • Some timeout with lcg-bdii this morning.
  • PIC: 41 GLITE_WMS errors, 6 RunTransform.log files.
    • 18 "job was stuck forever".
    • 11 "request expired".
    • lcg-cr error (connection refused yesterday in the evening) to srm-disk.pic.es and tbn18.nikhef.nl. There is some instabilities at PIC and NIKHEF according to SAM.
    • lcg-cp error (file size mismatch)
    • 1 "Setup file not found" (setup.sh).
  • IN2P3: 40 GLITE_WMS errors, 11 RunTransform.log files.
    • 21 "job was stuck forever".
    • 15 "hit job retry count".
    • 4 "Parent DAG was aborted".
    • lcg-cp error (could not open connection yesterday evening) to ccsrm.in2p3.fr
    • lcg-cp error (connection refused yesterday evening) to srm-disk.pic.es.
    • lcg-cp error (file size mismatch).
    • lcg-cr error (connection reset by peer) to dcache.gridpp.rl.ac.uk.
    • lcg-cr error (transport endpoint is not connected) to dgc-grid-34.brunel.ac.uk
    • lcg-cr error (permission denied) to epgse1.ph.bham.ac.uk, t2se01.physics.ox.ac.uk
    • lcg-cr error (could not get virtual id: internal error yesterday in the evening) to sc.cr.cnaf.infn.it and t2-dpm-01.na.infn.it.
    • lcg-cr error (File exists) to grid-cert-03.roma1.infn.it.
  • IN2P3: 120 WRAPLCG STAGEOUT LCGCR errors this afternoon.
    • 86 'Could not get virtual id: Internal error !\n', 'No user mapping\n', 'lcg_lr: Communication error on send' or 'copyReplicate: ERROR Failed to get PFN for stored file'
    • 08 'Error file does not exist, cannot delete\n', '\n', 'lcg_cr: Transport endpoint is not connected'
    • 05 'Internal error\n', 'lcg_cr: Communication error on send'
    • 05 'Error file does not exist, cannot delete\n', '\n', 'lcg_cr: Transport endpoint is not connected' or 'the server sent an error response: 425 425 Cannot open port: java.lang.Exception: Pool manager error: Best pool too high'
    • 04 'lcg_cr: No space left on device'
    • 03 'BDII Connection Timeout: lcg-bdii.cern.ch:2170\n', 'lcg_cr: Connection timed out'
    • 03 'CastorStagerInterface.c:2457 Device or resource busy'
    • 01 'Too many events in file'

2007-02-06

  • CNAF (ce01): 1276 GLITE_WMS errors this morning, 1024 RunTransform.log files.
    • 1273 "job was stuck forever": it means that it is gLite WMS problems. These jobs have been submitted from 2007-01-22 to 2007-01-26.
    • 3 "X509 proxy expired": these jobs were submitted on 2007-01-26. Seems to be ok.
    • Lot of failure during the execution of lcg-cr (permission denied, no space left on device) related to SEs sc.cr.cnaf.infn.it, t2-dpm-01.na.infn.it and atlasse.lnf.infn.it. For the two first SE, SAM shows that these machines are fine (at least since 02 February), but node atlasse.lnf.infn.it is still failing the SAM critical tests.
  • RAL-LCG2: 205 GLITE_WMS errors this morning, 17 RunTransform.log files.
    • 80 "request expired": all the jobs were submitted on 2007-02-05. Need to check on the WMS log files.
    • 46 "X509 proxy expired": all the jobs have been submitted on 2007-02-01. Perhaps we should take a look on the WMS log files at CNAF why it took too much time. Seems to be ok anyway.
    • 58 "job was stuck forever": it means that it is gLite WMS problems, which are cure by hand aborting the job at the LB level. Simone asked me to ignore these jobs.
    • 10 "remote file size mismatch": discussion on the Atlas mailing list for this problem.
  • IN2P3: 87 GLITE_WMS errors this morning, 18 RunTransform.log files.
    • 40 "job proxy expired": all the jobs have been submitted from 2007-01-27 to 2007-02-01. Seems to be ok.
    • 24 "job was stuck forever": ignore these jobs.
    • 10 "request expired": need to take a look in the WMS logs.
    • 4 proxy expired during the execution of the node on the WN.
    • 1 lcg-cr error to srm-disk.pic.es (the server sent an error 500 500). Transient problem during the night. It has been fixed this morning according to SAM.
    • 8 "remote file size mismatch": discussion on the Atlas mailing list concerning this problem.
    • 1 time out with lcg-bdii around 17h53. Nothing found in Lemon which could explain this problem (no BDII node overloaded).
  • IN2P3: 69 WRAPLCG STAGEOUT LCGCR errors this afternoon.
    • 38 'Could not get virtual id: Internal error !\n', 'No user mapping\n', 'lcg_lr: Communication error on send' or 'copyReplicate: ERROR Failed to get PFN for stored file'
    • 07 'Internal error\n', 'lcg_cr: Communication error on send'
    • 07 'lcg_cr: Permission denied'
    • 02 FAILEDSRMSOUT
    • 02 'lcg_cr: No space left on device'
    • 02 'BDII Connection Timeout: lcg-bdii.cern.ch:2170\n', 'lcg_cr: Connection timed out'
    • 02 'Error file does not exist, cannot delete\n', '\n', 'lcg_cr: Transport endpoint is not connected' or 'the server sent an error response: 425 425 Cannot open port: java.lang.Exception: Pool manager error: Best pool too high'
    • 01 'No information found for SE xxxx\n', 'lcg_cr: Invalid argument\n'
    • 01 'CastorStagerInterface.c:2457 Device or resource busy'
    • 01 'lcg_cr: Transport endpoint is not connected'
    • 01 'user has no permission to write into path'
    • 01 'Too many events in file'
    • 01 'lcg_cr: File exists'
  • PIC: 75 GLITE_WMS errors this morning, 20 RunTransform.log files.
    • 46 "job was stuck forever": ignore these jobs.
    • 15 "request expired": need to take a look in the WMS logs.
    • 11 "job proxy expired": all the jobs have been submitted on 2007-02-01. Seems to be ok.
    • 5 "remote file size mismatch": discussion on the Atlas mailing list concerning this problem.
    • 1 lcg-cp failure to dcache.gridpp.rl.ac.uk: transient problem during the night with this host (cf SAM).

2007-02-05

  • ce105 (CERN-PROD):
    • Problem trying to contact LFC node mu11.matrix.sara.nl causing some time out errors. There was indeed a problem with this machine according to SAM for this LFC node (VO Atlas). This should be fixed on 2007-02-05 in the morning according to SAM.
    • Some lfc-mkdir error message ("...connect error : Timed out...") due to the connection timeout with mu11.matrix.sara.nl. Indeed, taking a look into the tarball associated to these jobs, we can see that the LFC_HOST is set to mu11.matrix.sara.nl.
    • "Protocol not supported" when trying to lcg-cp a file to srm://dcache.gridpp.rl.ac.uk/pnfs. There was indeed a problem with this machine according to SAM for this dCache node (VO Atlas). This should be fixed on 2007-02-05 in the morning according to SAM. * Several jobs have an error "remote file size mismatch". I have sent an email to the atlas mailing list. * "KeyboardInterrupt" message. From the Atlas mailing list, it is said that this could be a kill from the batch system which depending of the signal, could be like a keyboard interrupt. Status of this jobs:
  • FZK-LCG2 (GRIDKA):
    • Almost all the submitted jobs have their proxy expired. It is "normal" since these jobs were submitted on 28 or 29 January (> 5 days).
  • IN2P3-CC:
    • The majority of jobs have had their proxy expired. It is "normal" since these jobs were submitted on 28 to 31 January (> 5 days).
    • One job with error "remote file size mismatch".
  • RAL-LCG2:
    • One job with error "remote file size mismatch".
  • All the other sites have no log files available.

2007-02-02

  • [INFO] Simone told me to ignore jobs with status "job was stuck forever" for all the sites. These jobs are aborted by him because there was a problem in the WMS.
  • AFS problem at site CERN-PROD (see ggus ticket #18077). This triggered some SWMISS errors on this site. In progress.
  • Some jobs failed at CERN-PROD:
    • "valid proxy expired" error. What is funny is that on the web page, we have "GLITEWMS - job stuck forever" and in the log file I can read "WNCHECK_PROXY - valid time xx hours less than required yy hours". Therefore the informations are not consistent.
    • Problem with the RunTransform.py script. Simone does not know what it means...
                2007-02-01 12:07:21,952 runPyJT     : INFO     Running TRF, log is: misal1_csc11.005100.JimmyWenu.digit.log.v12003107_tid004654._08291.job.log.3
                Traceback (most recent call last):
                File "RunTransform.py", line 1699, in ? rt.runPyJT()
                File "RunTransform.py", line 1408, in runPyJT
                (exitcode,output)=commands.getstatusoutput('./'+sname)
                File "/usr/lib/python2.2/commands.py", line 54, in getstatusoutput
                text = pipe.read()
                KeyboardInterrupt 
  • A lot of jobs failed at FZK-LCG2 (GRIDKA) but almost all the log files are missing...
    • lcg-bdii connection timeouts when executing lcg-cp between yesterday and this morning.
    • not sure if some jobs stucks forever because of md5sum information missing. It is not clear in the log files.
    • Error "checkLocalFile: ERROR Local:remote file size mismatch: 49356529:200" after a lcg-cp.
  • Some jobs failed at NIKHEF-ELPROD due to:
    • No space left on device for host tbn18.nikhef.nl. (see ggus #18106). Solved on 2007-02-06.
    • lcg-bdii connection timeouts when executing lcg-cp between yesterday and this morning.
    • copyReplicate: INFO 256 : the server sent an error response: 550 550 rfio write failure: Success. lcg_cr: Transport endpoint is not connected. Error related to partition full on tbn18.nikhef.nl ?
  • Some jobs failed at PIC due to:
    • Local:remote file size mismatch: 149990561:200 after a lcg-cp.
  • Some jobs failed a Taiwan-LCG2:
    • not sure if some jobs stucks forever because of md5sum information missing. It is not clear in the log files.
    • lcg-bdii connection timeouts when executing lcg-cp between yesterday and this morning.
    • 256 an end-of-file was reached (destination: ccsrm.in2p3.fr/pnfs).
    • partition full on DPM node dpm0001.m45.ihep.su at RU-Protvino-IHEP. Mail sent to the site managers of this site (no ggus ticket opened).
  • No log files for the other sites frown

2007-02-01

  • Problem with NFS partition full at PIC (ggus ticket #18057). Solved on 2007-02-02.
  • Request to PIC site to register ifaesrm.pic.es node in GOC (ggus ticket #18057). Solved on 2007-02-02.
  • Several jobs submitted on Taiwan-LCG2 site failed due to some timeouts with lcg-bdii.cern.ch. Perhaps the host contacted behind this alias was bdii102 which has been highly overloaded the last 2 days. There was an alarm high_load on this machine yesterday (load was ~30...), so I restarted the service and the load was fine just after.
  • 2 new top-level BDIIs added behind alias lcg-bdii. However there was a problem with the load-balancing mechanism which was configured to support only 6 BDIIs. CDB updated by Veronique this morning and we have now the 8 BDIIs behind lcg-bdii alias. We should have less bdii timeout error messages now.

2007-01-29

  • AFS problem at site IN2P3 (see ggus ticket #17967). This triggered some SWMISS errors on this site. Solved on 2007-01-31.
  • AFS problem at site TRIUMF (see ggus ticket #17963). This triggered some SWMISS errors on this site. In progress.

Beginning of the debugging exercise

(David) Errors in WRAPLCG_STAGEIN_NOREPLICAS during the last 24H

Checked at 12:00h on Jan 24th 2007

  • IN2P3-CC
    • cclcgceli02.in2p3.fr (109 errors)
  • TRIUMF-LCG2
    • lcgce01.triumf.ca (73 errors)
  • CERN-PROD
    • ce105.cern.ch (34 errors)
    • ce101.cern.ch (10 errors)
    • ce106.cern.ch (10 errors)
    • ce107.cern.ch ( 5 errors)
  • PIC
    • ce04.pic.es (38 errors)
  • TAIWAN-LCG2
    • lcg00125.grid.sinica.edu.tw (17 errors)
  • FZK-LCG2
    • ce-fzk.gridka.de (13 errors)
  • INFN-T1
    • ce01-lcg.cr.cnaf.infn.it (11 errors)

Errors – last 24 h – checked 23/01/07 at ~17.00

EXELEXOR_GLITE_WMS:

  • 10 in TRIUMF (lcgce01.triumf.ca)
    • Removal retries exceeded
    • X509 proxy expired
    • LOOKS OK IN SAM
  • 118 at ce101.cern.ch:
    • X509 proxy expired
    • Aborted by user
    • LOOKS OK IN SAM
  • 90 at ce102.cern.ch:
    • X509 proxy expired
    • Aborted by user
    • LOOKS OK IN SAM
  • 55 at ce-fzk.gridka.de
    • X509 proxy expired
    • Aborted by user
    • LOOKS OK IN SAM
  • 52 at ce04.pic.es
    • Mostly X509 proxy expired
    • LOOKS OK IN SAM
  • 287 at ce01-lcg.cr.cnaf.infn.it
    • Many Removal retries exceeded.
    • Failing in SAM; from infn gridice monitoring, it looks as if the queues were drained last night
  • 117 at lcgce01.gridpp.rl.ac.uk
    • X509 proxy expired
    • Aborted by user
    • LOOKS OK IN SAM

WRAPLCG_WNCHECK_SWMISS:

  • 8 in TRIUMF (lcgce01.triumf.ca)
  • 5 in mu6.matrix.sara.nl

WRAPLCG_STAGEIN_LCGCP:

  • 59 in TRIUMF (lcgce01.triumf.ca)
  • All cern CEs have some failures
    • ce101.cern.ch: 16
    • ce102.cern.ch: 17
    • ce105.cern.ch: 21
    • ce106.cern.ch: 15
    • ce107.cern.ch: 7
  • 82 at ce04.pic.es
  • All CEs in in2p3 have some relevant failures
  • 380 at tbn20.nikhef.nl
  • 89 at heplnx206.pp.rl.ac.uk

WRAPLCG_STAGEOUT_LCGCR:

  • 170 in TRIUMF (lcgce01.triumf.ca)
  • All cern CEs have some failures, the highest is ce107.cern.ch with 63
  • 155 at ce04.pic.es
  • All in2p3 ces have failures
  • 490 at tbn20.nikhef.nl

WRAPLCG_STAGEIN_NOREPLICAS:

  • 81 in TRIUMF (lcgce01.triumf.ca)
    • All errors are similar, worth to investigate:
    • ting file status from lfc returned: -1\n', '2007-01-22 20:08:27,693 getMetadataFromLfc: WARNING Getting file status from lfc returned: -1\n', '2007-01-22 20:08:29,218 getMetadataFromLfc: WARNING Getting file status from lfc returned: -1\n', '2007-01-22 20:08:29,325 getMetadataFromLfc: WARNING Getting file status from lfc returned: -1\n', '2007-01-22 20:08:31,106 getMetadataFromLfc: WARNING Getting file status from lfc returned: -1\n', '2007-01-22 20:08:32,896 getMetadataFromLfc: WARNING Getting file status from lfc returned: -1\n', '2007-01-22 20:08:34,503 getMetadataFromLfc: WARNING Getting file status from lfc returned: -1\n', '2007-01-22 20:08:36,073 getMetadataFromLfc: WARNING Getting file status from lfc returned: -1\n', '2007-01-22 20:08:37,937 getMetadataFromLfc: WARNING Getting file status from lfc returned: -1\n', '2007-01-22 20:08:37,937 getLfcFileMetadata: ERROR No replicas found in any LFC\n', '2007-01-22 20:08:37,938 lcgCopy : ERROR No replicas found\n']
  • 77 at ce04.pic.es
    • Same errors
  • 95 at tbn20.nikhef.nl
    • Same errors

EXECG_GETOUT_EMPTYOUT:

  • 322 at ce01-lcg.cr.cnaf.infn.it
    • Failing in SAM; from infn gridice monitoring, it looks as if the queues were drained last night
  • 231 at lcgce01.gridpp.rl.ac.uk
    • Looks OK in SAM

SUMMARY: I have marked in bold the T1 with teh higest error rate per type of test. There are also 2 T1 that appear in most of the error cases: Triumf and PIC, might be worth to start with them.

-- Main.diana - 23 Jan 2007

Edit | Attach | Watch | Print version | History: r89 < r88 < r87 < r86 < r85 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r89 - 2007-04-16 - YvanCalas
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback