problems and mysteries

  • 32/64 bit confusion:
    • The same works for Local submission. LSF jobs fails. Form the same account I can read the "missing" file!
    • According to Miguel, it is a 32-64 bit related problem. As a matter of fact, sourxung the LCG environment helps (apparently because LSF try to send you to a 32-bit machine in LSF...
    • This is the dump of the problem... Obviously the file _omnipymodule.so esists...
...
Traceback (most recent call last):
  File "/afs/cern.ch/sw/arda/diane/install/2.0-beta5/bin/diane-worker-start", line 50, in ?
    import diane.WorkerAgent
  File "./python/diane/WorkerAgent.py", line 14, in ?
  File "/afs/cern.ch/sw/arda/diane/external/omniORB/4.1.0/slc4_ia32_gcc34/lib/python2.3/site-packages/omniORB/__init__.py", line 236, in ?
    import _omnipy
ImportError: /afs/cern.ch/sw/arda/diane/external/omniORB/4.1.0/slc4_ia32_gcc34/lib/python2.3/site-packages/_omnipymodule.so: cannot open shared object file: No such file or directory
--- GANGA APPLICATION ERROR END ---

  • File servers instabilities
    • Clear signs of congestions:
2008-05-11 14:45:17,064 INFO downloading the file: hmc_su3_Nf2+1
2008-05-11 14:58:41,826 INFO downloading the file: hmc_su3_Nf2+1
2008-05-11 14:59:11,798 INFO downloading the file: input_Nf2+1.TEMPLATE
2008-05-11 15:01:49,431 INFO transferring /storage/lqcd/apps/output/./tmp/snap_0015_5.1885_1421_1297337599.4
2008-05-11 15:03:27,032 INFO downloading the file: dat/snap_0031_5.1905_1897
2008-05-11 15:04:27,451 INFO downloading the file: hmc_su3_Nf2+1
2008-05-11 15:05:54,080 INFO transferring /storage/lqcd/apps/output/./tmp/snap_0076_5.1860_1498_1287784870.68
2008-05-11 15:07:13,715 INFO downloading the file: hmc_su3_Nf2+1
2008-05-11 15:09:38,955 INFO downloading the file: hmc_su3_Nf2+1
*** CRASH around here: transfers (downloads) seem stucked (note the few TEMPLATE files being donloaded)
*** RESTART
2008-05-11 15:23:40,489 INFO Starting output server: lxb1420.cern.ch 22201 /storage/lqcd/apps/output
2008-05-11 15:25:41,927 INFO downloading the file: hmc_su3_Nf2+1
2008-05-11 15:25:50,334 INFO downloading the file: input_Nf2+1.TEMPLATE
2008-05-11 15:25:50,676 INFO downloading the file: dat/snap_0006_5.1880_742
**** Here (after restart) it looks OK, 10' to download the hmc file, less than 1' for the TEMPLATE etc...


==> I would suggest (as temporary measure to download the executable and the TEMPLATE with a wget. It looks that avoiding multiple uploads of fort.2 is  becoming urgent.

  • Workers instabilities
    • Sometimes all workers die one after another (time-outs). I suspect the workers are blocked in the file server (either waiting for input or trying t save their output
  • Master failures
    • Only one so far. I suspect it is a side-effect of the previous ones

-- MassimoLamanna - 11 May 2008

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2008-05-11 - MassimoLamanna
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback