TWiki> ArdaGrid Web>SU3 (revision 48)EditAttachPDF

SU3 QCD application

Second round (Diane 2.0 - April 2008)

Analysis of logs

This is information from 2010 (LQCD paper preparation). It is incomplete.

[lxarda29] /data/lqcd2008/backup_factory_logs/lostman.gangadir

Unzip output files:

find agent_factory -name stdout.gz -exec gunzip {} \;
find agent_factory -name stderr.gz -exec gunzip {} \;

Number of workers: 249

find agent_factory -name stderr -exec echo {} \; | wc

Show all logs

find agent_factory -name stderr -exec cat {} \; | less

Types of problems:

Type 1:

126 times DIANE_CORBA.XFileTransferError(message="[Errno 2] No such file or directory: '/data/lqcd/apps/output/hmc_su3_newmuI.amd_opteron'")

find agent_factory -name stderr -exec grep -l amd {} \; | wc

Type 2: 51 times This program was not built to run on the processor in your system

find agent_factory -name stderr -exec grep -l processor {} \; | wc

Type 3: 23 times error executing... sh: line 1: 14351 Killed                  ./hmc_su3_newmuI <parameters

find agent_factory -name stderr -exec grep -l 'error executing' {} \; > processed3

Type 4: 12 times communication error

    return _omnipy.invoke(self, "download", _0_DIANE_CORBA.FileTransferServer._d_download, args)

find agent_factory -name stderr -exec grep -l 'omniORB.UNKNOWN_PythonException' {} \; > processed4

Type 5: 15 times python version(?) problem: ImportError: No module named logging

find agent_factory -name stderr -exec grep -l 'ImportError: No module named logging' {} \; > processed5

Type 6: 6 times = ERROR: could not download the file : omniORB-4.1.2-slc3_gcc323-diane.tar.gz=

find agent_factory -name stderr -exec grep -l 'GANGA_VERSION' {} \; > processed6

Type 7: 11 times OSError: [Errno 13] Permission denied: '/home/atlas157/diane/submitters'

find agent_factory -name stderr -exec grep -l 'Permission denied' {} \; > processed7

Remaining errors: 5

  1. error with file system permissions
  2. http download failure
  3. SOCK problem with http download
  4. MAINTENANCE file download problem
  5. OS type problems (Suse, Debian)

Production 2008

Monitoring crontabs

There is acron on lxplus for publishing the live plots on the wiki page and the cron on each master server (lxarda28) for collecting the data for the plots and doing cleanups.

Check currently defined acrons and crons:

  • acrontab -l
  • crontab -l

and compare them with: * /storage/lqcd/apps/output/monitoring-and-cleanup.crontab= * /afs/

Install missing corntabs.

  • master server: crontab /storage/lqcd/apps/output/monitoring-and-cleanup.crontab
  • lxplus (acrontab for web publishing): acrontab < /afs/

How to start a master (and a file server). No set up

NOTE: rev 40 is the last revision of this doc that worked for old lxb1420 machine (new versions should be compatible but...)

  • Log in to lxarda28 (old servers: lxb7232 lxb1420)
  • Main sequence of commands:


# setup diane
source /storage/lqcd/

# go to the output area
cd /storage/lqcd/apps/output

# run the file server

# run master using an open port


# install cron to monitor the server processes (connections, cpu, memory etc)
# if you restart the servers then remove the old cron
cron -e
#* * * * * /storage/lqcd/apps/output/servermon PID_MASTER PID_FILESERVER

# you may get the PIDs using this command
netstat -lp | grep python

  • diane-master-ping verbosity [INFO|DEBUG] to switch the debugging level

How to submit more...

NOTE: rev 41 is the last revision of this doc before the submit_more command was introduced

cd /afs/
./submit_more config-kuba.gear lxb1420 NUMBER      [opts: --delay 10, --CE ce]

The name of the configuration file must be config-PROXYNAME and must match the proxy file p/PROXYNAME may be substituted by any submitter script ( etc. If the submitter script is not specified then ganga is run interactively.

Submitting using AgentFactory

The procedure below uses $HOME/proxy to store the long proxy and script; if you prefer different location, please make the appropriate changes to the scripts!

(i) create the long-lived proxy (this will create a proxy valid for a month) and myproxy

mkdir $HOME/proxy
cd $HOME/proxy

(ii) add the script acrontab:

acrontab -e

add the following line:

30 10,22 * * * lxarda28 /afs/

(iii) run Agent Factory (preferably on screen):

cd /afs/

where CONFIG_FILE is your configuration file and WORKER_NUMBER is the amount of workers you want to have.


  • !!! diane-master-ping kill should have ask for confirmation frown
  • Is this line (master log) correct?
    <verbatim>2008-05-02 09:03:34,311 INFO: config.main.cache = ~/dianedir/cache></verbatim>

fake snapshots for testing

cd /storage/lqcd/apps/output
cp /storage/lqcd/apps/LatticeQCD2/test/* .
python /storage/lqcd/apps/LatticeQCD2/

General talks on the subject

Old links

-- JakubMoscicki - 26 Jun 2007

Edit | Attach | Watch | Print version | History: r52 | r50 < r49 < r48 < r47 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r48 - 2010-04-12 - JakubMoscicki
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback