SU3 QCD application

Second round (Diane 2.0 - April 2008)

Analysis of logs

This is information from 2010 (LQCD paper preparation). It is incomplete.

We analyze the reasons for failures using the failure_log of the agent factory. This information is not complete, we have only a small snapshot.

Here is the procedure: we extract the ids of each type of failure into the processedNN files, where NN is the failure type number. Then we mkdir processed directory where the logs of each type will be moved (to filter out theselogs each time from the main failure_log directory). To filter out we do: python processedNN

The more details on CE and SITES is provided by this script: python processedNN

The results are commited to CVS files (for Atlas sample): RUN_41_AF_fail_logs_*

We store all ids of processed logs in all_processed file: ls processed > all_processed

A cross type analysis is a special case where we do: python all_processed. It gives all count of types of problems (as described below, according to their ids stored in processeedNN files)

The py helper scripts are in CVS: LatticeQCD2/analysis_tools

Small backup sample

[lxarda29] /data/lqcd2008/backup_factory_logs/lostman.gangadir

Unzip output files:

find agent_factory -name stdout.gz -exec gunzip {} \;
find agent_factory -name stderr.gz -exec gunzip {} \;

Number of workers: 249

find agent_factory -name stderr -exec echo {} \; | wc

Show all logs

find agent_factory -name stderr -exec cat {} \; | less

Types of problems:

Type 1:

126 times DIANE_CORBA.XFileTransferError(message="[Errno 2] No such file or directory: '/data/lqcd/apps/output/hmc_su3_newmuI.amd_opteron'")

find agent_factory -name stderr -exec grep -l amd {} \; | wc

Type 2: 51 times This program was not built to run on the processor in your system

find agent_factory -name stderr -exec grep -l processor {} \; | wc

Type 3: 23 times error executing... sh: line 1: 14351 Killed                  ./hmc_su3_newmuI <parameters

find agent_factory -name stderr -exec grep -l 'error executing' {} \; > processed3

Type 4: 12 times communication error

    return _omnipy.invoke(self, "download", _0_DIANE_CORBA.FileTransferServer._d_download, args)

find agent_factory -name stderr -exec grep -l 'omniORB.UNKNOWN_PythonException' {} \; > processed4

Type 5: 15 times python version(?) problem: ImportError: No module named logging

find agent_factory -name stderr -exec grep -l 'ImportError: No module named logging' {} \; > processed5

Type 6: 6 times = ERROR: could not download the file : omniORB-4.1.2-slc3_gcc323-diane.tar.gz=

find agent_factory -name stderr -exec grep -l 'GANGA_VERSION' {} \; > processed6

Type 7: 11 times OSError: [Errno 13] Permission denied: '/home/atlas157/diane/submitters'

find agent_factory -name stderr -exec grep -l 'Permission denied' {} \; > processed7

Remaining errors: 5

  1. error with file system permissions
  2. http download failure
  3. SOCK problem with http download
  4. MAINTENANCE file download problem
  5. OS type problems (Suse, Debian)

Large Atlas sample

[lxarda29] /data/lqcd2008/apps/output/lostman_atlas.gangadir

It corresponds to a subset of run 41 (first part of RUN4 in LQCD paper), from 27.08.2008 to 18.09.2008. The logs are available in: /data/lqcd2008/apps/output/logs-run4-20081028.tgz

Worker agent numbers, from 1 to 12069. Total number of invalid worker agents reported from AF: 4163-801(success)=*3354*. It quite well coindices with the number 3704 reported in the master log in that perios (available output by runnning script on RUN_41_vcard_summary_INVALID_WORKERS.dat)

Number of invalid workers: 1379 with stderr available.

Type1: 186

Type2: 41

Type3: 410

Type4: 0

Type5: 87

Type6: 8

Type7: 2

Type8: 398

Type9: 180

Type8 + Type9: 578 times a CORBA problem with worker registration

find agent_factory -name stderr -exec grep -l 'return _omnipy.invoke(self, "registerWorker", _0_DIANE_CORBA.RunMaster._d_registerWorker, args)' {} \; |wc 

Type 8: 398 communication problem

find agent_factory -name stderr -exec grep -l 'TRANSIENT: CORBA.TRANSIENT(omniORB.TRANSIENT_ConnectFailed, CORBA.COMPLETED_NO)' {} \; > processed8

Type 9: 180 communication problem (timeout)

find agent_factory -name stderr -exec grep -l 'TRANSIENT: CORBA.TRANSIENT(omniORB.TRANSIENT_CallTimedout, CORBA.COMPLETED_NO)' {} \; > processed9

Type 910: 59 http socket error (IOError: [Errno socket error] (110, 'Connection timed out'))

find agent_factory -name stderr -exec grep -l 'socket error' {} \; | wc

Other types 8 times: including Debian platform and strange socket errors

The reminder of 2784 invalid workers does not have stderr available:

[lxarda29] /data/lqcd2008/apps/output/lostman_atlas.gangadir > grep reason agent_factory/failure_log/*/full_print | grep success | wc
    801    5607   69670
[lxarda29] /data/lqcd2008/apps/output/lostman_atlas.gangadir > grep reason agent_factory/failure_log/*/full_print| wc
   2782   22736  253479
[lxarda29] /data/lqcd2008/apps/output/lostman_atlas.gangadir > grep reason agent_factory/failure_log/*/full_print | grep 'Retry' | wc
    588    4704   48165
[lxarda29] /data/lqcd2008/apps/output/lostman_atlas.gangadir > grep reason agent_factory/failure_log/*/full_print | grep 'cannot' | wc
    282    2824   41041
[lxarda29] /data/lqcd2008/apps/output/lostman_atlas.gangadir > grep reason agent_factory/failure_log/*/full_print | grep 'app. exit' | wc
    715    6441   62664
[lxarda29] /data/lqcd2008/apps/output/lostman_atlas.gangadir > grep reason agent_factory/failure_log/*/full_print | grep 'proxy' | wc
    398    3184   32155

Production 2008

Monitoring crontabs

There is acron on lxplus for publishing the live plots on the wiki page and the cron on each master server (lxarda28) for collecting the data for the plots and doing cleanups.

Check currently defined acrons and crons:

  • acrontab -l
  • crontab -l

and compare them with: * /storage/lqcd/apps/output/monitoring-and-cleanup.crontab= * /afs/

Install missing corntabs.

  • master server: crontab /storage/lqcd/apps/output/monitoring-and-cleanup.crontab
  • lxplus (acrontab for web publishing): acrontab < /afs/

How to start a master (and a file server). No set up

NOTE: rev 40 is the last revision of this doc that worked for old lxb1420 machine (new versions should be compatible but...)

  • Log in to lxarda28 (old servers: lxb7232 lxb1420)
  • Main sequence of commands:


# setup diane
source /storage/lqcd/

# go to the output area
cd /storage/lqcd/apps/output

# run the file server

# run master using an open port


# install cron to monitor the server processes (connections, cpu, memory etc)
# if you restart the servers then remove the old cron
cron -e
#* * * * * /storage/lqcd/apps/output/servermon PID_MASTER PID_FILESERVER

# you may get the PIDs using this command
netstat -lp | grep python

  • diane-master-ping verbosity [INFO|DEBUG] to switch the debugging level

How to submit more...

NOTE: rev 41 is the last revision of this doc before the submit_more command was introduced

cd /afs/
./submit_more config-kuba.gear lxb1420 NUMBER      [opts: --delay 10, --CE ce]

The name of the configuration file must be config-PROXYNAME and must match the proxy file p/PROXYNAME may be substituted by any submitter script ( etc. If the submitter script is not specified then ganga is run interactively.

Submitting using AgentFactory

The procedure below uses $HOME/proxy to store the long proxy and script; if you prefer different location, please make the appropriate changes to the scripts!

(i) create the long-lived proxy (this will create a proxy valid for a month) and myproxy

mkdir $HOME/proxy
cd $HOME/proxy

(ii) add the script acrontab:

acrontab -e

add the following line:

30 10,22 * * * lxarda28 /afs/

(iii) run Agent Factory (preferably on screen):

cd /afs/

where CONFIG_FILE is your configuration file and WORKER_NUMBER is the amount of workers you want to have.


  • !!! diane-master-ping kill should have ask for confirmation frown
  • Is this line (master log) correct?
    <verbatim>2008-05-02 09:03:34,311 INFO: config.main.cache = ~/dianedir/cache></verbatim>

fake snapshots for testing

cd /storage/lqcd/apps/output
cp /storage/lqcd/apps/LatticeQCD2/test/* .
python /storage/lqcd/apps/LatticeQCD2/

General talks on the subject

Old links

-- JakubMoscicki - 26 Jun 2007

Edit | Attach | Watch | Print version | History: r52 < r51 < r50 < r49 < r48 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r52 - 2010-04-14 - JakubMoscicki
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback