TWiki> ArdaGrid Web>SU3 (revision 49)EditAttachPDF

SU3 QCD application

Second round (Diane 2.0 - April 2008)

Analysis of logs

This is information from 2010 (LQCD paper preparation). It is incomplete.

Small backup sample

[lxarda29] /data/lqcd2008/backup_factory_logs/lostman.gangadir

Unzip output files:

find agent_factory -name stdout.gz -exec gunzip {} \;
find agent_factory -name stderr.gz -exec gunzip {} \;

Number of workers: 249

find agent_factory -name stderr -exec echo {} \; | wc

Show all logs

find agent_factory -name stderr -exec cat {} \; | less

Types of problems:

Type 1:

126 times DIANE_CORBA.XFileTransferError(message="[Errno 2] No such file or directory: '/data/lqcd/apps/output/hmc_su3_newmuI.amd_opteron'")

find agent_factory -name stderr -exec grep -l amd {} \; | wc

Type 2: 51 times This program was not built to run on the processor in your system

find agent_factory -name stderr -exec grep -l processor {} \; | wc

Type 3: 23 times error executing... sh: line 1: 14351 Killed                  ./hmc_su3_newmuI <parameters

find agent_factory -name stderr -exec grep -l 'error executing' {} \; > processed3

Type 4: 12 times communication error

    return _omnipy.invoke(self, "download", _0_DIANE_CORBA.FileTransferServer._d_download, args)
UNKNOWN: CORBA.UNKNOWN(omniORB.UNKNOWN_PythonException, CORBA.COMPLETED_MAYBE)

find agent_factory -name stderr -exec grep -l 'omniORB.UNKNOWN_PythonException' {} \; > processed4

Type 5: 15 times python version(?) problem: ImportError: No module named logging

find agent_factory -name stderr -exec grep -l 'ImportError: No module named logging' {} \; > processed5

Type 6: 6 times = ERROR: could not download the file : omniORB-4.1.2-slc3_gcc323-diane.tar.gz=

find agent_factory -name stderr -exec grep -l 'GANGA_VERSION' {} \; > processed6

Type 7: 11 times OSError: [Errno 13] Permission denied: '/home/atlas157/diane/submitters'

find agent_factory -name stderr -exec grep -l 'Permission denied' {} \; > processed7

Remaining errors: 5

  1. error with file system permissions
  2. http download failure
  3. SOCK problem with http download
  4. MAINTENANCE file download problem
  5. OS type problems (Suse, Debian)

Large Atlas sample

[lxarda29] /data/lqcd2008/apps/output/lostman_atlas.gangadir

Number of workers: 1379

Type1: 186

Type2: 41

Type3: 410

Type4: 0

Type5: 87

Type6: 8

Type7: 2

Type8: 398

Type9: 180

Type8 + Type9: 578 times a CORBA problem with worker registration

find agent_factory -name stderr -exec grep -l 'return _omnipy.invoke(self, "registerWorker", _0_DIANE_CORBA.RunMaster._d_registerWorker, args)' {} \; |wc 

Type 8: 398 communication problem

find agent_factory -name stderr -exec grep -l 'TRANSIENT: CORBA.TRANSIENT(omniORB.TRANSIENT_ConnectFailed, CORBA.COMPLETED_NO)' {} \; > processed8

Type 9: 180 communication problem (timeout)

find agent_factory -name stderr -exec grep -l 'TRANSIENT: CORBA.TRANSIENT(omniORB.TRANSIENT_CallTimedout, CORBA.COMPLETED_NO)' {} \; > processed9

Type 910: 59 http socket error (IOError: [Errno socket error] (110, 'Connection timed out'))

find agent_factory -name stderr -exec grep -l 'socket error' {} \; | wc

Other types 8 times: including Debian platform and strange socket errors

Production 2008

Monitoring crontabs

There is acron on lxplus for publishing the live plots on the wiki page and the cron on each master server (lxarda28) for collecting the data for the plots and doing cleanups.

Check currently defined acrons and crons:

  • acrontab -l
  • crontab -l

and compare them with: * /storage/lqcd/apps/output/monitoring-and-cleanup.crontab= * /afs/cern.ch/sw/arda/install/su3/2009/plots/web-monitoring.acrontab=

Install missing corntabs.

  • master server: crontab /storage/lqcd/apps/output/monitoring-and-cleanup.crontab
  • lxplus (acrontab for web publishing): acrontab < /afs/cern.ch/sw/arda/install/su3/2009/plots/web-monitoring.acrontab

How to start a master (and a file server). No set up

NOTE: rev 40 is the last revision of this doc that worked for old lxb1420 machine (new versions should be compatible but...)

  • Log in to lxarda28 (old servers: lxb7232 lxb1420)
  • Main sequence of commands:

bash

# setup diane
source /storage/lqcd/env.sh

# go to the output area
cd /storage/lqcd/apps/output

# run the file server
./start_file_server

# run master using an open port
./start_master_server

# OPTIONAL:

# install cron to monitor the server processes (connections, cpu, memory etc)
# if you restart the servers then remove the old cron
cron -e
#* * * * * /storage/lqcd/apps/output/servermon PID_MASTER PID_FILESERVER

# you may get the PIDs using this command
netstat -lp | grep python

  • diane-master-ping verbosity [INFO|DEBUG] to switch the debugging level

How to submit more...

NOTE: rev 41 is the last revision of this doc before the submit_more command was introduced

bash 
cd /afs/cern.ch/sw/arda/install/su3/2009
./submit_more config-kuba.gear lxb1420 LCG.py NUMBER      [opts: --delay 10, --CE ce]

The name of the configuration file must be config-PROXYNAME and must match the proxy file p/PROXYNAME

LCG.py may be substituted by any submitter script (LSF.py) etc. If the submitter script is not specified then ganga is run interactively.

Submitting using AgentFactory

The procedure below uses $HOME/proxy to store the long proxy and extend.sh script; if you prefer different location, please make the appropriate changes to the scripts!

(i) create the long-lived proxy (this will create a proxy valid for a month) and myproxy

mkdir $HOME/proxy
cd $HOME/proxy
/afs/cern.ch/sw/arda/install/su3/2009/agent_factory/mkproxy.sh

(ii) add the extend.sh script acrontab:

acrontab -e

add the following line:

30 10,22 * * * lxarda28 /afs/cern.ch/sw/arda/install/su3/2009/agent_factory/extend.sh

(iii) run Agent Factory (preferably on screen):

cd /afs/cern.ch/sw/arda/install/su3/2009
screen
./agent_factory/run_agentfactory.sh CONFIG_FILE HOST WORKER_NUMBER

where CONFIG_FILE is your configuration file and WORKER_NUMBER is the amount of workers you want to have.

Questions

  • !!! diane-master-ping kill should have ask for confirmation frown
  • Is this line (master log) correct?
    <verbatim>2008-05-02 09:03:34,311 INFO: config.main.cache = ~/dianedir/cache></verbatim>

fake snapshots for testing


# TESTING ONLY
cd /storage/lqcd/apps/output
cp /storage/lqcd/apps/LatticeQCD2/test/* .
python /storage/lqcd/apps/LatticeQCD2/make_test_snap.py



General talks on the subject

Old links

-- JakubMoscicki - 26 Jun 2007

Edit | Attach | Watch | Print version | History: r52 < r51 < r50 < r49 < r48 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r49 - 2010-04-13 - JakubMoscicki
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback