CMS Job Monitoring Collectors Troubleshooting

MonAlisa log file collector and feeder are running on dashboard12, dashb-ai-581 and dashb-ai-584.

In case of a problem with the collector at one or all machines, you should be warned by a mail. If all collectors are stuck, most probably you have a lock on the DB. If only one of them indicates a trouble, there could be many reasons, like run out of space on the machine, some unpredictable value sent from the user job, etc.... Configuration of all services is exactly the same. The instruction describes dashboard07 which was an old host, though it is exactly the same for current hosts mentioned in the header.

# ON LXPLUS etc. (with AFS token) login in dashboard07 as dboard
ssh dboard@dashboard07
...
# ON DASHBOARD07
# Check that all scripts of the collector are running:
ps -elf | grep '\.sh'
# You should see something like:
0 S root       377     1  0  78   0 -  1341 wait   Jun12 ?        00:00:00 /bin/bash ./logfile-collect.sh
0 S root     20942     1  0  76   0 -  1341 wait   Jun18 ?        00:04:50 /bin/bash ./monalisa-collect.sh
0 S root     20944     1  0  77   0 -  1342 wait   Jun18 ?        00:00:35 /bin/bash ./cms-prod-feeder.sh MonalisaFeeder17
0 S root     26031 22802  0  76   0 -   922 pipe_w 12:08 pts/2    00:00:00 grep \.sh

#You can check as well the python scripts
ps -elf | grep '\.py'
#But only not all python scripts are all the time running, they are invoked from the shell scripts time to time
If any of 3 shell scripts is missing you need to restart it. # Change to the correct directory
cd /data/dboard_cms
To start logfile collector (reads Monalisa log file):
./logfile-collect.sh > /tmp/LogFileCollector.log 2>&1 &
To start monalisa collector (reads the files written by logfile collector and reformats it for feeding DB)
 ./monalisa-collect.sh > /tmp/MonalisaCollector.log 2>&1 &
For monlaisa feeders the procedure has recently changed to allow to run multiple feeding process in parallel, which is now often required since the load on the Dashboard server increases. The bash script which runs the python feeder can run up to 5 feeders in parallel depending on number of non-processed input files on disk. There is currently some inconvinience, if something went wrong and the script has to be restarted the time stamp of the first file to start with has to be eddited in the bash script ( This needs to be changed). The followoing line in the bash script /data/dboard_cms/dbfeeder.sh has to be editted:
latestupdt=1277139744

The time stamp in seconds should be replaced, this time stamp is included in the name of the input ml_dict* file.

The feeder should be started the following way:

./dbfeeder.sh MonalisaFeeder07 >/tmp/Feeder.log 2>&1 &

Be carefull, use the correct extention of the Feeder name, it is important!

Do not forget to exit from dashboard07 or lxarda19 nodes! Otherwise as soon as your afs token would expire, the collector would stop:

exit

What else you can check:

Log files in the data directory of the first two collector steps

The directory where collector writes it's files:

cd /data/dboard_cms/ml-data
ls -ltr

You should see something like:

-rw-r--r--  1 root root 34287772 Jun 20 12:02 ml_orig_20080620-100059_1213955160_1213956059_1213956130
-rw-r--r--  1 root root 21039474 Jun 20 12:02 ml_list_1213956130
-rw-r--r--  1 root root 10713915 Jun 20 12:02 ml_dict_1213956130
-rw-r--r--  1 root root 32122001 Jun 20 12:17 ml_orig_20080620-101559_1213956060_1213956959_1213957056
-rw-r--r--  1 root root 19282208 Jun 20 12:17 ml_list_1213957056
-rw-r--r--  1 root root  9564788 Jun 20 12:18 ml_dict_1213957056
-rw-r--r--  1 root root 29786766 Jun 20 12:33 ml_orig_20080620-103059_1213956960_1213957859_1213957981
-rw-r--r--  1 root root 14047514 Jun 20 12:33 ml_list_1213957981
-rw-r--r--  1 root root  7177416 Jun 20 12:33 ml_latest
-rw-r--r--  1 root root  7177416 Jun 20 12:33 ml_dict_1213957981
With the fresh time stamps of the files.

For the log files of the last collector step look into directory /data/dboard_cms/tmp

You should see files with fresh time stamps

The state of the services in the CMS database:

SELECT * from SERVICES;
The relevant ML services are MonalisaFeeder07 (dashboard07), MonalisaFeeder19(lxarda19)

WHAT CAN GO WRONG: Machine runs out of disk space, though it should not happen since the disk cleaning procedures are setup

Check the disk space;

df -h
To clean the space there is a crontab job which cleans /data/dashboard/data and some other directories. You can get some more disk space by cleaning Monalisa archives:
cd /data/arda_cms_ml/MonaLisa/Service/myFarm/result_logs/
rm -rf *log.zip

But be carefull between 12 at night and 2 o'clock in the morning. At that point the archiving procedure is running.Do not remove the latest zip file, you can screw up the collector.

Something can go wrong with feeding the data base, for example the lock on the DB. So far the locks on the DB were created by US. Either by running some schema upgrade, or forgetting to commit changes from sqlpus or sqldeveloper sessions.

How you can detect such a problem.

The log file of the actual DB feeding is created in: /data/dashboard/tmp/data/dboard_cms/MonalisaFeeder*

tail -f /data/dashboard/tmp/data/dboard_cms/MonalisaFeeder<latest_log_file>
If it does not move for more than 5 minutes, you have a problem. If there is no eror jut pending on some line on update or select, most probably you have a lock. You can not check what causes the lock by yourself, only DBAs can do, but the first thing I would try to do is to kill all sqlplus and sqldeveloper sessions you've started. I normally run sqlplus from lxplus from dboard account. So I do:
ps -elf |  grep ssh 
and kill all ssh started with dboard. Another problem with DB feeding: You can see an error in the log file, then you need to look in the python scripts in the dboard afs directory.

/afs/cern.ch/user/d/dboard/cms-prod/dashboard-2006-07-03/python

You have to login as dboard, I'll send you password by mail.

I hope, it won't happen. Normally it happens when something changes in the job reporting. for example, jobs sent too long value and update fails, since the description of the column does not allow to introduce such a long value. Unfortunately, we do not have all possible protections in the collector in such cases. Give me a call.

*If you do not know what happens and why DB update is stuck, submit a snow ticket

What can you do to check Monalisa itself. Normally Monalisa runs without any troubles. Raw ML files are located at /data/arda_cms_ml/MonaLisa/Service/myFarm/result_logs. If you DO SEE a trouble (the log file in not updated or it does not contain relevant CMS data) Contact Bucoveanu-Ramiro Voicu <Ramiro.Voicu@cern.ch>. Ramiro will be around during my vacations. If he is not around contact Iosif Legrand.

You can restart the ML services yourself. It is rather easy. User for dashboard07 is dboard , for lxarda19 is dboarduser

su dboard
$ cd /data/arda_cms_ml/MonaLisa/Service/CMD/
$ ./ML_SER start/stop/restart
But better to try first to get in touch with ML experts. For unzipping old archives use /data/dboard_cms/bin/unzip

MonALISA Service . Directory /data/arda_cms_ml/MonaLisa/

Important files:

* ML log files - logs for the ML service itself

/data/arda_cms_ml/MonaLisa/Service/myFarm/ML0.log
/data/arda_cms_ml/MonaLisa/Service/myFarm/ML.log
* ML log result files - logs containing the received data
/data/arda_cms_ml/MonaLisa/Service/myFarm/result_logs/JStore_<date>.log
* Configuring the modules - most notable ApMon
/data/arda_cms_ml/MonaLisa/Service/myFarm/myFarm.conf
* Configuring various other properties
/data/arda_cms_ml/MonaLisa/Service/myFarm/ml.properties

The ML Service runs as user 'dboard'

The crontab is setup to restrt ML server automatically in case of problem. To see what it is doing:

crontab -u dboarduser -l

Setup of dashboard25 (as replacement of lxarda19)

S -- JuliaAndreeva - 15 June 2009

Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r25 - 2013-12-17 - JuliaAndreeva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback