Dashboard Earth: Grid Activity Monitoring with Google Earth
Documentation
Please check the
User Documentation
or the
Admin Documentation
Maintenance of the Application
Dashboard Earth runs in
dashb-earth
and has 6 collectors: one for the coordinates(
service.monitor.gearth.coords
), one for the aggregate of the VOs(
service.monitor.gearth.all
) and one per VO (
service.monitor.gearth.alice
,
service.monitor.gearth.altlas
,
service.monitor.gearth.cms
,
service.monitor.gearth.lhcb
). The config files are
/opt/dashboard/var/log/gearth-{coords,all,alice,atlas,cms,lhcb}.log
.
There is a cron job (
/root/cron/dashCollectors.sh
) set to run every hour that checks that all the collectors are running, if the coordinates' file has the correct permissions and the generated kmz files are up to date; in addition to restarting the collectors if they are down, it also corrects the permissions of the coordinates' file, should they be incorrect. Please use this script to start the collectors.
It is quite possible that if the data sources hit any problems during the break, the collector will not be able to run. If such a thing happens, the kmz will not be updated and the cron job should detect that the kmz has become outdated -- please check the log files and warn the service responsible for the data. In extreme cases, it will probably best to stop the collector(s) affected and remove them from the dashbCollectors.sh script (lines 71 and 73) and the corresponding kmz (line 80)
Debugging
If something is wrong and there is no data for some experiment or data is outdated these steps should be performed in order to fix the problem.
1. Restart collectors.
But where to find collectors? Both - collectors and server hosting their output kmz files are located on the same machine. This machine should have alias
dashb-earth
. So you can login to this machine using this alias. But sometimes behind the alias there is a proxy not the machine itself. Than you should look in
/etc/http/conf.d
directory of proxy machine and find the actual adress. Once the proper address is found and you are logged in you should remember that all collectors are executed by
dboard
user. So use
su - dboard
command and login as
dboard
. Now you can use
PYTHONPATH=/opt/dashboard/lib/ /opt/dashboard/bin/dashb-agent-list
command to list all collectors and
PYTHONPATH=/opt/dashboard/lib /opt/dashboard/bin/dashb-agent-restart service.monitor.gearth.COLLECTORNAME
to restart collector with name
COLLECTORNAME
.
2. Check if this helped - If restarting of collectors helped all kmz files should be up to date. It can be checked using
ls -l /opt/dashboard/www/
command. If you see that all the files are not older than 20 minutes from the moment you checked everyting is fine. If not - try to find you which kmz file is old and wich experiment it belongs to. Knowing that you can open log file using
tail -f /opt/dashboard/var/log/gearth-EXPERIMENTNAME.log
command and try to look for any helpful error messages and fix it.
3. If there is no error this can mean that the problem is with fetching data. In this case you have to find out in code (sic!) every data sources and check if they are working fine.
If some data service is down you have two possibilities -either this service is external and don't belong to dashboard framework - in that case you have to contact the person responsible for that service. If the problem is within dashboard framework you have to do it on your own.
4. Some of dashboard services are running on virtual machines that can cause the problems. At the moment we have 2 physical hosts for virtual machines:
dashboard12 is hosting dashb-virtual01-06
dashboard13 is hosting dashb-virtual 07-12
Try to log in as
dboard
user to the machine hosting problmatic virtual machine and check its status using
./vmbox.sh
command.
If machine is down you may want to restart it. Use
VBoxManage startvm --type headless MACHINENAME
command. Of course Murphy laws are working excellent so you may end up with error message like:
VBoxManage: error: The machine 'dashb-virtual06' is already locked by a session (or being locked or unlocked)
or any other strange error message. If you have this problem use
ps -ef | grep virtual
command and see which processes are doing something with your vittual machine. Knowing their pids kill them all using
kill -9 PIDS
command. Check once again list of processes and make sure you killed all doing something with problematic virtual machine. Then try to restart it once again and hopefully it will work.
5. If you are really unlucky the virtual machine will fail immediately because of virtual disk crash. In this case you need
VirtualBox. But this is graphical application so you need to forward X windows in you session. Login to host as
dboard
and execute
echo YOURLOGIN@CERNNOSPAMPLEASE.CH > .k5login
command, than log off. Log in once again using
ssh dboard@HOST -X
command. Run virtualbox using
VirtualBox
command. Make sure you killed all processes using this VM and try to find disk snapshots. If you fond one you can recover this from snapshot. If snapshot was made a long time ago you may have install and configure something that's not included in this snapshot.
6. If you are really really unlucky you won't find any snapshot at all. Then you have to delete this VM, and recreate it from scratch. Good night and good luck!