Dashboard Earth: Grid Activity Monitoring with Google Earth

Documentation

Please check the User Documentation or the Admin Documentation

Maintenance of the Application

Dashboard Earth runs in dashb-earth and has 6 collectors: one for the coordinates(service.monitor.gearth.coords), one for the aggregate of the VOs(service.monitor.gearth.all) and one per VO (service.monitor.gearth.alice, service.monitor.gearth.altlas, service.monitor.gearth.cms, service.monitor.gearth.lhcb). The config files are /opt/dashboard/var/log/gearth-{coords,all,alice,atlas,cms,lhcb}.log.

There is a cron job (/root/cron/dashCollectors.sh) set to run every hour that checks that all the collectors are running, if the coordinates' file has the correct permissions and the generated kmz files are up to date; in addition to restarting the collectors if they are down, it also corrects the permissions of the coordinates' file, should they be incorrect. Please use this script to start the collectors.

It is quite possible that if the data sources hit any problems during the break, the collector will not be able to run. If such a thing happens, the kmz will not be updated and the cron job should detect that the kmz has become outdated -- please check the log files and warn the service responsible for the data. In extreme cases, it will probably best to stop the collector(s) affected and remove them from the dashbCollectors.sh script (lines 71 and 73) and the corresponding kmz (line 80)

Debugging

If something is wrong and there is no data for some experiment or data is outdated these steps should be performed in order to fix the problem.

1. Restart collectors. But where to find collectors? Both - collectors and server hosting their output kmz files are located on the same machine. This machine should have alias dashb-earth. So you can login to this machine using this alias. But sometimes behind the alias there is a proxy not the machine itself. Than you should look in /etc/http/conf.d directory of proxy machine and find the actual adress. Once the proper address is found and you are logged in you should remember that all collectors are executed by dboard user. So use su - dboard command and login as dboard. Now you can use PYTHONPATH=/opt/dashboard/lib/ /opt/dashboard/bin/dashb-agent-list command to list all collectors and PYTHONPATH=/opt/dashboard/lib /opt/dashboard/bin/dashb-agent-restart service.monitor.gearth.COLLECTORNAME to restart collector with name COLLECTORNAME.

2. Check if this helped - If restarting of collectors helped all kmz files should be up to date. It can be checked using ls -l /opt/dashboard/www/ command. If you see that all the files are not older than 20 minutes from the moment you checked everyting is fine. If not - try to find you which kmz file is old and wich experiment it belongs to. Knowing that you can open log file using tail -f /opt/dashboard/var/log/gearth-EXPERIMENTNAME.log command and try to look for any helpful error messages and fix it.

3. If there is no error this can mean that the problem is with fetching data. In this case you have to find out in code (sic!) every data sources and check if they are working fine. If some data service is down you have two possibilities -either this service is external and don't belong to dashboard framework - in that case you have to contact the person responsible for that service. If the problem is within dashboard framework you have to do it on your own.

4. Some of dashboard services are running on virtual machines that can cause the problems. At the moment we have 2 physical hosts for virtual machines: dashboard12 is hosting dashb-virtual01-06 dashboard13 is hosting dashb-virtual 07-12

Try to log in as dboard user to the machine hosting problmatic virtual machine and check its status using ./vmbox.sh command. If machine is down you may want to restart it. Use VBoxManage startvm --type headless MACHINENAME command. Of course Murphy laws are working excellent so you may end up with error message like: VBoxManage: error: The machine 'dashb-virtual06' is already locked by a session (or being locked or unlocked) or any other strange error message. If you have this problem use ps -ef | grep virtual command and see which processes are doing something with your vittual machine. Knowing their pids kill them all using kill -9 PIDS command. Check once again list of processes and make sure you killed all doing something with problematic virtual machine. Then try to restart it once again and hopefully it will work.

5. If you are really unlucky the virtual machine will fail immediately because of virtual disk crash. In this case you need VirtualBox. But this is graphical application so you need to forward X windows in you session. Login to host as dboard and execute echo YOURLOGIN@CERNNOSPAMPLEASE.CH > .k5login command, than log off. Log in once again using ssh dboard@HOST -X command. Run virtualbox using VirtualBox command. Make sure you killed all processes using this VM and try to find disk snapshots. If you fond one you can recover this from snapshot. If snapshot was made a long time ago you may have install and configure something that's not included in this snapshot.

6. If you are really really unlucky you won't find any snapshot at all. Then you have to delete this VM, and recreate it from scratch. Good night and good luck!

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2011-08-05 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback