Central Harvester Instances
Technical documentation
For a technical description of Harvester components please visit the
Harvester github wiki
Monitoring
- Worker monitoring
- Host and service monitoring
- Harvester node monitoring on Grafana
to see CPU, memory and disk usage of selected hosts
- Harvester service monitoring
monitoring Harvester and Condor schedd serviced, integrated with other ADC components. The service monitoring package and its documentation can be found here
. In addition to publishing the SLS information, the service monitoring also sends out emails to the service managers.
- Site monitoring
Harvester machines
You can connect with the usual atlpan user
Name |
HarvesterID |
Description |
aipanda169 |
CERN_central_k8s |
Harvester instance to run K8S queues. |
aipanda170 |
cern_cloud |
Pre-poduction node. |
aipanda171, aipanda172 |
CERN_central_A |
Grid PQs: CERN T0, CA, FR, NL, RU, US. Submits via remote schedds |
aipanda173, aipanda174 |
CERN_central_B |
Grid PQs: CERN, DE, ES, IT, TW, UK. Submits via remote schedds |
aipanda175 |
CERN_central_0 |
Submits to P1. It contains a local MySQL database and local schedd. |
aipanda177, aipanda178 |
CERN_central_1 |
Submits to special resources like CloudSchedulers. Submit to local schedd. |
(K8S pods) |
CERN_central_boxed |
Helm chart production instances. Submits for Grid PQs (share partial load with CERN_central_A and B). DB and sharedFS server are on aipanda028. It submits to schedd pods in the same Helm chart. See here for more information. |
Important paths and files
Now Python 3 becomes standard for central Harvester.
Some Harvester servers are migrated to run with Python 3 in virtual environment (venv), while some nodes are still running with Python 2, installed in system python (no venv), which will be migrated.
Please note the path difference of files below between different python version.
Files under /cephfs are shared across harvester nodes and schedd nodes, which are used when harvester submits via remote schedds.
Filename |
Description |
Managed by Puppet? |
/var/log/harvester |
All the harvester logs of the various agents |
|
/opt/harvester |
Main directory of Harvester python virtual environment (python 3 only) |
|
/opt/harvester/etc/panda/panda_harvester.cfg (python 2: /usr/etc/panda/panda_harvester.cfg) |
General configuration of subcomponents, DB connection, etc. |
|
/opt/harvester/etc/panda/panda_queueconfig.json (python 2: /usr/etc/panda/panda_queueconfig.json) |
Queue configuration |
/data/atlpan/harvester_common/ , /cephfs/atlpan/harvester/harvester_common/ |
Condor sdf templates and other files needed by Harvester |
|
/data/atlpan/harvester_wdirs/${harvesterID}/XX/YY/${workerID} , /cephfs/atlpan/harvester/harvester_wdirs/${harvesterID}/XX/YY/${workerID}/ |
Worker directories: sdf file submitted to Condor for each job and other files. Where XXYY are the last 4 digits of workerID. |
|
/data/atlpan/harvester_worker_dir/ , /cephfs/atlpan/harvester/harvester_worker_dir/ |
Deprecated and removed. (worker directory) |
|
/data/atlpan/condor_logs/ |
Local condor and pilot logs for each job |
|
/data2/atlpan/condor_logs/ |
On schedd nodes: condor and pilot logs for each job; on harvester nodes: dummy folders/files but must exist |
|
/etc/cron.d/harvester_with_remote_schedd.cron |
Cron running on production harvester CERN_central_A and CERN_central_B which submit through external condor nodes |
|
/opt/harvester/etc/panda/condor_host_config.json |
Configuration file about which external condor nodes that harvesters should submit through |
|
Harvester Service
To start, stop, or reload service
Python 3 (with venv):
[root@<machine>]# /opt/harvester/etc/rc.d/init.d/panda_harvester-uwsgi start
[root@<machine>]# /opt/harvester/etc/rc.d/init.d/panda_harvester-uwsgi stop
[root@<machine>]# /opt/harvester/etc/rc.d/init.d/panda_harvester-uwsgi reload
Python 2 (without venv):
[root@<machine>]# /usr/etc/rc.d/init.d/panda_harvester-uwsgi start
[root@<machine>]# /usr/etc/rc.d/init.d/panda_harvester-uwsgi stop
[root@<machine>]# /usr/etc/rc.d/init.d/panda_harvester-uwsgi reload
The most important tables in the DB structure can be found
here
, although it's under constant evolution. The DB configuration can always be found in the panda-harvester.cfg file, but currently you can connect to them like this:
Read-only account atlas-ro for debugging:
- On aipanda171,172:
# mysql -h dbod-ha-proxy.cern.ch -P 7106 -u atlas-ro -p HARVESTER
- On aipanda173,174:
# mysql -h dbod-ha-proxy.cern.ch -P 7107 -u atlas-ro -p HARVESTER
- On aipanda175:
# mysql -u atlas-ro -p harvester
- On aipanda177,178:
# mysql -h dbod-harv-c1.cern.ch -P 5500 -u atlas-ro -p harvester
- For CERN_central_boxed, DB is on aipanda028:
# mysql -u atlas-ro -p HARVESTER
Password is in /cephfs/atlpan/harvester/mysql-passwd
Note the database names, DB hostname, and ports are different on nodes. Check configuration in [db] schema in harvester.cfg
You will need to know the password to connect. Run only queries you understand and where you know what you are doing.
If write permission really needed, use the account of harvester service, Check harvester.cfg for account user and password.
Production Condor Schedd machines
These are external condor schedd nodes that production harvester CERN_central_A and CERN_central_B submit through.
You can connect with the usual atlpan user
Name |
submit for |
aipanda023 |
CERN_central_A |
aipanda024 |
CERN_central_B |
aipanda156 |
CERN_central_A |
aipanda157 |
CERN_central_B |
aipanda159 |
CERN_central_A |
aipanda183 |
CERN_central_A |
aipanda184 |
CERN_central_B |
Condor nodes are interchangeable.
Important paths and files
Filename |
Description |
Managed by Puppet? |
/etc/condor/ |
Condor configuration folder, costom configuration usually goes under config.d |
|
/etc/condor/config.d/condor_config.local |
Main condor configuration for ATLAS GRID |
Yes |
/var/log/condor/ |
Condor logs, one per each agent |
/etc/cron.d/harvester_condor.cron |
Cron file on the condor nodes |
Yes |
Restarting condor service
[root@<schedd machine>]# systemctl restart condor
HGCS
HGCS is a sidekick service to run on remote HTCondor schedd node working with Harvester for GRID.
HGCS is installed on every production condor schedd node, and is required to be running.
To restart HGCS:
[root@<schedd machine>]# systemctl restart hgcs
Configuration and log files:
Filename |
Description |
Managed by Puppet? |
/opt/hgcs.cfg |
HGCS config |
|
/var/log/hgcs/hgcs.log |
HGCS log |
|
For detail about HGCS, see
HGCS github wiki
Troubleshooting
SLS errors
- The harvester host has many missed workers: Please check the Harvester monitoring (e.g. the main Grafana dashboard
), select the problematic harvester instance and "missed" worker status. In the worker table you will see the reason for the missed workers. Missed workers can be caused by an issue on the instance itself, but also by a site issue or misconfiguration. When the error message and the source of the error is not straightforward to understand, the service manager should be contacted.
FahuiLin - 2021-07-08