Central Harvester Instances

Technical documentation

For a technical description of Harvester components please visit the Harvester github wiki

Monitoring

Harvester machines

You can connect with the usual atlpan user
Name HarvesterID Description
aipanda169 CERN_central_k8s Harvester instance to run K8S queues.
aipanda170 cern_cloud Pre-poduction node.
aipanda171, aipanda172 CERN_central_A Grid PQs: CERN T0, CA, FR, NL, RU, US. Submits via remote schedds
aipanda173, aipanda174 CERN_central_B Grid PQs: CERN, DE, ES, IT, TW, UK. Submits via remote schedds
aipanda175 CERN_central_0 Submits to P1. It contains a local MySQL database and local schedd.
aipanda177, aipanda178 CERN_central_1 Submits to special resources like CloudSchedulers. Submit to local schedd.
(K8S pods) CERN_central_boxed Helm chart production instances. Submits for Grid PQs (share partial load with CERN_central_A and B). DB and sharedFS server are on aipanda028. It submits to schedd pods in the same Helm chart. See here for more information.

Important paths and files

Now Python 3 becomes standard for central Harvester.

Some Harvester servers are migrated to run with Python 3 in virtual environment (venv), while some nodes are still running with Python 2, installed in system python (no venv), which will be migrated.

Please note the path difference of files below between different python version.

Files under /cephfs are shared across harvester nodes and schedd nodes, which are used when harvester submits via remote schedds.

Filename Description Managed by Puppet?
/var/log/harvester All the harvester logs of the various agents  
/opt/harvester Main directory of Harvester python virtual environment (python 3 only)  
/opt/harvester/etc/panda/panda_harvester.cfg (python 2: /usr/etc/panda/panda_harvester.cfg) General configuration of subcomponents, DB connection, etc.  
/opt/harvester/etc/panda/panda_queueconfig.json (python 2: /usr/etc/panda/panda_queueconfig.json) Queue configuration
/data/atlpan/harvester_common/ , /cephfs/atlpan/harvester/harvester_common/ Condor sdf templates and other files needed by Harvester  
/data/atlpan/harvester_wdirs/${harvesterID}/XX/YY/${workerID} , /cephfs/atlpan/harvester/harvester_wdirs/${harvesterID}/XX/YY/${workerID}/ Worker directories: sdf file submitted to Condor for each job and other files. Where XXYY are the last 4 digits of workerID.  
/data/atlpan/harvester_worker_dir/ , /cephfs/atlpan/harvester/harvester_worker_dir/ Deprecated and removed. (worker directory)  
/data/atlpan/condor_logs/ Local condor and pilot logs for each job  
/data2/atlpan/condor_logs/ On schedd nodes: condor and pilot logs for each job; on harvester nodes: dummy folders/files but must exist  
/etc/cron.d/harvester_with_remote_schedd.cron Cron running on production harvester CERN_central_A and CERN_central_B which submit through external condor nodes  
/opt/harvester/etc/panda/condor_host_config.json Configuration file about which external condor nodes that harvesters should submit through  

Harvester Service

To start, stop, or reload service

Python 3 (with venv):

[root@<machine>]# /opt/harvester/etc/rc.d/init.d/panda_harvester-uwsgi start
[root@<machine>]# /opt/harvester/etc/rc.d/init.d/panda_harvester-uwsgi stop
[root@<machine>]# /opt/harvester/etc/rc.d/init.d/panda_harvester-uwsgi reload

Python 2 (without venv):

[root@<machine>]# /usr/etc/rc.d/init.d/panda_harvester-uwsgi start
[root@<machine>]# /usr/etc/rc.d/init.d/panda_harvester-uwsgi stop
[root@<machine>]# /usr/etc/rc.d/init.d/panda_harvester-uwsgi reload

MySQL

The most important tables in the DB structure can be found here, although it's under constant evolution. The DB configuration can always be found in the panda-harvester.cfg file, but currently you can connect to them like this:

Read-only account atlas-ro for debugging:

  • On aipanda171,172: # mysql -h dbod-ha-proxy.cern.ch -P 7106 -u atlas-ro -p HARVESTER

  • On aipanda173,174: # mysql -h dbod-ha-proxy.cern.ch -P 7107 -u atlas-ro -p HARVESTER

  • On aipanda175: # mysql -u atlas-ro -p harvester

  • On aipanda177,178: # mysql -h dbod-harv-c1.cern.ch -P 5500 -u atlas-ro -p harvester

  • For CERN_central_boxed, DB is on aipanda028: # mysql -u atlas-ro -p HARVESTER

Password is in /cephfs/atlpan/harvester/mysql-passwd

Note the database names, DB hostname, and ports are different on nodes. Check configuration in [db] schema in harvester.cfg

You will need to know the password to connect. Run only queries you understand and where you know what you are doing.

If write permission really needed, use the account of harvester service, Check harvester.cfg for account user and password.

Production Condor Schedd machines

These are external condor schedd nodes that production harvester CERN_central_A and CERN_central_B submit through. You can connect with the usual atlpan user
Name submit for
aipanda023 CERN_central_A
aipanda024 CERN_central_B
aipanda156 CERN_central_A
aipanda157 CERN_central_B
aipanda159 CERN_central_A
aipanda183 CERN_central_A
aipanda184 CERN_central_B

Condor nodes are interchangeable.

Important paths and files

Filename Description Managed by Puppet?
/etc/condor/ Condor configuration folder, costom configuration usually goes under config.d  
/etc/condor/config.d/condor_config.local Main condor configuration for ATLAS GRID Yes
/var/log/condor/ Condor logs, one per each agent
/etc/cron.d/harvester_condor.cron Cron file on the condor nodes Yes

Restarting condor service

[root@<schedd machine>]# systemctl restart condor

HGCS

HGCS is a sidekick service to run on remote HTCondor schedd node working with Harvester for GRID. HGCS is installed on every production condor schedd node, and is required to be running.

To restart HGCS:

[root@<schedd machine>]# systemctl restart hgcs

Configuration and log files:

Filename Description Managed by Puppet?
/opt/hgcs.cfg HGCS config  
/var/log/hgcs/hgcs.log HGCS log  

For detail about HGCS, see HGCS github wiki

Troubleshooting

SLS errors

  • The harvester host has many missed workers: Please check the Harvester monitoring (e.g. the main Grafana dashboard), select the problematic harvester instance and "missed" worker status. In the worker table you will see the reason for the missed workers. Missed workers can be caused by an issue on the instance itself, but also by a site issue or misconfiguration. When the error message and the source of the error is not straightforward to understand, the service manager should be contacted.

Screenshot_2021-08-12_at_12.41.42.png

FahuiLin - 2021-07-08

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2021-08-12 - FernandoHaraldBarreiroMegino
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback