HTCondor batch system

General infos

Information can be found at:

Spool Option

The -spool option can be used at condor_submit level, e.g.
condor_submit -spool htcondor.sub
In this case, all the output files (transfer_output_files) and the error, log and output files are not generated once the jobs finishes, but only when requested by the user, after the job is over. Retrieval can be done via the following command:
condor_transfer_data $LOGNAME -const 'JobStatus == 4'
In the above example, the files from all completed job in each cluster will be retrieved.

In this case, the jobs may not automatically disappear from condor_q. Job removal takes place after 10 days the job has finished.

Queue Flavours

The following table lists the queue types of HTCondor (link) and of LSF (link):
HTCondor LSF
name max duration name
espresso 20min 8nm
microcentury 1h 1nh
longlunch 2h 8nh
workday 8h 1nd
tomorrow 1d 2nd
testmatch 3d 1nw
nextweek 1w 2nw

Queue GUI

A GUI for showing the running jobs is available:

FAQs

Scheduler Not Replying

From time to time it happens that the scheduler does not reply. In general, it is a temporary problem; if this is not the case, open IT ticket; at the same time, you may try changing the scheduler you are assigned by default. This can be accomplished by setting the two following variables: _condor_SCHEDD_HOST and _condor_CREDD_HOST. E.g.:
  • tcsh
setenv _condor_SCHEDD_HOST bigbird02.cern.ch
setenv _condor_CREDD_HOST bigbird02.cern.ch
  • bash
export _condor_SCHEDD_HOST="bigbird02.cern.ch"
export _condor_CREDD_HOST="bigbird02.cern.ch"

In the output of a simple call to condor_q you can find the scheduler name. If you don't set these variables, the reported scheduler name is the one assigned to you by default; otherwise, you should find the one that you have set via the previous variables.

Please keep in mind that these statements, if typed on terminal, will apply only to that session. For instance, in case you log out or the lxplus session expires, you have to re-set those two variables if you want them also in the new session. So, please remember the scheduler that you have requested, otherwise you won't be able to retrieve the results form HTCondor.

Jobs Being Taken Very Slowly

(Main.AlessioMereghetti and Main.NikolaosKarastathis - 2017-10-23) It may happen that you see your jobs queueing for too long. This might be simply due to overload of the batch system (please check the batch GUI); more rarely, it can be also a problem with priorities. Indeed, it may happen that your jobs are assigned (by mistake) an accounting group with very low priority. In general, the accounting groups of ABP people with high priority are (managers are shown as well - you can contact them if you are not in the group of interest):

group Associated e-group description managed by
group_u_BE.ABP.SLAP htcondor-u-SLAP dedicated to SixTrack DA studies M.Giovannozzi
group_u_BE.ABP.COLL lhc-coll-lsf-users dedicated to Collimation Team A.Mereghetti
group_u_BE.ABP.ICE htcondor-u-ICE dedicated to Collective Effects simulations G. Rumolo

The groups with low priority are in general wider and refer to the BE department:

group Associated group description
group_u_BE.UNIX.u_si pz unix group  
group_u_BE.UNIX.u_pz si unix group  

The share among group can be seen

haggis  group list | grep group_u_BE
| group_u_BE                       |  15000 | true    |
| group_u_BE.UNIX                  |    300 | true    |
| group_u_BE.UNIX.u_pz             |    150 | true    |
| group_u_BE.UNIX.u_si             |    150 | true    |
| group_u_BE.ABP                   |  14700 | true    |
| group_u_BE.ABP.ICE               |   4900 | true    |
| group_u_BE.ABP.COLL              |   4900 | true    |
| group_u_BE.ABP.SLAP              |   4900 | true    |

Hence, you can check if your jobs are assigned the wrong accounting group via (an example output is shown):

$ condor_q owner $LOGNAME -long | grep '^AccountingGroup' | sort | uniq -c
9 AccountingGroup = "group_u_ATLAS.u_zp.nkarast"
1496 AccountingGroup = "group_u_BE.UNIX.u_pz.nkarast"

You can force the use of the high priority accounting group modifying your .sub script as:

+AccountingGroup = "group_u_BE.ABP.SLAP"

Submitting Jobs to HTCondor from a local Machine -- set up for Ubuntu

These notes (Main.AlessioMereghetti and Main.RiccardoDeMaria - 2017-10-10) refer to Ubuntu 16.04 LTS xenial and 18.04 LTS. If you have a different Linux distribution, steps might be the same, but syntax may change. Sudo rights are needed

As pre-requisite, you will need to install a kerberos client on your desktop; afterwards, you can proceed with the installation of HTCondor

kerberos set up

Configure kerberos

  1. install kerberos user and developer packages and add lxplus credential components (when asked, default realm is CERN.CH):
    sudo apt install krb5-user libkrb5-dev libauthen-krb5-perl
    scp $USERNAME@lxplus.cern.ch:/usr/bin/batch_krb5_credential .
    chmod +x batch_krb5_credential 
    sudo mv batch_krb5_credential /usr/bin/
    scp $USERNAME@lxplus.cern.ch:/etc/ngauth_batch_crypt_pub.pem .
    sudo mv ngauth_batch_crypt_pub.pem /etc/
    scp $USERNAME@lxplus.cern.ch:/etc/krb5.conf.no_rdns .
    sudo mv krb5.conf.no_rdns /etc/krb5.conf.no_rdns
    scp $USERNAME@lxplus.cern.ch:/etc/sysconfig/ngbauth-submit .
    sudo mkdir /etc/sysconfig/
    sudo mv ngbauth-submit /etc/sysconfig/
     
  2. check that the kerberos components are properly installed and set-up (the script will tell you the missing perl packages):
    /usr/bin/batch_krb5_credential
     
  3. if this does not work, you may need to change the line my $principalName = "ngauth/SOMESERVER"; into my $principalName = "ngauth/ngauth.cern.ch";
  4. to install missing perl components, please run commands like:
    perl -MCPAN -e 'install Authen::Krb5'
     

HTCondor set up

Install HTCondor

On Ubuntu 18.04 it is enough to install condor from the packaged version

sudo apt-get update
sudo apt-get install condor
 

For Ubuntu 16.04, the actual instructions can be found on the web-page of HTCondor. Mainly:

  1. install the latest HTCondor (stable) release:
    echo "deb http://research.cs.wisc.edu/htcondor/ubuntu/stable/ trusty contrib" | sudo tee -a /etc/apt/sources.list > /dev/null
     
  2. Install HTCondor Repository key:
    wget -qO - http://research.cs.wisc.edu/htcondor/ubuntu/HTCondor-Release.gpg.key | sudo apt-key add -
     
  3. Install the HTCondor package:
    sudo apt-get update
    sudo apt-get install condor
     

It may happen that at sudo apt-get update, you get the error message:

N: Skipping acquire of configured file 'contrib/binary-i386/Packages' as repository 'http://research.cs.wisc.edu/htcondor/ubuntu/stable trusty InRelease' doesn't support architecture 'i386'
In case your system is actually 64bit, a common solution is to limit the research of the package distro to just 64 bit by introducing the [arch=amd64] in the list of sources (in /etc/apt/sources.list), e.g.
deb [arch=amd64] http://research.cs.wisc.edu/htcondor/ubuntu/stable/ trusty contrib

Configure HTCondor

  1. create the config file /etc/condor/config.d/10-local.config. Please set as scheduler the default one you get on lxplus (e.g. bigbird05.cern.ch), you can find out by running /usr/bin/cernbatchsubmit on =lxplus. An example file is provided here:
    CONDOR_HOST = tweetybird03.cern.ch, tweetybird04.cern.ch
    COLLECTOR_HOST = tweetybird03.cern.ch, tweetybird04.cern.ch
    SCHEDD_HOST = bigbird05.cern.ch
    SCHEDD_NAME = $(SCHEDD_HOST)
    SEC_CLIENT_AUTHENTICATION_METHODS = KERBEROS
    SEC_CREDENTIAL_PRODUCER = /usr/bin/batch_krb5_credential
    CREDD_HOST = $(SCHEDD_HOST)
    FILESYSTEM_DOMAIN = cern.ch
    UID_DOMAIN = cern.ch
     
  2. restart HTCondor:
    /etc/init.d/condor restart
     

Pay attention to the couple COLLECTOR_HOST and SCHEDD_HOST , as depending on the collector you may be able to reach only a sub-set of the scheduler. To get the whole lists, please login to lxplus.cern.ch and type:

  • schedulers: condor_status -sched ;
  • collectors: condor_status -collector ;

-- Main.GiovanniIadarola - 2017-05-03

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2019-06-26 - RiccardoDeMaria
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ABPComputing All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback