Linux Batch Service

Short description

ABP users can launch computing jobs on the CERN linux batch service. To do so, you need to request "LXPLUS and Linux" service for your CERN account, which allows you to connect to the LXPLUS interactive logon service.

Please note that, as of May 2018, the 16 core nodes under HTCondor have become operational.

You just need to add:

RequestCpus = 16

in your submission script.

As from 21 November 2018, 32 core HTCondor nodes have also become available.

RequestCpus = 32

Please refer to http://cern.ch/batchdocs/ for other HTCondor submission info.

A tutorial is available at : http://batchdocs.web.cern.ch/batchdocs/tutorial/introduction.html

Observed HTCondor Issues

All the following issues are not reproducible but frequently recurrent (reported from Riccardo, Dario, Nikos)

  • Scheduler not answering or taking long time to answer, submission fails with "ERROR: Can't find address of local schedd"
  • Jobs disappearing from the queue short after the expected completion without being explicitly removed
  • Jobs put on hold for node errors copying files
  • Submissions taking long time in particular in combination with data written in EOS(worse) or AFS(better, but still visible)
  • Jobs having a large variation in completion time when involving I/O with EOS(worse) or AFS(better, but still visible)
  • Jobs failing rate in the order of 10% when involving I/O with EOS(worse) or AFS(better, but still visible)

The typical answers for the service managers were:

  • The scheduler is being upgraded (since May 2017)
  • The user is assigned to another scheduler
  • Add request_memory = 2000 (the default value anyway) in the submission script
  • Sometimes no solution is provided
  • EOS issues will not be further addressed until the new EOS fuse software will be available (not before September 2017)

Most of the issues were mitigated by (Riccardo)

  • using the option -spool in condor_submit in combination with condor_transfer_data and
  • installing htcondor on my desktop (not supported by the LXBATCH managers, but encouraged by EOS managers).
  • For scheduler problems log into a different LXPLUS machine

However the -spool option may overload the htcondor server if the amount of data transiting is too larger (>20 MB)


Observed reasons for jobs being held (check with condor_q -held)

  • Error from slot1_5@b6d5247564NOSPAMPLEASE.cern.ch: STARTER at 188.184.97.100 failed to send file(s) to <128.142.196.38:9618>; SHADOW at 128.142.196.38 ...
  • Failed to initialize user log to /afs/cern.ch/ ...


Transferring output files from the compute node to AFS and EOS

  • HTCondor features a mechanism for the transfer of output files generated on the compute node to AFS, but not EOS (By specifying transfer_output_files =  "NAMEOFYOURFILE" in the job description). This transfer mechanism is limited in size to 1Gb. In order to overcome these limitations the current recommendation from IT is to use cp to an AFS location or xdrcp to an EOS location at the end of the execution script.
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2020-02-07 - XavierBuffat
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ABPComputing All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback