RunningBrandeisCodeOnBNLBatch

Introduction

This page collects information on how to run Brandeis analysis code on the BNL batch cluster.

Useful reading

Getting set up

First steps

Read the BNL cluster TWiki and follow the instructions to set up your BNL and associated computing account. Important: You should select "usatlastier3" as your group when prompted during the BNL account creation.

Set up your local config

After getting the account, and setting up the login to the cluster via public key (see their TWiki!), set up your $HOME/.ssh/config as on the TWiki:


host atlasgw
     Compression yes
     hostname atlasgw01.usatlas.bnl.gov
     user <your user name>
     AddressFamily inet
 
host atlasnx*
   Compression yes
   user <your user name>
   proxycommand ssh atlasgw nc %h %p
   AddressFamily inet

host spar*
   Compression yes
   user <your user name>
   proxycommand ssh atlasgw nc %h %p
   AddressFamily inet

Here, please do remember to replace the placeholders by (guess what?) your BNL user name.

This will allow you to connect to the interactive batch nodes with one command, without needing to manually log into their gateway first.

It may also be useful to set a few aliases in your .bashrc:

function spar(){
ssh -X spar010$1
}
function atlasnx(){
ssh -X -C atlasnx0$1
}

With these, you can ssh into the interactive batch nodes by typing for example

spar 1 #(or 2/.../8)

Any number from 1 to 8 will get you on one of the nodes, with the corresponding number. Best find a not too busy one.

If you type 'top' and see a user using 100% CPU on 'madevent', better find another node wink

There are also graphical nodes with more programs (editors etc) installed, with the alias above you can connect to those via

atlasnx 1 # (or 2)
  • Whenever you reboot your machine, you need to add your private key (the one belonging to the public key you used for the BNL account) to your ssh keyring:
ssh-add .ssh/<Private key file> 

You will need to enter the password you used to encrypt the key.

You can also create a macro that will create a SSH agent if needed and pick up an existing one if possible - contact Max if interested.

Logging into the cluster

Now, we are ready to connect. Log into a node with the

spar 1 # or other number

command.

If you did everything correctly, you should get a huge, scary disclaimer message, followed by a password prompt. Enter your password and you should be connected to the node.

Working on the cluster - local running

Setting up athena / ATLAS

To get the ATLAS environment, we use the same AtlasLocalRootBase setup as encountered in other cases:

    export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
    source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh 

Getting your code to BNL and building it

Editing directly on BNL is usually a pain, and hence not recommended. Instead, I suggest to write and test your code locally (for example on your laptop) and to sync it to BNL using either

  • rsync, for example via
                   rsync  -au --progress --exclude=InstallArea --exclude=.svn --exclude=RootCoreBin --exclude=.so --exclude=x86_64-slc6-gcc49-opt  -e "ssh -i <location_of_my_usatlas_ssh_key>  <MyLocalDir> <TheDirAtBNL>
        
    - note that we need to provide our ssh key for the connection, done via the "-e" parameter. Use the 'exclude' parameter to avoid syncing things that will be different at BNL, such as binaries you built or output files.
  • git (push your changes, pull from BNL - note that this can lead to a LOT of commits!)

Once you have your code in place, you can build as usual. Before submitting batch jobs, make sure to test run the code locally on the interactive node at least once.

Where do I put my stuff?

Your code can go into your home directory. There are a few other useful locations:

  • /atlasgpfs01/usatlas/data/: Seems to be an additional area with a nice quota, I use it for my batch outputs and log files
  • /pnfs/usatlas.bnl.gov/users//: dcache area with ~5TB quota. You can for example copy input files here, in particular interesting for files not registered in rucio.
  • BNL-OSG2_LOCALGROUPDISK is accessible from BNL - so you can use rucio (https://rucio-ui-prod-01.cern.ch/r2d2/request) to request your favourite data or MC samples here. You have 15TB quota which can be extended on request.

One setup could be:

  • Code: In your home
  • Inputs (non-rucio): dcache area
  • Inputs (rucio): LOCALGROUPDISK
  • Outputs/Logs: atlasgpfs01

How do I access my datasets on the LOCALGROUPDISK?

You can use

rucio list-file-replicas --rse BNL-OSG2_LOCALGROUPDISK <MyDataSetName> 

and then replace

/pnfs/usatlas.bnl.gov/

by

root://dcgftp.usatlas.bnl.gov:1096//pnfs/usatlas.bnl.gov/

to get the path with the correct prefix for xrootd access.

You can look at https://gitlab.cern.ch/atlas-mpp-xampp/XAMPPsm4lExample/blob/InclusiveLooseSelection/XAMPPsm4lExample/run/MakeFileLists.py for an example script providing this functionality (feel free to use if it works for you!).

How do I submit to the batch cluster?

In addition to the code you have set up and tested(!), you will typically need two things:

  • A job description file that tells the cluster what to run and which input / output files to transfer back and forth
  • A job script (typically, an executable shell script) that is executed by the cluster nodes and runs your code, performing the actual work.

The job description file

The job description files in condor allow to perform quite a bit of magic, such as automatically submitting jobs for every line of a text file (which may for example contain a list of inputs), file copying and renaming, and more. Please see http://www.iac.es/sieinvens/siepedia/pmwiki.php?n=HOWTOs.CondorSubmitFile or http://research.cs.wisc.edu/htcondor/manual/v7.8/2_5Submitting_Job.html for some detail.

You can set variables within the file or pass them to the file by using the "-a" parameter of condor_submit - the latter allows you to submit for example different files / options using the same job description file!

To access a variable, use $(variable_name) - of course, this is different than in bash, it would be too easy otherwise wink

An example which runs several jobs per file on a list of files provided by the user can be found at https://gitlab.cern.ch/atlas-mpp-xampp/XAMPPsm4lExample/blob/InclusiveLooseSelection/XAMPPsm4lExample/run/runAthena.job

The job script

This is not strictly necessary - you could also call your executable directly in the job description. But usually, it is helpful to have a small executable .sh macro which can do additional things needed during the jobs, for example set up kerberos / AFS tokens if needed, print debug info on the environment, validate the output at the end of the job, determine some configuration parameters based on the arguments used to call the script.

An example matching the job description example above is at https://gitlab.cern.ch/atlas-mpp-xampp/XAMPPsm4lExample/blob/InclusiveLooseSelection/XAMPPsm4lExample/run/runAthena.sh

You can see that it is 80% debug printout, 15% deducing the parameters for running the analysis with based on the configuration and 5% the actual call to the analysis call.

Accessing files on PNFS in batch jobs

There are currently some issues with xrootd access in batch jobs. To allow your jobs acces to the files, you need to

  • obtain and copy a VOMS proxy:
voms-proxy-init -valid 96:0 -voms atlas:/atlas/usatlas --old
  • tell your batch job how to find the proxy by adding
x509userproxy = $ENV(X509_USER_PROXY)

in your job description file.

For an example, please see https://gitlab.cern.ch/atlas-mpp-xampp/XAMPPsm4lExample/blob/InclusiveLooseSelection/XAMPPsm4lExample/run/setupProxy.sh

Submitting - useful condor commands

  • condor_submit [-a "VARIABLE=VALUE"] [-a "ANOTHERVARIABLE=ANOTHERVALUE"] will submit your jobs!
  • condor_q will show input on your running jobs
    • If your jobs encounter problems, they will get held by the system. You can diagnose the reason via condor_q --held
  • condor_rm or condor_rm can be used to kill your jobs if something went wrong
  • condor_release or condor_release can be used to put jobs in the "Held" state (meaning something went wrong) back to running once you fixed the problem.
  • condor_submit_dag can be used to submit a DAG, which is a beautiful way of coding up jobs which depend on other jobs. See below for more!

Chaining jobs together - the DAG mechanism

Imagine you would like to run N processing jobs which each run on one file of your input dataset, and one postprocessing step which puts all the outputs together. Condor can do this sort of thing automatically for you! The recipe is the DAG (directed acyclic graph) - basically a very simple syntax used to define interdependencies between jobs.

Using this, we can tell the cluster which of our jobs depend on which, and it will automatically execute them in the appropriate order.

To define such a set of jobs, you need

  • job description files and job scripts for each individual stage of the workflow
  • a .dag file to chain them together

The latter has to list the jobs that should be executed and their interdependency. It can also define variables that should be passed to the job scripts - this can for example be used to define a certain folder as output to the first job and input to the second one.

A simple example for the use case described above, using the macros I linked in earlier sections:

# first we define the jobs 

# job 1: run our athena analysis on one input file 
JOB  runAthena  runAthena.job
# job 2: Merge the outputs with hadd and run a small postprocessing (normalisation etc).
JOB  MergeAndAugment  MergeAndAugment.job

# Now, we define their relationship: The merge job should happen after ALL of the runAthena jobs are done! 
# Note: The number of such jobs is defined in the runAthena job description file (see above). 
PARENT runAthena CHILD MergeAndAugment

# finally, define a few variables to configure the individual jobs 
# This is just for demonstration and will look different for each job
VARS runAthena FILELIST="/direct/usatlas+u/goblirsc/Code/SM4l/XAMPPsm4lExample/run/FileLists/2019-03-30-DAOD_HIGG2D1atBNL//Filelist_mc16_13TeV.364250.Sherpa_222_NNPDF30NNLO_llll.deriv.DAOD_HIGG2D1.e5894_s3126_r9364_p3654.txt"
VARS runAthena stepsPerFile="1"
VARS runAthena LogDir="/atlasgpfs01/usatlas/data/goblirsc/SM4l/BatchProduction/logs/Nominal/2019-04-01-1620"
VARS runAthena TmpDir="/atlasgpfs01/usatlas/data/goblirsc/SM4l/BatchProduction/Tmp/Nominal/2019-04-01-1620"
VARS runAthena xamppArgs="--noSys"
VARS MergeAndAugment OutDir="/atlasgpfs01/usatlas/data/goblirsc/SM4l/BatchProduction/Out/Nominal/2019-04-01-1620"
VARS MergeAndAugment Sample="mc16_13TeV.364250.Sherpa_222_NNPDF30NNLO_llll.deriv.DAOD_HIGG2D1.e5894_s3126_r9364_p3654"
VARS MergeAndAugment LogDir="/atlasgpfs01/usatlas/data/goblirsc/SM4l/BatchProduction/logs/Nominal/2019-04-01-1620"
VARS MergeAndAugment InDir="/atlasgpfs01/usatlas/data/goblirsc/SM4l/BatchProduction/Tmp/Nominal/2019-04-01-1620"
VARS MergeAndAugment nFilesToMerge="218"

These DAGs can then be submitted to condor using 'condor_submit_dag --append "accounting_group=group_atlas.brandeis" MyDag.dag' (not sure if the accounting_group is still needed!)

What condor_q will show you is a dagman job that manages the DAG. It will start spawning the first type of job, using the respective job description file, monitor the status and submit the follow up when all jobs of the first stage are done. It will also write outout files such as for example MyDag.dag.dagman.out, which can be very useful for debugging crashing jobs.

Since DAGs are so simple to construct, it can be useful to have a small python macro that builds an appropriate DAG for the run you have in mind. See https://gitlab.cern.ch/atlas-mpp-xampp/XAMPPsm4lExample/blob/InclusiveLooseSelection/XAMPPsm4lExample/run/Submit.py for an example.

-- MaximilianGoblirschKolb - 19 May 2016

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2019-08-23 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback