ATLAS Northeast Tier 3

The NET3 is a set of interactive nodes used for analysis by the UMass ATLAS group. Their names and purposes are:

  • um1.net3.mghpcc.org : analysis
  • um2.net3.mghpcc.org : analysis
  • um3.net3.mghpcc.org : analysis
  • um4.net3.mghpcc.org : analysis
  • um5.net3.mghpcc.org : engineering

um5 is reserved for engineering (detector development) work. Please, refrain from using it for other goals.

New Users

New users have to request an account to their advisor and join the Slack workspace https://northeasttier3.slack.com/. The following information will be needed:

  • Full name
  • username (if you have a CERN/lxplus login, please, use the same)
  • Phone number
  • ssh public key

In order to generate a ssh public key, you can use the command ssh-keygen. You will be asked to choose the location of the key. The standard place is a folder called .ssh in your home directory. Two files are created id_rsa and id_rsa.pub. The first is your private key, the second is your public key. You should never share your private key, but you can copy the file id_rsa to other machines where you may want to use it (lxplus and the NET3 machines, for instance). The content of id_rsa.pub is what is needed to create an account.

Setting Up Your Environment

The NET3 machines do not have the command alias setupATLAS defined like lxplus. You can define it with:

    1export ATLAS_LOCAL_ROOT_BASE="/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase"
    2alias setupATLAS="source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh"

You can add these lines to your ~/.bash_profile so that they are automatically executed when a new interactive login shell is opened.

Using ROOT

ROOT can be setup by using lsetup root. If you want to see all the versions of ROOT available, you can do lsetup root -h, choose one, and do lsetup "root [ver]".

Using python

To use python, you can do lsetup python. Only the default python modules will be available. New modules can be installed using the pip command with the following syntax:

    1pip install [module] --user

You can also create python virtual environments where different modules are installed. In order to use a python virtual environment, you can do:

  • Make sure virtualenv is install: pip install virtualenv --user
  • Create the virtual environment inside your project directory (or wherever you like, but choose wisely): python -m virtualenv myenv, which myenv is your chosen name for the environment.
  • Activate the environment: source ./myenv/bin/activate
  • Now a special python environment exists within myenv and is activated. Try which python and which pip to see. Now you may install python libraries with pip and import them.

Using ML tools

The NET3 machines do not have GPU, but they are very powerful machines that can be used to train ML algorithms. In order to use the usual ML tools (keras/tensorflow/pytorch/...) you need to install them. That can be done with pip (see above) or loading the environments provided by ATLAS and CERN.

The ATLAS Machine Learning group distributes Docker images that can be loaded inside the umX machines. It is convenient to use the unpacked files in /cvmfs by doing:

    1singularity exec '/cvmfs/unpacked.cern.ch/registry.hub.docker.com/atlasml/ml-base:latest' bash

Alternatively, CERN also distribute views of the usual ML tools that be setup by doing:

    1source /cvmfs/sft.cern.ch/lcg/views/[release]/[platform]/setup.sh

Using gitlab

in order to use git, a simple lsetup git is enough. However, the CERN gitlab relies heavily on Kerberos authentication and you will need to get a Kerberos ticket before being able to perform operations with the remote server. If you encounter an error HTTP Basic: Access denied, it means that your ticket expired and you need to get a new one. The following commands should setup git and retrieve a ticket for you:

    1lsetup git
    2kinit -f [username]@CERN.CH

Connecting to NET3

Using ssh

In order to connect to a machine, you can use ssh by doing

    1ssh -X -l [username] umX.net3.mghpcc.org

If you are using Windows, you will need to download an X11 client. Xming is a good option but this method is really not recommended. If you need graphical interface, it is much better to use x2go (below).

Once you login, if you want your session to be persistent, you can use the command screen. Your commands will continue to run even if the connection drops (or when you detach it with ctrl+A, D). Note that each time you use screen a new shell is created, so use it wisely. You can always see all your open screens by doing screen -ls. To return to a previous screen session, use the command screen -r. More details in the GNU manual. Another option is to use tmux. tmux is more powerful, allows sessions to be shared between users (for cooperative coding, for instance), but is essentially the same thing.

Using x2go

x2go is a remote desktop system with clients available for MacOS, Windows, and Linux. You can download it from the x2go webpage. Once installed, create a new session (Ctrl+N). The following fields should be filled:

  • Session name: enter something meaningful here
  • Host: umX.net3.mghpcc.org (where X is any number from 1 to 5)
  • Login: your username
  • SSH port: 22
  • Use RSA/DSA key for ssh connection: enter the path of your ssh key
  • Check the box "Try auto login"

The image below shows a typical configuration for um1:

x2gosetup.jpg
Example x2go configuration

File storage

/home area

Your home area is mounted on a NFS drive and is available in all umX machines. The usage of the home space should be kept to a minimum. The whole NET3 cluster will stop working if the /home are gets full. It is strongly recommended that you keep your /home area below 30 GB. If you need, at any moment, to check how much space you are using, just do du -sh . from your home folder.

You should not use your home area as source of files for grid jobs (see below) either, since this can unececessarily stress the network.

/scratch area

The scratch area is a local disk, only seen by the local machine. They should be the choice of storage for the small ntuples you are currently analyzing. The space is limited, but should be enough for everyone if you keep it clean. um1/2/3 have each 2TB of disk, while um4/5 have each 1.5TB.The simplest way to copy file to and from local /scratch disks is via scp. They can also be used to copy the result of your batch jobs (see below).

IMPORTANT: In order to keep the disks organized, create a folder mkdir /scratch/[username] and put all your files inside. You should try to limit your usage to 300 GB (across every umX machine!). If your ntuples became too large and you need to store them in a large scale storage, please discuss the issue in the Slack channel. Under reasonable circumstances, the files can be moved for permanent storage.

If the /scratch of a machine gets full, the machine will cease to work. Usually, the first symptom one will see is that /cvmfs (which relies on /scratch for bufferring) will no longer be accessible.

Grid space

If you want to analyze the output of your grid jobs (your ntuples, for instance) in NET3, you will have to transfer them to the MGHPCC disks. If you are producing the ntuples yourself, the easiest way to do this is to instruct PanDA to do that for you. When running your pathena command, just add:

    1lsetup panda
    2pathena --destSET NET2_LOCALGROUPDISK ...

If the ntuples were created by someone else and they are stored in another site, you can request a transfer using rucio. To do that, the first thing you need is to install your grid certificate in your browser. You only need to do that once (per certificate) and instructions are provided in the US-ATLAS webpages. Once your certificate installed, access the webpage https://rucio-ui.cern.ch/, login using your X.509 certificate, and select Request new rule in the Data Transfers (R2D2) menu.

You need to fill out 4 forms. On the first, you enter your dataset identifier. On the second, you need to enter the RSE, which should be NET2_LOCALGROUPDISK. On the third, you can specify the lifetime of the copy. Each user should have 15TB of space in NET2_LOCALGROUPDISK (see below).

r2d2setup.jpg
Example R2D2 (Rucio Data Transfer) configuration

Accessing files through xrootd

Once you know that a file is in a NET2 storage element (NET2_LOCALGROUPDISK, NET2_DATADISK, soon NESE...) you can access the files by xrootd as if they were locally. In order to discover the local path of the files in a dataset you can use the command:

    1lsetup rucio
    2rucio list-file-replicas [dataset] --rse NET2_LOCALGROUPDISK

where NET2_LOCALGROUPDISK can be replaced by other storage elements. The local paths will begin with davs://atlas-dtn4.bu.edu:1094/gpfs1/. These paths can be open directly inside ROOT

    1auto file = TFile::Open("davs://atlas-dtn4.bu.edu:1094/gpfs1/[filepath]");

They can also be given as inputs for any ATLAS software. Note that you need to have a grid certificate to access files via xrootd. To get a grid certificate use the command voms-proxy-init --voms atlas.

Accesing CERN EOS

Files in CERN EOS can also be accessed via xrootd. It is not particularly fast, but may help if you don't want to move large amounts of files. There are basically two locations files are stored in CERN EOS: /eos/user and /eos/atlas (note that some users may have a /eos/atlas/user, and those are in the latter category).

To access files in /eos/user, you can do:

    1auto file = TFile::Open("root://eosuser.cern.ch//eos/[filepath]");

and to access files in /eos/atlas, you can do:

    1auto file = TFile::Open("root://eosatlas.cern.ch//eos/[filepath]");

Note that, in both cases, you will need a grid ticket.

Jupyter notebooks

Installation

Jupyter notebooks come with most ML setups described above. But you can also install it yourself by doing:

    1pip install ipython jupyter --user

You just need to do it once. You may choose to install it in a virtual environment, as discussed above.

Connect using ssh tunnel

You can access the notebook in your own browser by creating a ssh tunnel. In order to do that, open a ssh connection with one of the umX machines, start a screen, and do:

    1export ssh_port=`python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()'`
    2ssh -f -o ExitOnForwardFailure=yes ${HOSTNAME} -R ${ssh_port}:localhost:${ssh_port} sleep 30
    3echo ${ssh_port}
    4jupyter notebook --no-browser --port=${ssh_port} --ip 127.0.0.1

The echo command will tell on which port the tunnel was created. The jupyter command will return an http address for the notebook. You can then go to your computer and start your side of the tunnel

    1ssh -N -f -L [port]:localhost:[port] -l [username] umX.net3.mghpcc.org

And then open the http address on your browser.

Connect using x2go

If you are using x2go, you can use the browser from umX. In this case, just start a jupyter notebook doing jupyter notebook.

Fast analysis with RDataFrames

The NET3 machines are servers with many CPUs (um1/2/3 have 80 each, um4/5 have 64 each). In order to fully benefit from the computation power of these machines, it is strongly recommended to use RDataFrames to read ntuples.

    1import ROOT
    2
    3# This activates implicit multi-threading
    4ROOT.EnableImplicitMT()
    5
    6# The analysis below runs in parallel
    7rdf = ROOT.RDataFrame("mytree", "myfile.root")
    8hx = rdf.Filter("x > 0").Histo1D("x")
    9hy = rdf.Filter("x > 0").Histo1D("y")
   10hz = rdf.Filter("x > 0").Histo1D("z")
   11
   12c = ROOT.TCanvas("canvas", "", 1200, 800)
   13c.Divide(2,2)
   14c.cd(1)
   15hx.Draw()
   16c.cd(2)
   17hy.Draw()
   18c.cd(3)
   19hz.Draw()
   20c.Draw()

Note that the Draw command is only issued after all histograms are defined (lazy evaluation). When the histograms are filled, they are done in parallel, using all the machine CPUs. More information in the RDataFrame manual. This example should work on Juputer notebooks (using JupyROOT) and when you do c.Draw() the canvas will be drawn in the notebook.

Batch system

The NET3 has a small batch system for analysis. For large jobs, you should use the grid. But for analysis on the final ntuples (make histograms, set limits, ...), the batch system is the ideal tool. It is also very useful to produce very large gridpacks. The batch system uses PBS for job scheduling.

Submitting jobs manually

In order to submit jobs, you should make a bash script and use the command qsub -q tier3 [script.sh].

A typical script begins with:

    1#!/bin/bash
    2
    3cd ${TMPDIR}
    4export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase                                               
    5alias setupATLAS="source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh"                                              
    6setupATLAS        

Checking your jobs

You can check the status of your jobs by doing qstat -q tier3 -u [username].

Deleting jobs

Sometimes jobs get stuck. IMPORTANT Please, check frequently if you have stuck jobs with the qstat command described above. If you have stuck jobs, delete them by using qdel [jobid], where the jobid is the number on the first column of the qstat output.

Job configuration

Usually, you will want to have different configurations for each worker job. This can be accomplished by submitting the jobs with options qsub -t [start]-[end] -q tier3 ..., where [start] is the number of the first job and [end] is the number of the last job. Each job will have an environment variable called $SGE_TASK_ID with the job number.

So, for instance, if you would like to define an input file per job, you could define:

    1inputFiles = ("file1.root" "file2.root" "file3.root")
    2
    3analysis.py ${inputFIles[${SGE_TASK_ID}]}

Similar configuration can be done definiing more arrays.

Sending your environment

Often, you will need to send your environment to the batch jobs. This is the case if you setup special variables for ML tools or another similar ATLAS setup. To copy all of your environment to the worker nodes, submit your jobs with the option

    1qsub -V -q tier3 ...

Submitting multithread jobs

Submitting a batch job with parallel processing/multithreading may overwhelm the tier3 nodes if the number of threads specified is greater than the number of processors per node. Instead, jobs must be submitted with a specified "parallel environment", which will distribute the threads over several nodes. This is done with the -pe smh [n_slots] option, with a full example of

    1qsub -pe smp 40 -q tier3 ...

Setting a parallel environment will likely cause the user to create several jobs, so discretion should be used as to not submit an excessive amount of total jobs.

Submit jobs with EventLoop

If part of your work uses EventLoop, you can use the pre-defined EL::GEDriver class to submit batch jobs with no extra effort. See this code for example usage.

    1job.options()->setString(EL::Job::optSubmitFlags, "-V -q tier3");

Running Madgraph jobs

The umX machines are quite powerful and can be an asset to produce challenging Madgraph gridpacks. For processes that require large integration time per helicity combination, your best bet is to use the following Cards/me5_configuration.txt option:

    1run_mode = 2

For processes that may be many thousands of helicity combinations (multileg processes) but each does not require much processing, you can use the cluster. The settings in Cards/me5_configuration.txt in this case are:

    1run_mode = 1
    2cluster_type=pbs
    3cluster_queue=tier3
    4cluster_local_path=/cvmfs/atlas.cern.ch/repo/sw/Generators/

Grid certificate

You may need your grid certificate in your batch jobs (to access xrootd, for instance). The easiest way to transfer a ticket to the batch jobs is to get a ticket on the umX machine before submitting (with the usual voms-proxy-init command) and then adding the following line to the beginning of your script:

    1scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null umX:/tmp/x509up_u[uid] /tmp

where uid is your user ID in NET3, and umX is the machine where you got the grid ticket. If you don't know your user ID, you can discover with the command id.

This method also assumes you copied your private SSH key to you ~/.ssh/ folder in NET3 (with correct permissions). The correct permissions are 600 for id_rsa and 644 for id_rsa.pub.

Transferring files

The same method described above can be used to transfer files from your script back to the /scratch directory of one of the umX when your script is done. The last line of a script is usually:

    1scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null outfile.root umX:/scratch/[username]/outfile_${SGE_TASK_ID}.root

Example of a practical batch system script

This example puts together all the items above and can serve as skeleton for your analysis jobs. This is just an made up script that uses Rafael's username as an example. You should change it accordingly.

    1#!/bin/bash
    2
    3cd ${TMPDIR}
    4export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase                                               
    5alias setupATLAS="source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh"                                              
    6setupATLAS        
    7asetup AnalysisBase 21.2.201
    8
    9scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null um1:/tmp/x509up_u1000 /tmp/x509up_u1000
   10
   11inputFiles=("davs://atlas-dtn4.bu.edu:1094/gpfs1/atlasdatadisk/rucio/mc20_13TeV/2c/9e/DAOD_PHYS.28687139._000070.pool.root.1" \
   12"davs://atlas-dtn4.bu.edu:1094/gpfs1/atlasdatadisk/rucio/mc20_13TeV/2c/9e/DAOD_PHYS.28687139._000071.pool.root.1" \
   13"davs://atlas-dtn4.bu.edu:1094/gpfs1/atlasdatadisk/rucio/mc20_13TeV/2c/9e/DAOD_PHYS.28687139._000072.pool.root.1" \
   14"davs://atlas-dtn4.bu.edu:1094/gpfs1/atlasdatadisk/rucio/mc20_13TeV/2c/9e/DAOD_PHYS.28687139._000073.pool.root.1")
   15
   16python analyze.py ${inputFiles[${SGE_TASK_ID}]}}
   17scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null outfile.root um1:/scratch/rcoelhol/analysis/outfile_${SGE_TASK_ID}.root
Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg r2d2setup.jpg r1 manage 77.3 K 2022-04-25 - 14:59 RafaelLopesdeSa  
JPEGjpg x2gosetup.jpg r1 manage 70.6 K 2022-04-24 - 17:28 RafaelLopesdeSa  
Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2022-05-01 - RafaelLopesdeSa
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback