ATLAS Northeast Tier 3
The NET3 is a set of interactive nodes used for analysis by the UMass ATLAS group. Their names and purposes are:
- um1.net3.mghpcc.org : analysis
- um2.net3.mghpcc.org : analysis
- um3.net3.mghpcc.org : analysis
- um4.net3.mghpcc.org : analysis
- um5.net3.mghpcc.org : engineering
um5 is reserved for engineering (detector development) work. Please, refrain from using it for other goals.
New Users
New users have to request an account to their advisor and join the Slack workspace
https://northeasttier3.slack.com/
. The following information will be needed:
- Full name
- username (if you have a CERN/lxplus login, please, use the same)
- Phone number
- ssh public key
In order to generate a ssh public key, you can use the command
ssh-keygen
. You will be asked to choose the location of the key. The standard place is a folder called
.ssh
in your home directory. Two files are created
id_rsa
and
id_rsa.pub
. The first is your private key, the second is your public key. You should never share your private key, but you can copy the file
id_rsa
to other machines where you may want to use it (lxplus and the NET3 machines, for instance). The content of
id_rsa.pub
is what is needed to create an account.
Setting Up Your Environment
The NET3 machines do not have the command alias
setupATLAS
defined like lxplus. You can define it with:
1export ATLAS_LOCAL_ROOT_BASE="/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase"
2alias setupATLAS="source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh"
You can add these lines to your
~/.bash_profile
so that they are automatically executed when a new interactive login shell is opened.
ROOT can be setup by using
lsetup root
. If you want to see all the versions of
ROOT available, you can do
lsetup root -h
, choose one, and do
lsetup "root [ver]"
.
Using python
To use python, you can do
lsetup python
. Only the default python modules will be available. New modules can be installed using the pip command with the following syntax:
1pip install [module] --user
You can also create python virtual environments where different modules are installed. In order to use a python virtual environment, you can do:
- Make sure virtualenv is install:
pip install virtualenv --user
- Create the virtual environment inside your project directory (or wherever you like, but choose wisely):
python -m virtualenv myenv
, which myenv
is your chosen name for the environment.
- Activate the environment:
source ./myenv/bin/activate
- Now a special python environment exists within
myenv
and is activated. Try which python
and which pip
to see. Now you may install python libraries with pip
and import them.
Using ML tools
The NET3 machines do not have GPU, but they are very powerful machines that can be used to train ML algorithms. In order to use the usual ML tools (keras/tensorflow/pytorch/...) you need to install them. That can be done with
pip
(see above) or loading the environments provided by ATLAS and CERN.
The
ATLAS Machine Learning group
distributes
Docker images
that can be loaded inside the umX machines. It is convenient to use the unpacked files in
/cvmfs
by doing:
1singularity exec '/cvmfs/unpacked.cern.ch/registry.hub.docker.com/atlasml/ml-base:latest' bash
Alternatively, CERN also distribute views of the usual ML tools that be setup by doing:
1source /cvmfs/sft.cern.ch/lcg/views/[release]/[platform]/setup.sh
Using gitlab
in order to use git, a simple
lsetup git
is enough. However, the CERN gitlab relies heavily on Kerberos authentication and you will need to get a Kerberos ticket before being able to perform operations with the remote server. If you encounter an error
HTTP Basic: Access denied
, it means that your ticket expired and you need to get a new one. The following commands should setup git and retrieve a ticket for you:
1lsetup git
2kinit -f [username]@CERN.CH
Connecting to NET3
Using ssh
In order to connect to a machine, you can use ssh by doing
1ssh -X -l [username] umX.net3.mghpcc.org
If you are using Windows, you will need to download an X11 client.
Xming
is a good option but this method is really not recommended. If you need graphical interface, it is much better to use x2go (below).
Once you login, if you want your session to be persistent, you can use the command
screen
. Your commands will continue to run even if the connection drops (or when you detach it with ctrl+A, D). Note that each time you use
screen
a new shell is created, so use it wisely. You can always see all your open screens by doing
screen -ls
. To return to a previous screen session, use the command
screen -r
. More details in the
GNU manual
. Another option is to use
tmux
.
tmux
is more powerful, allows sessions to be shared between users (for cooperative coding, for instance), but is essentially the same thing.
Using x2go
x2go is a remote desktop system with clients available for
MacOS, Windows, and Linux. You can download it from
the x2go webpage
. Once installed, create a new session (Ctrl+N). The following fields should be filled:
- Session name: enter something meaningful here
- Host: umX.net3.mghpcc.org (where X is any number from 1 to 5)
- Login: your username
- SSH port: 22
- Use RSA/DSA key for ssh connection: enter the path of your ssh key
- Check the box "Try auto login"
The image below shows a typical configuration for um1:
Example x2go configuration
File storage
/home area
Your home area is mounted on a NFS drive and is available in all umX machines. The usage of the home space should be kept to a minimum. The whole NET3 cluster will stop working if the /home are gets full. It is strongly recommended that you keep your /home area below 30 GB. If you need, at any moment, to check how much space you are using, just do
du -sh .
from your home folder.
You should not use your home area as source of files for grid jobs (see below) either, since this can unececessarily stress the network.
/scratch area
The scratch area is a local disk, only seen by the local machine. They should be the choice of storage for the small ntuples you are currently analyzing. The space is limited, but should be enough for everyone if you keep it clean. um1/2/3 have each 2TB of disk, while um4/5 have each 1.5TB.The simplest way to copy file to and from local /scratch disks is via
scp
. They can also be used to copy the result of your batch jobs (see below).
IMPORTANT: In order to keep the disks organized, create a folder
mkdir /scratch/[username]
and put all your files inside. You should try to limit your usage to 300 GB (across every umX machine!). If your ntuples became too large and you need to store them in a large scale storage, please discuss the issue in the Slack channel. Under reasonable circumstances, the files can be moved for permanent storage.
If the
/scratch
of a machine gets full, the machine will cease to work. Usually, the first symptom one will see is that
/cvmfs
(which relies on
/scratch
for bufferring) will no longer be accessible.
Grid space
If you want to analyze the output of your grid jobs (your ntuples, for instance) in NET3, you will have to transfer them to the MGHPCC disks. If you are producing the ntuples yourself, the easiest way to do this is to instruct
PanDA
to do that for you. When running your
pathena
command, just add:
1lsetup panda
2pathena --destSET NET2_LOCALGROUPDISK ...
If the ntuples were created by someone else and they are stored in another site, you can request a transfer using rucio. To do that, the first thing you need is to install your grid certificate in your browser. You only need to do that once (per certificate) and instructions are provided in the
US-ATLAS webpages
. Once your certificate installed, access the webpage
https://rucio-ui.cern.ch/
, login using your X.509 certificate, and select
Request new rule
in the
Data Transfers (R2D2)
menu.
You need to fill out 4 forms. On the first, you enter your dataset identifier. On the second, you need to enter the RSE, which should be NET2_LOCALGROUPDISK. On the third, you can specify the lifetime of the copy. Each user should have 15TB of space in NET2_LOCALGROUPDISK (see below).
Example
R2D2 (Rucio Data Transfer) configuration
Accessing files through xrootd
Once you know that a file is in a NET2 storage element (NET2_LOCALGROUPDISK, NET2_DATADISK, soon NESE...) you can access the files by xrootd as if they were locally. In order to discover the local path of the files in a dataset you can use the command:
1lsetup rucio
2rucio list-file-replicas [dataset] --rse NET2_LOCALGROUPDISK
where NET2_LOCALGROUPDISK can be replaced by other storage elements. The local paths will begin with
davs://atlas-dtn4.bu.edu:1094/gpfs1/
. These paths can be open directly inside
ROOT
1auto file = TFile::Open("davs://atlas-dtn4.bu.edu:1094/gpfs1/[filepath]");
They can also be given as inputs for any ATLAS software. Note that you need to have a grid certificate to access files via xrootd. To get a grid certificate use the command
voms-proxy-init --voms atlas
.
Accesing CERN EOS
Files in CERN EOS can also be accessed via xrootd. It is not particularly fast, but may help if you don't want to move large amounts of files. There are basically two locations files are stored in CERN EOS:
/eos/user
and
/eos/atlas
(note that some users may have a
/eos/atlas/user
, and those are in the latter category).
To access files in
/eos/user
, you can do:
1auto file = TFile::Open("root://eosuser.cern.ch//eos/[filepath]");
and to access files in
/eos/atlas
, you can do:
1auto file = TFile::Open("root://eosatlas.cern.ch//eos/[filepath]");
Note that, in both cases, you will need a grid ticket.
Jupyter notebooks
Installation
Jupyter notebooks come with most ML setups described above. But you can also install it yourself by doing:
1pip install ipython jupyter --user
You just need to do it once. You may choose to install it in a virtual environment, as discussed above.
Connect using ssh tunnel
You can access the notebook in your own browser by creating a ssh tunnel. In order to do that, open a ssh connection with one of the umX machines, start a screen, and do:
1export ssh_port=`python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()'`
2ssh -f -o ExitOnForwardFailure=yes ${HOSTNAME} -R ${ssh_port}:localhost:${ssh_port} sleep 30
3echo ${ssh_port}
4jupyter notebook --no-browser --port=${ssh_port} --ip 127.0.0.1
The
echo
command will tell on which port the tunnel was created. The
jupyter
command will return an http address for the notebook. You can then go to your computer and start your side of the tunnel
1ssh -N -f -L [port]:localhost:[port] -l [username] umX.net3.mghpcc.org
And then open the http address on your browser.
Connect using x2go
If you are using x2go, you can use the browser from umX. In this case, just start a jupyter notebook doing
jupyter notebook
.
The NET3 machines are servers with many CPUs (um1/2/3 have 80 each, um4/5 have 64 each). In order to fully benefit from the computation power of these machines, it is strongly recommended to use
RDataFrames to read ntuples.
1import ROOT
2
3# This activates implicit multi-threading
4ROOT.EnableImplicitMT()
5
6# The analysis below runs in parallel
7rdf = ROOT.RDataFrame("mytree", "myfile.root")
8hx = rdf.Filter("x > 0").Histo1D("x")
9hy = rdf.Filter("x > 0").Histo1D("y")
10hz = rdf.Filter("x > 0").Histo1D("z")
11
12c = ROOT.TCanvas("canvas", "", 1200, 800)
13c.Divide(2,2)
14c.cd(1)
15hx.Draw()
16c.cd(2)
17hy.Draw()
18c.cd(3)
19hz.Draw()
20c.Draw()
Note that the
Draw
command is only issued after all histograms are defined (lazy evaluation). When the histograms are filled, they are done in parallel, using all the machine CPUs. More information in the
RDataFrame manual
. This example should work on Juputer notebooks (using
JupyROOT) and when you do
c.Draw()
the canvas will be drawn in the notebook.
Batch system
The NET3 has a small batch system for analysis. For large jobs, you should use the grid. But for analysis on the final ntuples (make histograms, set limits, ...), the batch system is the ideal tool. It is also very useful to produce very large gridpacks. The batch system uses
PBS
for job scheduling.
Submitting jobs manually
In order to submit jobs, you should make a bash script and use the command
qsub -q tier3 [script.sh]
.
A typical script begins with:
1#!/bin/bash
2
3cd ${TMPDIR}
4export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
5alias setupATLAS="source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh"
6setupATLAS
Checking your jobs
You can check the status of your jobs by doing
qstat -q tier3 -u [username]
.
Deleting jobs
Sometimes jobs get stuck.
IMPORTANT Please, check frequently if you have stuck jobs with the
qstat
command described above. If you have stuck jobs, delete them by using
qdel [jobid]
, where the
jobid
is the number on the first column of the
qstat
output.
Job configuration
Usually, you will want to have different configurations for each worker job. This can be accomplished by submitting the jobs with options
qsub -t [start]-[end] -q tier3 ...
, where [start] is the number of the first job and [end] is the number of the last job. Each job will have an environment variable called
$SGE_TASK_ID
with the job number.
So, for instance, if you would like to define an input file per job, you could define:
1inputFiles = ("file1.root" "file2.root" "file3.root")
2
3analysis.py ${inputFIles[${SGE_TASK_ID}]}
Similar configuration can be done definiing more arrays.
Sending your environment
Often, you will need to send your environment to the batch jobs. This is the case if you setup special variables for ML tools or another similar ATLAS setup. To copy all of your environment to the worker nodes, submit your jobs with the option
1qsub -V -q tier3 ...
Submitting multithread jobs
Submitting a batch job with parallel processing/multithreading may overwhelm the tier3 nodes if the number of threads specified is greater than the number of processors per node. Instead, jobs must be submitted with a specified "parallel environment", which will distribute the threads over several nodes. This is done with the
-pe smh [n_slots]
option, with a full example of
1qsub -pe smp 40 -q tier3 ...
Setting a parallel environment will likely cause the user to create several jobs, so discretion should be used as to not submit an excessive amount of total jobs.
If part of your work uses
EventLoop, you can use the pre-defined
EL::GEDriver
class to submit batch jobs with no extra effort. See
this code
for example usage.
1job.options()->setString(EL::Job::optSubmitFlags, "-V -q tier3");
Running Madgraph jobs
The umX machines are quite powerful and can be an asset to produce challenging Madgraph gridpacks. For processes that require large integration time per helicity combination, your best bet is to use the following
Cards/me5_configuration.txt
option:
1run_mode = 2
For processes that may be many thousands of helicity combinations (multileg processes) but each does not require much processing, you can use the cluster. The settings in
Cards/me5_configuration.txt
in this case are:
1run_mode = 1
2cluster_type=pbs
3cluster_queue=tier3
4cluster_local_path=/cvmfs/atlas.cern.ch/repo/sw/Generators/
Grid certificate
You may need your grid certificate in your batch jobs (to access xrootd, for instance). The easiest way to transfer a ticket to the batch jobs is to get a ticket on the umX machine before submitting (with the usual
voms-proxy-init
command) and then adding the following line to the beginning of your script:
1scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null umX:/tmp/x509up_u[uid] /tmp
where uid is your user ID in NET3, and umX is the machine where you got the grid ticket. If you don't know your user ID, you can discover with the command
id
.
This method also assumes you copied your private SSH key to you
~/.ssh/
folder in NET3 (with correct permissions). The correct permissions are 600 for
id_rsa
and 644 for
id_rsa.pub
.
Transferring files
The same method described above can be used to transfer files from your script back to the
/scratch
directory of one of the umX when your script is done. The last line of a script is usually:
1scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null outfile.root umX:/scratch/[username]/outfile_${SGE_TASK_ID}.root
Example of a practical batch system script
This example puts together all the items above and can serve as skeleton for your analysis jobs. This is just an made up script that uses Rafael's username as an example. You should change it accordingly.
1#!/bin/bash
2
3cd ${TMPDIR}
4export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
5alias setupATLAS="source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh"
6setupATLAS
7asetup AnalysisBase 21.2.201
8
9scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null um1:/tmp/x509up_u1000 /tmp/x509up_u1000
10
11inputFiles=("davs://atlas-dtn4.bu.edu:1094/gpfs1/atlasdatadisk/rucio/mc20_13TeV/2c/9e/DAOD_PHYS.28687139._000070.pool.root.1" \
12"davs://atlas-dtn4.bu.edu:1094/gpfs1/atlasdatadisk/rucio/mc20_13TeV/2c/9e/DAOD_PHYS.28687139._000071.pool.root.1" \
13"davs://atlas-dtn4.bu.edu:1094/gpfs1/atlasdatadisk/rucio/mc20_13TeV/2c/9e/DAOD_PHYS.28687139._000072.pool.root.1" \
14"davs://atlas-dtn4.bu.edu:1094/gpfs1/atlasdatadisk/rucio/mc20_13TeV/2c/9e/DAOD_PHYS.28687139._000073.pool.root.1")
15
16python analyze.py ${inputFiles[${SGE_TASK_ID}]}}
17scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null outfile.root um1:/scratch/rcoelhol/analysis/outfile_${SGE_TASK_ID}.root