HEP Distributed Analysis Exercise: Analysing "data" from the ATLAS experiment

This exercise is prepared for the NGIn Grid School at the NorduGrid 2007 conference.

1 - Introduction

High energy physics (HEP) experiments produce massive amounts of data. The ATLAS experiment at LHC, which starts running in Summer 2008, will write out about 1Gb/s, which then has to be analysed by physicists. The only effective way to do this is via grid computing, and a massive effort has been put into developing the tools that the HEP community needs and will need even more when the LHC starts.

This exercise demonstrates how an ATLAS physicist can use NorduGrid resources to perform his or her personal analysis. We will work with real ATLAS "data", which at present means highly detailed simulations, perform a simple analysis using grid resources, and then use the grid to produce a physics plot from the output file. The tool to do all this will be the GANGA package, which is a python-based job submission tool able to work with several grid and batch queue flavours as backends.

1.1 What is distributed analysis?

When the LHC starts taking data, the grids will be used to do basic reconstruction (finding tracks, identifying electrons etc.), and then to distribute and store the data at various sites.

What then?

This is when physicists want access to the data, using their own, home-grown analysis code. It is NOT possible for each physicist to copy all the data to his or her home institution, hence the physicist's analysis jobs need to go to the data. This is distributed analysis:

  • I, as a physicist, can write my own analysis code that I want to apply to the data.
  • If my code works on a local test data file, I should be able to confidently send it to "the grid" and ask "the grid" to apply it to the data for me.
  • Somewhere on "the grid", my job should meet the data and some available computing resource should be found.
  • I should not have to care (much) about the details of "the grid".
  • The answer, in the form of a file I can read, should come back to me.

In ATLAS, where we use three flavours of grid (LCG, OSG, NorduGrid), this list of demands can now be met. I still have to care a little bit about which grid I am submitting to, but not much. The goal of complete grid transparency is within reach by the time the LHC turns on.

In this exercise, we will send a distributed analysis job to NorduGrid, and do some processing of the result - also on the grid.

2 - Setting up: Ganga and NorduGrid

To do distributed analysis on NorduGrid, we will use a python-based job submission tool called GANGA. It comes with ARC 0.6, and so all you need in addition is a grid certificate. In this section we download and set up GANGA, and look at some NorduGrid tools for monitoring ATLAS jobs.

2.1 - GANGA

2.1.1 - Downloading GANGA and ARC

To download and install GANGA and ARC 0.6, do the following:

(For the regular, updated GANGA installation instructions, see http://ganga.web.cern.ch/ganga/download/)

  1. Get the script ganga-install from the GANGA webpage.
  2. Make it executable: > chmod u+x ./ganga-install
  3. Run the following command:
./ganga-install --prefix=/where/you/want/it/installed --extern=GangaAtlas,GangaNG 4.4.1

That's it - this gives you both GANGA with the NorduGrid backend and the nordugrid-arc-standalone middleware. You can also add 'GangaGUI' to the --extern list if you want a GUI interface to GANGA, but we will not use this for todays exercise.

2.1.2 If there is no afs, apply a patch

This section only applies if there is no afs on your machine. We recently found a problem with proxy handling on non-afs, non-CERN machines, which then needs to be fixed. GANGA 4.4.2 will have this included, but for now we need to patch 4.4.1. Instructions:

2.1.3 Starting and setting up GANGA

Three more things are needed to do to get GANGA working properly:

  • On first startup, you will be prompted to have GANGA auto-create a .gangarc file. Let the nice program do this.
  • Edit the file ${HOME}/.gangarc, and under [Configuration] set
RUNTIME_PATH = GangaAtlas:GangaNG
  • To set up ARC, go to /where/you/installed/ganga/external/nordugrid-arc-standalone/0.6.0/slc3_gcc323/ and say
source ./setup.sh
grid-update-crls

Now restart ganga, and all should we well. Currently source ./setup.sh has to be run first in every shell where you want to start GANGA. This is a bug and will be fixed...

2.1.2 - GANGA 101

For a full GANGA/ATLAS tutorial, geared towards CERN users, see GangaTutorial44 by Johannes Elmsheuser. From his tutorial (just refer back to this list once you get started with the jobs in section 3):

A few hints how to work with the Ganga Command Line Interface in Python (CLIP), after you have started Ganga with: ganga

  • The repository for input/output files for every job is located by default at: $HOME/gangadir/workspace/Local/
  • All commands typed on the GANGA command line can also be saved to single script files like mygangajob1.py, etc. and executed with: execfile('/home/Joe.User/mygangajob1.py')
  • The job repository can be viewed with: jobs
  • Subjobs of a specific job can be view with: subjobs jobid (e.g. jobid=42)
  • A running or queued job can be killed with: jobs(jobid).kill()
  • A completed job can be removed from the job repository with: jobs(jobid).remove()
  • The job output directrory of finished jobs that is retrieved back to the job repository can be viewed with: jobs(jobid).peek()
  • The stdout log file of a finished job can viewed with: jobs(jobid).peek('stdout', 'cat')
  • Export a job configuration to a file with: export(jobs(jobid), '~/jobconf.py')
  • Load a job configuration from a file with: load('~/jobconf.py')
  • Change the logging level of Ganga to get more information during job submission etc. with: config["Logging"]['GangaAtlas'] = 'INFO' or config["Logging"]['GangaAtlas'] = 'DEBUG'

2.2 - Nordugrid and the NG monitor

Once you get jobs running in the next step, it may be interesting to follow the progress of the jobs as they move around NorduGrid. While GANGA gives you the status of each job, more detailed information can be found through the NorduGrid ATLAS monitor. Refer to this webpage once you have submitted a job.

3 - A simple job

We will start with a simple job, just to get the feel for what GANGA can do. It will run first on the local machine, and then on the grid.

3.1 - ...on the local machine

Type the following into GANGA:

j = Job()
j.application=Executable()
j.application.exe='echo'
j.application.args='Dear grid user. I, your humble servant node, have dutifully executed your job. Regards, ${HOSTNAME}'
j.backend=Local()

Now list the contents of the job object by saying just j. Note that the job has containers for input and output data, splitting and merging, even though we don't use them for this example.

You can also list all your jobs by saying jobs. Currently this will just contain your one job. You can also access it by saying jobs(0).

Now submit it by saying j.submit(). Hit enter again, and you will see that the job goes from status submitted to running to completed very quickly. As it should, as this is just an echo process on your local machine... You can look at the output in one of two ways:

  1. Say j.peek(). This lists the contents of the job output directory. Then list the actual standard output of the job by saying j.peek('stdout','cat'), i.e. view the contents of the file 'stdout' using the program 'cat'.
  2. In another shell, go to the job output directory and look at the files directly. The output is under ${HOME}/gangadir/workspace/Local//output. (This can be changed in .gangarc)

3.2 - ...on the grid

If the local job worked, let's try to send it to a grid node instead. First copy the job object:

jg = j.copy()

Change the backend to the NorduGrid handler:

jg.backend=NG()

List the object again by saying jg. have a look at the properties of the NG backend. Then send it by saying jg.submit(). After submission is completed, list the job object again and look at the backend. You will see that the job now has a jobid (e.g. gsiftp://ce01.titan.uio.no:2811/jobs/201041190713179823984758) which tells you which cluster it went to. If you're fast (or the grid is slow...) you can go to the NorduGrid ATLAS monitor, find the cluster and monitor the progress of your job there. Once the job has completed, look at it again with jg.peek(). Note that this time you have an output file called 'stdout.txt', and a whole directory called 'gmlog' with grid job information.

4 - Grid job inputs - analysis code and a dataset

OK, now we can get ready to run a physics job. The two things required are analysis code to be executed, and a dataset to run the code on.

4.1 Analysis code

In the ATLAS experiment, there is a standard software package called athena which is installed on all grid sites. Users (physicists) can configure simulation, reconstruction or analysis jobs using the modules contained in athena, and they can also write their own code in addition so long as it integrates with athena.

For this example, you can download an athena setup file and a tarball with some user code here:

Don't unpack the tar file, it will be sent to the grid as-is.

What this code does is to

  • read a fully simulated ATLAS data file.
  • convert the default data structure, which is not very user-friendly, to a ROOT tree. (ROOT is the standard analysis toolkit for high energy physics. We will use it in the last example.)
  • for each collision, calculate some variables used to search for supersymmetry. We will plot some of these in the last example.

4.2 - Datasets

The next part is to find a dataset to run on. DQ2 can tell us what datasets are available and where, but for NorduGrid there is a nice webpage that keeps track of what files we can get at:

NorduGrid AOD browser

From this list we must select a set that corresponds to what we will look at. In a real data analysis, we would select e.g. the data from a given run number. Now, we select a simulation of a special kind of physics - a supersymmetry mode.

For the next example, we will use trig1_misal1_csc11.005403.SU3_jimmy_susy.merge.AOD.v12000605, which, for the curious among you, is a simulation of the mSUGRA supersymmetry model with parameters m0=100, m1/2=300, A0=0, tan(beta)=10, sgn(mu)=+.

5 - A physics grid job

Unfortunately, access to ATLAS data is restricted to ATLAS users. For this tutorial, we provide access to you for a single file and specify that one in the job below. If, for some reason, this doesn't work, there is another possibility described in appendix A below.

Time to send the job. Copy the following into GANGA, just remember to change the paths to SV_Production.py and UserAnalysis-00003.tar.gz to the right ones:

# Make the job object, name it
j2 = Job()
j2.name = 'NG07_ex2'

# Set the input data set, requesting only one file
j2.inputdata = DQ2Dataset()
j2.inputdata.dataset = 'trig1_misal1_csc11.005403.SU3_jimmy_susy.recon.AOD.v12000601'
j2.inputdata.names=['trig1_misal1_csc11.005403.SU3_jimmy_susy.recon.AOD.v12000601_tid006978._00462.pool.root.1']
j2.inputdata.number_of_files=1

# Set the application and some properties
j2.application = Athena()
j2.application.atlas_release = '12.0.7.1'
j2.application.max_events = 1000
j2.application.option_file = '/full/path/to/SV_Production.py'
j2.application.user_area = '/full/path/to/UserAnalysis-00003.tar.gz'

# Output data
j2.outputdata = ATLASOutputDataset()
j2.outputdata.outputdata = ['MuidTau1p3p_SUSYView.AAN.root']

# Backend parameters
j2.backend = NG()
j2.backend.check_availability = True
j2.backend.requirements.walltime=30
j2.backend.requirements.cputime=30
j2.backend.requirements.memory=500

Look at the job object and see how we have specified

  • the application, which in this case is a special class created for athena
  • the input data, which is a dataset registered in DQ2
  • output data, which is a named file
  • some backend parameters relating to the amount of resources we believe the job will require

Before sending the job, you can list the contents of the dataset by saying

j2.inputdata.list_contents()

Note that we have set

j2.inputdata.number_of_files=1
. This makes GANGA select just one (random) file out of all the ones in the set. If this was set to 0, the job would download all the files and run them sequentially.

OK, send the job with

j2.submit()

and monitor it using the NorduGrid monitor page. While it runs, feel free to experiment with GANGA and sending other kinds of jobs. Once the job goes into state RUNNING, it should take approx. 10 minutes to finish.

When the job is running, you can look at the live standard output by saying j2.backend.peek(). This uses the 'ngcat' command from the arc backend.

Once it finishes, look at the output directory with j2.peek(). You should see a file called MuidTau1p3p_SUSYView.AAN.root, which contains the output of the job. We will use this file in the next example, so copy it to your ganga directory. It is located under ${HOME}/gangadir/workspace/Local//output.

6 - A grid-made physics plot

OK, you have a root file with some output, but what you want is a physics plot that you can publish in a paper. We will use the grid for this as well, performing a simplistic final analysis using the ROOT package on a grid node. For this, you need to download the following ROOT macro:

GANGA has a special class for ROOT as well, so set up the following job (again, remember to change the /full/path/to for the input files):

# Make the job object, name it
j3 = Job()
j3.name = 'NG07_ex3'

# Set the application and some properties
j3.application = Root()
j3.application.script='/full/path/to/plotATLASPhysics.C'

# Input data
j3.inputdata = ATLASLocalDataset()
j3.inputdata.names = ['/full/path/to/MuidTau1p3p_SUSYView.AAN.root']

# Output data
j3.outputsandbox=['atlasphysics.ps']

# Backend parameters
j3.backend = NG()

Run the job, and wait for it to finish. Then look at the output by saying

j3.peek('atlasphysics.ps','gv')

(This assumes you have gv on your system, if not substitute with another postscript viewer.) Look at the nice plots and colors, and congratulate yourself on having performed a distributed physics analysis and gotten a final result.

Appendix A - Hack to get access to an ATLAS file

Since the ATLAS data files are now only accessible to those with membership in the ATLAS VO, we need to hack a bit for a tutorial such as this one. Rather than using the grid analysis job in part 5 above, do the following. First, copy this data file to your computer (yes, this is a very non-grid thing to do...), and also an alternative athena setup file:

Then type the following into GANGA, changing the /full/path/to as before:

# Make the job object, name it
j2 = Job()
j2.name = 'NG07_ex2_hack'

# Set the application and some properties
j2.application = Executable()
j2.application.exe='athena.py'
j2.application.args='SV_Production2.py'

# Input data
j2.inputsandbox=['/full/path/to/SV_Production2.py','/full/path/to/trig1_misal1_csc11.005403.SU3_jimmy_susy.recon.AOD.v12000601_tid006978._00462.pool.root.1']  

# Output data
j2.outputsandbox=['MuidTau1p3p_SUSYView.AAN.root']

# Backend parameters
j2.backend = NG()
j2.backend.requirements.runtimeenvironment=['APPS/HEP/ATLAS-12.0.7.1']

Then run the job with j2.submit() and pretend you've run the job in example 5 above. Go there and follow the suggestions for monitoring and looking at the output.

Appendix B - Acknowledgement

Many thanks to the GANGA team and the Oslo group for help in putting all this together. Especially thanks to David and Jon for the emergency test session needed to get the file access through...

-- BjornSamset - 26 Sep 2007

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2007-09-26 - BjornSamset
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback