Generating Monte Carlo using the Pleiades Cluster

While it is far faster to simply copy datasets to Pleiades if they can be found elsewhere, many needed samples either do not exist, or have not been reconstructed in the desired version. In these cases, the datasets must be generated locally. The Pleiades cluster can produce approximately 35k full simulation events per week.

This tutorial will go over the basics of how to run the various programs involved in generating Monte Carlo on the Pleiades cluster. Before actually generating monte carlo, there are two additional topics that you should understand that are not covered in this tutorial: what generation parameters to use and how to use scripts to submit large jobs to the batch system. If you can't figure these out, you should e-mail or talk with me and Verena.

A really useful companion to this guide is the ATLAS computing WorkBook, which describes many useful things, from how to setup your account to how to run the full chain simulation.

Overview of Data Structures in ATLAS

Before generating, it is worth understanding the structure of ATLAS data.

ATLAS data is broken into 6 basic types

Type Output Of Stage Description Approximate Type for one Event
GEN Generation Contains a list of the four vectors of all particles in the event ~ 0.05 sec
SIM Simulation Contains record of energy depositions of particles in the detector ~ 12 min
RDO Digitization Turns these energy depositions into the output of the detector (what we really get) ~ 15 sec
AOD Reconstruction Contains reconstructed objects ready for analysis ~ 45 sec
NTUPL Analysis Output of your analysis to be read in ROOT ~ .5 sec

A pretty picture of this is drawn at WorkBookFullChain

Generators

As mentioned before, Generators create GEN files, which are essentially a list of the four vectors of all the particles in the event. These programs do not describe the interaction of the particles with the detector, and so you need only supply the center of mass energy and the process you which to generate. It is worth noting that generation is generally very fast compared to the other steps in full simulation, and it is possible to generate many thousands of events in very little time.

There are quite a few generators out there, each of which has its own strengths and weaknesses. We'll go over two common ones here: Pythia and Alpgen+Herwig

Pythia

Pythia is one of, if not the, most used generator for ATLAS monte carlo. It is relatively easy to use, fast, and does a good job on many types of processes. A general rule is that if you don't have a specific reason not to use Pythia, its probably a good choice. Some problems with pythia include lack of NLO calculations (which is important for SM ttbar cross section calculations) and poor implementation of hard gluon emission (which makes it unsuitable for w+jets events).

Running Pythia

To run pythia, start by logging into pleiadies with

ssh <username>@login.pleiades.physics.harvard.edu

Get to an interactive shell by typing

qrsh -q short.q

Setup your environment by typing

source setup13.sh

(Note, if you do not have a setup script, you can steal mine with cp /clusterhome/bcsmith/setup13.sh . and changing all instances of bcsmith to your username in the file)

We like to keep all generation on the mass storage drive, so move to the following directory:

cd /pleiades/datasets/tutorial

and make a directory for yourself to work in:

mkdir <username>
cd <username>

As with all athena software, you will need a job options file to run pythia. There's an example one (that generates semileptonic Zprime to ttbar events) that you can copy:

cp ../generateZprime.py .

Take a look at the file with:

less generateZprime.py

and we'll go through it block by block

from AthenaCommon.AppMgr import ServiceMgr
ServiceMgr.MessageSvc.OutputLevel = INFO 

This is a fancy way of saying, only output stuff that is interesting. It asks the message service to ignore any message below the INFO level (this includes mostly debugging information).

from AthenaCommon.AlgSequence import AlgSequence
topAlg = AlgSequence("TopAlg")

from Pythia_i.Pythia_iConf import Pythia
topAlg += Pythia()

When you run athena, algorithms in the TopAlg sequence are run sequentially. This is just a fancy way of saying, add Pythia to the sequence of algorithms that will be run when I run athena.

Pythia = topAlg.Pythia

Pythia.PythiaCommand = [
                        "pysubs msel 0",
                        "pysubs msub 141 1",
                        "pydat2 pmas 32 1 1000.",
                        "pydat2 pmas 6 1 170.9",
                        "pydat2 pmas 24 1 80.425",
                        "pypars mstp 61 1",
                        "pypars mstp 71 1",
                        "pypars mstp 111 1",
                        "pypars mstp 81 1",
                        "pypars mstp 128 0",
                        "pypars mstp 44 3",
                        "pydat1 parj 90 20000", # Turn off FSR.
                        "pydat3 mdcy 15 1 0",
                        "pydat3 mdme 289 1 0",
                        "pydat3 mdme 290 1 0",
                        "pydat3 mdme 291 1 0",
                        "pydat3 mdme 292 1 0",
                        "pydat3 mdme 293 1 0",
                        "pydat3 mdme 294 1 1",
                        "pydat3 mdme 295 1 0",
                        "pydat3 mdme 296 1 0",
                        "pydat3 mdme 297 1 0",
                        "pydat3 mdme 298 1 0",
                        "pydat3 mdme 299 1 0",
                        "pydat3 mdme 300 1 0",
                        "pydat3 mdme 301 1 0",
                        "pydat3 mdme 302 1 0",
                        "pydat3 mdme 303 1 0",
                        "pydat3 mdme 304 1 0",
                        "pydat3 mdme 305 1 0",
                        "pydat3 mdme 306 1 0",
                        "pydat3 mdme 307 1 0",
                        "pydat3 mdme 308 1 0",
                        "pydat3 mdme 309 1 0",
                        "pydat3 mdme 310 1 0",
                        "pydat3 mdme 190 1 3",
                        "pydat3 mdme 191 1 3",
                        "pydat3 mdme 192 1 3",
                        "pydat3 mdme 193 1 0",
                        "pydat3 mdme 194 1 3",
                        "pydat3 mdme 195 1 3",
                        "pydat3 mdme 196 1 3",
                        "pydat3 mdme 197 1 0",
                        "pydat3 mdme 198 1 3",
                        "pydat3 mdme 199 1 3",
                        "pydat3 mdme 200 1 3",
                        "pydat3 mdme 201 1 0",
                        "pydat3 mdme 202 1 0",
                        "pydat3 mdme 203 1 0",
                        "pydat3 mdme 204 1 0",
                        "pydat3 mdme 205 1 0",
                        "pydat3 mdme 206 1 2",
                        "pydat3 mdme 207 1 2",
                        "pydat3 mdme 208 1 0",
                        "pydat3 mdme 209 1 0",
                        "pyinit pylisti 12",
                        "pyinit pylistf 1",
                        "pystat 1 3 4 5",
                        "pyinit dumpr 1 5" ]

This section configures pythia so that it generates the event you are interested in. The first line says, grab the object named Pythia that I just added to the list of algorithms to run, while the second part says, adjust its internal properties in the following way. The options to pythia are way beyond the scope of this tutorial, but you can find them in the pythia manual at http://home.thep.lu.se/~torbjorn/Pythia.html.

A good source of pythia job options files are the official ATLAS production files, which can be found in the CVS repository:

http://atlas-sw.cern.ch/cgi-bin/viewcvs-atlas.cgi/offline/Generators/DC3_joboptions/share/

and

http://atlas-sw.cern.ch/cgi-bin/viewcvs-atlas.cgi/offline/Generators/EvgenJobOptions/share/

Continuing on with the file, we see:

from EvgenJobOptions.PythiaEvgenConfig import evgenConfig
evgenConfig.generators += [ "Pythia" ]

This block tells athena that you will be generating events with pythia

from TruthExamples.TruthExamplesConf import DumpMC
topAlg += DumpMC()

This adds the algorithm DumpMC from the package TruthExamples to the list of algorithms to run. This algorithm basically prints all the information from each generated event to the screen. Its useful when you're first starting, but you would comment this out for a large production run.

## NUMBER OF EVENTS
theApp.EvtMax = 5

This sets the number of events to generate

## RANDOM NUMBER SEEDS
from AthenaServices.AthenaServicesConf import AtRndmGenSvc
ServiceMgr += AtRndmGenSvc()
ServiceMgr.AtRndmGenSvc.Seeds = ["PYTHIA 4801483 989252096", "PYTHIA_INIT 831605 2359116"]

This sets the seeds to use for the random number generator. You should set this to something new for each file you generate, or you'll get duplicate events.

## OUTPUT FILE
from AthenaPoolCnvSvc.WriteAthenaPool import AthenaPoolOutputStream
Stream1 = AthenaPoolOutputStream( "Stream1" )
Stream1.OutputFile = "Zprime1000_semilep_tt_130030.GEN.pool.root"
Stream1.ItemList += [ 'EventInfo#*', 'McEventCollection#*' ]

This sets what to write to the file. To change the name of the file, you should change the line

Stream1.OutputFile = "Zprime1000_semilep_tt_130030.GEN.pool.root"

You probably shoudn't change the other lines in that block unless you know what you're doing.

To run it, simply type

athena generateZprime.py

and you should see all the events output to the screen as well as written to the output file!

Alpgen

Alpgen is a program for generating the emission of hard partons. While it is very specialized, and somewhat difficult to work with, it far outperforms pythia for generating these types of events.

It is worth noting what Alpgen does, and what it does not do. Alpgen does the matrix element calculation for hard parton emission. A very simplified way to think about this is to imagine generating W+3 jet samples: alpgen will generate the W, and the three 3 jets when they are sufficiently far from each other. It will not generate parton showers or perform hadronization, or generate jets that are close together. For this, there exists another program, Herwig, which will be introduced shortly. It is also important to realize that Alpgen creates text files, not GEN files. Again, you must use Herwig to do this.

Setting up Alpgen

First, its always good to start with a clean shell when doing these things, so log out with exit

and log back into blade1 with

ssh -X blade01

Because of how alpgen works, it is best if you install a copy in your home directory. Do the following:

cd

mkdir alpgen

cd alpgen

cp /clusterhome/bcsmith/tutorial/v213.tgz .

tar zxvf v213.tgz

This will install alpgen in your ~/alpgen directory. To compile it, simply type:

make

You must also compile each process you will be working with. Today, we'll work with wbb+1 jet, so we should compile the wqqwork directory. To do this, type:

cd wqqwork

make gen

Now, we must get the current PDF's. Type:

./pdflnk

And now we are ready to go!

Running Alpgen

There are two steps to running alpgen. First, you must generate the weighted events. During this process, Alpgen generates a grid in phase space, evaluates the probability of those points, and then makes a list of all the events it picked, along with their weights. Next, you must turn these into unweighted events, by allowing Alpgen to sample from the weighted events according to their relative probabilities. The output of this process is a text file containing the four vectors of the particles in the hard scattering.

To generate weighted events, you need an input file. A sample input file comes with Alpgen, so lets use that. Take a look at it with:

less input

The contents of the file are:

1     ! imode
wbbj       ! label for files
0 ! start with: 0=new grid, 1=previous warmup grid, 2=previous generation grid
10000 2  ! Nevents/iteration,  N(warm-up iterations)
100000      ! Nevents generated after warm-up
*** The above 5 lines provide mandatory inputs for all processes
*** (Comment lines are introduced by the three asteriscs)
*** The lines below modify existing defaults for the hard process under study
*** For a complete list of accessible parameters and their values,
*** input 'print 1' (to display on the screen) or 'print 2' to write to file
ihvy 5
njets 1
ickkw 0
ptjmin  20
ptbmin  20
etajmax 1
etabmax 1
drjmin  0.7
drbmin  0.7

The first line specifies the mode to run in (1 indicates generate weighted events), the second the name of the files, and the third whether to use an old grid. The fourth line indicates how much warmup to do - in essence, the more warmup, the better your sampling grid, and the better your results. I typically do 10 iterations of 100k events, but there is no hard and fast number. The fifth line indicates how many points to sample. Because of how the unweighting process works, increasing this number won't necessarily increase the number of events you generate.

The next block sets the values of various parameters. In this case, it tells Alpgen to generate Wbb (ihvy = 5), + 1 jet (njets = 1), with various eta, delta R, and pt cuts. The general format is

 <variable name> value 

To find out what all the variables are and what their values do, take a look at prc.list, in your ~/alpgen directory.

To run Alpgen, you simply feed the input file to wqqgen in the following way:

./wqqgen < input

This will generate a whole bunch of files, and make a lot of output which you can read later. To unweight the events, you need to create a very simple file that says:

2
wbbj

This tells alpgen to run in mode 2 (unweight events) and to unweight the files named wbb. Call this file whatever you want - I've called it unweight. Feed it to alpgen with

./wqqgen < unweight

(If you've called it something else, replace unweight with whatever you've called it)

This will unweight the events, and display some information about the run

Of all the files you have generated, there are two that are most important: wbbj.unw, which is a list of the four vectors of the partons, and wbbj_unw.par, which is a list of the parameters used to generate the events. You will feed these two files to Herwig in order to generate a GEN file. But first, there is one last correction you must make. Open the file wbbj_unw.par, and right above the line ************** end parameters, add the following:

501 20
502 .7
503 0 

These are used during MLM matching of the parton showers. The first line specifies the pt cut, the second the delta R cut, and the third whether the sample is exclusive (0) or inclusive (1).

Herwig

Once you have generated 4 vector files from Alpgen, Herwig is used to perform parton showering and hadronization. Essentially, whereas Alpgen specializes in hard parton emission, Herwig is good at soft parton emission. Running Herwig will produce a GEN file that you can run the full chain on.

Setting up your environment for Herwig

Unfortunately, Herwig doesn't run correctly in version 13, so we have to generate in version 12. To do this, go to your home directory and source the version twelve setup file with:

cd

source cmthome/setup.sh -tag=12.0.6,gcc323,opt

If you do not have a cmthome directory, you need to do the following:

mkdir cmthome

cp /clusterhome/bcsmith/cmthome/requirement .

cd cmthome

source /pleiades/software/athena-12.0.6/CMT/v1r19/mgr/setup.sh

cmt config

and then repeat the above (cd to your home directory and source the setup file)

Linking Alpgen Files

The first thing you must do to run Herwig is to link the alpgen files you just created. Go to your directory on the mass storage unit:

cd /pleiades/datasets/tutorial/<username> 

(If this doesn't work, you need to create it first, by typing

mkdir /pleiades/datasets/tutorial/<username> 
and then cd'ing as above)

Now, you need to create a link to the four vector file as well as the parameter file. Note that these two files must be named in the following manner, as their names are hardcoded in Herwig.

Link the four vector file:

 ln -sf ~/alpgen/wqqwork/wbbj.unw alpgen.unw_events 

Link the parameter file:

 ln -sf ~/alpgen/wqqwork/wbbj_unw.par inparmAlpGen.dat 

Running Herwig

Like with Pythia, all we need to run Herwig is a job options file. You can copy mine:

cp ../generateWbbj.py .

You can take a look at the file, but its very similar to pythia job options file, with the main difference that its in the language of version 12 instead of version 13.

To run it, simply type:

athena generateWbbj.py

and you'll get output similar to what we got from pythia.

Generating AODs

GEN files on their own are not very useful. They contain no information about how the particles interact with the detector, or reconstruction. They are just the "truth" of the event. To generate a file that you can do analysis, you must run the full chain, which includes simulation, digitization, and reconstruction. This is a very time consuming process.

Running the Full Chain

Running each step of the chain will be very similar to generating events. You will need a job options file that specifies exactly what you want to do. Luckily, though, you typically want to simulate, digitize, and reconstruct all your samples in the same way, so many of the problems in generation do not appear here. In particular, one job options file per step (simulation, digitization, and reconstruction), is probably enough for all your needs.

Simulation

Simulation is the process by which the particles generated in the generation stage interact with the detector. Put very briefly, each particle is stepped across the detector from the IP. At each step, energy is deposited into the detector according the dE/dx of the particle in that material. The particle may also lose energy in other ways, such as bremsstrahlung. It should be noted that this is by far the slowest part of running the full chain, with a single event taking nearly 15 minutes.

Running Simulation

It is best to start the simulation process in a clean shell, so logout with

exit

and then get a new interactive shell with

qrsh -q short.q

You will want to simulate in version 13, so source the setup file

source setup13.sh

If you don't have a setup file, copy mine with cp /clusterhome/bcsmith/setup13.sh . and replace references to bcsmith with your username.

Change to the directory we've been working in with

cd /pleiades/datasets/tutorial/<username> 

and copy the simulation job options file with

cp ../simulateZprime.py .

If you didn't run pythia earlier on, you'll need a copy of the GEN file as well. You can take mine with

cp ../bcsmith/Zprime1000_semilep_tt_130030.GEN.pool.root .

Take a look at the simulation job options file with less simulateZprime.py

#--- Detector flags -------------------------------------------
from AthenaCommon.DetFlags import DetFlags
# - Select detectors 
DetFlags.ID_setOn()
DetFlags.Calo_setOn()
DetFlags.Muon_setOn()
DetFlags.simulate.Truth_setOn()

This turns on simulation of the various sub-detectors, as well as turns on truth propagation.

#--- Simulation flags -----------------------------------------
from G4AtlasApps.SimFlags import SimFlags
# Look into SimFlags.SimLayout for other possible values 
SimFlags.SimLayout='ATLAS-CSC-02-00-00' # specific value 

This sets the geometry database to use. Various geometries have various features, such as different types of distortions and misalignments. Unless you're doing something very specialized, it probably won't make a big difference. What does matter, however, is that you note this down, as you will be prompted for the geometry database for the later stages, and if they do not match, you will have big problems.

SimFlags.KinematicsMode='ReadGeneratedEvents'

This tells athena to read from your GEN file, rather than generate new events.

MessageSvc = Service( "MessageSvc" )
MessageSvc.OutputLevel = 4

As before, this sets the output level - i.e., what messages are printed to the screen.

## NUMBER OF EVENTS
from AthenaCommon.AthenaCommonFlags import athenaCommonFlags
athenaCommonFlags.EvtMax = 1

This sets the number of events. Note that we will only run one event, as its incredibly slow!

## INPUT
athenaCommonFlags.PoolEvgenInput=['Zprime1000_semilep_tt_130030.GEN.pool.root']

This sets the input file (your GEN file)

 ## OUTPUT
athenaCommonFlags.PoolHitsOutput='Zprime1000_semilep_tt_130030.SIM.pool.root'

This sets the output file. The last lines call the actual simulation, and should not be touched.

To run simulation, simply type

athena simulateZprime.py

You will see a great deal of output related to the simulation, and after about 15 minutes, you'll have your one event!

Digitization

Digitization is the process of turning the energy depositions generated during simulation into the output of the various detector components. The output of the digitization stage is essentially the same as what we will read out from the real detector. Digitization is quite fast compared to simulation.

Running Digitization

Running digitization is quite simple. You need a digitization job options file, an AtlasDigitization file, and a SIM file. If you have not already done so, get your shell setup for version 13, and cd to our working directory with

cd
source setup13.sh
cd /pleiades/datasets/tutorial/<username>

You can get the two job options files with

cp ../digitizeZprime.py .

cp ../AtlasDigitization.py .

Take a look at digitizeZprime.py with less digitizeZprime.py and you'll see its very simple! The first few lines set the geometry to use (which better be what was used during simulation) and the number of events to run on. Note that we only simulated one event, so it will stop after that. The next lines set the input and the output.

Most of the meat of the process is hidden away in AtlasDigitization.py, which you'll notice is very complicated. Luckily enough, unless you really know what you're doing, you don't have to open this file. Think of it as including this file runs digitization.

To run digitization, simply type:

athena digitizeZprime.py AtlasDigitization.py

and after a few seconds, you should have one digitized event!

Reconstruction

Reconstruction is the process of taking the output of the detector and turning it in physics objects, like jets and electrons. This will run on simulated data in exactly the same way as it will be run on real data. You can expect it to take upwards of one minute to reconstruct a very complicated event.

Running Reconstruction

In order to perform reconstruction, you need a reconstruction job options file, a RecExCommon file, and a RDO file (the output of digitization).

As before, the first thing you need to do in order to perform reconstruction is to setup your environment. If you have not already done so, get your shell setup for version 13, and cd to our working directory with

cd
source setup13.sh
cd /pleiades/datasets/tutorial/<username>

Now, copy the needed job options files with

cp ../reconstructZprime.py .

cp ../RecExCommon_topOptions.py .

Take a look at reconstructZprime.py. Most everything should be familiar by now, except for the flags

doWriteESD = False
doWriteAOD = True

and

doCBNT = False
doHist = False

These flags allow you to output your data in other formats than AOD (like ESD and CBNT), or to output histograms related to reconstruction. For now we just want to make AODs.

You can also look at RecExCommon_topOptions.py. Like before, this is a complicated file that you shouldn't need to change if you just want to reconstruct events. All you need to do is include it when you run athena, and everything will be taken care of.

To reconstruct your event, type:

athena reconstructZprime.py RecExCommon_topOptions.py

After a minute or two, you should have an AOD file that you can do analysis on!

Producing many events

Once you feel comfortable with how monte carlo generation works, the next step is to prepare to do larger scale production. Doing this by hand would be a nightmare, so your first goal should be to either find scripts that do what you want, or to write scripts for yourself. If you're generating things with pythia or alpgen+herwig, I probably have scripts for you, so just ask. The next is to figure out the batch system. We use SGE on Pleiades, which you can read about here: http://gridengine.sunsource.net I would be sure to understand how array jobs work before trying to generate large datasets.

Once you understand the technical aspects, you need to pick how many events you want to produce, and how you want to break up the jobs. A good rule of thumb is that each core can produce 100 events a day, and you can submit to 70 cores at a time (only submit generation jobs to the long.q queue, or your jobs will be killed), assuming no one else is generating monte carlo. This gives a theoretical maximum of slightly under 50k events a week. However, you will never get this many out because jobs will crash, etc. A more realistic number is 35k events per week. Once you've picked the number of events, you need decide how to split them up. You should always have at least as many jobs as there are cores. So if no one else is using pleiades, you should break the task into at least 70 jobs. Beyond that, the choice is really yours. Using many jobs is nice because it allows for more efficient job scheduling and less work is lost when a job crashes. However, there is overhead in initializing your athena run, so the more jobs you have the more time you spend initializing things. Also, its more of a pain to keep track of hundreds of files, and check if they crashed, etc. I have found a nice medium is to have 250 events per file.

-- VerenaMartinez - 01 May 2008

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2008-05-06 - VerenaMartinez
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback