How to submit Athena jobs to Panda
Introduction
This page describes how to submit user Athena jobs from LCG/OSG/NG to the ATLAS production system,
PanDA.
Getting started
1. Setup
First, make sure you have a grid certificate. See
Starting on the Grid. You should have usercert.pem and userkey.pem under ~/.globus.
$ ls ~/.globus/*
usercert.pem userkey.pem

All you need here is to put usercert.pem and userkey.pem under the globus directory.
Then, setup Athena because pathena works in the Athena runtime environment. You can use any ATLAS releases or dev/bugfix nigtlies.
The workbook or
this page
may help. Here is an example for
AnalysisBase, 21.2.40;
$ export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
$ alias setupATLAS='source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh'
$ setupATLAS
$ asetup AnalysisBase,21.2.40
Note that you need to setup the panda-client after sourcing Athena's setup.sh since the latter changes the PATH environment variable.
example
Here is an example sequence for fresh users who use ~/myWork/test for the Athena working directory. If you use CERN AFS see
Setup for CERN AFS users instead.
# ssh myhost.somedomain.xyz
$ ls -l .globus
total 7
-rw------- 1 tmaeno zp 1442 Nov 10 14:17 usercert.pem
-rw------- 1 tmaeno zp 1207 Nov 10 14:18 userkey.pem
$ setupATLAS
$ asetup AnalysisBase,21.2.40,here
Using AnalysisBase/21.2.40 [cmake] with platform x86_64-slc6-gcc62-opt
at /cvmfs/atlas.cern.ch/repo/sw/software/21.2
Test area /afs/cern.ch/user/t/tmaeno/myWork/test
$ lsetup panda
************************************************************************
Requested: panda ...
Setting up panda 0.6.9 ...
>>>>>>>>>>>>>>>>>>>>>>>>> Information for user <<<<<<<<<<<<<<<<<<<<<<<<<
************************************************************************
$ pathena --version
Version: X.Y.Z
2. Submit
When you run Athena with
$ athena jobO_1.py jobO_2.py jobO_3.py
all you need is
$ pathena jobO_1.py jobO_2.py jobO_3.py [--inDS inputDataset]
--outDS outputDataset
where
inputDataset is a dataset which contains input files, and
outputDataset is a dataset which will contain output files. For details about options, see
pathena.
example.1 : analysis job running on an official dataset
You may use
AMI
to find datasets which you are interested in.
Checkout/customize sources/ and job options in any packages as you like. Make sure your job runs properly on local PC before submit it. If it succeeds, then go ahead. As a concrete example, we will use
AnalysisSkeleton_topOptions.py
. Note howver that from 12.0.3, it is
AnalysisSkeleton_topOptions.py
.
get_files AnalysisSkeleton_topOptions.py
Edit the your job options file and make sure that the input AOD name and path are well-defined before testing. If you do not have some AOD available, you may run the reconstruction for a few events using this job options: You need some AOD test the above job options in your local area before submitting to the grid - create a file called
myTopOptions.py
and add these lines to it:
####################################################
Atlas.DetDescrVersion=" Detector descriptor Version Tag"
doHist = False
doCBNT = False
EvtMax = 15
PoolRDOInput =[ "Input RDO file path and name" ]
include ("Atlas.RecExCommon/Atlas.RecExCommon_topOptions.py")
######################################################
Make sure you have enough memory to run this job interactively. Then, run it:
athena.py -b myTopOptions.py
This should produce some ESD and AOD for 15 events. Once you have the AOD, you test your analysis code, for example:
athena -b AnalysisSkeleton_topOptions.py
Running the
AnalysisSkeleton_topOptions.py
should produce an
AthenaAwareNTuple file called
AnalysisSkeleton.aan.root
: you may open this file in ROOT and examine it. Now, you've tested that your job options, in this particular case
AnalysisSkeleton_jobOptions.py
, runs successfully on your local machine. You should also check that the output data, in this case an
AthenaAwareNTuple file, contains sensible results before submitting to the grid for much larger statistics.
If you've setup for the release 12, note that the release 11 AOD note automatically readable in 12. You should use AOD dataset produced in 12 or do this exercise with the indicated dataset in 11.0.5 - change the username appropriately in the output dataset name:
$ pathena AnalysisSkeleton_topOptions.py --inDS trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601 --outDS user.KeteviA.Assamagan.trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AAN.v12000601
example.2 : re-submit the same analysis job on another dataset
pbook shows
libDS (library dataset) for each job. Normally, one job is composed of one buildJob and many runAthenas (see
details). buildJob produces libraries and libDS contains them. buildJob can be skipped by reusing the libraries, which reduces the total time of the job execution. If --libDS is given, runAthenas run with the libraries in libDS.
$ pathena TagSelection_AODanalysis_topOptions.py --inDS csc11.005406.SU8_jimmy_susy1.recon.AOD.v11000501 --outDS user... --libDS pandatest.368b45f5-b6dd-4046-a368-6cb50cd9ee5b
You could write a simple script, e.g.,
import os
import commands
inDSs = ['dataset1','dataset2','dataset3']
outDS = "user.AhoBaka.test0"
comFirst = "pathena --outDS %s --inDS %s myJob.py"
comLater = "pathena --outDS %s --inDS %s --libDS LAST myJob.py"
for i,inDS in enumerate(inDSs):
if i==0:
os.system(comFirst % (outDS,inDS))
else:
os.system(comLater % (outDS,inDS))
Note that "--libDS LAST" gives the last libDS which you have produced.
example.3 : How to run production transformations
pathena allows users to run official transformations with customized sources/packages. First, setup AtlasXYZRuntime, e.g.,
$ asetup AtlasProduction,17.0.4.5,here,setup
Next, if you locally run a trf like
$ Reco_trf.py inputAODFile=AOD.493610._000001.pool.root.1 outputNTUP_SUSYFile=my.NTUP.root"
replace some parameters with
%XYZ
- Input
- → %IN
- Cavern Input
- → %CAVIN
- Minimumbias Input
- → %MININ
- Low pT Minimumbias Input
- → %LOMBIN
- High pT Minimumbias Input
- → %HIMBIN
- BeamHalo Input
- → %BHIN
- BeamGas Input
- → %BGIN
- Output
- → %OUT + suffix (e.g., %OUT.ESD.pool.root)
- MaxEvents
- → %MAXEVENTS
- SkipEvents
- → %SKIPEVENTS
- FirstEvent
- → %FIRSTEVENT
- DBRelease or CDRelease
- → %DB:DatasetName:FileName (e.g., %DB:ddo.000001.Atlas.Ideal.DBRelease.v050101:DBRelease-5.1.1.tar.gz. %DB:LATEST if you use the latest DBR). Note that if your trf uses named parameters (e.g., DBRelease=DBRelease-5.1.1.tar.gz) you will need DBRelease=%DB:DatasetName:FileName (e.g., DBRelease=%DB:ddo.000001.Atlas.Ideal.DBRelease.v050101:DBRelease-5.1.1.tar.gz)
- Random seed
- → %RNDM:basenumber (e.g., %RNDM:100, this will be incremented per sub-job)
and then submit jobs using the
--trf
option;
$ pathena --trf "Reco_trf.py inputAODFile=%IN outputNTUP_SUSYFile=%OUT.NTUP.root" --inDS ... --outDS ...
When your job doesn't take an input (e.g., evgen), use the
--split
option to instantiate multiple sub-jobs if needed.
%SKIPEVENTS
may be needed if you use the
--nEventsPerJob
or
--nEventsPerFile
options of pathena.
Note that you need to explicitly specify maxEvents=XYZ or something
in
--trf
to set the number of events processed in each subjob, since
the value of
--nEventsPerJob
or
--nEventsPerFile
is used only for
job splitting but it is not used to configure subjobs on WN.
Also, pathena doesn't interpret the argument for the
--trf
option although it replaces
%XYZ
.
It is user's responsibility to consistently specify pathena's options and the
--trf
argument.
If you want to add parameters to the transformation that are not listed above, just add them inside the quotes as you normally would to the command line options. Pathena doesn't replace anything for you, but it passes these parameters along to the transformation just the same.
example.4 : How to run multiple transformations in a single job
eOne can run multiple transformations in a single job by using the
--trf
option like
$ pathena --trf "trf1.py ...; trf2.py ...; trf3.py ..." ...
Here is an example to run simul+digi;
$ pathena --trf "AtlasG4_trf.py inputEvgenFile=%IN outputHitsFile=tmp.HITS.pool.root maxEvents=10 skipEvents=0 randomSeed=%RNDM geometryVersion=ATLAS-GEO-16-00-00 conditionsTag=OFLCOND-SDR-BS7T-04-00; Digi_trf.py inputHitsFile=tmp.HITS.pool.root outputRDOFile=%OUT.RDO.pool.root maxEvents=-1 skipEvents=0 geometryVersion=ATLAS-GEO-16-00-00 conditionsTag=OFLCOND-SDR-BS7T-04-00" --inDS ...
where AtlasG4_trf.py produces a HITS file (tmp.HITS.pool.root) which is used as an input by Digi_trf.py to produce RDO. In this case, only RDO is added to the output dataset since only RDO has the
%OUT
prefix (i.e.
%OUT.RDO.pool.root
).
If you want to have HITS and RDO in the output dataset the above will be
$ pathena --trf "AtlasG4_trf.py inputEvgenFile=%IN outputHitsFile=%OUT.HITS.pool.root maxEvents=10 skipEvents=0 randomSeed=%RNDM geometryVersion=ATLAS-GEO-16-00-00 conditionsTag=OFLCOND-SDR-BS7T-04-00; Digi_trf.py inputHitsFile=%OUT.HITS.pool.root outputRDOFile=%OUT.RDO.pool.root maxEvents=-1 skipEvents=0 geometryVersion=ATLAS-GEO-16-00-00 conditionsTag=OFLCOND-SDR-BS7T-04-00" --inDS ...
Note that both AtlasG4_trf.py and Digi_trf.py take
%OUT.RDO.pool.root
as a parameter. AtlasG4_trf.py uses it as an output filename while Digi_trf.py uses it as an input filename.
example.5 : How to run on a good run list
Before you submit jobs to the grid, it is recomended to read
GoodRunsListsTutorial.
First, you need to get an XML file from
Good Run List Generator
. The XML contains a list of runs and LBs of interest. Next, configure your jobO to use the XML for good run list selection. Now you can submit the job using the
--goodRunListXML
option. e.g.,
$ pathena myJobO.py --goodRunListXML MyLBCollection.xml --outDS user...
where
MyLBCollection.xml
is the file name of the XML. The list of runs/LBs are internally converted to a list of datasets by AMI. So you don't need to sepcify
--inDS
. The XML is sent to remote WNs and jobs will run on the datasets with the XML. There are additional options related to good run list:
- --goodRunListDataType
- the XML is converted to AOD datasets by default. If you require another type of datasets, you can specify the type using this option
- --goodRunListProdStep
- the XML is converted to merge datasets by default. If you require datasets with another production step, use this option
- --goodRunListDS
- a comma-separated list of pattern strings. Datasets, which are converted from the XML, will be used when their names match with one of the pattern strings. Either \ or "" is required when a wild-card is used. If this option is omitted jobs will run over all datasets
example.6 : How to read particular events for real data
pathena can take a run/event list as an input since 0.2.37.
First you need to prepare a list of runs/events of interest. You may get a list by analysing D3PD, browsing event display,
using ELSSI, and so on. A list looks like
$ cat rrr.txt
154514 21179
154514 29736
154558 448080
where each line contains a run number and an event number. Then, e.g.,
$ pathena AnalysisSkeleton_topOptions.py --eventPickEvtList rrr.txt --eventPickDataType AOD --eventPickStreamName physics_CosmicCaloEM --outDS user...
where events in the input file are internally converted to AOD
(eventPickDataType)
files with the physics_CosmicCaloEM
(eventPickStreamName)
stream.
For MC data,
--eventPickStreamName
needs to be '' or the option itself needs to be omitted since stream names are not defined.
Also, your jobO is dynamically re-configured to use event selection
on remote WNs, so generally you don't need to change your jobO.
In principle, you can run any arbitrary jobO. For example, if you run
esdtoesd.py (with
--eventPickDataType=ESD
) you will get a skimmed ESD. FYI, you can also skim ESD/AOD/RAW using acmd.py via prun (
see this page).
[Edit: with recent versions of esdtoesd.py (such as release 16.6.6 and beyond), you may need to use the flag --supStream GLOBAL in your pathena command line. See
this FAQ for more details]
3. Monitoring
One can monitor task status in
PandaMonitor
.
Task parameters are defined in
this page.
Job status would change as follows;
- defined : recorded in job database
- assigned : DDM is setting up datasets
- activated : waiting for request from worker node
- running : worker node running
- holding : waiting to add files to Rucio dataset
- finished/failed
4. Get results
Output datasets which were specified with --outDS contain output files. They are registered in DDM so that you can access them using rucio client.
If you want to get the output dataset at your favorite destination automatically, see
this Tip.
5. Bookkeeping & Retry

New bookkeeping tool is available in
this page. Job records can be retrived with pbook. In a pbook session, autocomplete is bounded to the TAB key. And use the up-arrow key to bring back the command.
$ pbook
>>> show()
===================
JobID : 1
time : 2006-04-25 22:48:04
inDS :
outDS : user.TadashiMaeno.123456.aho.evgen.pool.v11356
libDS : pandatest.60bfeef5-3998-44ed-8802-3e79c568ba8b
build : 163801
run : 163802
jobO : jobOptions.pythia.py
===================
JobID : 2
time : 2006-04-28 17:32:50
...
See status of JobID=3.
>>> show(3)
===================
JobID : 3
time : 2006-04-28 18:32:53
inDS : csc11.005056.PythiaPhotonJet2.recon.AOD.v11004107
outDS : user.TadashiMaeno.123456.aho.test2
libDS : pandatest.368b45f5-b6dd-4046-a368-6cb50cd9ee5b
build : 166095
run : 166096-166107
jobO : ../share/AnalysisSkeleton_jobOptions.py
----------------------
buildJob : succeeded
----------------------
runAthena :
total : 10
succeeded : 8
failed : 1
running : 1
unknown : 0
----------------------
Retry failed subjobs in JobID=5.
>>> retry(5)
Kill JobID=3.
>>> kill(3)
See help
>>> help()
Press Ctl-D to exit
6. How to debug jobs when they failed
Error discription shows why your job failed.
When transExitCode!=0, the job failed with an Athena problem. You may want to see log files. You can browse the log files.
Now you find a directory PandaJob _* which contains various log files. e.g., there should be pilot_child.stdout for stdout and pilot_child.stderr for stderr, respectively. You may investigate Athena output in pilot_child.stdout.
If you still have a problem, send a message to the distributed analysis help hypernews
.
Details
What's going to happen when a user submits a job
A user job is composed of sub-jobs, i.e., one buildJob and many runAthena's. buildJob receives source files from the user, compiles and produces libraries. The libraries are stored to the storage. The completion of buidJob triggers runAthena's. Each runAthena retrieves the libraries and runs Athena. Output files are added to an output dataset. Then DDM moves the dataset to your local area.
- input dataset
- contains input files, such as pool, bytestream, and tag collection. Input datasets must be registered in DDM before submit jobs. Official datasets have been registered already. When you want to use a private dataset, use rucio_add
.
- output dataset
- contains output files
pathena
pathena is a client tool to submit user-defined jobs to distributed analysis systems. It provides a consistent user-interface to Athena users. It does
- archive user's working directory
- send the archive to Panda
- extract job configration from jobOs
- define job specifications automatically
- submit jobs
options
Try
--help
to see all options.
$ pathena --help
Some (not all) options are explained here.
- -v
- Verbose. useful for debugging
- --inDS
- name of input dataset
- --inDsTxt
- a text file which contains the list of datasets to run over. newlines are replaced by commas and the result is set to --inDS. lines starting with # are ignored
- --inOutDsJson
- a json file to specify input and output datasets for bulk submission. It contains a json dump of [{'inDS': a comma-concatenated input dataset names, 'outDS':output dataset name}, ...]
- --mcData
- Create a symlink with linkName to .dat which is contained in input file
- --pfnList
- Name of file which contains a list of input PFNs. Those files can be un-registered in DDM
- --outDS
- name of output dataset. should be unique and follow the naming convention
- --outputPath
- Physical path of output directory relative to a root path
- --destSE
- Destination storage element
- --libDS
- name of a library dataset
- --athenaTag
- use different version of Athena on remote WN. By default the same version which you are locally using is set up on WN. e.g., --athenaTag=AtlasProduction,14.2.24.3
- --cmtConfig
- CMTCONFIG=i686-slc5-gcc43-opt is used on remote worker-node by default even if you use another CMTCONFIG locally. This option allows you to use another CMTCONFIG remotely. e.g., --cmtConfig x86_64-slc5-gcc43-opt. If you use --libDS together with this option, make sure that the libDS was compiled with the same CMTCONFIG, in order to avoid failures due to inconsistency in binary files
- --provenanceID
- provenanceID
- --useSiteGroup
- Use only site groups which have group numbers not higher than --siteGroup. Group 0: T1 or undefined, 1,2,3,4: alpha,bravo,charlie,delta which are defined based on site reliability
- --split
- the number of sub-jobs to which an analysis job is split
- --nFilesPerJob
- the number of files on which each sub-job runs
- --nEventsPerJob
- the number of events per job
- --nEventsPerFile
- the number of events per file
- --nFiles
- use an limited number of files in the input dataset
- --nGBPerJob
- instantiate one sub job per NGBPERJOB GB of input files. --nGBPerJob=MAX sets the size to the default maximum value
- --nGBPerMergeJob
- instantiate one merge job per NGBPERMERGEJOB GB of pre-merged files
- --nSkipFiles
- the number of files in the input dataset one wants to skip counting from the first file of the dataset
- --skipFilesUsedBy
- A comma-separated list of TaskIDs. Files used by those tasks are skipped when running a new task
- --useAMIAutoConf
- evaluates the inDS autoconf information from AMI. Boolean -- just set the option without arguments
- --noRandom
- Enter random seeds manually
- --memory
- Required memory size in MB. e.g., for 1GB --memory 1024
- --nCore
- The number of CPU cores. Note that the system distinguishes only nCore=1 and nCore>1. This means that even if you set nCore=2 jobs can go to sites with nCore=8 and your application must use the 8 cores there. The number of available cores is defined in an environment variable, $ATHENA_PROC_NUMBER, on WNs. Your application must check the env variable when starting up to dynamically change the number of cores
- --nThreads
- The number of threads for AthenaMT. If this option is set to larger than 1, Athena is executed with --threads=$ATHENA_PROC_NUMBER at sites which have nCore>1. This means that even if you set nThreads=2 jobs can go to sites with nCore=8 and your application will use the 8 cores there
- --forceStaged
- Force files from primary DS to be staged to local disk, even if direct-access is possible
- --useShortLivedReplicas
- Use replicas even if they have very sort lifetime
- --useDirectIOSites
- Use only sites which use directIO to read input files
- --forceDirectIO
- Use directIO if directIO is available at the site
- --respectSplitRule
- force scout jobs to follow split rules like nGBPerJob
- --maxCpuCount
- Required CPU count in seconds. Mainly to extend time limit for looping job detection
- --official
- Produce official dataset
- --unlimitNumOutputs
- Remove the limit on the number of outputs. Note that having too many outputs per job causes a severe load on the system. You may be banned if you carelessly use this option
- --descriptionInLFN
- LFN is user.nickname.jobsetID.something (e.g. user.harumaki.12345.AOD._00001.pool) by default. This option allows users to put a description string into LFN. i.e., user.nickname.jobsetID.description.something
- --noEmail
- Suppress email notification
- --update
- Update panda-client to the latest version
- --noBuild
- skip buildJob
- --noCompile
- Just upload a tarball in the build step to avoid the tighter size limit imposed by --noBuild. The tarball contains binaries compiled on your local computer, so that compilation is skipped in the build step on remote WN
- --disableRebrokerage
- disable auto-rebrokerage
- --maxAttempt
- Maximum number of reattempts for each job (3 by default and not larger than 50)
- --extFile
- pathena exports files with some special extensions (.C, .dat, .py .xml) in the current directory. If you want to add other files, specify their names, e.g.,
--extFile data1.root,data2.tre
- --excludeFile
- specify a comma-separated string to exclude files and/or directories when gathering files in local working area. Either \ or "" is required when a wildcard is used. e.g., doc,\*.C
- --noSubmit
- don't submit jobs
- --noOutput
- Send job even if there is no output file
- --allowTaskDuplication
- As a general rule each task has a unique outDS and history of file usage is recorded per task. This option allows multiple tasks to contribute to the same outDS. Typically useful to submit a new task with the outDS which was used by another broken task. Use this option very carefully at your own risk, since file duplication happens when the second task runs on the same input which the first task successfully processed
- --bulkSubmission
- Bulk submit tasks. When this option is used, --inOutDsJson is required while --inDS and --outDS are ignored. It is possible to use %DATASET_IN and %DATASET_OUT in --trf which are replaced with actual dataset names when tasks are submitted, and %BULKSEQNUMBER which is replaced with a sequential number of tasks in the bulk submission
- --tmpDir
- temporary directory in which an archive file is created
- --site
- send job to a site, if it is AUTO, jobs will go to the site which holds the most number of input files. See the list of sites for analysis jobs
and choose a site from the list of Queue names.
- --excludedSite
- list of sites which are not used for site section. See the list of sites for analysis jobs
and choose a site from the list of Queue names.
- --generalInput
- Read input files with general format except POOL,ROOT,ByteStream
- -c --command
- one-liner, runs before any jobOs
- --extOutFile
- define extra output files, e.g., output1.txt,output2.dat
- --fileList
- List of files in the input dataset to be run
- --addPoolFC
- file names to be inserted into PoolFileCatalog.xml except input files. e.g., MyCalib1.root,MyGeom2.root
- --skipScan
- Skip LRC/LFC lookup
- --containerImage
- Name of a container image
- -s
- show printout of included files
- --inputFileList
- name of file which contains a list of files to be run in the input dataset
- --removeFileList
- name of file which contains a list of files to be removed from the input dataset
- --removedDS
- don't use datasets in the input dataset container
- --corCheck
- Enable a checker to skip corrupted files
- --inputType
- File type in input dataset which contains multiple file types
- --shipInput
- Ship input files to remote WNs
- --noLock
- Don't create a lock for local database access
- --disableAutoRetry
- disable automatic job retry on the server side
- -p
- location of bootstrap file
- --myproxy
- Name of the myproxy server
- --voms
- generate proxy with particular roles. e.g., atlas:/atlas/ca/Role=production,atlas:/atlas/fr/Role=pilot
- --spaceToken
- spacetoken for outputs. e.g., ATLASLOCALGROUPDISK
- --outTarBall
- Save a copy of local files which are used as input to the build
- --inTarBall
- Use a saved copy of local files as input to build
- --outRunConfig
- Save extracted config information to a local file
- --inRunConfig
- Use a saved copy of config information to skip config extraction
- --useNewCode
- When task are resubmitted with the same outDS, the original souce code is used to re-run on failed/unprocessed files. This option uploads new source code so that jobs will run with new binaries
- --useRucio
- Use Rucio as DDM backend
- --loadJson
- Read command-line parameters from a json file which contains a dict of {parameter: value}. Arguemnts for Athena can be specified as {'atehna_args': [...,]}
- --dumpJson
- Dump all command-line parameters and submission result such as returnCode, returnOut, jediTaskID, and bulkSeqNumber if --bulkSubmission is used, to a json file
- --minDS
- Dataset name for minimum bias stream
- --nMin
- Number of minimum bias files per one signal file
- --lowMinDS
- Dataset name for low pT minimum bias stream
- --nLowMin
- Number of low pT minimum bias files per job
- --highMinDS
- Dataset name for high pT minimum bias stream
- --nHighMin
- Number of high pT minimum bias files per job
- --randomMin
- randomize files in minimum bias dataset
- --cavDS
- Dataset name for cavern stream
- --nCav
- Number of cavern files per one signal file
- --randomCav
- randomize files in cavern dataset
- --goodRunListXML
- Good Run List XML which will be converted to datasets by AMI
- --goodRunListDataType
- specify data type when converting Good Run List XML to datasets, e.g, AOD (default)
- --goodRunListProdStep
- specify production step when converting Good Run List to datasets, e.g, merge (default)
- --goodRunListDS
- a comma-separated list of pattern strings. Datasets which are converted from Good Run List XML will be used when they match with one of the pattern strings. Either \ or "" is required when a wild-card is used. If this option is omitted all datasets will be used
- --eventPickEvtList
- a file name which contains a list of runs/events for event picking
- --eventPickDataType
- type of data for event picking. one of AOD,ESD,RAW
- --ei_api
- flag to signalise mc in event picking
- --eventPickStreamName
- stream name for event picking. e.g., physics_CosmicCaloEM
- --eventPickDS
- a comma-separated list of pattern strings. Datasets which are converted from the run/event list will be used when they match with one of the pattern strings. Either \ or "" is required when a wild-card is used. e.g., data\*
- --eventPickStagedDS
- --eventPick options create a temporary dataset to stage-in interesting files when those files are available only on TAPE, and then a stage-in request is automatically sent to DaTRI. Once DaTRI transfers the dataset to DISK you can use the dataset as an input using this option
- --eventPickAmiTag
- AMI tag used to match TAG collections names. This option is required when you are interested in older data than the latest one. Either \ or "" is required when a wild-card is used. e.g., f2\*
- --eventPickNumSites
- The event picking service makes a temporary dataset container to stage files to DISK. The consistent datasets are distributed to N sites (N=1 by default)
- --eventPickSkipDaTRI
- Skip sending a staging request to DaTRI for event picking
- --eventPickWithGUID
- Using GUIDs together with run and event numbers in eventPickEvtList to skip event lookup
- --trf
- run transformation, e.g. --trf "csc_atlfast_trf.py %IN %OUT.AOD.root %OUT.ntuple.root -1 0"
- --useNewTRF
- Use the original filename with the attempt number for input in --trf when there is only one input, which follows the globbing scheme of new transformation framework
- --useOldTRF
- Remove the attempt number from the original filename for input in --trf when there is only one input
- --useTagInTRF
- Set this option if you use TAG in --trf. If you run normal jobO this option is not required
- --sameSecRetry
- Use the same secondary input files when jobs are retried
- --tagStreamRef
- specify StreamRef of parent files when you use TAG in --trf. It must be one of StreamRAW,StreamESD,StreamAOD. E.g., if you want to read RAW files via TAGs, use --tagStreamRef=StreamRAW. If you run normal jobO, this option is ignored and RefName in your jobO is used
- --tagQuery
- specify Query for TAG preselection when you use TAG in --trf. If you run normal jobO, this option is ignored and EventSelector.Query in your jobO is used
- --express
- Send the job using express quota to have higher priority. The number of express subjobs in the queue and the total execution time used by express subjobs are limited (a few subjobs and several hours per day, respectively). This option is intended to be used for quick tests before bulk submission. Note that buildXYZ is not included in quota calculation. If this option is used when quota has already exceeded, the panda server will ignore the option so that subjobs have normal priorities. Also, if you submit 1 buildXYZ and N runXYZ subjobs when you only have quota of M (M < N), only the first M runXYZ subjobs will have higher priorities
- --debugMode
- Send the job with the debug mode on. If this option is specified the subjob will send stdout to the panda monitor every 5 min. The number of debug subjobs per user is limited. When this option is used and the quota has already exceeded, the panda server suppresses the option so that subjobs will run without the debug mode. If you submit multiple subjobs in a single job, only the first subjob will set the debug mode on. Note that you can turn the debug mode on/off by using pbook after jobs are submitted
- --useContElementBoundary
- Split job in such a way that sub jobs do not mix files of different datasets in the input container. See --useNthFieldForLFN too
- --addNthFieldOfInDSToLFN
- A middle name is added to LFNs of output files when they are produced from one dataset in the input container or input dataset list. The middle name is extracted from the dataset name. E.g., if --addNthFieldOfInDSToLFN=2 and the dataset name is data10_7TeV.00160387.physics_Muon..., 00160387 is extracted and LFN is something like user.hoge.TASKID.00160387.blah. Concatenate multiple field numbers with commas if necessary, e.g., --addNthFieldOfInDSToLFN=2,6.
- --addNthFieldOfInFileToLFN
- A middle name is added to LFNs of output files similarly as --addNthFieldOfInDSToLFN, but strings are extracted from input file names
- --buildInLastChunk
- Produce lib.tgz in the last chunk when jobs are split to multiple chunks due to the limit on the number of files in each chunk or due to --useContElementBoundary/--loadXML
- --useAMIEventLevelSplit
- retrieve the number of events per file from AMI to split the job using --nEventsPerJob
- --appendStrToExtStream
- append the first part of filenames to extra stream names for --individualOutDS. E.g., if this option is used together with --individualOutDS, %OUT.AOD.pool.root will be contained in an EXT0_AOD dataset instead of an EXT0 dataset
- --mergeOutput
- merge output files
- --mergeScript
- Specify user-defied script or execution string for output merging
- --useCommonHalo
- use an integrated DS for beamHalo
- --beamHaloDS
- Dataset name for beam halo
- --beamHaloADS
- Dataset name for beam halo A-side
- --beamHaloCDS
- Dataset name for beam halo C-side
- --nBeamHalo
- Number of beam halo files per sub job
- --nBeamHaloA
- Number of beam halo files for A-side per sub job
- --nBeamHaloC
- Number of beam halo files for C-side per sub job
- --useCommonGas
- use an integrated DS for beamGas
- --beamGasDS
- Dataset name for beam gas
- --beamGasHDS
- Dataset name for beam gas Hydrogen
- --beamGasCDS
- Dataset name for beam gas Carbon
- --beamGasODS
- Dataset name for beam gas Oxygen
- --nBeamGas
- Number of beam gas files per sub job
- --nBeamGasH
- Number of beam gas files for Hydrogen per sub job
- --nBeamGasC
- Number of beam gas files for Carbon per sub job
- --nBeamGasO
- Number of beam gas files for Oxygen per sub job
- --useNextEvent
- Use this option if your jobO uses theApp.nextEvent() e.g. for G4 simulation jobs. Note that this option is not required when you run transformations using --trf.
- --notSkipMissing
- If input files are not read from SE, they will be skipped by default. This option disables the functionality
- --pfnList
- Name of file which contains a list of input PFNs. Those files can be un-registered in DDM
- --individualOutDS
- Create individual output dataset for each data-type. By default, all output files are added to one output dataset
- --transferredDS
- Specify a comma-separated list of patterns so that only datasets which match the given patterns are transferred when --destSE is set. Either \ or "" is required when a wildcard is used. If omitted, all datasets are transferred
- --dbRelease
- use non-default DBRelease or CDRelease (DatasetName:FileName). e.g., ddo.000001.Atlas.Ideal.DBRelease.v050101:DBRelease-5.1.1.tar.gz. To run with no dbRelease (e.g. for event generation) provide an empty string ('') as the release.
- --dbRunNumber
- RunNumber for DBRelease or CDRelease. If this option is used some redundant files are removed to save disk usage when unpacking DBRelease tarball. e.g., 0091890. This option is deprecated and to be used with 2008 data only.
- --supStream
- suppress some output streams. e.g., ESD,TAG
- --gluePackages
- list of glue packages which pathena cannot find due to empty i686-slc4-gcc34-opt. e.g., External/AtlasHepMC,External/Lhapdf
- --allowNoOutput
- A comma-separated list of regexp patterns. Output files are allowed not to be produced if their filenames match with one of regexp patterns. Jobs go to finished even if they are not produced on WN
- --ara
- obsolete. Please use prun
- --ares
- obsolete. Please use prun
pathena analysis queues
An info about pathena analysis queues is provided in
PathenaAnalysisQueues page.
Job priority
Job priorities are calculated for each user by using the following formula. When a user submits a job which is composed of M subJobs,
where
- Priority(n) … Priority for n-th subJob (0≤n<M)
- T … The total number of the user's subJobs existing in the whole queue. (existing = job status is one of defined,assigned,activated,sent,starting,running)
For example, if a fresh user submits a job composed of 100 subJobs, the first 5 subJobs have Priority=1000 while the last 5 subJobs have Priority=981. The idea of this gradual decrease is to prevent large jobs (large = composed of many subJobs) from occupying the whole CPU slots. When another fresh user submits a job with 10 subJobs, these subJobs have Priority=1000,999 so that they will be executed as soon as CPU becomes available even if other users have already queued many subJobs. Priorities for waiting jobs in the queue are recalculated every 20 minutes. Even if some subjobs have very low priorities at the submission time their priorities are increased periodically so that they are executed before they expire.
If the user submits jobs with the
--voms
option to produce group datasets (see
this section), those jobs are regarded as group jobs.
Priorities are calculated per group, so group jobs don't reduce priorities of normal jobs which are submitted by the same user
without the
--voms
option.
There are a few kinds of jobs which have higher priorities, such as merge jobs (5000) and ganga-robot jobs (4000), since they have to be processed quickly.
How Panda decides on a submission site
The brokerage chooses one of sites using
- input dataset locations
- the number of jobs in activated/defined/running state (site occupancy rate)
- the average number of CPUs per worker node at each site
- the number of active or available worker nodes
- pilot rate for last 3 hours. If no pilots, the site is skipped
- available disk space in SE
- Atlas release/cache matching
- site statue
The weight is defined as
where
- G … The number of available worker nodes which have sent getJob requests for last 3 hours
- U … The number of active worker nodes which have sent updateJob requests for last 3 hours
- R … The maximum number of running jobs in last 24 hours
- D … The number of defined jobs
- A … The number of activated or starting jobs
- T … The number of assigned jobs which are transferring input files to the site
- X … Weight factor based on data availability. When input file transfer is disabled, X=1 if input data is locally available, otherwise X=0. When input file transfer is enabled, X=1+(total size of input files on DISK)/10GB if files are available on DISK, X=1+(total size of input files on TAPE)/1000GB if files are available on TAPE, X=1 otherwise
- P … Preferential weight to use beyond-pledge (
schedconfig.availableCPU/pledgeCPU
) if countryGroup is matched between the user and the site
You can see in the
PandaLogger
how the brokerage works.
In general, the brokerage automatically sends jobs to a site where the input dataset is available and CPU/SW resources are optimal. However, the brokerage is skipped when the
--site
option is used.
It is better to avoid this option unless you really need, since jobs might be sent to a busy site.
If you want to exclude some sites explicitly, you can specify them as a comma separated list using
--excludedSite
. E.g.,
--excludedSite=ABC,XYZ
. In this case, the brokerage ignores all sites which contain ABC or XYZ in their siteIDs. See
the list of siteIDs (Queue names)
Rebrokerage
Jobs are internally reassigned to another site at most 3 times, when
- they are waiting for 24 hours.
- HammerCloud set sites to the test or offline mode 3 hours ago
The algorithm for site selection is the same as normal brokerage described in the above section. Old jobs are closed.
When a new site is not found, jobs will stay at the original site.
The latest DBRelease
The latest DBRelease is defined as follows:
- The dataset name starts with
ddo.
and have a 4 digits version number
- It contains only one file and the filename starts with
DBRelease
, which excludes reprocessing DBR
- More than 40 and 90% online sites have the replica
- All online T1 have the replica
- It has the highest version number in all candidates which meet the above requirements
FAQ
Contact
We have one egroup and one JIRA. Please submit all your help requests to
hn-atlas-dist-analysis-help@cern.ch
which is maintained by
AtlasDAST.
Panda JIRA
is for bug reporting. Please choose appropriate forum according to your purpose, which allows us to provide better responses. For general info see
Panda info and help
. For TAG and Event Picking issues
atlas-event-metadata@cern.ch
will help.
Why did my jobs crash with "sh: line 1: XYZ Killed"?
sh: line 1: 13955 Killed athena.py -s ...
If you see something like the above message in the log file, perhaps your jobs were killed by the batch system due to huge memory consumption. pathena instantiates one sub-job per 20 AODs or per 10 EVNT/HITS/RDO/ESDs by default. You can reduce it by using --nFilesPerJob.
pathena failed with "No files available at ['XYZ']"
The input files don't exist at the remote site. Submit a
DDM subscription request
for the files to be replicated. You need to register first, click on the "for registration go here" to get to the registration page.
How can I send jobs to the site which holds the most number of input files?
Don't use
--site
, so that jobs are automatically sent to proper sites.
Why is my job pending in activated/defined state?
Atlas.Panda server manages all job information, and worker nodes run jobs actually. When a worker node is free,
- the worker sends a request to the server
- the server sends a job back to the worker
- the worker runs the job
- the job status is changed to 'running'
The Atlas.Panda system has a pull-mechanism here. Each worker starts all sessions between the server and the worker. If your job is pending, this means all workers are busy or the back-end batch system is down. If your job is stuck for 1 day or more, ask
Savannah
.
How do I kill jobs?
$ pbook
>>> kill(JobID)
Use proper JobID. e.g.,
kill(5)
. See
Bookkeeping.
My job got split into N sub-jobs, and only M sub-jobs failed. Is it possible to retry only the M sub-jobs?
$ pbook
>>> retry(JobID)
which retries failed sub-jobs in the job with JobID. Use proper JobID. e.g.,
retry(5)
. See
Bookkeeping.
Why were my jobs killed by the Atlas.Panda server? : what does 'upstream job failed' mean?
Analysis job is composed of one 'build' job and many 'run Athena' jobs (see
What happens in Panda). If the build job fails, downstream runAthena jobs will get killed.
Are there any small samples to test my job before run it on the whole dataset?
If '--nfiles N' is given, your job will run on only N files in the dataset.
$ pathena ... --inDS csc11.005056.PythiaPhotonJet2.recon.AOD.v11004107 --nfiles 2
What is the meaning of the 'lost heartbeat' error?
A worker node running a job sends heartbeat messages to the panda server every 30 min. That indicates the worker node (pilot process) is alive. If the panda server doesn't receive any heartbeats for 6 hours, the job gets killed.
jobDispatcherErrorCode 100
jobDispatcherErrorDiag lost heartbeat : 2006-05-30 22:32:00
This means the last hartbeat was received at 22:32:00 and then the job was killed after 6 hours. The error happens mainly when the pilot died due to temporary troubles in the backend batch system or network. Simply retry would succeed.
What is the meaning of the 'Looping job killed by pilot' error?
If a job doesn't update output files for 2 hours, it will be killed. This protection is intended to kill dead-locked jobs or infinite-looping jobs.
If your job doesn't update output files very frequently (e.g., some heavy-ion job takes several hours to process one event) you can relax
this limit by using the
--maxCpuCount
option.
However, sometimes even normal jobs get killed due to this protection. When the storage element has a problem, jobs cannot copy input files to run Athena and of course cannot update output files. When you think that your job was killed due to an SE problem, you may report to
hn-atlas-dist-analysis-help@cern.ch
. Then shift people and the SE admin will take care of it.
I want to process the whole dataset with N events per job. (integer N>0)
Use the following command:
$ pathena --split -1 --nEventsPerJob N .....(other parameters)
or equally acceptable form:
$ pathena --nEventsPerJob N .....
In the case above, number of submitted jobs will depend on number of files in a given dataset, number of events per file in the dataset and requested number N. Each job will have one file as input.
For example if you chose N to be larger than a number of events in a file then:
$ pathena --nEventsPerJob N .....
will submit number of jobs equal to a number of files in the dataset, with one file per job.
I want to run a transformation like Reco_tf.py with N events per job.
When running a transformation you should be aware that pathena doesn't append or subtract anything to/from the argument string
which you specify in the --trf option because option names can depend on releases and/or transformations and pathena doesn't know
all of them (for example, maxEvents was --maxEvents in some old releases). All pathena does is to replace some placeholders, which are described in
this section,
to actual values. Here is an example to run Reco_tf.py with 100 events per job.
$ pathena --trf "Reco_tf.py inputAODFile=%IN outputNTUP_SUSYFile=%OUT.NTUP.root skipEvents=%SKIPEVENTS maxEvents=100 ..." --inDS ... --outDS ... --nEventsPerFile=1000 --nEventsPerJob=100 ...
Note that you should check if the argument string is correct before trying the above example. Generally you can see all options of a transformation by giving an invalid option like
$ Reco_tf.py -x
usage: Reco_tf.py [-h] [--verbose]
[--loglevel {INFO,CRITICAL,WARNING,VERBOSE,ERROR,DEBUG,FATAL,CATASTROPHE}]
[--argJSON FILE] [--dumpargs] [--showGraph] [--showPath]
[--showSteps] [--dumpPickle FILE] [--dumpJSON FILE]
...
Common pitfalls are as follows:
- When you don't specify maxEvents, each job processes all events in the input file
- When you don't specify skipEvents, all jobs with the same input file process same events.
Especially pay attention that maxEvents is not automatically appended even if you set --nEventsPerJob.
I want to launch M jobs, each with N events per job
You can use the following command:
$ pathena --split M --nEventsPerJob N .....
Note: Options --nFiles e.g."split by files" and --nEventsPerJob e.g."split by events" can not be defined simultaneously, pathena
will exit with an error at startup. Please define only one or another.
How do I merge the results of my pathena job?
See
Merge ROOT files using prun.
pathena failed with "ImportError: No module named Client"
Some releases (e.g., 13.0.X) contain obsolete pathena in the HiggsPhys package or something. If pathena points to the obsolete one it will fail with
Traceback (most recent call last):
File "/.../prod/releases/13.0.40/AtlasAnalysis/13.0.40/InstallArea/share/bin/pathena", line 15, in ?
import UserAnalysis.Client as Client
ImportError: No module named Client
Make sure pathena points to one which you installed locally; i.e.,
$ which pathena
$ ls -l `which pathena`
should give something like
~/myWork/InstallArea/share/bin/pathena
lrwxr-xr-x 1 tmaeno zp 102 Jun 26 16:21 ~/myWork/InstallArea/share/bin/pathena -> ~/myWork/PhysicsAnalysis/Atlas.DistributedAnalysis/PandaTools/share/pathena
If this is wrong, check PATH
$ echo $PATH
If you are using csh/zsh, you need to execute 'rehash' to update the hash table in the shell after installing PandaTools.
$ rehash
Expected output file does not exist
Perhaps the output stream is defined in somewhere in your jobOs, but nothing uses it. In this case, Athena doesn't produce the file. The solutions could be to modify your jobO or to use the
--supStream
option. E.g.,
--supStream hist1
will disable user.AhoBaka.TestDataSet1.hist1._00001.root.
How to make a group defined outDS
$ pathena --official --voms atlas:/atlas/groupName/Role=production --outDS group.groupName.[otherFields].dataType.Version ...
where groupName for SUSY is phys-susy, for instance. See the document
ATL-GEN-INT-2007-001
for dataset naming convention. The group name needs to be officially approved and registered (see
GroupsOnGrid). Note that you need to have the production role for the group to produce group-defined datasets. If not, please request it in the ATLAS VO registration page. For example, to produce datasets for Higgs working group,
$ pathena --official --voms atlas:/atlas/phys-higgs/Role=production --outDS group.phys-higgs...
If you submit jobs with the
--voms
option, those jobs are regarded as group jobs.
How to ignore Athena Fatal errors
See
this hypernews message
as an example. Note that this may be required only when you run official transformations.
Job state definitions in Panda (defined/assigned/waiting/holding etc.)
See this info under Monitoring above, also a detailed info from
this link.
How can I delete my user datasets
Please refer to
rucio client howto
.
How can I ask for datasets to be replicated
You need to make a request on
DaTRI
for the dataset to be replicated. You need to register first, click on the "for more information click here" to get to the registration page. You can check the status of your request, please refer to
DaTRI
.
Why do my jobs fail with "Files is on tape" or "Transform input file not found" error
The pilot doesn't copy input files to WNs when they are on tape, to avoid occupying the CPU slot inactively. Normal jobs will fail with "Files is on tape" error if all files are skipped, or tansformation jobs will fail with "Transform input file not found" if some files are skipped since transformations require all input files. Atlas policy states that ESD/RDO/RAW datasets should be on tape. Users are encouraged to file a
DDM subscription request
for the dataset to be moved to disk. See more info about this request in
the above FAQ item. When files are on tape the pilot skips them after sending prestage-requests, and copies only files which have already been cached on disk. Jobs run on available files on WNs, i.e., they ignore missing files. The user may retry later if the files become available on disk. See
additional info about skipped files.
Is it possible to run AthenaRootAccess with pathena
It is recommended to use
prun for this purpose. You can find an example in
this page.
pbook generates the error "ERROR : SQL error: unsupported file format"
If this happens sometimes, the filesystem of user's home dir might be unstable. For example, this kind of SQL error tends to happen when AFS has a problem. Or if the error happens permanently, then user's local database might be corrupted. In this case, the solution is to delete the local database. It will be recreated when pathena runs next time.
$ rm $PANDA_CONFIG_ROOT/pandajob.db
The reasons for user's local database to get corrupted, The local database is manipulated by sqlite3. As mentioned in
this link
, OS/HW problems may corrupt database files. Or there might be rare bugs in sqlite3 itself.
What is the meaning of 'Partial' in an email notification?
You may find 'Partial' in email notification, e.g.,
Summary of JobID : 603
Created : 2008-10-27 14:03:29 (UTC)
Ended : 2008-10-27 14:20:26 (UTC)
Site : ANALY_BNL_ATLAS_1
Total Number of Jobs : 3
Succeeded : 2
Partial : 1
Failed : 0
When the pilot skips some files on tape and jobs successfully run on partial files, the final status of those jobs is categorized as 'partially succeeded'. 'Partial' is the number of partially succeeded jobs. If all files are skipped, the job will be failed. The above notification means that two sub-jobs ran on all associated input files while one sub-job ran on part of input files. Skipped files have status='skipped' in the panda monitor. pathena will use skipped files if the user submits jobs with the same input/output datasets again. This capability allows users to process on-tape files incrementally and allows the storage system to use the disk cache efficiently.
What does "COOL exception caught: The database does not exist" or IOVDbSvc ERROR mean?
This error mostly means that your jobs require the latest conditions data which is not available in the default database release installed by the release kit. For example, normally RecExCommission jobs require the latest conditions data, but many CERN lxplus users don't notice that because AFS builds are implicitly using the latest conditions data buffered on AFS disks. However, all grid jobs run with release kits using the default database even if they run at CERN, and thus users sometimes see this kind of problems. Solutions could be to specify the
latest version of the DB release
by the
--dbRelease
option (see above in options for its usage) or to use
Oracle RAC . Make sure if you are allowed to use Oracle RAC when using the second solution. Find more info about DB Releases at
AtlasDBReleases.
What does the "OFLCOND-SIM-00-00-06 does NOT exist" error mean?
As an example, the cause of the problem is the absence of the Conditions DB tag OFLCOND-SIM-00-00-06 (requested in your jobOptions) in the default DBRelease version 6.2.1 used in your s/w release 14.5.1. Generally, such problem is corrected by switching to the very
latest DB Release version
, find more info about DB Releases at
AtlasDBReleases. Solution is to use --dbRelease option, for instance --dbRelease='ddo.000001.Atlas.Ideal.DBRelease.v060402:DBRelease-6.4.2.tar.gz'.
What does the "Exception caught: Connection on "ATLASDD" cannot be established" error mean?
Possible reasons:
- The content of your dblookup.xml is incompatible with the grid environment. In this case this file should be removed in run directory before submitting the job. See an discussion on this from this egroups link
. Note: The user can have his/her local dblookup.xml in order to override default database connection settings, provided this user has some knowledge about dblookup.xml structure and contents.
- The job is supposed to access sqlite replica, which is not accessible due to wrong permissions, nfs problems or something else.
- The job is supposed to access sqlite replica, which is missing.
- The job is supposed to access Oracle and there are some technical problems with Oracle connections.
- There can be other reasons as well.
For all database access problems
Please refer to
AthenaDBAccess,
CoolTroubles
Usage of ATLASUSERDISK vs ATLASLOCALGROUPDISK
pathena writes outDS to space token ATLASUSERDISK at the execution site, one can write outDS to space token ATLASLOCALGROUPDISK by the option "--spaceToken ATLASLOCALGROUPDISK". Please note that this option will not be used for the US sites. User data will stay in USERDISK and the deletion policy will be different than other ATLAS sites explained in the page
StorageSetUp. Users do not need to worry about 30 day deletion limit for US sites for now.
User work directory exceeds the size limit when DBRelease is used
When a large DBRelease is used, the pilot sometimes fails with
!!FAILED!!1999!! User work directory
(/tmp/Panda_Pilot_671_1233518570/PandaJob_25004808_1233518571/workDir) too
large: 2285164 kB (must be < 2 GB)
The solution is to use
--dbRunNumber
in addition to
--dbRelease=
. When the option is used redundant files (typically ~1.5GB) are removed to save disk usage when unpacking DBRelease tarball on WN. E.g.,
$ pathena --dbRunNumber 0091890 --dbRelease ddo.000001.Atlas.Ideal.DBRelease.v06030101:DBRelease-6.3.1.1.tar.gz ...
The
--dbRunNumber
option is available in 0.1.10 or higher. This option is deprecated and to be used with 2008 data only.
How do I blacklist sites against receiving my pathena submission
If your input datasets has some files corrupted or missing at some sites for instance, you may want to exclude these sites in your submission by
--excludedSite=SITE1,SITE2
. See
the list of Queues names)
How do I get the output dataset at my favorite destination automatically
When
--destSE
option is used, output files are automatically aggregated to a DDM endpoint. e.g.,
$ pathena --destSE LIP-LISBON_LOCALGROUPDISK ...
The first successful subjob makes a replication rule to the DDM endpoint in rucio with user's permission. The user
needs to have write permission at the DDM endpoint. Otherwise, output files are not
aggregated to the destination although subjobs are successful. This means that output files are left where they are produced
and the user is still able to manually replicate and/or download those files later.
The name of the DDM endpoint can be found in
AGIS
.
Generally LOCALGROUPDISK (long term storage) or SCRATCHDISK (short term storage) can be used.
You can check permission in each DDM endpoint page.
For example, if you go to
LIP-LISBON_LOCALGROUPDISK
you can see that only
/atlas/pt
users are
allowed to write to the endpoint, so if you don't belong to the pt group the above
example will fail and you will have to choose a proper endpoint.
How can I check if a release/production cache is installed at a site and how I can request for and installation
https://atlas-install.roma1.infn.it/atlas_install/
(For US sites the site names are sitename_Install, for instance BNL_ATLAS_Install). Status of Panda installation jobs can be monitored from this
Panda monitoring link
. If you like to request for a release installation (e.g. for T0 releases), you should
create a GGUS ticket
for OSG and EGEE sites. Alternatively you can use the "
Request an installation
" link for only EGEE sites.
Issues with SL5 and SL4
While the migration from SL4 to SL5 should be transparent to users a few errors have appeared.
For users running SL5: If you have installed an SL5 kit your jobs must be sent to an SL5 site. To insure compatibility, SL4 kits should be installed with suitable compatibility libraries.
For users running SL4: Certain jobs will terminated with a strange segmentation fault, please see
this hypernews message
and try to run on an SL4 site.
Note: For lcg sites the operating system can be queried using lcg-infosites
pathena failed due to "ERROR : Could not parse jobOptions"
As an example the error message would be something like:
BTagging/BTagging_LoadTools.py", line 65, in <module>
input_items = pf.extract_items(pool_file=
svcMgr.EventSelector.InputCollections[0])
IndexError: list index out of range
ERROR : Could not parse jobOptions
or
RecExCommon_topOptions.py", line 1053, in <module>
from RecExCommon.InputFilePeeker import inputFileSummary
File ".../RecExCommon/InputFilePeeker.py", line 29, in <module>
RunTimeError," no existing input file detected. Stop here."
NameError: name 'RunTimeError' is not defined
ERROR : Could not parse jobOptions
inputFilePeeker tries to extract metadata from the input file to configure run parameters dynamically. So they don't work locally unless you define a valid input file in your jobO. i.e., Athena (not pathena) fails with the same error.
$ athena -i yourJobO.py
...
RecExCommon/RecExCommon_topOptions.py", line 1053, in <module>
from RecExCommon.InputFilePeeker import inputFileSummary
File "/afs/cern.ch/atlas/software/builds/AtlasReconstruction/15.2.0/InstallArea/python/RecExCommon/InputFilePeeker.py", line 29, in <module>
RunTimeError," no existing input file detected. Stop here."
NameError: name 'RunTimeError' is not defined
athena>
Basically pathena doesn't work if Athena locally fails with the jobO.
The solution is to have something like
svcMgr.EventSelector.InputCollections=["/somedir/mc08.108160.AlpgenJimmyZtautauNp0VBFCut.recon.ESD.e414_s495_r635_tid070252/ESD.070252._000001.pool.root.1"]
or
PoolESDInput=["/somedir/mc08.108160.AlpgenJimmyZtautauNp0VBFCut.recon.ESD.e414_s495_r635_tid070252/ESD.070252._000001.pool.root.1"]
in your jobO, where the input file must be valid (i.e. can be accessed from your local computer). Note that those parameter (essentially
EventSelector.InputCollections
and
AthenaCommonFlags.FilesInput
) will be overwritten to lists of input files on remote WNs automatically.
Question: Does the local file have to be from the very same inDS that will be run on the grid? The answer is no. In many cases, only the data type, such as AOD,ESD,RAW..., is important.
What does "ImportError: ...lfc.so: wrong ELF class" error mean?
This refers to a problem with your panda-client package setup. Please refer to
PandaTools#Setup.
Where can I find list of clouds and their sites and CE
From Panda monitor
cloud link
or from DDM
dashboard link
or from
ATLAS Grid Information System
.
Why do I get a different number of subjobs at different sites while running on the same dataset?
Files were still being added to the dataset when the job was submitted. The content of the dataset can change before the dataset get frozen.
Here is more detailed info: When a dataset is being made by Tier 0 the sequence is as follows. (1) the empty dataset is registered in Rucio.
(2) files are produced and added to the dataset. The dataset is visible in Rucio during this time but it is OPEN. The number of files will be changing.
The analysis tools do not check to see if the dataset the user requests is frozen or closed. If you were to run an analysis job during this period
you might get the situation where different numbers of jobs were declared because the number of files available is different.
(3) When the dataset is complete (no more files to add) it will be frozen, and at this point and only at this point, Tier 0 raises an "AMI Ready" flag,
and the dataset is read into AMI. (Normally within 5 minutes maximum of the raising of the flag.) Therefore, there is a period of a few hours when
a dataset may be visible in Rucio but it is NOT in AMI. But during this period it is probably not complete, and you CAN use it but you use it at your own risk.
How to retry failed subjobs at different sites
pbook.retry()
retries failed subjobs at the same site. Also, it doesn't retry subjobs if buildJob failed.
You may want to resubmit them to different sites, e.g., to avoid a site specific
problem. In order to send new subjobs to other sites, their job specification
needs to be re-created with new site parameters. The easiest way is
to run pathena/prun with the same
input/output datasets (or dataset containers). In this case, that will run only on failed files
instead
of all files in the input dataset and output files will be appended to the output dataset container.
If you are using output container
(the default since 0.2.73) and specify
--excludedSite
, new subjobs
will go to other sites.
How to use external packages with pathena
Use the gluePackages option of pathena. See some user examples in the DAST list
here
.
What does "ignore replicas of DATASET_NAME at ANALY_XYZ due to archived=ToBeDeleted or short lifetime < 7days" mean?
Once the archived flag of a dataset replica is set to ToBeDeleted or it has a short lifetime the replica is deleted by the DDM deletion service very soon.
In this case the brokerage skips the site since the replica may be deleted during the job is running.
Duplicated files when old output dataset (container) is reused
When you submit jobs twice with the same input dataset and output dataset, the second job runs only on files which were unused or failed in the first job.
However, if the first job is older than 30 days this machinery doesn't work properly and there will be duplicated files in the output dataset, as explained in
this ticket
.
This problem is going to be addressed in the new panda system, but for now you should avoid reusing very old output dataset (container).
Contact Email Address:
hn-atlas-dist-analysis-help@cern.ch
Responsible: