How to submit Athena jobs to Panda

Introduction

This page describes how to submit user Athena jobs from LCG/OSG/NG to the ATLAS production system, PanDA.


Getting started

1. Setup

First, make sure you have a grid certificate. See Starting on the Grid. You should have usercert.pem and userkey.pem under ~/.globus.

$ ls ~/.globus/*
usercert.pem userkey.pem

Warning, important All you need here is to put usercert.pem and userkey.pem under the globus directory.

Then, setup Athena because pathena works in the Athena runtime environment. You can use any ATLAS releases or dev/bugfix nigtlies. The workbook or this page may help. Here is an example for AnalysisBase, 21.2.40;

$ setupATLAS
$ asetup AnalysisBase,21.2.40

Warning, important Note that you need to setup the panda-client after sourcing Athena's setup.sh since the latter changes the PATH environment variable.

example

Here is an example sequence for fresh users who use ~/myWork/test for the Athena working directory. If you use CERN AFS see Setup for CERN AFS users instead.

# ssh myhost.somedomain.xyz

$ ls -l .globus
total 7
-rw-------  1 tmaeno zp 1442 Nov 10 14:17 usercert.pem
-rw-------  1 tmaeno zp 1207 Nov 10 14:18 userkey.pem


$ setupATLAS
$ asetup AnalysisBase,21.2.40,here

Using AnalysisBase/21.2.40 [cmake] with platform x86_64-slc6-gcc62-opt
        at /cvmfs/atlas.cern.ch/repo/sw/software/21.2
Test area /afs/cern.ch/user/t/tmaeno/myWork/test

$ lsetup panda
************************************************************************
Requested:  panda ...
 Setting up panda 0.6.9 ...
>>>>>>>>>>>>>>>>>>>>>>>>> Information for user <<<<<<<<<<<<<<<<<<<<<<<<<
************************************************************************

$ pathena --version
Version: X.Y.Z

2. Submit

When you run Athena with

$ athena jobO_1.py jobO_2.py jobO_3.py

all you need is

$ pathena jobO_1.py jobO_2.py jobO_3.py [--inDS inputDataset]
       --outDS outputDataset

where inputDataset is a dataset which contains input files, and outputDataset is a dataset which will contain output files. For details about options, see pathena.

example.1 : evgen

$ wget http://cern.ch/tmaeno/jobOptions.pythia16.py
$ pathena jobOptions.pythia16.py --outDS user.tmaeno.123456.aho.evgen.pool.v1

extracting run configuration
 ConfigExtractor > No Input
 ConfigExtractor > Output=STREAM1
 ConfigExtractor > RndmStream PYTHIA
 ConfigExtractor > RndmStream PYTHIA_INIT
archive sources
post sources
submit
===================
 JobID  : 8
 Status : 0
  > build
    PandaID=167957
  > run
    PandaID=167958

https://twiki.cern.ch/twiki/bin/viewfile/Atlas/SoftwareIntegration?rev=1.1;filename=datasetsV152.pdf When you try the above example, replace user.tmaeno.123456.aho.evgen.pool.v1 to another dataset name. User-defined datasets should have prefix user.nickname. This page explains how to register your nickname in ATLAS VO. Concerning the naming of dataset, see the ATLAS dataset nomenclature document

Warning, important This jobOptions.pythia16.py is NOT a special job options file. Actually one can run it on local computer.

$ athena jobOptions.pythia16.py

In Panda system, each job has a unique PandaID. The above job consists of two sub-jobs 167957-167958. 167957 builds Athena-environment on remote site, and 167958 runs Athena and produces outputs. The detailed job-sequence is described below. Panda provies a WEB-based monitor (see Monitoring) so that users can see job status easily.

When you set '--split N', e.g.,

$ pathena jobOptions.pythia16.py --outDS user.tmaeno.456.ccaho.evgen.pool.v5 --split 3

extracting run configuration
...
PYTHIA : 100 200
PYTHIA_INIT : 300 400
...
submit
===================
 JobID  : 178
 Status : 0
  > build
    PandaID=347362
  > run
    PandaID=347363-347365

N sub-jobs will run, so that you will get N×EvtMax events. Random seeds will be incremented per each sub-job. The job status will change as described in Job state definition in Panda. Once the job finished, you should receive an e-mail and can retrieve the outputs (see Get results).


example.2 : g4sim

$ get_files -jo jobOptions.G4Atlas_Sim.py

Modify the jobOption according to the WorkBook instruction

$ pathena -c "EvtMax=3" jobOptions.G4Atlas_Sim.py --inDS user.tmaeno.123456.aho.evgen.pool.v1 --outDS user.TadashiMaeno.123456.baka.simul.pool.v1 --useNextEvent

The input of this job is the dataset which example.1 produced. You can use '-c' option like athena. Basically all jobs except evgen (or single particle simulation jobs) should have both --inDS and --outDS. Note the --useNextEvent switch which should be used for all jobs with job options using theApp.nextEvent() (e.g. G4 sim) except transformations, i.e., you don't need --useNextEvent when running transformations using --trf. See this section for the --trf option.

Warning, important G4 simulation is one of the trickiest job types in Athena. Unless you have enough expertise on the G4Atlas framework, you don't need to try this example. The point of this example is that the output dataset produced by a grid job can be used as an input for another grid job.


example.3 : g4sim + customized package

When you want to change the default algorithm, check out corresponding package and modify sources as you like. Here is an example.

Identifiy the tag to use for the release that you are using.

$ cd ../../../../
$ get_tag Simulation/G4Atlas/G4AtlasAlg

Check out that tag, e.g., for 12.0.2, it is G4AtlasAlg-00-00-09

$ cmt co -r G4AtlasAlg-00-00-09 Simulation/G4Atlas/G4AtlasAlg
$ cd Simulation/G4Atlas/G4AtlasAlg/*/src

edit G4AtlasAlg.cxx for example. Then

$ cd ../cmt
$ cmt br make

Note that you need to compile the package, because pathena scans InstallArea in your work directory.

$ cd ../../../../PhysicsAnalysis/AnalysisCommon/UserAnalysis/*/run
$ pathena -c "EvtMax=3" jobOptions.G4Atlas_Sim.py --inDS user.tmaeno.123456.aho.evgen.pool.v1 --outDS user.TadashiMaeno.123456.baka.simul.pool.v2 --useNextEvent

Warning, important In principle, you can check out multiple packages and/or create new packages. Also you can customize all production steps including analysis. Note the --useNextEvent switch which should be used for all jobs with job options using theApp.nextEvent() (e.g. G4 sim) except transformations, i.e., you don't need --useNextEvent when running transformations using --trf. See this section for the --trf option.


example.4 : analysis job running on an official dataset

You may use AMI to find datasets which you are interested in.

Checkout/customize sources/ and job options in any packages as you like. Make sure your job runs properly on local PC before submit it. If it succeeds, then go ahead. As a concrete example, we will use AnalysisSkeleton_topOptions.py. Note howver that from 12.0.3, it is AnalysisSkeleton_topOptions.py.

get_files AnalysisSkeleton_topOptions.py

Edit the your job options file and make sure that the input AOD name and path are well-defined before testing. If you do not have some AOD available, you may run the reconstruction for a few events using this job options: You need some AOD test the above job options in your local area before submitting to the grid - create a file called myTopOptions.py and add these lines to it:

####################################################
Atlas.DetDescrVersion=" Detector descriptor Version Tag"
doHist = False
doCBNT = False
EvtMax = 15

PoolRDOInput =[ "Input RDO file path and name" ]
include ("Atlas.RecExCommon/Atlas.RecExCommon_topOptions.py")
######################################################

Make sure you have enough memory to run this job interactively. Then, run it:

athena.py -b myTopOptions.py

This should produce some ESD and AOD for 15 events. Once you have the AOD, you test your analysis code, for example:

athena -b AnalysisSkeleton_topOptions.py

Running the AnalysisSkeleton_topOptions.py should produce an AthenaAwareNTuple file called AnalysisSkeleton.aan.root: you may open this file in ROOT and examine it. Now, you've tested that your job options, in this particular case AnalysisSkeleton_jobOptions.py, runs successfully on your local machine. You should also check that the output data, in this case an AthenaAwareNTuple file, contains sensible results before submitting to the grid for much larger statistics.

Warning, important If you've setup for the release 12, note that the release 11 AOD note automatically readable in 12. You should use AOD dataset produced in 12 or do this exercise with the indicated dataset in 11.0.5 - change the username appropriately in the output dataset name:

$ pathena AnalysisSkeleton_topOptions.py      --inDS trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601      --outDS user.KeteviA.Assamagan.trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AAN.v12000601

example.5 : TAG Selection in Distributed Analysis

Generally when you use TAG in Athena, your jobO has something like

EventSelector.InputCollections = [ "XYZ" ]
EventSelector.CollectionType = "ExplicitROOT"
EventSelector.Query="NLooseMuon>0 || NLooseElectron>0" # your querie for selecting the events

or

from RecExConfig.RecFlags import rec
rec.readTAG.set_Value_and_Lock(True)
from AthenaCommon.AthenaCommonFlags import athenaCommonFlags
athenaCommonFlags.PoolInputQuery.set_Value_and_Lock("CellMET>500000.")

pathena checks whether inputs are TAGs or not, by parsing jobOs. If inputs are TAGs, POOL references are extracted from TAG files on the remote worker node to construct PoolFileCatalog.xml. Parent files (such as AOD and ESD) are read in directly from the storage element when back-navigation is triggered. Everything happens automatically, so users don't need to care about those things.

To submit TAG-based analysis jobs to Panda, all you need to do is just to use TAG datasets as input datasets, e.g.

$ pathena TagSelection_AODanalysis_topOptions.py --inDS data09_cos.00121238.physics_L1CaloEM.merge.TAG_COMM.r733_p39/ --outDS user.TadashiMaeno.123456.ahoTAG.test7

The job will be sent to a site where the TAG dataset is available. Sometimes you may use back-navigation to ESD, RDO, etc., but they are not guaranteed to exist at the site where TAG is available. If you want to send the job to a site where parent datasets are available, use the --parentDS option. e.g.,

$ pathena TagSelection_AODanalysis_topOptions.py --inDS data09_cos.00121238.physics_L1CaloEM.merge.TAG_COMM.r733_p39/ --parentDS data09_cos.00121238.physics_L1CaloEM.recon.ESD.r733/ --outDS user...

See Using File-based Tag Datasets with Atlas.Panda and Using Relational Tags with Atlas.Panda for detail.


example.6 : re-submit the same analysis job on another dataset

pbook shows libDS (library dataset) for each job. Normally, one job is composed of one buildJob and many runAthenas (see details). buildJob produces libraries and libDS contains them. buildJob can be skipped by reusing the libraries, which reduces the total time of the job execution. If --libDS is given, runAthenas run with the libraries in libDS.

$ pathena  TagSelection_AODanalysis_topOptions.py      --inDS csc11.005406.SU8_jimmy_susy1.recon.AOD.v11000501      --outDS user...      --libDS pandatest.368b45f5-b6dd-4046-a368-6cb50cd9ee5b

You could write a simple script, e.g.,

import os
import commands
inDSs = ['dataset1','dataset2','dataset3']
outDS = "user.AhoBaka.test0"
comFirst = "pathena --outDS %s --inDS %s myJob.py"
comLater = "pathena --outDS %s --inDS %s --libDS LAST myJob.py"
for i,inDS in enumerate(inDSs):
    if i==0:
        os.system(comFirst % (outDS,inDS))
    else:
        os.system(comLater % (outDS,inDS))

Note that "--libDS LAST" gives the last libDS which you have produced.


example.7 : submit a long running job to a long queue

If sub-jobs exceed the walltime limit, they will get killed. When you want to submit long running jobs (e.g., customized G4 simulation), submit them to sites where longer walltime limit is available by specifying the expected execution time (in second) to the --maxCpuCount option.

$ pathena G4Sim.py --inDS csc11.005115.JimmyZmumuM150.simul.HITS.v11004205  --outDS user.TadashiMaeno.123456.baka.simul.v2 --maxCpuCount 172800
Note that the maximum value for --maxCpuCount is 259200 (3 days).

example.8 : How to run production transformations

pathena allows users to run official transformations with customized sources/packages. First, setup AtlasXYZRuntime, e.g.,

$ asetup AtlasProduction,17.0.4.5,here,setup

Next, if you locally run a trf like

$ Reco_trf.py inputAODFile=AOD.493610._000001.pool.root.1 outputNTUP_SUSYFile=my.NTUP.root"

replace some parameters with %XYZ

Input
→ %IN
Cavern Input
→ %CAVIN
Minimumbias Input
→ %MININ
Low pT Minimumbias Input
→ %LOMBIN
High pT Minimumbias Input
→ %HIMBIN
BeamHalo Input
→ %BHIN
BeamGas Input
→ %BGIN
Output
→ %OUT + suffix (e.g., %OUT.ESD.pool.root)
MaxEvents
→ %MAXEVENTS
SkipEvents
→ %SKIPEVENTS
FirstEvent
→ %FIRSTEVENT
DBRelease or CDRelease
→ %DB:DatasetName:FileName (e.g., %DB:ddo.000001.Atlas.Ideal.DBRelease.v050101:DBRelease-5.1.1.tar.gz. %DB:LATEST if you use the latest DBR). Note that if your trf uses named parameters (e.g., DBRelease=DBRelease-5.1.1.tar.gz) you will need DBRelease=%DB:DatasetName:FileName (e.g., DBRelease=%DB:ddo.000001.Atlas.Ideal.DBRelease.v050101:DBRelease-5.1.1.tar.gz)
Random seed
→ %RNDM:basenumber (e.g., %RNDM:100, this will be incremented per sub-job)

and then submit jobs using the --trf option;

$ pathena --trf "Reco_trf.py inputAODFile=%IN outputNTUP_SUSYFile=%OUT.NTUP.root" --inDS ... --outDS ...

When your job doesn't take an input (e.g., evgen), use the --split option to instantiate multiple sub-jobs if needed. %SKIPEVENTS may be needed if you use the --nEventsPerJob or --nEventsPerFile options of pathena. Note that you need to explicitly specify maxEvents=XYZ or something in --trf to set the number of events processed in each subjob, since the value of --nEventsPerJob or --nEventsPerFile is used only for job splitting but it is not used to configure subjobs on WN.

If you want to add parameters to the transformation that are not listed above, just add them inside the quotes as you normally would to the command line options. Pathena doesn't replace anything for you, but it passes these parameters along to the transformation just the same.


example.9 : How to run multiple transformations in a single job

One can run multiple transformations in a single job by using the --trf option like

$ pathena --trf "trf1.py ...; trf2.py ...; trf3.py ..." ...

Here is an example to run simul+digi;

$ pathena --trf "AtlasG4_trf.py inputEvgenFile=%IN outputHitsFile=tmp.HITS.pool.root maxEvents=10 skipEvents=0 randomSeed=%RNDM geometryVersion=ATLAS-GEO-16-00-00 conditionsTag=OFLCOND-SDR-BS7T-04-00; Digi_trf.py inputHitsFile=tmp.HITS.pool.root outputRDOFile=%OUT.RDO.pool.root maxEvents=-1 skipEvents=0 geometryVersion=ATLAS-GEO-16-00-00  conditionsTag=OFLCOND-SDR-BS7T-04-00" --inDS ...

where AtlasG4_trf.py produces a HITS file (tmp.HITS.pool.root) which is used as an input by Digi_trf.py to produce RDO. In this case, only RDO is added to the output dataset since only RDO has the %OUT prefix (i.e. %OUT.RDO.pool.root).

If you want to have HITS and RDO in the output dataset the above will be

$ pathena --trf "AtlasG4_trf.py inputEvgenFile=%IN outputHitsFile=%OUT.HITS.pool.root maxEvents=10 skipEvents=0 randomSeed=%RNDM geometryVersion=ATLAS-GEO-16-00-00 conditionsTag=OFLCOND-SDR-BS7T-04-00; Digi_trf.py inputHitsFile=%OUT.HITS.pool.root outputRDOFile=%OUT.RDO.pool.root maxEvents=-1 skipEvents=0 geometryVersion=ATLAS-GEO-16-00-00  conditionsTag=OFLCOND-SDR-BS7T-04-00" --inDS ...

Note that both AtlasG4_trf.py and Digi_trf.py take %OUT.RDO.pool.root as a parameter. AtlasG4_trf.py uses it as an output filename while Digi_trf.py uses it as an input filename.


example.10 : How to run on a good run list

Before you submit jobs to the grid, it is recomended to read GoodRunsListsTutorial.

First, you need to get an XML file from Good Run List Generator. The XML contains a list of runs and LBs of interest. Next, configure your jobO to use the XML for good run list selection. Now you can submit the job using the --goodRunListXML option. e.g.,

$ pathena myJobO.py --goodRunListXML MyLBCollection.xml --outDS user...

where MyLBCollection.xml is the file name of the XML. The list of runs/LBs are internally converted to a list of datasets by AMI. So you don't need to sepcify --inDS. The XML is sent to remote WNs and jobs will run on the datasets with the XML. There are additional options related to good run list:

--goodRunListDataType
the XML is converted to AOD datasets by default. If you require another type of datasets, you can specify the type using this option
--goodRunListProdStep
the XML is converted to merge datasets by default. If you require datasets with another production step, use this option
--goodRunListDS
a comma-separated list of pattern strings. Datasets, which are converted from the XML, will be used when their names match with one of the pattern strings. Either \ or "" is required when a wild-card is used. If this option is omitted jobs will run over all datasets

example.11 : How to read particular events for real data

pathena can take a run/event list as an input since 0.2.37. First you need to prepare a list of runs/events of interest. You may get a list by analysing D3PD, browsing event display, using ELSSI, and so on. A list looks like
$ cat rrr.txt
154514 21179
154514 29736
154558 448080
where each line contains a run number and an event number. Then, e.g.,
$ pathena AnalysisSkeleton_topOptions.py --eventPickEvtList rrr.txt --eventPickDataType AOD --eventPickStreamName physics_CosmicCaloEM --outDS user...
where events in the input file are internally converted to AOD (eventPickDataType) files with the physics_CosmicCaloEM (eventPickStreamName) stream. For MC data, --eventPickStreamName needs to be '' or the option itself needs to be omitted since stream names are not defined. Also, your jobO is dynamically re-configured to use event selection on remote WNs, so generally you don't need to change your jobO. In principle, you can run any arbitrary jobO. For example, if you run esdtoesd.py (with --eventPickDataType=ESD) you will get a skimmed ESD. FYI, you can also skim ESD/AOD/RAW using acmd.py via prun (see this page).
[Edit: with recent versions of esdtoesd.py (such as release 16.6.6 and beyond), you may need to use the flag --supStream GLOBAL in your pathena command line. See this FAQ for more details]

example.12 : How to use AtlCopyBSEvent.exe with TAG files

If your non-Athena job uses TAG files to read RAW,ESD,AOD, it is easier to execute the job by using pathena rather than prun since the TAG stuff heavily depends on the Athena infrastructure. In principle, you can run arbitrary scripts using --trf which essentially corresponds to --exec of prun. e.g.,

$ pathena --trf "AtlCopyBSEvent.exe --src PFN:%IN RootCollection -q 'EFPassedTrigMask31 == 512' -o %OUT.event.data" --inDS data11_7TeV.00186533.physics_JetTauEtmiss.merge.TAG.f393_m929_m928 --useTagInTRF --tagStreamRef StreamRAW --tagQuery "EFPassedTrigMask31 == 512" ...
In order for --trf to use TAGs you need to set --useTagInTRF and --tagStreamRef (and optionally --tagQuery). If you run athena with normal job option files, pathena autoamltically detects the job uses TAGs, so you don't need to specify those options. If --useTagInTRF is set input files are treated as TAG files so that pathena sends the job to the site where both TAG and parent datasets are available. --tagStreamRef is the stream name of parent files which you want to read through TAGs. It must be one of StreamRAW,StreamESD,StreamAOD. --tagQuery is the query string to select events. jobs will run only on files which contain selected events. Note that you need to set --tagQuery even if you have the same string in --trf because pathena doesn't parse the string in --trf except parameters which starts with % such as %IN.


3. Monitoring

One can monitor task status in PandaMonitor.

Task parameters are defined in this page.

Job status would change as follows;

  1. defined : recorded in job database
  2. assigned : DDM is setting up datasets
  3. activated : waiting for request from worker node
  4. running : worker node running
  5. holding : waiting to add files to DQ2 dataset
  6. finished/failed

4. Get results

Output datasets which were specified with --outDS contain output files. They are registered in DDM so that you can access them using rucio client. If you want to get the output dataset at your favorite destination automatically, see this Tip.

5. Bookkeeping & Retry

Warning, important New bookkeeping tool is available in this page. Job records can be retrived with pbook. In a pbook session, autocomplete is bounded to the TAB key. And use the up-arrow key to bring back the command.

$ pbook

>>> show()
===================
JobID : 1
 time  : 2006-04-25 22:48:04
 inDS  :
outDS : user.TadashiMaeno.123456.aho.evgen.pool.v11356
 libDS : pandatest.60bfeef5-3998-44ed-8802-3e79c568ba8b
 build : 163801
 run   : 163802
 jobO  : jobOptions.pythia.py
===================
JobID : 2
 time  : 2006-04-28 17:32:50
...

See status of JobID=3.

>>> show(3)
===================
JobID : 3
 time  : 2006-04-28 18:32:53
 inDS  : csc11.005056.PythiaPhotonJet2.recon.AOD.v11004107
 outDS : user.TadashiMaeno.123456.aho.test2
 libDS : pandatest.368b45f5-b6dd-4046-a368-6cb50cd9ee5b
 build : 166095
 run   : 166096-166107
 jobO  : ../share/AnalysisSkeleton_jobOptions.py
----------------------
buildJob   : succeeded
----------------------
runAthena  :
          total : 10
      succeeded : 8
         failed : 1
        running : 1
        unknown : 0
----------------------

Retry failed subjobs in JobID=5.

>>> retry(5)

Kill JobID=3.

>>> kill(3)

See help

>>> help()

Press Ctl-D to exit

6. How to debug jobs when they failed

Error code shows why your job failed. Error codes are defined in the Error Code List.

deb2.png

When transExitCode!=0, the job failed with an Athena problem. You may want to see log files. You can browse the log files.

dbg2.png

dbg3.png

dbg4.png

Now you find a directory PandaJob _* which contains various log files. e.g., there should be pilot_child.stdout for stdout and pilot_child.stderr for stderr, respectively. You may investigate Athena output in pilot_child.stdout.

dbg5.png

Warning, important If you still have a problem, send a message to the distributed analysis help hypernews.


Details

What's going to happen when a user submits a job

A user job is composed of sub-jobs, i.e., one buildJob and many runAthena's. buildJob receives source files from the user, compiles and produces libraries. The libraries are stored to the storage. The completion of buidJob triggers runAthena's. Each runAthena retrieves the libraries and runs Athena. Output files are added to an output dataset. Then DDM moves the dataset to your local area.

build2run.png

input dataset
contains input files, such as pool, bytestream, and tag collection. Input datasets must be registered in DDM before submit jobs. Official datasets have been registered already. When you want to use a private dataset, use dq2_put.

output dataset
contains output files

pathena

pathena is a client tool to submit user-defined jobs to distributed analysis systems. It provides a consistent user-interface to Athena users. It does

  1. archive user's working directory
  2. send the archive to Panda
  3. extract job configration from jobOs
  4. define job specifications automatically
  5. submit jobs

options

Try --help to see all options.

$ pathena --help

Some (not all) options are explained here.

-v
Verbose. useful for debugging

--inDS
name of input dataset

--outDS
name of output dataset. should be unique and follow the naming convention

--libDS
name of a library dataset

--split
the number of sub-jobs to which an analysis job is split

--nFilesPerJob
the number of files on which each sub-job runs

--nEventsPerJob
the number of events per job

--nFiles
use an limited number of files in the input dataset

--nSkipFiles
the number of files in the input dataset one wants to skip counting from the first file of the dataset

--useAMIAutoConf
evaluates the inDS autoconf information from AMI. Boolean -- just set the option without arguments

--long
send job to a long queue

--blong
send build job to a long queue

--noBuild
skip buildJob

--extFile
pathena exports files with some special extensions (.C, .dat, .py .xml) in the current directory. If you want to add other files, specify their names, e.g., --extFile data1.root,data2.tre

--noSubmit
don't submit jobs

--tmpDir
temporary directory in which an archive file is created

--site
send job to a site, if it is AUTO, jobs will go to the site which holds the most number of input files

-c --command
one-liner, runs before any jobOs

--extOutFile
define extra output files, e.g., output1.txt,output2.dat

--fileList
List of files in the input dataset to be run

--addPoolFC
file names to be inserted into PoolFileCatalog.xml except input files. e.g., MyCalib1.root,MyGeom2.root

--skipScan
Skip LRC/LFC lookup

--inputFileList
name of file which contains a list of files to be run in the input dataset

--removeFileList
name of file which contains a list of files to be removed from the input dataset

--corCheck
Enable a checker to skip corrupted files

--inputType
File type in input dataset which contains multiple file types

--outTarBall
Save a copy of local files which are used as input to the build

--inTarBall
Use a saved copy of local files as input to build

--outRunConfig
Save extracted config information to a local file

--inRunConfig
Use a saved copy of config information to skip config extraction

--minDS
Dataset name for minimum bias stream

--nMin
Number of minimum bias files per one signal file

--cavDS
Dataset name for cavern stream

--nCav
Number of cavern files per one signal file

--trf
run transformation, e.g. --trf "csc_atlfast_trf.py %IN %OUT.AOD.root %OUT.ntuple.root -1 0"

--useNextEvent
Use this option if your jobO uses theApp.nextEvent() e.g. for G4 simulation jobs. Note that this option is not required when you run transformations using --trf.

--notSkipMissing
If input files are not read from SE, they will be skipped by default. This option disables the functionality

--pfnList
Name of file which contains a list of input PFNs. Those files can be un-registered in DDM

--individualOutDS
Create individual output dataset for each data-type. By default, all output files are added to one output dataset

--dbRelease
use non-default DBRelease or CDRelease (DatasetName:FileName). e.g., ddo.000001.Atlas.Ideal.DBRelease.v050101:DBRelease-5.1.1.tar.gz. To run with no dbRelease (e.g. for event generation) provide an empty string ('') as the release.

--dbRunNumber
RunNumber for DBRelease or CDRelease. If this option is used some redundant files are removed to save disk usage when unpacking DBRelease tarball. e.g., 0091890. This option is deprecated and to be used with 2008 data only.

--supStream
suppress some output streams. e.g., ESD,TAG

--ara
obsolete. Please use prun

--ares
obsolete. Please use prun

pathena analysis queues

An info about pathena analysis queues is provided in PathenaAnalysisQueues page.

Job priority

Job priorities are calculated for each user by using the following formula. When a user submits a job which is composed of M subJobs,

where

  • Priority(n) … Priority for n-th subJob (0≤n<M)

  • T … The total number of the user's subJobs existing in the whole queue. (existing = job status is one of defined,assigned,activated,sent,starting,running)

  • W … Weight

  • U … CPU usage for last 24 hours (in kSI2kday)

  • Q … CPU quota

  • H(x) … Heaviside step function (0: x≤0, 1: x>0)

For example, if a fresh user submits a job composed of 100 subJobs, the first 5 subJobs have Priority=1000 while the last 5 subJobs have Priority=981. The idea of this gradual decrease is to prevent large jobs (large = composed of many subJobs) from occupying the whole CPU slots. When another fresh user submits a job with 10 subJobs, these subJobs have Priority=1000,999 so that they will be executed as soon as CPU becomes available even if other users have already queued many subJobs. Priorities for waiting jobs in the queue are recalculated every 20 minutes. Even if some subjobs have very low priorities at the submission time their priorities are increased periodically so that they are executed before they expire.

The CPU usage and quota can be found in PandaMonitor. The assigned quota is 500 for all users and groups based on the request from physics coordination.

Currently Weight=100 for all users and groups. It is possible to dynamically change the value per user or group, if physics coordination request that.

Each site can define priority offsets to give special privileges to particular VO groups, where VO groups may be country groups like /atlas/ca, physics groups like /atlas/susy, and so on. If the user belongs to the groups and she/he submits jobs to the site, their priorities will be automatically increased.

If the user submits jobs with the --voms option to produce group datasets (see this section), those jobs are regarded as group jobs. Priorities are calculated per group, so group jobs don't reduce priorities of normal jobs which are submitted by the same user without the --voms option.

There are a few kinds of jobs which have higher priorities, such as merge jobs (5000) and ganga-robot jobs (4000), since they have to be processed quickly.

How Panda decides on a submission site

The brokerage chooses one of sites using

  • input dataset locations
  • the number of jobs in activated/defined/running state (site occupancy rate)
  • the average number of CPUs per worker node at each site
  • the number of active or available worker nodes
  • pilot rate for last 3 hours. If no pilots, the site is skipped
  • available disk space in SE
  • Atlas release/cache matching
  • site statue

The weight is defined as

where

  • W … Weight at the site

  • G … The number of available worker nodes which have sent getJob requests for last 3 hours

  • U … The number of active worker nodes which have sent updateJob requests for last 3 hours

  • R … The maximum number of running jobs in last 24 hours

  • D … The number of defined jobs with higher priorities than that of the submitter

  • A … The number of activated jobs with higher priorities than that of the submitter

  • P … Preferential weight to use beyond-pledge (schedconfig.availableCPU/pledgeCPU) if countryGroup is matched between the user and the site

You can see in the PandaLogger how the brokerage works. In general, the brokerage automatically sends jobs to a site where the input dataset is available and CPU/SW resources are optimal. However, the brokerage is skipped in the following cases;

  • the --site option is used
    • e.g. --site ANALY_LANCS
  • the --libDS option is used
    • libDS contains compiled InstallArea which has site dependence. So jobs are sent to the site where libDS exists. Note that --libDS saves compilation time but the brokerage is skipped so this option does not always improve performance
  • the output dataset is reused
    • Jobs are sent to the site where the output dataset exists in order to avoid splitting output files among multiple sites

Warning, important It is better to avoid those options unless you really need to, since jobs might be sent to a busy site.

Only BNL has short and long queues. If jobs in the short queue exceed the walltime limit, they are automatically sent to the long queue. If you want to exclude some sites explicitly, you can specify them as a comma separated list using --excludedSite. E.g., --excludedSite=ABC,XYZ . In this case, the brokerage ignores all sites which contain ABC or XYZ in their siteIDs.

This link displays the last computations to make the brokerage decisions.

Rebrokerage

Jobs are internally reassigned to another site at most 5 times, when

  • they are waiting for 24 hours. When HammerCloud sets sites to the test or offline mode jobs are reassigned 3 hours later
  • no sub-jobs with the same jobID (jobDefinitionID) ran in last 3 hours. Jobs are reassigned per jobID
  • the output is a dataset container instead of a dataset
  • the user didn't specify --site or --disableRebrokerage or --libDS or --workingGroup

The algorithm for site selection is the same as normal brokerage described in the above section, except that the size of scratch space is checked in addition. If a new site is found, the panda server will generate new build-job and run-jobs with new jobsetID and jobID, send them to the new site, and then kill old run-jobs, when corresponding build job has already finished. When the build job has not run yet, original jobs will be sent to the new site with the original jobsetID, jobID and PandaID. When a new site is not found, jobs will stay at the original site.

This link shows rebrokered jobs in the last 12 hours.

The latest DBRelease

The latest DBRelease is defined as follows:

  • The dataset name starts with ddo. and have a 4 digits version number
  • It contains only one file and the filename starts with DBRelease, which excludes reprocessing DBR
  • More than 40 and 90% online sites have the replica
  • All online T1 have the replica
  • It has the highest version number in all candidates which meet the above requirements


FAQ

Contact

We have one egroup and one JIRA. Please submit all your help requests to hn-atlas-dist-analysis-help@cern.ch which is maintained by AtlasDAST. Panda JIRA is for bug reporting. Please choose appropriate forum according to your purpose, which allows us to provide better responses. For general info see Panda info and help. For TAG and Event Picking issues atlas-event-metadata@cern.ch will help.

Why did my jobs crash with "sh: line 1: XYZ Killed"?

sh: line 1: 13955 Killed                  athena.py -s ...

If you see something like the above message in the log file, perhaps your jobs were killed by the batch system due to huge memory consumption. pathena instantiates one sub-job per 20 AODs or per 10 EVNT/HITS/RDO/ESDs by default. You can reduce it by using --nFilesPerJob.

pathena failed with "No files available at ['XYZ']"

The input files don't exist at the remote site. Submit a DDM subscription request for the files to be replicated. You need to register first, click on the "for registration go here" to get to the registration page.

How can I send jobs to the site which holds the most number of input files?

Don't use --site, so that jobs are automatically sent to proper sites.

Why is my job pending in activated/defined state?

Atlas.Panda server manages all job information, and worker nodes run jobs actually. When a worker node is free,

  1. the worker sends a request to the server
  2. the server sends a job back to the worker
  3. the worker runs the job
  4. the job status is changed to 'running'

The Atlas.Panda system has a pull-mechanism here. Each worker starts all sessions between the server and the worker. If your job is pending, this means all workers are busy or the back-end batch system is down. If your job is stuck for 1 day or more, ask Savannah.

How do I kill jobs?

$ pbook
>>> kill(JobID)

Use proper JobID. e.g., kill(5). See Bookkeeping.

My job got split into N sub-jobs, and only M sub-jobs failed. Is it possible to retry only the M sub-jobs?

$ pbook
>>> retry(JobID)

which retries failed sub-jobs in the job with JobID. Use proper JobID. e.g., retry(5). See Bookkeeping.

Why were my jobs killed by the Atlas.Panda server? : what does 'upstream job failed' mean?

Analysis job is composed of one 'build' job and many 'run Athena' jobs (see What happens in Panda). If the build job fails, downstream runAthena jobs will get killed.

Are there any small samples to test my job before run it on the whole dataset?

If '--nfiles N' is given, your job will run on only N files in the dataset.

$ pathena ... --inDS csc11.005056.PythiaPhotonJet2.recon.AOD.v11004107      --nfiles 2

What is the meaning of the 'lost heartbeat' error?

A worker node running a job sends heartbeat messages to the panda server every 30 min. That indicates the worker node (pilot process) is alive. If the panda server doesn't receive any heartbeats for 6 hours, the job gets killed.

jobDispatcherErrorCode   100
 jobDispatcherErrorDiag   lost heartbeat : 2006-05-30 22:32:00

This means the last hartbeat was received at 22:32:00 and then the job was killed after 6 hours. The error happens mainly when the pilot died due to temporary troubles in the backend batch system or network. Simply retry would succeed.

What is the meaning of the 'Looping job killed by pilot' error?

If a job doesn't update output files for 2 hours, it will be killed. This protection is intended to kill dead-locked jobs or infinite-looping jobs. If your job doesn't update output files very frequently (e.g., some heavy-ion job takes several hours to process one event) you can relax this limit by using the --maxCpuCount option. However, sometimes even normal jobs get killed due to this protection. When the storage element has a problem, jobs cannot copy input files to run Athena and of course cannot update output files. When you think that your job was killed due to an SE problem, you may report to hn-atlas-dist-analysis-help@cern.ch. Then shift people and the SE admin will take care of it.

I want to process the whole dataset with N events per job. (integer N>0)

Use the following command:

$ pathena --split -1 --nEventsPerJob N  .....(other parameters)

or equally acceptable form:

$ pathena --nEventsPerJob N .....

In the case above, number of submitted jobs will depend on number of files in a given dataset, number of events per file in the dataset and requested number N. Each job will have one file as input.
For example if you chose N to be larger than a number of events in a file then:

$ pathena --nEventsPerJob N .....

will submit number of jobs equal to a number of files in the dataset, with one file per job.

I want to run a transformation like Reco_tf.py with N events per job.

When running a transformation you should be aware that pathena doesn't append or subtract anything to/from the argument string which you specify in the --trf option because option names can depend on releases and/or transformations and pathena doesn't know all of them (for example, maxEvents was --maxEvents in some old releases). All pathena does is to replace some placeholders, which are described in this section, to actual values. Here is an example to run Reco_tf.py with 100 events per job.

$ pathena --trf "Reco_tf.py inputAODFile=%IN outputNTUP_SUSYFile=%OUT.NTUP.root skipEvents=%SKIPEVENTS maxEvents=100 ..." --inDS ... --outDS ... --nEventsPerFile=1000 --nEventsPerJob=100 ...
Note that you should check if the argument string is correct before trying the above example. Generally you can see all options of a transformation by giving an invalid option like
$ Reco_tf.py -x
usage: Reco_tf.py [-h] [--verbose]
                  [--loglevel {INFO,CRITICAL,WARNING,VERBOSE,ERROR,DEBUG,FATAL,CATASTROPHE}]
                  [--argJSON FILE] [--dumpargs] [--showGraph] [--showPath]
                  [--showSteps] [--dumpPickle FILE] [--dumpJSON FILE]
...
Common pitfalls are as follows:

  • When you don't specify maxEvents, each job processes all events in the input file
  • When you don't specify skipEvents, all jobs with the same input file process same events.

Especially pay attention that maxEvents is not automatically appended even if you set --nEventsPerJob.

I want to launch M jobs, each with N events per job

You can use the following command:

$ pathena --split M --nEventsPerJob N .....

Note: Options --nFiles e.g."split by files" and --nEventsPerJob e.g."split by events" can not be defined simultaneously, pathena
will exit with an error at startup. Please define only one or another.

How do I merge the results of my pathena job?

See Merge ROOT files using prun.

pathena failed with "ImportError: No module named Client"

Some releases (e.g., 13.0.X) contain obsolete pathena in the HiggsPhys package or something. If pathena points to the obsolete one it will fail with

Traceback (most recent call last):
  File "/.../prod/releases/13.0.40/AtlasAnalysis/13.0.40/InstallArea/share/bin/pathena", line 15, in ?
    import UserAnalysis.Client as Client
ImportError: No module named Client

Make sure pathena points to one which you installed locally; i.e.,

$ which pathena
$ ls -l `which pathena`

should give something like

~/myWork/InstallArea/share/bin/pathena
lrwxr-xr-x  1 tmaeno zp 102 Jun 26 16:21 ~/myWork/InstallArea/share/bin/pathena -> ~/myWork/PhysicsAnalysis/Atlas.DistributedAnalysis/PandaTools/share/pathena

If this is wrong, check PATH

$ echo $PATH

If you are using csh/zsh, you need to execute 'rehash' to update the hash table in the shell after installing PandaTools.

$ rehash

Expected output file does not exist

Perhaps the output stream is defined in somewhere in your jobOs, but nothing uses it. In this case, Athena doesn't produce the file. The solutions could be to modify your jobO or to use the --supStream option. E.g., --supStream hist1 will disable user.AhoBaka.TestDataSet1.hist1._00001.root.

How to make a group defined outDS

$ pathena --official --voms atlas:/atlas/groupName/Role=production --outDS group.groupName.[otherFields].dataType.Version ...

where groupName for SUSY is phys-susy, for instance. See the document ATL-GEN-INT-2007-001 for dataset naming convention. The group name needs to be officially approved and registered (see GroupsOnGrid). Note that you need to have the production role for the group to produce group-defined datasets. If not, please request it in the ATLAS VO registration page. For example, to produce datasets for Higgs working group,

$ pathena --official --voms atlas:/atlas/phys-higgs/Role=production --outDS group.phys-higgs...
If you submit jobs with the --voms option, those jobs are regarded as group jobs.

How to ignore Athena Fatal errors

See this hypernews message as an example. Note that this may be required only when you run official transformations.

Job state definitions in Panda (defined/assigned/waiting/holding etc.)

See this info under Monitoring above, also a detailed info from this link.

How can I delete my user datasets

Please refer to DQ2ClientsHowTo.

How can I ask for datasets to be replicated

You need to make a request on DaTRI for the dataset to be replicated. You need to register first, click on the "for more information click here" to get to the registration page. You can check the status of your request, please refer to DaTRI.

Why do my jobs fail with "Files is on tape" or "Transform input file not found" error

The pilot doesn't copy input files to WNs when they are on tape, to avoid occupying the CPU slot inactively. Normal jobs will fail with "Files is on tape" error if all files are skipped, or tansformation jobs will fail with "Transform input file not found" if some files are skipped since transformations require all input files. Atlas policy states that ESD/RDO/RAW datasets should be on tape. Users are encouraged to file a DDM subscription request for the dataset to be moved to disk. See more info about this request in the above FAQ item. When files are on tape the pilot skips them after sending prestage-requests, and copies only files which have already been cached on disk. Jobs run on available files on WNs, i.e., they ignore missing files. The user may retry later if the files become available on disk. See additional info about skipped files.

Is it possible to run AthenaRootAccess with pathena

It is recommended to use prun for this purpose. You can find an example in this page.

pbook generates the error "ERROR : SQL error: unsupported file format"

If this happens sometimes, the filesystem of user's home dir might be unstable. For example, this kind of SQL error tends to happen when AFS has a problem. Or if the error happens permanently, then user's local database might be corrupted. In this case, the solution is to delete the local database. It will be recreated when pathena runs next time.

$ rm $PANDA_CONFIG_ROOT/pandajob.db

The reasons for user's local database to get corrupted, The local database is manipulated by sqlite3. As mentioned in this link, OS/HW problems may corrupt database files. Or there might be rare bugs in sqlite3 itself.

What is the meaning of 'Partial' in an email notification?

You may find 'Partial' in email notification, e.g.,

Summary of JobID : 603

Created : 2008-10-27 14:03:29 (UTC)
Ended   : 2008-10-27 14:20:26 (UTC)

Site    : ANALY_BNL_ATLAS_1

Total Number of Jobs : 3
           Succeeded : 2
           Partial   : 1
           Failed    : 0

When the pilot skips some files on tape and jobs successfully run on partial files, the final status of those jobs is categorized as 'partially succeeded'. 'Partial' is the number of partially succeeded jobs. If all files are skipped, the job will be failed. The above notification means that two sub-jobs ran on all associated input files while one sub-job ran on part of input files. Skipped files have status='skipped' in the panda monitor. pathena will use skipped files if the user submits jobs with the same input/output datasets again. This capability allows users to process on-tape files incrementally and allows the storage system to use the disk cache efficiently.

What does "COOL exception caught: The database does not exist" or IOVDbSvc ERROR mean?

This error mostly means that your jobs require the latest conditions data which is not available in the default database release installed by the release kit. For example, normally RecExCommission jobs require the latest conditions data, but many CERN lxplus users don't notice that because AFS builds are implicitly using the latest conditions data buffered on AFS disks. However, all grid jobs run with release kits using the default database even if they run at CERN, and thus users sometimes see this kind of problems. Solutions could be to specify the latest version of the DB release by the --dbRelease option (see above in options for its usage) or to use Oracle RAC . Make sure if you are allowed to use Oracle RAC when using the second solution. Find more info about DB Releases at AtlasDBReleases.

What does the "OFLCOND-SIM-00-00-06 does NOT exist" error mean?

As an example, the cause of the problem is the absence of the Conditions DB tag OFLCOND-SIM-00-00-06 (requested in your jobOptions) in the default DBRelease version 6.2.1 used in your s/w release 14.5.1. Generally, such problem is corrected by switching to the very latest DB Release version, find more info about DB Releases at AtlasDBReleases. Solution is to use --dbRelease option, for instance --dbRelease='ddo.000001.Atlas.Ideal.DBRelease.v060402:DBRelease-6.4.2.tar.gz'.

What does the "Exception caught: Connection on "ATLASDD" cannot be established" error mean?

Possible reasons:

  • The content of your dblookup.xml is incompatible with the grid environment. In this case this file should be removed in run directory before submitting the job. See an discussion on this from this egroups link. Note: The user can have his/her local dblookup.xml in order to override default database connection settings, provided this user has some knowledge about dblookup.xml structure and contents.
  • The job is supposed to access sqlite replica, which is not accessible due to wrong permissions, nfs problems or something else.
  • The job is supposed to access sqlite replica, which is missing.
  • The job is supposed to access Oracle and there are some technical problems with Oracle connections.
  • There can be other reasons as well.

For all database access problems

Please refer to AthenaDBAccess, CoolTroubles

Usage of ATLASUSERDISK vs ATLASLOCALGROUPDISK

pathena writes outDS to space token ATLASUSERDISK at the execution site, one can write outDS to space token ATLASLOCALGROUPDISK by the option "--spaceToken ATLASLOCALGROUPDISK". Please note that this option will not be used for the US sites. User data will stay in USERDISK and the deletion policy will be different than other ATLAS sites explained in the page StorageSetUp. Users do not need to worry about 30 day deletion limit for US sites for now.

User work directory exceeds the size limit when DBRelease is used

When a large DBRelease is used, the pilot sometimes fails with

!!FAILED!!1999!! User work directory
(/tmp/Panda_Pilot_671_1233518570/PandaJob_25004808_1233518571/workDir) too
large: 2285164 kB (must be < 2 GB)

The solution is to use --dbRunNumber in addition to --dbRelease=. When the option is used redundant files (typically ~1.5GB) are removed to save disk usage when unpacking DBRelease tarball on WN. E.g.,

$ pathena --dbRunNumber 0091890 --dbRelease ddo.000001.Atlas.Ideal.DBRelease.v06030101:DBRelease-6.3.1.1.tar.gz ...

The --dbRunNumber option is available in 0.1.10 or higher. This option is deprecated and to be used with 2008 data only.

How do I blacklist sites against receiving my pathena submission

If your input datasets has some files corrupted or missing at some sites for instance, you may want to exclude these sites in your submission by --excludedSite=SITE1,SITE2.

"failed to access LFC" error in my pathena submission

If you see this error in your pathena submission, you may want to check if that site's LFC is accessible. For instance for ANALY_LYON site's LFC:

source /afs/cern.ch/atlas/offline/external/GRID/ddm/DQ2Clients/setup.sh
voms-proxy-init -voms atlas
dq2-ls -f mc08.105300.PythiaH130zz4l.digit.RDO.e352_s462_d150_tid040027
source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh
export LFC_HOST=lfc-prod.in2p3.fr
lcg-lr -V atlas guid:8A8B2593-A3F4-DD11-9C1F-0019B9E7C925

This is the guid of the last file listed in the dataset. The last command returns:

srm://ccsrm.in2p3.fr/pnfs/in2p3.fr/data/atlas/atlasmcdisk/mc08/RDO/mc08.105300.PythiaH130zz4l.digit
.RDO.e352_s462_d150_tid040027/RDO.040027._00969.pool.root.1

which shows no problem accessing the LFC.

How do I get the output dataset at my favorite destination automatically

When --destSE option is used, output files are automatically aggregated to a DDM endpoint. e.g.,

$ pathena --destSE LIP-LISBON_LOCALGROUPDISK ...
The first successful subjob makes a replication rule to the DDM endpoint in rucio with user's permission. The user needs to have write permission at the DDM endpoint. Otherwise, output files are not aggregated to the destination although subjobs are successful. This means that output files are left where they are produced and the user is still able to manually replicate and/or download those files later.

The name of the DDM endpoint can be found in AGIS. Generally LOCALGROUPDISK (long term storage) or SCRATCHDISK (short term storage) can be used. You can check permission in each DDM endpoint page. For example, if you go to LIP-LISBON_LOCALGROUPDISK you can see that only /atlas/pt users are allowed to write to the endpoint, so if you don't belong to the pt group the above example will fail and you will have to choose a proper endpoint.

How can I check if a release/production cache is installed at a site and how I can request for and installation

https://atlas-install.roma1.infn.it/atlas_install/ (For US sites the site names are sitename_Install, for instance BNL_ATLAS_Install). Status of Panda installation jobs can be monitored from this Panda monitoring link. If you like to request for a release installation (e.g. for T0 releases), you should create a GGUS ticket for OSG and EGEE sites. Alternatively you can use the "Request an installation" link for only EGEE sites.

Issues with SL5 and SL4

While the migration from SL4 to SL5 should be transparent to users a few errors have appeared.

For users running SL5: If you have installed an SL5 kit your jobs must be sent to an SL5 site. To insure compatibility, SL4 kits should be installed with suitable compatibility libraries.

For users running SL4: Certain jobs will terminated with a strange segmentation fault, please see this hypernews message and try to run on an SL4 site.

Note: For lcg sites the operating system can be queried using lcg-infosites

pathena failed due to "ERROR : Could not parse jobOptions"

As an example the error message would be something like:

BTagging/BTagging_LoadTools.py", line 65, in <module>
    input_items = pf.extract_items(pool_file=
svcMgr.EventSelector.InputCollections[0])
IndexError: list index out of range
ERROR : Could not parse jobOptions

or

RecExCommon_topOptions.py", line 1053, in <module>
    from RecExCommon.InputFilePeeker import inputFileSummary
  File ".../RecExCommon/InputFilePeeker.py", line 29, in <module>
    RunTimeError," no existing input file detected. Stop here."
NameError: name 'RunTimeError' is not defined
ERROR : Could not parse jobOptions

inputFilePeeker tries to extract metadata from the input file to configure run parameters dynamically. So they don't work locally unless you define a valid input file in your jobO. i.e., Athena (not pathena) fails with the same error.

$ athena -i yourJobO.py
...
RecExCommon/RecExCommon_topOptions.py", line 1053, in <module>
    from RecExCommon.InputFilePeeker import inputFileSummary
  File "/afs/cern.ch/atlas/software/builds/AtlasReconstruction/15.2.0/InstallArea/python/RecExCommon/InputFilePeeker.py", line 29, in <module>
    RunTimeError," no existing input file detected. Stop here."
NameError: name 'RunTimeError' is not defined
athena>

Basically pathena doesn't work if Athena locally fails with the jobO.

The solution is to have something like

svcMgr.EventSelector.InputCollections=["/somedir/mc08.108160.AlpgenJimmyZtautauNp0VBFCut.recon.ESD.e414_s495_r635_tid070252/ESD.070252._000001.pool.root.1"]

or

PoolESDInput=["/somedir/mc08.108160.AlpgenJimmyZtautauNp0VBFCut.recon.ESD.e414_s495_r635_tid070252/ESD.070252._000001.pool.root.1"]

in your jobO, where the input file must be valid (i.e. can be accessed from your local computer). Note that those parameter (essentially EventSelector.InputCollections and AthenaCommonFlags.FilesInput) will be overwritten to lists of input files on remote WNs automatically.

Question: Does the local file have to be from the very same inDS that will be run on the grid? The answer is no. In many cases, only the data type, such as AOD,ESD,RAW..., is important.

What does "ImportError: ...lfc.so: wrong ELF class" error mean?

This refers to a problem with your panda-client package setup. Please refer to PandaTools#Setup.

Where can I find list of clouds and their sites and CE

From Panda monitor cloud link or from DDM dashboard link or from ToA cache.

Why do I get a different number of subjobs at different sites while running on the same dataset?

Files were still being added to the dataset when the job was submitted. The content of the dataset can change before the dataset get frozen. Here is more detailed info: When a dataset is being made by Tier 0 the sequence is as follows. (1) the empty dataset is registered in DQ2. (2) files are produced and added to the dataset. The dataset is visible in DQ2 during this time but it is OPEN. The number of files will be changing. The analysis tools do not check to see if the dataset the user requests is frozen or closed. If you were to run an analysis job during this period you might get the situation where different numbers of jobs were declared because the number of files available is different. (3) When the dataset is complete (no more files to add) it will be frozen, and at this point and only at this point, Tier 0 raises an "AMI Ready" flag, and the dataset is read into AMI. (Normally within 5 minutes maximum of the raising of the flag.) Therefore, there is a period of a few hours when a dataset may be visible in DQ2 but it is NOT in AMI. But during this period it is probably not complete, and you CAN use it but you use it at your own risk.

How to retry failed subjobs at different sites

pbook.retry() retries failed subjobs at the same site. Also, it doesn't retry subjobs if buildJob failed. You may want to resubmit them to different sites, e.g., to avoid a site specific problem. In order to send new subjobs to other sites, their job specification needs to be re-created with new site parameters. The easiest way is to run pathena/prun with the same input/output datasets (or dataset containers). In this case, that will run only on failed files instead of all files in the input dataset and output files will be appended to the output dataset container. If you are using output container (the default since 0.2.73) and specify --excludedSite, new subjobs will go to other sites.

How to use external packages with pathena

Use the gluePackages option of pathena. See some user examples in the DAST list here.

What to do if the "voms-proxy-init -voms atlas" command is stuck

See this page from DQ2ClientsTroubleshooting.

The voms-proxy-init command accesses one of the 3 voms servers located at BNL or CERN. It is chosen randomly. If BNL server is chosen and is down, the command gets stuck. Upon retry it works as soon as CERN is reached. It is reported by DDM team to the glite development team as it should have failed and tried on another voms server. In the meantime you can force to use the voms server at CERN. The choice of VOMS server is controlled by a (site-wide) 'vomses' file, but you can force a particular server to be used by creating your own 'vomses' file and using the '-vomses' flag to voms-proxy-init. Create a file called e.g. '~myvomses' containing the following two lines:

 "atlas" "voms.cern.ch" "15001" "/DC=ch/DC=cern/OU=computers/CN=voms.cern.ch" "atlas"
 "atlas" "lcg-voms.cern.ch" "15001" "/DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch" "atlas"
and then:
voms-proxy-init  -vomses ~/myvomses -voms atlas
will force use of the voms servers at CERN.

What does "ignore replicas of DATASET_NAME at ANALY_XYZ due to archived=ToBeDeleted or short lifetime < 7days" mean?

Once the archived flag of a dataset replica is set to ToBeDeleted or it has a short lifetime the replica is deleted by the DDM deletion service very soon. In this case the brokerage skips the site since the replica may be deleted during the job is running.

Duplicated files when old output dataset (container) is reused

When you submit jobs twice with the same input dataset and output dataset, the second job runs only on files which were unused or failed in the first job. However, if the first job is older than 30 days this machinery doesn't work properly and there will be duplicated files in the output dataset, as explained in this ticket. This problem is going to be addressed in the new panda system, but for now you should avoid reusing very old output dataset (container).


Contact Email Address: hn-atlas-dist-analysis-help@cern.ch



Responsible:

Edit | Attach | Watch | Print version | History: r146 < r145 < r144 < r143 < r142 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r146 - 2018-11-09 - MayukoKataoka
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback