PandaShiftGuide

Introduction

This page describes the instructions for Panda shifters. Please also refer to ADCoS wiki page.

Send an e-mail to Panda shift team

Panda shift team

Shift calendar

The usual ATLAS password is used for modification of the shift calendar.

Shift instructions

This page is edited with removing obselete info from previous version r82, if you need to see that info please refer to r82:

  • We are not running pilot jobs from host atlas002.uta.edu and from local submitter at BNL anymore, all pilots are switched to autopilots that run from BNL machines.
  • We are also not running Eowyn/ExtIF anymore. Baboo is now loading Panda database with jobs from the production database at CERN.

Below we list the other items from r82 version of the wiki page that are still relevant.

  • Check if the Panda server is still running from the production dashboard, if not look at the log files for a possible reason and let Tadashi and the shift team know, and restart the Panda server as the instructions given below.
  • Check to see if jobs are piling up in the assigned, waiting and transferring states. If piling up in the assigned state check the status of Panda subscriptions for that site. If piling up in the waiting state check the input files for those jobs are available at BNL. If piling up in the transferring state check the status of DQ2 transfers to BNL. Look at the job state definitions in Panda below for more details about the job flow in Panda.
  • Check if we have jobs in defined table. If not check the status of Eowyn.
  • Check the queues monitor/Monalisa to see if the jobs are running/queued fine at the sites (in case of problems let the system admins know). The queues monitor/Monalisa for each site is linked from the Panda monitoring page.

Looking at job failures

  • Look at the Panda monitoring page for the error summaries of last 1, 3, 7, 30 days.
  • One can use /atlas-grid/log-file-debug area on host atlas002.uta.edu for retrieving the tarball of workdir of the jobs in the failed table.
  • Two wiki pages are already available for retrieving the tarball of the workdir of the failed jobs, PandaFindLogFiles, and the error codes/descriptions of the errors seen in Panda so far, PandaErrorCodes.
  • Look at John Kennedy's monitoring page for the last 24 hours statistics. The error codes/description of the ATLAS production system can be found in AtlasProdSysErrorCodes. Type bamboo@lxmrrb5310.cern.ch in the field of Executor ID for the Detailed Executor info, then click on submit.

To restart Panda monitor and logger

The Panda monitor currently (June 2008) runs on 3 servers at BNL. gridui05,6,7. If the monitor is down or acting strangely on any of these machines, login as sm and do

./restart-monitor

If the logger shows problems, restart the monitor on gridui05. The monitor's Apache instance there also hosts the logger.

If the code requires updating before restart, do

cd monitor/panda
cvs update -d
before restarting. (On gridui01 and gridui03 be sure you have an AFS token when you do this.)

To restart Panda server

Login to sm on gridui05.usatlas.bnl.gov, and gridui06 and gridui07 as well (instances of the production server run on all three).

then

to restart production server:

/home/sm/prod/httpd/bin/apachectl stop
/home/sm/prod/httpd/bin/apachectl start

You can find log files under /home/sm/prod/httpd/logs.

Using voms-proxy-init

Instead of using the grid-proxy-init command you can use voms-proxy-init This allows using roles, which in turn gives higher priority to production jobs. The command is:

voms-proxy-init -voms atlas:/atlas/usatlas/Role=production -hours 300

To check the status of proxy and roles use the following command:

voms-proxy-info -all

you should see something like that:

issuer    : /DC=org/DC=doegrids/OU=People/CN=Nurcan Ozturk 18551
identity  : /DC=org/DC=doegrids/OU=People/CN=Nurcan Ozturk 18551
type      : proxy
strength  : 512 bits
path      : /tmp/x509up_u635
timeleft  : 278:45:46
=== VO atlas extension information ===
VO        : atlas
subject   : /DC=org/DC=doegrids/OU=People/CN=Nurcan Ozturk 18551
issuer    : /DC=org/DC=doegrids/OU=Services/CN=vo.racf.bnl.gov
attribute : /atlas/usatlas/Role=production/Capability=NULL
attribute : /atlas/Role=NULL/Capability=NULL
attribute : /atlas/usatlas/Role=NULL/Capability=NULL
attribute : /atlas/lcg1/Role=NULL/Capability=NULL
timeleft  : 2:45:46

On sites which support VOMS/GUMS services (BNL and UC at the moment) you can check whether the role=production works properly. You should be mapped to "usatlas1" account in that case, for example:

globus-job-run tier2-osg.uchicago.edu /bin/pwd 
/home/usatlas1

Otherwise (with grid-proxy-init for instance) you will be mapped to "usatlas3" as an ordinary ATLAS user (or may be to some other local-account).

Please note that a healty proxy is required on all six machines; gridui03, 7, 9 (run autopilots) and gridui05, 6,7 (run Panda server and monitor). gridui01 is now just an alias, the former gridui01 is now a development machine. The new servers gridui05+ require that each person requiring access to sm be specifically authorized. Please check with John/Dantong if you have access to these machines.

Controlling Pilot Submission (from Torre)

The Autopilot pilot submitter and pilots themselves are sensitive to the online/offline setting of a queue in the schedconfig database. Pilots are not sent to and do not become active on sites that are offline. The online/offline status of queues is controlled via URL, so use curl as follows:

curl -sS 'http://pandamon.usatlas.bnl.gov:25880/server/pandamon/query?tpmes=CMD&queue=NAME '

where CMD can be:

  • setonline
  • settest
  • setoffline
  • setmanual
  • setauto

and NAME can be either the AutoPilot queue name or the Panda site name.

Only queues which are configured for 'manual' control can be set online/test/offline via the URL; the rest are auto-set by the AutoPilot DB loader. The setmanual command will set any queue to manual mode so it can subsequently be controlled by URL (and the AutoPilot loader program pilotController.py will leave its status alone). setauto will remove manual control.

e.g.

curl -sS  'http://...?tpmes=setmanual&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'

curl -sS  'http://...?tpmes=setoffline&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'

curl -sS  'http://...?tpmes=settest&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'

curl -sS 'http://...?tpmes=setonline&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'

curl -sS  'http://...?tpmes=setauto&queue=Taiwan-LCG2-lcg00125-atlas-lcgpbs'
The 'Clouds' link at the top of the monitor now gives a page that can be used to track all this. It shows the queues by cloud and site, with their AutoPilot and Panda names, and the status (OK, NOTOK will be phased out as we move to autopilot, to uniformly use online/offline), and the manual/auto setting.

If a queue is set for manual control the queue depth ('nqueue') can now be controlled with a curl command as well. For example:

 curl 'http://...?tpmes=setnqueue&nqueue=20&queue=NoSuchQueue'

will set the queue depth to 20 for the autopilot queue name "NoSuchQueue."

Any Panda-server will work with "curl" commands above, ie you can use pandamon as well as gridui05,6,7.

Site administrators are free to set their site offline or to test mode to run test jobs. ADC shifters should be notified. ADC shifters should be the only ones deciding when to set a site online.

Procedure to "set a site online"

If the site or queue was set "offline", then before bringing this site/queue back "online" a shifter should do the following:

  • set the site/queue in "test" mode with the corresponding "settest" curl command.
  • submit a number of test jobs (testEvgen14.py and testSimulReco14.py for instance) as described in PandaTestJob
  • check whether these jobs completed OK (if not - submit corresponding RT or GGUS ticket), e.g. look http://gridui07.usatlas.bnl.gov:25880/server/pandamon/query?job=*&type=test
  • if jobs finished OK set the site/queue "online" with the corresponding "setonline" curl command.

Panda plots and other tools

Look here for production, or here for site stats, or here for stats tables, or here for analysis information.

All this can be had from the PanDA main page.

Error reporting

How to send test pilots to a new site

  • Once test jobs are sent as explained in PandaTestJob, pilots can pick test jobs up through the production server because both the production/development servers use the same DB. So the pilots running on the submit hosts will pick up the test jobs, no need to start a seperate pusher pointing to the development server. The test jobs won't be counted to production in the monitor.

How to kill production jobs

Be careful since you can kill any jobs using the production role. How to kill jobs is described below.

First, logon to lxplus. 'git clone' is required only at the first time.

$ git clone git://github.com/PanDAWMS/panda-server.git
Make a grid proxy with the production role.
$ voms-proxy-init -voms atlas:/atlas/Role=production
Set PYTHONPATH.
$ cd panda-server/pandaserver
$ export PYTHONPATH=$PWD:$PYTHONPATH
$ cd test
Then
$ python killJob.py -9 12345
to kill PandaID=12345
$ python killJob.py -9 12346 12350
to kill all jobs with PandaIDs between 12346 and 12350.

How to use debug mode

You can use debug mode to peek standard outout of a running job if you are the owner of the job or have production role for ATLAS and/or for the working group. Setup panda-server package as described in the above section. Then
$ python setDebugMode.py (--on|--off) [PandaID]
You need to generate proxy with group production role if you want to use debug mode for a group job.

How to kill user jobs with atlas production role

If you have the atlas production role, you can kill any user jobs. Setup the panda-server package as described in the above section, and generate a proxy with the production role.

$ voms-proxy-init -voms atlas:/atlas/Role=production
Then
$ python killJob.py --killUserJobs 12345
to kill PandaID=12345
$ python killJob.py --killUserJobs 12346 12350
to kill jobs with PandaIDs between 12346 and 12350.

How to send test jobs, how to kill and reassign jobs

See PandaTestJob.

Job-state definitions in Panda

There are 12 values in Panda describing different possible states of jobs and corresponding to jobStatus column in PandaDB (jobsDefined4, jobsWaiting4, jobsActive4 and jobsArchived4 tables):

  • pending : job-record is injected to PandaDB by JEDI
  • defined : kicked to start by bamboo or JEDI
  • assigned : dispatchDBlock is subscribed to site to transfer input files to T2 or to prestage input files from T1 TAPE
  • waiting : input files or software are not ready
  • activated : waiting for pilot requests
  • sent : sent to a pilot
  • starting : the pilot is starting the job on a worker node. For push mode - the job was retrieved by the pilot submitter, but did not start on the worker node yet.
  • running : running on a worker node
  • holding : adding output files to datasets. Or waiting for job recovery
  • transferring : output files are moving to the final destination
  • finished : completed successfully
  • failed : failed due to errors
  • cancelled : manually killed
  • merging : output files are being merged by merge jobs
  • throttled : throttled to regulate WAN data access
  • closed : terminated by the system before completing the allocated workload. E.g., killed to be reassigned to another site

Note: The pilot is internally using an intermediary "starting" state after the job download, prior to job execution when it sets the "running" state. It can happen that the pilot fails to send the "running" state to the server, due to network problems, which leads to a seemingly hanging "starting" state. The job should finish as normal unless there was a real problem at the beginning of the job or the network problems are not temporary, which in turn leads to a lost heartbeat error.

Normal sequence of job-states:

 pending -> defined -> assigned -> activated -> sent -> running -> holding -> transferring -> finished/failed

If input files are not available:

 defined -> waiting
then, when files are ready
  -> assigned -> activated

  • pending -> defined : triggered by JEDI
  • defined -> assigned/waiting : automatic
  • assigned -> activated : received a callback for the dispatchDBlock. If jobs don't have input files or all input files are already available, the jobs get activated without a callback.
  • activated -> sent : sent the job to a pilot
  • sent -> running : the pilot received the job
  • waiting -> assigned : received a callback for the destinationDBlock of upstream jobs
  • running -> holding : received the final status report from the pilot
  • holding -> transferring : added the output files to destinationDBlocks
  • transferring -> finished/failed : received callbacks for the destinationDBlocks

For example,

2006-01-27 16:17:25,514 panda.log.broker: DEBUG
UC_ATLAS_MWT2 assigned:22 activated:2 running:18 nPilots:41.0
UTA-DPCC assigned:167 activated:0 running:1 nPilots:55.0
BU_ATLAS_Tier2 assigned:0 activated:100 running:15 nPilots:59.0
BNL_ATLAS_2 assigned:222 activated:40 running:19 nPilots:158.0
BNL_ATLAS_1 assigned:263 activated:63 running:7 nPilots:157.0

When there are few activated jobs while many jobs are assigned, DDM doesn't work properly.

Nagios Panda page

View FTS (Grid File Transfer Service) transfer logs from FTS monitor

BNL RACF Operators Information

At BNL there are two RACF operators, Kevin Casella and Enrique Garcia. Usually one of them works until midnight. If the person on shift discovers an issue with, for example, a BNL critical server (like the griduiOx machines, etc.) and it is an emergency, the operators are available to attempt a simple fix (like performing a restart/reboot of a machine), or to place a call with an expert.

RACF operator's coordinates:

RACF CELL phone: 631-487-6780.

E-mail: kac@bnlNOSPAMPLEASE.gov (Kevin) garcia@bnlNOSPAMPLEASE.gov (Enrique)


Major updates:
-- NurcanOzturk - 28 Dec 2005 -- NurcanOzturk - 20 Jan 2006 -- YuriSmirnov - 23 Jan 2006 -- NurcanOzturk - 06 Feb 2006 -- YuriSmirnov - 06 Feb 2006 -- NurcanOzturk - 09 Feb 2006 -- TomaszWlodek - 09 Feb 2006 -- YuriSmirnov - 10 Feb 2006 -- TomaszWlodek - 31 Mar 2006 -- YuriSmirnov - 15 Apr 2006 -- TomaszWlodek - 24 Apr 2006 -- NurcanOzturk - 28 Apr 2006 -- NurcanOzturk - 21 Jun 2006 -- NurcanOzturk - 25 Jul 2006 -- NurcanOzturk - 05 Feb 2007 -- MarcoMambelli - 01 Mar 2007 -- NurcanOzturk - 12 Apr 2007 -- NurcanOzturk - 28 Jun 2007 -- WenshengDeng - 05 Feb 2008 -- YuriSmirnov - 14 Feb 2008 -- YuriSmirnov - 27 Jun 2008 -- NurcanOzturk - 01 Jul 2008 -- YuriSmirnov - 21 Nov 2008 -- MarcoMambelli - 25 Nov 2008



Responsible: -- YuriSmirnov

-- NurcanOzturk - 01 Jul 2008

Edit | Attach | Watch | Print version | History: r104 < r103 < r102 < r101 < r100 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r104 - 2018-09-17 - IvanGlushkov
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback