--
JuliaAndreeva - 03-Jun-2010 We need to get information for 4 job monitoring data categories :
Below I've tried to find correspondence between the Dashboard DB and Panda Db structures, taking into account only data which is needed for Dashboard job monitoring purposes and correspondingly for creating of the sql queries ran by Dashboard collector.
Task Table and other tables to which Task table points via foreign keys
DashboardField |
DashboardTable |
Panda fied |
Panda table |
Comments/questions |
TaskMonitorId |
TASK |
taskname+taskid for production, produsername+jobdefinitionid for analysis |
cloudtasks for production, job table for analysis |
For production task description is done in the production table and taskid is coming from there. The name which would be allocated to the task in Dashboard would have a structure panda_ |
UserId foregn key to USERS table |
TASK |
produsername |
jobtable |
we take from Panda produsername and produserid , which contains certificate subject. They define unique record in Dashboard USERS table , task table contains a foregn key to it |
TaskCreatedTimeStamp |
TASK |
- |
- |
Internal to Dashboard when record is created in the dashboard DB |
TaskTypeId foregn key to TASK_TYPE table |
TASK |
processingtype |
jobtable |
in case of pathina or Ganga TaskType inDashboard is analysis, for other cases is just the same value as recorded in processingtype. The value itself is recorded in the TASK_TYPE table, task keeps a foregn key to it |
NeventsPerJob |
TASK |
- |
- |
Not known in Panda |
ApplicationId foregn key to APPLICATION table |
TASK |
atlasrelease |
jobtable |
|
ApplExecId foreign key to APPL_EXEC table |
TASK |
transformation |
jobtable |
|
InputCollectionId foreign key to InputCollection table |
TASK |
prodDBlock |
jobtable |
|
OutputCollectionId foreign key to OutputCollection table |
TASK |
destinationblock |
jobtable |
|
DefaultSchedulerId foreign key to SCHEDULER table |
TASK |
- |
- |
for panda jobs is always 'panda' |
SubmissionToolId foreign key to SUBMISSION_TOOL table |
TASK |
processingtype for analysis jobs (pathina or ganga) in case of analysis, have to clarify are there any different possibilities for production |
SubToolVerId foreign key to SUBMISSION_TOOL_VER table |
TASK |
? |
? |
I think I forgot to ask about it , have to clarify |
SubmissionUIId foregn key to SUBMISSION_UI table |
TASK |
creationhost |
jobtable |
where task submission was initiated from |
JobProcessingTypeId |
TASK |
- |
- |
Note really used in Dashboard for the time being, the idea was that for difefrent type of jobs there could be additional set of parameters to follow |
TagetCE |
TASK |
destinationse |
jobtable |
does not make much sence for PANDA since the site is resolved by PANDA before submission, in Dashboard was used in order to resolve site as soon as possible, even before the job is scheduled at the site |
SeId foreign key to SE table |
TASK |
destinationse |
jobtable |
|
SubmissionType foregn key to SUBMISSION_TYPE table |
TASK |
- |
- |
for panda will be always 'panda_server' |
Job Table and other tables to which Job table points via foreign keys
DashboardField |
DashboardTable |
Panda fied |
Panda table |
Comments/questions |
SchedulerJobId |
JOB |
pandaid |
jobstable |
Panda developers told taht pandaid is unique and not recyclable, so panda_ should be good enough for identifying panda GRID job |
TaskJobId |
JOB |
? |
? |
see comments below , it is a bit complex for the moment and not yet clear how to solve |
TaskId foreign key to the TASK table |
JOB |
|
|
|
LocalBatchJobId |
JOB |
? |
? |
Did not clarify with Tadashi whether it is available in PANDA or not , probably not |
VOJobId |
JOB |
pandaid |
jobtable |
|
NextJobId |
JOB |
? |
? |
see comments below, it is currently unclear how to resolve the job resubmission history |
RBId foreign key to the RB table |
JOB |
- |
- |
always 'panda' for panda jobs |
EventRange |
JOB |
? |
? |
see comments below, again it relates to job resubmission history |
SubNodeIp |
JOB |
schedulerid |
jobtable |
location of panda server |
LongCEId |
JOB |
? |
? |
Looks like we only have SHORT_CE and SITE in Panda |
ShortCEId foreign key to SHORT_CE table |
JOB |
computingelement |
jobtable |
|
SiteId foregn key to the site table |
JOB |
computingsite |
jobtable |
|
DboardStatusId |
JOB |
calculated from jobstatus in panda jobtable |
jobtable |
|
DboardJobEndId |
JOB |
calculated from panda jobstatus and transexitcode |
jobtable |
means application code, only transformation code should be taken into account to undetrstand whether application ran properly |
DboardGridEndId |
JOB |
calculated from jobstatus and all possible exitcodes in panda |
jobtable |
means GRID exit code, for panad would mean jobprocessing success not related to application success as such , that is success of panda workload management system |
DboardStatusEnterTimeStamp |
JOB |
modificationtime only when it triggered status change |
jobtable |
DboardFirtsInfoTimeStamp |
JOB |
creationtime |
jobtable |
|
DboardLatestInfoTimeStamp |
JOB |
modificationtime |
jobtable |
|
GridStatusId |
JOB |
jobstatus |
jobtable |
redundant in case of panda , unless we instrument pilot jobs |
GridStatusReasonId |
JOB |
- |
- |
might have no sence here |
GridEndStatusTimeStamp |
JOB |
modificationtime when job istirned into finished |
jobtable |
redundant unless we instrument pilots |
JobExecExitCode |
JOB |
transexitcode |
jobtable |
exit code of user exectable code |
JobExecExitReasonId |
JOB |
transexitdiag |
jobtable |
reason of user executable failure |
JobExecExitTimeStamp |
JOB |
endtime |
jobtable |
|
CreatedTimeStamp |
JOB |
- |
- |
Internal to Dashboard when record is created in the Dashboard DB |
SubmittedTimeStamp |
JOB |
creationtime |
jobtable |
|
ScheduledTimeStamp |
JOB |
- |
- |
Does not make sense for PANDA , job is not pending at the site |
StartedRunningTimeStamp |
JOB |
starttime |
jobtable |
|
FinishedTimeStamp |
JOB |
modificationtime when job is turned into finished |
jobtable |
|
SchedulerId foreign key to the SCHEDULER table |
JOB |
- |
- |
always panda for panda jobs |
TimeOutFlag |
JOB |
? |
? |
Flag which marks the jobs which never reported back after they started running and therefore are timeout by Dashboard, unclear wheter it makes sence for panada jobs |
ExeTime |
JOB |
? |
? |
wall clock time of user executable, not clear wheter it is known in panda, have to clarify |
ExeCpu |
JOB |
cpuconsumptiontime? |
jobtable |
need to clarify whether it relates to executable only or includes time spend on saving output of the job |
WrapWC |
JOB |
endtime -starttime |
jobtable |
wallclock time of the user job |
WrapCPU |
JOB |
cpuconsumptiontime? |
jobtable |
need to clarify whether it relates to executable only or includes time spend on saving output of the job |
StOutWC |
JOB |
? |
? |
whether time spent on saving output is known in Panda, needs to be clarified |
JobType |
JOB |
- |
- |
Does not make sense for Panda |
StageOutSE |
JOB |
destinationse |
jobtable |
PilotFlag |
JOB |
- |
- |
always the same for all jobs processed by panda pilots |
InputSE |
JOB |
? |
? |
in panda it looks like input SE and StOutSE are always the sme, to clarify |
ParentPilotId |
JOB |
pilotid |
jobtable |
currently in panda there is a name of the parent pilot. We might need to create a pilot table and to keep in the JOB table a foreign key to the record in this table |
WNHostName |
JOB |
modificationhost |
jobtable |
|
NEvProc |
JOB |
nevents |
jobtable |
number of processed events |
NEvReq |
JOB |
- |
- |
number of events which had to be processed by the job, does not exist in panda |
NFilesProc |
JOB |
ninputfiles |
jobtable |
|
Open questions
Resubmissions
Dashboard keeps the track of resubmission history of every particular job. This feature is considered to be useful by analysis users and analysis support team since it allows to debug various problems. We would like to provide this functionality for panda jobs as well if possible.
As far as we understand from Tadashi, Panda has slightly different approach to resubmissions compared to Dashboard and WLM systems in CMS.
In the Dashboard scenario in case of resubmission the task is always kept the same and just the new job is created. Dashboard keeps reference to
the unique portion of work (whether in terms of input files or in terms of input events or in terms of random seeds or combination of those ) which have to be processed by a given job of the task. If the same input is processed by non-unique jobs of the same task it would mean that all those jobs are just resubmission attempts of some primary job.
As we understood from Tadashi, in ATLAS in case jobs are submitted via PANDA there are two possible scenarios.
1) The resubmitted jobs go to the same site as their predecessors. In this case PANDA would create a new task but would keep a reference to the previous task in the executionid filed of the jobtable and would increase by one attemptnumber of all jobs of this task.
The problem for Dashboard to recreate a resubmission history consists of the fact that panda does not keep reference to the predecessor of every single job in the job record. To find out that job B is a next attempt of job A, one needs to make sure that A and B have exactly the same set of input files and
in some cases in addition to it exactly the same values on the jobparams table which is CLOB and pretty expensive for comparison.
It is possible to retrive info needed for recreation of the resubmission history, but would require pretty heavy queries on the DB, and looks to be a big overhead.
Related to it we have a request. If we understand how panda works, it might just clone jobs which need to be resubmitted combining them in a single task.
At that point PANDA should know pandaid of the previous attempts of all jobs in the set. Would it be possible to record pandaid of the previous attempt
in the jobtable for analysis jobs? It would require small change of the panda schema adding one more field in the job table. We found out that jobparameters fileld in the job table is always 'null' for analysis and suggested whether this one could be used for pandaid of the job predecessor but Tadashi thinks that it is messy and it is better to create a new field.
2) In the second scenario the jobs are resubmitted to a different site and then re-submission as such is happenning ouside panda scope on the UI level (Ganga or Pathena). In this case as far as we understood there is no way to recretae a resubmission history using just info in panda DB.
Can update be delayed?
Modification time stamp is it time stamp when data is change din PANDA DB ?
If we decide to query PANDA DB every 5 minutes, can it happen that we would miss some status changes?
Should we ask every 5 minutes for last 5 minutes interval or rather every 5 minutes for last 10 minutes?
Query details
Below is an example of the query we might need to run every few minutes. For 5 minutes interval on the integration instance it takes
~1.5 seconds, returning 8-12K jobs.
select PANDAID, GRID, PRODSOURCELABEL,to_char(MODIFICATIONTIME, 'YYYY-MM-DD HH24:MI:SS') MODIFICATIONTIME,JOBDEFINITIONID,SCHEDULERID, to_char(CREATIONTIME, 'YYYY-MM-DD HH24:MI:SS') CREATIONTIME,CREATIONHOST, MODIFICATIONHOST,ATLASRELEASE,TRANSFORMATION,PRODUSERID,ATTEMPTNR,JOBSTATUS, to_char(STARTTIME, 'YYYY-MM-DD HH24:MI:SS') STARTTIME, to_char(ENDTIME, 'YYYY-MM-DD HH24:MI:SS') ENDTIME , CPUCONSUMPTIONTIME,CPUCONSUMPTIONUNIT,TRANSEXITCODE,PILOTERRORCODE,EXEERRORCODE, EXEERRORDIAG,SUPERRORCODE,SUPERRORDIAG,DDMERRORCODE,DDMERRORDIAG,BROKERAGEERRORCODE,BROKERAGEERRORDIAG, JOBDISPATCHERERRORCODE,JOBDISPATCHERERRORDIAG,COMPUTINGSITE,COMPUTINGELEMENT,PRODDBLOCK,DESTINATIONDBLOCK, DESTINATIONSE,NEVENTS,CPUCONVERSION,JOBEXECUTIONID,PROCESSINGTYPE,PRODUSERNAME,NINPUTFILES
from atlas_panda.JOBSACTIVE4 where MODIFICATIONTIME >= to_date(:start_time,'YYYY-MM-DD HH24:MI:SS') and MODIFICATIONTIME < to_date(:end_time, 'YYYY-MM-DD HH24:MI:SS')
union ALL
select PANDAID, GRID, PRODSOURCELABEL,to_char(MODIFICATIONTIME, 'YYYY-MM-DD HH24:MI:SS') MODIFICATIONTIME,JOBDEFINITIONID,SCHEDULERID, to_char(CREATIONTIME, 'YYYY-MM-DD HH24:MI:SS') CREATIONTIME,CREATIONHOST, MODIFICATIONHOST,ATLASRELEASE,TRANSFORMATION,PRODUSERID,ATTEMPTNR,JOBSTATUS, to_char(STARTTIME, 'YYYY-MM-DD HH24:MI:SS') STARTTIME, to_char(ENDTIME, 'YYYY-MM-DD HH24:MI:SS') ENDTIME, CPUCONSUMPTIONTIME,CPUCONSUMPTIONUNIT,TRANSEXITCODE,PILOTERRORCODE,EXEERRORCODE, EXEERRORDIAG,SUPERRORCODE,SUPERRORDIAG,DDMERRORCODE,DDMERRORDIAG,BROKERAGEERRORCODE,BROKERAGEERRORDIAG, JOBDISPATCHERERRORCODE,JOBDISPATCHERERRORDIAG,COMPUTINGSITE,COMPUTINGELEMENT,PRODDBLOCK,DESTINATIONDBLOCK, DESTINATIONSE,NEVENTS,CPUCONVERSION,JOBEXECUTIONID,PROCESSINGTYPE,PRODUSERNAME,NINPUTFILES
from atlas_panda.JOBSARCHIVED4 where MODIFICATIONTIME >= to_date(:start_time, 'YYYY-MM-DD HH24:MI:SS') and MODIFICATIONTIME < to_date(:end_time, 'YYYY-MM-DD HH24:MI:SS')
union ALL
select PANDAID, GRID, PRODSOURCELABEL,to_char(MODIFICATIONTIME, 'YYYY-MM-DD HH24:MI:SS') MODIFICATIONTIME,JOBDEFINITIONID,SCHEDULERID, to_char(CREATIONTIME, 'YYYY-MM-DD HH24:MI:SS') CREATIONTIME,CREATIONHOST, MODIFICATIONHOST,ATLASRELEASE,TRANSFORMATION,PRODUSERID,ATTEMPTNR,JOBSTATUS, to_char(STARTTIME, 'YYYY-MM-DD HH24:MI:SS') STARTTIME, to_char(ENDTIME, 'YYYY-MM-DD HH24:MI:SS') ENDTIME, CPUCONSUMPTIONTIME,CPUCONSUMPTIONUNIT,TRANSEXITCODE,PILOTERRORCODE,EXEERRORCODE, EXEERRORDIAG,SUPERRORCODE,SUPERRORDIAG,DDMERRORCODE,DDMERRORDIAG,BROKERAGEERRORCODE,BROKERAGEERRORDIAG, JOBDISPATCHERERRORCODE,JOBDISPATCHERERRORDIAG,COMPUTINGSITE,COMPUTINGELEMENT,PRODDBLOCK,DESTINATIONDBLOCK, DESTINATIONSE,NEVENTS,CPUCONVERSION,JOBEXECUTIONID,PROCESSINGTYPE,PRODUSERNAME,NINPUTFILES
from atlas_panda.JOBSDEFINED4 where MODIFICATIONTIME >= to_date(:start_time, 'YYYY-MM-DD HH24:MI:SS') and MODIFICATIONTIME < to_date(:end_time, 'YYYY-MM-DD HH24:MI:SS')
union ALL
select PANDAID, GRID, PRODSOURCELABEL,to_char(MODIFICATIONTIME, 'YYYY-MM-DD HH24:MI:SS') MODIFICATIONTIME,JOBDEFINITIONID,SCHEDULERID, to_char(CREATIONTIME, 'YYYY-MM-DD HH24:MI:SS') CREATIONTIME,CREATIONHOST, MODIFICATIONHOST,ATLASRELEASE,TRANSFORMATION,PRODUSERID,ATTEMPTNR,JOBSTATUS, to_char(STARTTIME, 'YYYY-MM-DD HH24:MI:SS') STARTTIME, to_char(ENDTIME, 'YYYY-MM-DD HH24:MI:SS') ENDTIME, CPUCONSUMPTIONTIME,CPUCONSUMPTIONUNIT,TRANSEXITCODE,PILOTERRORCODE,EXEERRORCODE, EXEERRORDIAG,SUPERRORCODE,SUPERRORDIAG,DDMERRORCODE,DDMERRORDIAG,BROKERAGEERRORCODE,BROKERAGEERRORDIAG, JOBDISPATCHERERRORCODE,JOBDISPATCHERERRORDIAG,COMPUTINGSITE,COMPUTINGELEMENT,PRODDBLOCK,DESTINATIONDBLOCK, DESTINATIONSE,NEVENTS,CPUCONVERSION,JOBEXECUTIONID,PROCESSINGTYPE,PRODUSERNAME,NINPUTFILES
from atlas_panda.JOBSWAITING4 where MODIFICATIONTIME >= to_date(:start_time, 'YYYY-MM-DD HH24:MI:SS') and MODIFICATIONTIME < to_date(:end_time, 'YYYY-MM-DD HH24:MI:SS')
Some more useful info for the record
The task for production is defined in the production DB, for analysis there is no description of task as such, but the task is defined by combination of userid and jobdefinitioid
When user submits a new job the record is created in the JOBSDEFINED , for production from JOBSDEFINED table record can be moved eithr to JOBACTIVATED if input already exists, or to WAITING if input does not exist yet. For analysis PATHENA already resolved the name of the site where job should go, so job is moved to JOBACTIVATED. For historical reason for ~5 minutes job record can exist both in JOBSDEFINED and other table wher job was moved to. The job record stays in JOBACTIVATED table unless the job is finished whatever resuplt of processing is. When it is finished it is moved to JOBSARCHIVE table. Every new resubmission would result in creating a new record with a new pandaid in JOBSDEFINED table.
Are there some standard APIs available to retrieve data from PANDA DB?
There is an api to PANDA which we need to look into :
https://svnweb.cern.ch/trac/panda/browser/panda-server/current/pandaserver/userinterface/Client.py
Regarding errror codes, according to Tadashi normally a single job has one single error code, so we can keep it in two fields we currently have in Dashboard, one for GRID another one for application.
Dashboard PanDA collector
Collector is intended to retrieve job/task related information from PANDA DB and stores in ATLAS Dashboard Job monitoring DB.
http://svnweb.cern.ch/world/wsvn/dashboard/trunk/arda.dashboard.service-collector-panda/#path_trunk_arda.dashboard.service-collector-panda_stout