Submitting forever issue

Validation of last patched version

Test: 5 task with 4411 jobs each, as in the SubSuccess test

Results

Task1

CRAB client

 >>>>>>>>> 3963 Jobs with Wrapper Exit Code : 0
         List of jobs: 1-95,97-192,194-411,413-580,582-673,675-680,682-689,691-704,706-715,717-735,737-738,740-805,807-813,815-820,822-838,840-1744,1746-1861,1863-1971,1973-2361,2363-2626,2628-2658,2660-2724,2726-2786,2788-3018,3020-3100,3102-3167,3169-3188,3190-3213,3215-3512,3514-3640,3642,3644-3680,3682-3689,3691-3703,3705-3722,3724-3729,3731-3787,3789-3955,3957-4001
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:  ExitCodes Summary
 >>>>>>>>> 1 Jobs with Wrapper Exit Code : 137
         List of jobs: 3101
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:  ExitCodes Summary
 >>>>>>>>> 19 Jobs with Wrapper Exit Code : 50117
         List of jobs: 96,193,412,581,674,681,690,705,716,736,806,814,821,839,1972,2362,2787,3168,3730
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:   4411 Total Jobs
 >>>>>>>>> 411 Jobs Submitted
        List of jobs Submitted: 3513,4002-4411
 >>>>>>>>> 17 Jobs Aborted
        You can resubmit them specifying JOB numbers: crab -resubmit <List of jobs>
        List of jobs: 739,1745,1862,2627,2659,2725,3019,3189,3214,3641,3643,3681,3690,3704,3723,3788,3956
 >>>>>>>>> 3983 Jobs Done
        Jobs terminated: retrieve them with: crab -getoutput <List of jobs>
        List of jobs: 1-738,740-1744,1746-1861,1863-2626,2628-2658,2660-2724,2726-3018,3020-3188,3190-3213,3215-3512,3514-3640,3642,3644-3680,3682-3689,3691-3703,3705-3722,3724-3787,3789-3955,3957-4001

crab:  You can also follow the status of this task on :
        CMS Dashboard: http://dashb-cms-job-task.cern.ch/taskmon.html#task=leo_CERN-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_lj975z
        Server page: http://vocms21.cern.ch:8888/logginfo
        Your task name is: leo_CERN-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_lj975z

DB

mysql> select state,closed,process_status,status,wrapper_return_code,status_reason, count(process_status) from bl_runningjob where task_id=114 group by state,process_status,status,wrapper_return_code,status_reason;
+------------+--------+----------------+--------+---------------------+-----------------------------+-----------------------+
| state      | closed | process_status | status | wrapper_return_code | status_reason               | count(process_status) |
+------------+--------+----------------+--------+---------------------+-----------------------------+-----------------------+
| Aborted    | Y      | processed      | A      |                NULL | hit job retry count (0)     |                    34 |
| SubSuccess | N      | created        | S      |                NULL |                             |                   394 |
| SubSuccess | N      | created        | S      |                NULL | hit job retry count (0)     |                     1 |
| SubSuccess | N      | created        | S      |                NULL | unavailable                 |                    16 |
| SubSuccess | Y      | processed      | A      |                NULL | hit job retry count (0)     |                     4 |
| Terminated | Y      | processed      | E      |                   0 | Job terminated successfully |                  3963 |
| Terminated | Y      | processed      | E      |                 137 | Warning: job exit code != 0 |                     1 |
| Terminated | Y      | processed      | E      |               50117 | Job terminated successfully |                    19 |
+------------+--------+----------------+--------+---------------------+-----------------------------+-----------------------+

Task2

CRAB client

crab:  ExitCodes Summary
 >>>>>>>>> 1977 Jobs with Wrapper Exit Code : 0
         List of jobs: 1-200,202-237,239-241,243-278,280-286,288-290,292-294,296-318,320-326,328-330,332-334,336-367,369-375,377-383,385-460,462-493,495-744,746-776,778-954,956-1065,1067-1235,1237-1253,1255-1819,1821-2000
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:  ExitCodes Summary
 >>>>>>>>> 7 Jobs with Wrapper Exit Code : 50117
         List of jobs: 295,745,955,1066,1236,1254,1820
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:   4411 Total Jobs
 >>>>>>>>> 2411 Jobs Submitted
        List of jobs Submitted: 2001-4411
 >>>>>>>>> 13 Jobs Scheduled
        List of jobs Scheduled: 201,238,242,287,291,319,327,331,335,368,384,461,494
 >>>>>>>>> 3 Jobs Running
        List of jobs Running: 279,376,777
 >>>>>>>>> 1984 Jobs Done
        Jobs terminated: retrieve them with: crab -getoutput <List of jobs>
        List of jobs: 1-200,202-237,239-241,243-278,280-286,288-290,292-318,320-326,328-330,332-334,336-367,369-375,377-383,385-460,462-493,495-776,778-2000

crab:  You can also follow the status of this task on :
        CMS Dashboard: http://dashb-cms-job-task.cern.ch/taskmon.html#task=leo_CERN-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-9_0n4ts9
        Server page: http://vocms21.cern.ch:8888/logginfo
        Your task name is: leo_CERN-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-9_0n4ts9

DB

mysql> select state,closed,process_status,status,wrapper_return_code,status_reason, count(process_status) from bl_runningjob where task_id=115 group by state,process_status,status,wrapper_return_code,status_reason;
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+
| state      | closed | process_status | status | wrapper_return_code | status_reason                        | count(process_status) |
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+
| Aborted    | Y      | processed      | A      |                NULL | hit job retry count (0)              |                    20 |
| SubSuccess | N      | created        | S      |                NULL |                                      |                   935 |
| SubSuccess | N      | created        | S      |                NULL | Job successfully submitted to Globus |                  1051 |
| SubSuccess | N      | created        | S      |                NULL | unavailable                          |                   425 |
| SubSuccess | N      | handled        | R      |                NULL | Job successfully submitted to Globus |                     3 |
| SubSuccess | N      | handled        | SS     |                NULL | Job successfully submitted to Globus |                    13 |
| SubSuccess | Y      | processed      | A      |                NULL | hit job retry count (0)              |                     1 |
| Terminated | Y      | processed      | E      |                   0 | Job terminated successfully          |                  1977 |
| Terminated | Y      | processed      | E      |               50117 | Job terminated successfully          |                     7 |
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+

Task3

CRAB client

crab:  ExitCodes Summary
 >>>>>>>>> 1793 Jobs with Wrapper Exit Code : 0
         List of jobs: 2005,2007-2061,2063-2071,2073-2698,2700-2919,2921-3000,3201-3300,3302-4003
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:  ExitCodes Summary
 >>>>>>>>> 6 Jobs with Wrapper Exit Code : 50117
         List of jobs: 2006,2062,2072,2699,2920,3301
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:   4411 Total Jobs
 >>>>>>>>> 2612 Jobs Submitted
        List of jobs Submitted: 1-2004,3001-3200,4004-4411
 >>>>>>>>> 1799 Jobs Done
        Jobs terminated: retrieve them with: crab -getoutput <List of jobs>
        List of jobs: 2005-3000,3201-4003

crab:  You can also follow the status of this task on :
        CMS Dashboard: http://dashb-cms-job-task.cern.ch/taskmon.html#task=leo_CERN-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-10_90ij6k
        Server page: http://vocms21.cern.ch:8888/logginfo
        Your task name is: leo_CERN-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-10_90ij6k

DB

mysql> select state,closed,process_status,status,wrapper_return_code,status_reason, count(process_status) from bl_runningjob where task_id=116 group by state,process_status,status,wrapper_return_code,status_reason;
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+
| state      | closed | process_status | status | wrapper_return_code | status_reason                        | count(process_status) |
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+
| Aborted    | Y      | processed      | A      |                NULL | hit job retry count (0)              |                    67 |
| SubSuccess | N      | created        | S      |                NULL | NULL                                 |                   200 |
| SubSuccess | N      | created        | S      |                NULL |                                      |                  1155 |
| SubSuccess | N      | created        | S      |                NULL | Job successfully submitted to Globus |                   974 |
| SubSuccess | N      | created        | S      |                NULL | unavailable                          |                    83 |
| SubSuccess | N      | handled        | SU     |                NULL |                                      |                   200 |
| Terminated | Y      | processed      | E      |                   0 | Job terminated successfully          |                  1793 |
| Terminated | Y      | processed      | E      |               50117 | Job terminated successfully          |                     6 |
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+

Task4

CRAB client

crab:  ExitCodes Summary
 >>>>>>>>> 1986 Jobs with Wrapper Exit Code : 0
         List of jobs: 1-278,280-615,618-806,808-827,829-848,850-935,937-946,948-986,988-1002,1004-1221,1223-1290,1292-1403,1405-1978,1980-2000
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:  ExitCodes Summary
 >>>>>>>>> 14 Jobs with Wrapper Exit Code : 50117
         List of jobs: 279,616-617,807,828,849,936,947,987,1003,1222,1291,1404,1979
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:   4411 Total Jobs
 >>>>>>>>> 2411 Jobs Submitted
        List of jobs Submitted: 2001-4411
 >>>>>>>>> 2000 Jobs Done
        Jobs terminated: retrieve them with: crab -getoutput <List of jobs>
        List of jobs: 1-2000

crab:  You can also follow the status of this task on :
        CMS Dashboard: http://dashb-cms-job-task.cern.ch/taskmon.html#task=leo_CERN-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-11_ob0p32
        Server page: http://vocms21.cern.ch:8888/logginfo
        Your task name is: leo_CERN-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-11_ob0p32

DB

mysql> select state,closed,process_status,status,wrapper_return_code,status_reason, count(process_status) from bl_runningjob where task_id=117 group by state,process_status,status,wrapper_return_code,status_reason;
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+
| state      | closed | process_status | status | wrapper_return_code | status_reason                        | count(process_status) |
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+
| Aborted    | Y      | processed      | A      |                NULL | hit job retry count (0)              |                    75 |
| SubSuccess | N      | created        | S      |                NULL | NULL                                 |                    11 |
| SubSuccess | N      | created        | S      |                NULL |                                      |                  1772 |
| SubSuccess | N      | created        | S      |                NULL | Job successfully submitted to Globus |                   210 |
| SubSuccess | N      | created        | S      |                NULL | unavailable                          |                   418 |
| Terminated | Y      | processed      | E      |                   0 | Job terminated successfully          |                  1986 |
| Terminated | Y      | processed      | E      |               50117 | Job terminated successfully          |                    14 |
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+

Task5

CRAB client

crab:  ExitCodes Summary
 >>>>>>>>> 1922 Jobs with Wrapper Exit Code : 0
         List of jobs: 2479-2561,2563-2952,2954-2983,2985-3062,3064-3595,3597-3604,3606-4172,4174-4188,4190-4193,4196-4208,4210-4411
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:  ExitCodes Summary
 >>>>>>>>> 11 Jobs with Wrapper Exit Code : 50117
         List of jobs: 2562,2953,2984,3063,3596,3605,4173,4189,4194-4195,4209
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:   4411 Total Jobs
 >>>>>>>>> 2478 Jobs Submitted
        List of jobs Submitted: 1-2478
 >>>>>>>>> 3 Jobs Done
        Jobs not completely processed: cannot retrieve them, yet (few minutes of delay can occur when using the server)
        List of jobs: 2585,3427,4338
 >>>>>>>>> 1930 Jobs Done
        Jobs terminated: retrieve them with: crab -getoutput <List of jobs>
        List of jobs: 2479-2584,2586-3426,3428-4337,4339-4411

crab:  You can also follow the status of this task on :
        CMS Dashboard: http://dashb-cms-job-task.cern.ch/taskmon.html#task=leo_CERN-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-12_j85uw4
        Server page: http://vocms21.cern.ch:8888/logginfo
        Your task name is: leo_CERN-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-12_j85uw4

DB

mysql> select state,closed,process_status,status,wrapper_return_code,status_reason, count(process_status) from bl_runningjob where task_id=118 group by state,process_status,status,wrapper_return_code,status_reason;
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+
| state      | closed | process_status | status | wrapper_return_code | status_reason                        | count(process_status) |
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+
| Aborted    | Y      | processed      | A      |                NULL | hit job retry count (0)              |                   508 |
| SubSuccess | N      | created        | S      |                NULL | NULL                                 |                   200 |
| SubSuccess | N      | created        | S      |                NULL |                                      |                  2056 |
| SubSuccess | N      | created        | S      |                NULL | hit job retry count (0)              |                     5 |
| SubSuccess | N      | created        | S      |                NULL | Job successfully submitted to Globus |                   133 |
| SubSuccess | N      | created        | S      |                NULL | unavailable                          |                    84 |
| SubSuccess | Y      | processed      | SD     |                   0 | Job terminated successfully          |                     3 |
| Terminated | Y      | processed      | E      |                   0 | Job terminated successfully          |                  1918 |
| Terminated | Y      | processed      | E      |                   0 | Warning: job exit code != 0          |                     1 |
| Terminated | Y      | processed      | E      |               50117 | Job terminated successfully          |                    11 |
+------------+--------+----------------+--------+---------------------+--------------------------------------+-----------------------+


SubSuccess issue

Test1: 4411 jobs

Run 1,2, taskid 2,3

Dataset: /MinimumBias/Commissioning10-SD_JetMETTauMonitor-Jun14thSkim_v1/RECO

TaskName: leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10_7h94ob

UI: t3ui02.psi.ch

Notes: JPE.py used, no output file given to be retrieved

Submission of all jobs not successfull (too big), submitting in more bunches (1-2000, 2001-4411)

Result: 4411 Submitted

2 runs, in the second lot of SubSuccess, but with a restart only one remained... what got stuck?

The bad remained:

mysql> select *  from bl_runningjob where closed!='Y' and task_id=3;

id job_id task_id submission state scheduler service sched_attr scheduler_id scheduler_parent_id status_scheduler status status_reason destination creation_timestamp lb_timestamp submission_time scheduled_at_site start_time stop_time stageout_time getoutput_time output_request_time output_enqueue_time getoutput_retry output_dir storage lfn application_return_code wrapper_return_code process_status closed
8718 4277 3 1 SubSuccess SchedulerGlidein   NULL submit-2.t2.ucsd.edu//1935.76 leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 2010-07-30 08:32:32 0000-00-00 00:00:00 2010-07-29 22:50:42 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 2010-07-29 23:48:27 2010-07-29 23:28:22 2010-07-29 23:43:55 1 /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_spec/BossJob_3_4277/Submission_1/ NULL NULL 0 0 in_progress N

Ten good ones:

mysql> select *  from bl_runningjob where closed='Y' and task_id=3 limit 10;

id job_id task_id submission state scheduler service sched_attr scheduler_id scheduler_parent_id status_scheduler status status_reason destination creation_timestamp lb_timestamp submission_time scheduled_at_site start_time stop_time stageout_time getoutput_time output_request_time output_enqueue_time getoutput_retry output_dir storage lfn application_return_code wrapper_return_code process_status closed
4442 1 3 1 Cleared SchedulerGlidein   NULL submit-2.t2.ucsd.edu//1914.0 leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v Cleared E   osg-gw-5.t2.ucsd.edu/jobmanager-condor 2010-07-30 01:37:02 0000-00-00 00:00:00 2010-07-29 22:37:46 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 2010-07-29 23:01:36 2010-07-29 22:55:05 2010-07-29 22:55:07 1 /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_spec/BossJob_3_1/Submission_1/ NULL NULL 0 0 processed Y
4443 2 3 1 Cleared SchedulerGlidein   NULL submit-2.t2.ucsd.edu//1914.1 leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v Cleared E   osg-gw-5.t2.ucsd.edu/jobmanager-condor 2010-07-30 01:37:02 0000-00-00 00:00:00 2010-07-29 22:37:46 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 2010-07-29 23:01:32 2010-07-29 22:55:05 2010-07-29 22:55:07 1 /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_spec/BossJob_3_2/Submission_1/ NULL NULL 0 0 processed Y
4444 3 3 1 Cleared SchedulerGlidein   NULL submit-2.t2.ucsd.edu//1914.2 leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v Cleared E   osg-gw-5.t2.ucsd.edu/jobmanager-condor 2010-07-30 01:37:02 0000-00-00 00:00:00 2010-07-29 22:37:46 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 2010-07-29 23:01:30 2010-07-29 22:55:05 2010-07-29 22:55:07 1 /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_spec/BossJob_3_3/Submission_1/ NULL NULL 0 0 processed Y
4445 4 3 1 Cleared SchedulerGlidein   NULL submit-2.t2.ucsd.edu//1914.3 leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v Cleared E   osg-gw-5.t2.ucsd.edu/jobmanager-condor 2010-07-30 01:37:02 0000-00-00 00:00:00 2010-07-29 22:37:46 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 2010-07-29 23:01:30 2010-07-29 22:55:05 2010-07-29 22:55:07 1 /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_spec/BossJob_3_4/Submission_1/ NULL NULL 0 0 processed Y
4446 5 3 1 Cleared SchedulerGlidein   NULL submit-2.t2.ucsd.edu//1914.4 leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v Cleared E   osg-gw-5.t2.ucsd.edu/jobmanager-condor 2010-07-30 01:37:02 0000-00-00 00:00:00 2010-07-29 22:37:46 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 2010-07-29 23:01:27 2010-07-29 22:55:05 2010-07-29 22:55:07 1 /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_spec/BossJob_3_5/Submission_1/ NULL NULL 0 0 processed Y
4447 6 3 1 Cleared SchedulerGlidein   NULL submit-2.t2.ucsd.edu//1914.5 leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v Cleared E   osg-gw-5.t2.ucsd.edu/jobmanager-condor 2010-07-30 01:37:02 0000-00-00 00:00:00 2010-07-29 22:37:46 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 2010-07-29 23:01:26 2010-07-29 22:55:05 2010-07-29 22:55:07 1 /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_spec/BossJob_3_6/Submission_1/ NULL NULL 0 0 processed Y
4448 7 3 1 Cleared SchedulerGlidein   NULL submit-2.t2.ucsd.edu//1914.6 leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v Cleared E   osg-gw-5.t2.ucsd.edu/jobmanager-condor 2010-07-30 01:37:02 0000-00-00 00:00:00 2010-07-29 22:37:46 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 2010-07-29 23:01:22 2010-07-29 22:55:05 2010-07-29 22:55:07 1 /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_spec/BossJob_3_7/Submission_1/ NULL NULL 0 0 processed Y
4449 8 3 1 Cleared SchedulerGlidein   NULL submit-2.t2.ucsd.edu//1914.7 leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v Cleared E   osg-gw-5.t2.ucsd.edu/jobmanager-condor 2010-07-30 01:37:02 0000-00-00 00:00:00 2010-07-29 22:37:46 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 2010-07-29 23:01:19 2010-07-29 22:55:05 2010-07-29 22:55:07 1 /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_spec/BossJob_3_8/Submission_1/ NULL NULL 0 0 processed Y
4450 9 3 1 Cleared SchedulerGlidein   NULL submit-2.t2.ucsd.edu//1914.8 leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v Cleared E   osg-gw-5.t2.ucsd.edu/jobmanager-condor 2010-07-30 01:37:02 0000-00-00 00:00:00 2010-07-29 22:37:46 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 2010-07-29 23:01:19 2010-07-29 22:55:05 2010-07-29 22:55:07 1 /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_spec/BossJob_3_9/Submission_1/ NULL NULL 0 0 processed Y
4451 10 3 1 Cleared SchedulerGlidein   NULL submit-2.t2.ucsd.edu//1914.9 leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v Cleared E   osg-gw-5.t2.ucsd.edu/jobmanager-condor 2010-07-30 01:37:02 0000-00-00 00:00:00 2010-07-29 22:37:46 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 2010-07-29 23:01:16 2010-07-29 22:55:05 2010-07-29 22:55:07 1 /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_spec/BossJob_3_10/Submission_1/ NULL NULL 0 0 processed Y

mysql> select *  from bl_job where task_id=3 and job_id=4277 limit 10;

id task_id job_id wmbsJob_id name executable events arguments stdin stdout stderr input_files output_files dls_destination submission_number closed
8718 3 4277 NULL leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_job4277 CMSSW.sh NULL 4277 1 NULL CMSSW_4277.stdout CMSSW_4277.stderr NULL ['out_files_4277.tgz', 'crab_fjr_4277.xml', '.BrokerInfo'] ['srm.unl.edu', 'cmsrm-se01.roma1.infn.it', 'bsrm-1.t2.ucsd.edu'] 1 N
mysql> select *  from bl_job where task_id=3 limit 10;

id task_id job_id wmbsJob_id name executable events arguments stdin stdout stderr input_files output_files dls_destination submission_number closed
4442 3 1 NULL leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_job1 CMSSW.sh NULL 1 1 NULL CMSSW_1.stdout CMSSW_1.stderr NULL ['out_files_1.tgz', 'crab_fjr_1.xml', '.BrokerInfo'] ['srm.unl.edu', 'cmsrm-se01.roma1.infn.it', 'bsrm-1.t2.ucsd.edu'] 1 N
4443 3 2 NULL leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_job2 CMSSW.sh NULL 2 1 NULL CMSSW_2.stdout CMSSW_2.stderr NULL ['out_files_2.tgz', 'crab_fjr_2.xml', '.BrokerInfo'] ['srm.unl.edu', 'cmsrm-se01.roma1.infn.it', 'bsrm-1.t2.ucsd.edu'] 1 N
4444 3 3 NULL leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_job3 CMSSW.sh NULL 3 1 NULL CMSSW_3.stdout CMSSW_3.stderr NULL ['out_files_3.tgz', 'crab_fjr_3.xml', '.BrokerInfo'] ['srm.unl.edu', 'cmsrm-se01.roma1.infn.it', 'bsrm-1.t2.ucsd.edu'] 1 N
4445 3 4 NULL leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_job4 CMSSW.sh NULL 4 1 NULL CMSSW_4.stdout CMSSW_4.stderr NULL ['out_files_4.tgz', 'crab_fjr_4.xml', '.BrokerInfo'] ['srm.unl.edu', 'cmsrm-se01.roma1.infn.it', 'bsrm-1.t2.ucsd.edu'] 1 N
4446 3 5 NULL leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_job5 CMSSW.sh NULL 5 1 NULL CMSSW_5.stdout CMSSW_5.stderr NULL ['out_files_5.tgz', 'crab_fjr_5.xml', '.BrokerInfo'] ['srm.unl.edu', 'cmsrm-se01.roma1.infn.it', 'bsrm-1.t2.ucsd.edu'] 1 N
4447 3 6 NULL leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_job6 CMSSW.sh NULL 6 1 NULL CMSSW_6.stdout CMSSW_6.stderr NULL ['out_files_6.tgz', 'crab_fjr_6.xml', '.BrokerInfo'] ['srm.unl.edu', 'cmsrm-se01.roma1.infn.it', 'bsrm-1.t2.ucsd.edu'] 1 N
4448 3 7 NULL leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_job7 CMSSW.sh NULL 7 1 NULL CMSSW_7.stdout CMSSW_7.stderr NULL ['out_files_7.tgz', 'crab_fjr_7.xml', '.BrokerInfo'] ['srm.unl.edu', 'cmsrm-se01.roma1.infn.it', 'bsrm-1.t2.ucsd.edu'] 1 N
4449 3 8 NULL leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_job8 CMSSW.sh NULL 8 1 NULL CMSSW_8.stdout CMSSW_8.stderr NULL ['out_files_8.tgz', 'crab_fjr_8.xml', '.BrokerInfo'] ['srm.unl.edu', 'cmsrm-se01.roma1.infn.it', 'bsrm-1.t2.ucsd.edu'] 1 N
4450 3 9 NULL leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_job9 CMSSW.sh NULL 9 1 NULL CMSSW_9.stdout CMSSW_9.stderr NULL ['out_files_9.tgz', 'crab_fjr_9.xml', '.BrokerInfo'] ['srm.unl.edu', 'cmsrm-se01.roma1.infn.it', 'bsrm-1.t2.ucsd.edu'] 1 N
4451 3 10 NULL leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-2_13mb6v_job10 CMSSW.sh NULL 10 1 NULL CMSSW_10.stdout CMSSW_10.stderr NULL ['out_files_10.tgz', 'crab_fjr_10.xml', '.BrokerInfo'] ['srm.unl.edu', 'cmsrm-se01.roma1.infn.it', 'bsrm-1.t2.ucsd.edu'] 1 N

Questions:

  • why creation_timestamp is later than the submission time?
  • why in bj_job nothing is closed?

Notes to myself:

TrackingComponent retrieves the job status from WMS using JobStatus.py (both in cms/prodagent/PRODAGENT_0_12_18_pre3-cmp3/lib/JobTracking/), which calls:

command = \
                        'python ' + \
                        '$PRODAGENT_ROOT/lib/JobTracking/QueryStatus.py ' + \
                        str(taskId) + ' ' + jobRange + ' ' + scheduler + \
                        ' ' + task['user_proxy']

If the job is succesfull, the status become SD. This is the case for SubSuccess jobs, so the issue probably is not here.

First impression: it seems that the GetOutput is not triggered for some reason. Added debug printouts to prodagent/PRODAGENT_0_12_18_pre3-cmp3/lib/GetOutput/JobOutput.py. After server restart, some jobs are retrieved (TODO: next time restart only GetOutput).

How is seems to work:

  • JobTracking puts the status "SD"
  • JobOutput changes the processStatus to "output_requested", then "in_progress"

Run3

Try: submit 2000 jobs, wait to be submitted all, then the other ones. taskid=4.

Not all the jobs submitted, only 1-4411 on 4431 total. No SubProcess seen before getoutput performed on the client side.

mysql> select state,count(state) from bl_runningjob where task_id=4 group by state;
+--------------+--------------+
| state        | count(state) |
+--------------+--------------+
| Cleared      |         3811 |
| Created      |           20 |
| SubRequested |          600 |
+--------------+--------------+

So, submitting bug hit, but not subrequest one.

Run 4

Same as Run 3, task_id=5.

mysql> select state,count(state) from bl_runningjob where task_id=5 group by state;
+--------------+--------------+
| state        | count(state) |
+--------------+--------------+
| Created      |           20 |
| SubRequested |          670 |
| SubSuccess   |           20 |
| Terminated   |         3722 |
+--------------+--------------+

20 subsuccess:

mysql> select id,job_id,task_id,submission,state,scheduler,service,sched_attr,status_scheduler,status,status_reason,destination,lb_timestamp,scheduled_at_site,getoutput_retry,storage,lfn,application_return_code,wrapper_return_code,process_status,closed from bl_runningjob where state='SubSuccess' and task_id=5;

id job_id task_id submission state scheduler service sched_attr status_scheduler status status_reason destination lb_timestamp scheduled_at_site getoutput_retry storage lfn application_return_code wrapper_return_code process_status closed
16439 3156 5 1 SubSuccess SchedulerGlidein   NULL Done SD   osg-gw-5.t2.ucsd.edu/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
16539 3256 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
16656 3373 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
16657 3374 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
16658 3375 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
16675 3392 5 1 SubSuccess SchedulerGlidein   NULL Done SD   osg-gw-5.t2.ucsd.edu/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
16869 3586 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
16900 3617 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17153 3870 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17154 3871 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17314 4031 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17346 4063 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17347 4064 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17396 4113 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17397 4114 5 1 SubSuccess SchedulerGlidein   NULL Done SD   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17413 4130 5 1 SubSuccess SchedulerGlidein   NULL Done SD   osg-gw-5.t2.ucsd.edu/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17445 4162 5 1 SubSuccess SchedulerGlidein   NULL Done SD   osg-gw-5.t2.ucsd.edu/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17589 4306 5 1 SubSuccess SchedulerGlidein   NULL Done SD   osg-gw-5.t2.ucsd.edu/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17590 4307 5 1 SubSuccess SchedulerGlidein   NULL Done SD   osg-gw-5.t2.ucsd.edu/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N
17624 4341 5 1 SubSuccess SchedulerGlidein   NULL Done SD   osg-gw-5.t2.ucsd.edu/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress N

mysql> select id,job_id,task_id,submission,state,scheduler,service,sched_attr,status_scheduler,status,status_reason,destination,lb_timestamp,scheduled_at_site,getoutput_retry,storage,lfn,application_return_code,wrapper_return_code,process_status,closed from bl_runningjob where state!='SubSuccess' and task_id=5 limit 1;

id job_id task_id submission state scheduler service sched_attr status_scheduler status status_reason destination lb_timestamp scheduled_at_site getoutput_retry storage lfn application_return_code wrapper_return_code process_status closed
13284 1 5 1 Terminated SchedulerGlidein   NULL Cleared E   cms-sleepgw.fnal.gov/jobmanager-condor 0000-00-00 00:00:00 0000-00-00 00:00:00 1 NULL NULL 0 0 in_progress Y

Trying to restarting the server: fixed all the jobs.

Run 5

TaskId: 9

mysql> select state,count(state) from bl_runningjob where task_id=9 group by state;
+--------------+--------------+
| state        | count(state) |
+--------------+--------------+
| SubRequested |          453 |
| SubSuccess   |          192 |
| Terminated   |         3786 |
+--------------+--------------+

Part of logfiles rotated, info lost...

Strange states in DB:

mysql> select state,count(state),process_status,count(process_status),status, count(status),wrapper_return_code, count(wrapper_return_code) from bl_runningjob where task_id=9 group by state,process_status,status,wrapper_return_code;
+--------------+--------------+----------------+-----------------------+--------+---------------+---------------------+----------------------------+
| state        | count(state) | process_status | count(process_status) | status | count(status) | wrapper_return_code | count(wrapper_return_code) |
+--------------+--------------+----------------+-----------------------+--------+---------------+---------------------+----------------------------+
| SubRequested |           11 | in_progress    |                    11 | E      |            11 |                   0 |                         11 |
| SubRequested |            2 | in_progress    |                     2 | SD     |             2 |                   0 |                          2 |
| SubRequested |          440 | processed      |                   440 | E      |           440 |                   0 |                        440 |
| SubSuccess   |          192 | in_progress    |                   192 | SD     |           192 |                   0 |                        192 |
| Terminated   |          680 | in_progress    |                   680 | E      |           680 |                   0 |                        680 |
| Terminated   |         3101 | processed      |                  3101 | E      |          3101 |                   0 |                       3101 |
| Terminated   |            5 | processed      |                     5 | E      |             5 |               50117 |                          5 |
+--------------+--------------+----------------+-----------------------+--------+---------------+---------------------+----------------------------+

Not so much info from bl_job:

mysql> select submission_number, count(submission_number), closed, count(closed)from bl_job  where task_id=9 group by submission_number,closed;
+-------------------+--------------------------+--------+---------------+
| submission_number | count(submission_number) | closed | count(closed) |
+-------------------+--------------------------+--------+---------------+
|                 1 |                     4431 | N      |          4431 |
+-------------------+--------------------------+--------+---------------+

Question: when the job is closed?

Added more debug.

Modified logrotation for PA (PRODAGENT_0_12_18_pre3-cmp3/lib/ProdAgentCore/LoggingUtils.py): from 3 files of 1MB, 10 files of 10 MB.

Rebooting only GetOutput: crabServerd --component=GetOutput {--sh,--start}

After some minutes:

mysql> select state,count(state),process_status,count(process_status),status, count(status),wrapper_return_code, count(wrapper_return_code) from bl_runningjob where task_id=9 group by state,process_status,status,wrapper_return_code;
+--------------+--------------+------------------+-----------------------+--------+---------------+---------------------+----------------------------+
| state        | count(state) | process_status   | count(process_status) | status | count(status) | wrapper_return_code | count(wrapper_return_code) |
+--------------+--------------+------------------+-----------------------+--------+---------------+---------------------+----------------------------+
| SubRequested |           11 | in_progress      |                    11 | E      |            11 |                   0 |                         11 |
| SubRequested |            2 | in_progress      |                     2 | SD     |             2 |                   0 |                          2 |
| SubRequested |          440 | processed        |                   440 | E      |           440 |                   0 |                        440 |
| SubSuccess   |          184 | in_progress      |                   184 | SD     |           184 |                   0 |                        184 |
| Terminated   |          534 | in_progress      |                   534 | E      |           534 |                   0 |                        534 |
| Terminated   |           23 | output_retrieved |                    23 | E      |            23 |                   0 |                         23 |
| Terminated   |         3232 | processed        |                  3232 | E      |          3232 |                   0 |                       3232 |
| Terminated   |            5 | processed        |                     5 | E      |             5 |               50117 |                          5 |
+--------------+--------------+------------------+-----------------------+--------+---------------+---------------------+----------------------------+

At the end:

mysql> select state,process_status,status,wrapper_return_code, count(wrapper_return_code) from bl_runningjob where task_id=9 group by state,process_status,status,wrapper_return_code;
+--------------+------------------+--------+---------------------+----------------------------+
| state        | process_status   | status | wrapper_return_code | count(wrapper_return_code) |
+--------------+------------------+--------+---------------------+----------------------------+
| SubRequested | output_retrieved | E      |                   0 |                          3 |
| SubRequested | processed        | E      |                   0 |                        450 |
| SubSuccess   | in_progress      | SD     |                   0 |                         18 |
| Terminated   | in_progress      | E      |                   0 |                         64 |
| Terminated   | output_retrieved | E      |                   0 |                        102 |
| Terminated   | processed        | E      |                   0 |                       3789 |
| Terminated   | processed        | E      |               50117 |                          5 |
+--------------+------------------+--------+---------------------+----------------------------+

After one night:

+--------------+------------------+--------+---------------------+----------------------------+
| state        | process_status   | status | wrapper_return_code | count(wrapper_return_code) |
+--------------+------------------+--------+---------------------+----------------------------+
| SubRequested | output_retrieved | E      |                   0 |                          3 |
| SubRequested | processed        | E      |                   0 |                        450 |
| Terminated   | in_progress      | E      |                   0 |                          7 |
| Terminated   | output_retrieved | E      |                   0 |                        112 |
| Terminated   | processed        | E      |                   0 |                       3854 |
| Terminated   | processed        | E      |               50117 |                          5 |
+--------------+------------------+--------+---------------------+----------------------------+

Somebody restarted? or just randomly decided to pick up those jobs? Nobody restarted...

After a new restart, everything is fine:

mysql> select state,process_status,status,wrapper_return_code, count(wrapper_return_code) from bl_runningjob where task_id=9 group by state,process_status,status,wrapper_return_code;
+--------------+----------------+--------+---------------------+----------------------------+
| state        | process_status | status | wrapper_return_code | count(wrapper_return_code) |
+--------------+----------------+--------+---------------------+----------------------------+
| SubRequested | processed      | E      |                   0 |                        453 |
| Terminated   | processed      | E      |                   0 |                       3973 |
| Terminated   | processed      | E      |               50117 |                          5 |
+--------------+----------------+--------+---------------------+----------------------------+

My understanding:

  • recoverStatus in JobOutput.py checks the job status, if !failed it changes it to output_request
  • JobOutput.py then queries jobs in output_requested status throught doWork, and actually retrieves the job.
  • recoverStatus is called in doWork after output_requested jobs have been retrieved, then DB is updated

Run 6

Same as above, cleaned up logfiles. Server shut down and restarted.

Taskid: 14

Killed, issues with the scheduler solved by Sanjay

Run 7

Same as above. Task id: 15

Status after the night:

+--------------+--------+----------------+--------+---------------------+---------------+-----------------------+
| state        | closed | process_status | status | wrapper_return_code | status_reason | count(process_status) |
+--------------+--------+----------------+--------+---------------------+---------------+-----------------------+
| SubRequested | Y      | in_progress    | E      |                   0 |               |                    10 |
| SubRequested | N      | in_progress    | SD     |                   0 |               |                     6 |
| SubRequested | Y      | processed      | E      |                   0 |               |                   767 |
| SubSuccess   | N      | in_progress    | SD     |                   0 |               |                   195 |
| Terminated   | Y      | in_progress    | E      |                   0 |               |                   650 |
| Terminated   | Y      | processed      | E      |                   0 |               |                  2803 |
+--------------+--------+----------------+--------+---------------------+---------------+-----------------------+

GetOutput log for a normal job:

ComponentLog.2:2010-08-04 10:17:54,741:[MainThread] Job 15.101.1: Finished
ComponentLog.2:2010-08-04 10:21:11,809:[Thread-6] Job 15.1.1: Getting output
ComponentLog.2:2010-08-04 10:21:11,809:leo: [GetOutput.doWork.1] job [Thread-6] Job 15.1.1 is in status in_progress
ComponentLog.2:2010-08-04 10:21:11,810:[Thread-6] Job 15.1.1: Processing output
ComponentLog.2:2010-08-04 10:21:11,810:leo: starting retrieval for [Thread-6] Job 15.1.1 in status in_progress
ComponentLog.2:2010-08-04 10:21:11,812:[Thread-6] Job 15.1.1: Creating directory /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_spec/BossJob_15_1/Submission_1/
ComponentLog.2:2010-08-04 10:21:11,862:Evaluate job [Thread-6] Job 15.1.1 and publish the job results
ComponentLog.2:2010-08-04 10:21:11,862:Job [Thread-6] Job 15.1.1 : report rebounce
ComponentLog.2:2010-08-04 10:21:11,896:Job [Thread-6] Job 15.1.1 REBOUNCE DBG : /data/CSstoragePath/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8/crab_fjr_1.xml, /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_spec/BossJob_15_1/Submission_1/crab_fjr_1.xml
ComponentLog.2:2010-08-04 10:21:21,923:Job [Thread-6] Job 15.1.1 Report rebounce completed
ComponentLog.2:2010-08-04 10:21:21,924:Job [Thread-6] Job 15.1.1 : report file name /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_spec/BossJob_15_1/Submission_1//crab_fjr_1.xml exists: True
ComponentLog.2:2010-08-04 10:21:22,004:Job [Thread-6] Job 15.1.1 check Job Success: True
ComponentLog.2:2010-08-04 10:21:22,004:leo: [GetOutput.doWork.4] job [Thread-6] Job 15.1.1 got ret=(<ProdCommon.BossLite.DbObjects.Job.Job object at 0x2b00255083d0>, True, '/data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_spec/BossJob_15_1/Submission_1//crab_fjr_1.xml'), in status in_progress
ComponentLog.2:2010-08-04 10:21:22,005:[Thread-6] Job 15.1.1 : Processing output finished
ComponentLog.2:2010-08-04 10:21:22,005:leo: [GetOutput.doWork.7] job [Thread-6] Job 15.1.1: updating DB, status in_progress
ComponentLog.2:2010-08-04 10:21:22,013:Job [MainThread] Job 15.1.1 finished: JobState DB Notified
ComponentLog.2:2010-08-04 10:21:22,013:[MainThread] Job 15.1.1: Finished

For a SubSuccess one:

ComponentLog.2:2010-08-04 10:17:54,147:[Thread-9] Job 15.97.1: Getting output
ComponentLog.2:2010-08-04 10:17:54,148:leo: [GetOutput.doWork.1] job [Thread-9] Job 15.97.1 is in status in_progress
ComponentLog.2:2010-08-04 10:17:54,148:[Thread-9] Job 15.97.1: Processing output
ComponentLog.2:2010-08-04 10:17:54,148:leo: starting retrieval for [Thread-9] Job 15.97.1 in status in_progress
ComponentLog.2:2010-08-04 10:17:54,161:[Thread-9] Job 15.97.1: Creating directory /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_spec/BossJob_15_97/Submission_1/
ComponentLog.2:2010-08-04 10:17:54,214:Evaluate job [Thread-9] Job 15.97.1 and publish the job results
ComponentLog.2:2010-08-04 10:17:54,214:Job [Thread-9] Job 15.97.1 : report rebounce
ComponentLog.2:2010-08-04 10:17:54,254:Job [Thread-9] Job 15.97.1 REBOUNCE DBG : /data/CSstoragePath/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8/crab_fjr_97.xml, /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_spec/BossJob_15_97/Submission_1/crab_fjr_97.xml
ComponentLog.2:2010-08-04 10:18:04,229:Job [Thread-9] Job 15.97.1 Report rebounce completed
ComponentLog.2:2010-08-04 10:18:04,229:Job [Thread-9] Job 15.97.1 : report file name /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_spec/BossJob_15_97/Submission_1//crab_fjr_97.xml exists: True
ComponentLog.2:2010-08-04 10:18:04,303:Job [Thread-9] Job 15.97.1 check Job Success: True
ComponentLog.2:2010-08-04 10:18:04,304:leo: [GetOutput.doWork.4] job [Thread-9] Job 15.97.1 got ret=(<ProdCommon.BossLite.DbObjects.Job.Job object at 0x2b0025480350>, True, '/data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_spec/BossJob_15_97/Submission_1//crab_fjr_97.xml'), in status in_progress
ComponentLog.2:2010-08-04 10:18:04,304:[Thread-9] Job 15.97.1 : Processing output finished
ComponentLog.2:2010-08-04 10:18:04,304:leo: [GetOutput.doWork.7] job [Thread-9] Job 15.97.1: updating DB, status in_progress
ComponentLog.2:2010-08-04 10:18:04,313:Job [MainThread] Job 15.97.1 finished: JobState DB Notified
ComponentLog.2:2010-08-04 10:18:04,313:[MainThread] Job 15.97.1: Finished

They are the same, but in the log the "State" info is missing...

Interesting: JobOutput.py, L125:

            # inconsistent status
            if status == 'in_progress' and job.runningJob['closed'] == 'Y':
                logging.warning(
                    "%s in status %s: Enqueuing previous output" % \
                    (cls.fullId( job ), status) )
                logging.info("leo: [GetOutput.doWork.3] job %s moved to output_requested, to be retrieved next, in status %s" % (cls.fullId( job ), status))

                job.runningJob['processStatus'] = 'output_retrieved'
                bossLiteSession.updateDB( job )
                skipRetrieval = True

but actually never called, and 195 jobs in that state. Actually, this check can be done also in BossLiteAPI.py, L749 :

        for job in jobList :
            if job['id'] != retJob['id'] and job.runningJob['closed'] == 'N' \
                   and job.runningJob['processStatus'] == 'processed' :
                logging.warning(
                    "WARNING: previous job %s.%s.%s not closed. Forcing closed='Y'" \
                    % (job['taskId'], job['jobId'], job['submissionNumber'])
                    )
                job.runningJob['closed'] = 'Y'
                self.updateDB(job.runningJob)

Update to Terminated status actually called by TaskTracking, but no signs of it in logfiles:

crab@submit-2 ~/work/TaskTracking$ grep "Updating at" ComponentLog* | grep " 97L" | grep LPJ_10-7_6i3gp8

crab@submit-2 ~/work/TaskTracking$ grep "Updating at" ComponentLog* | grep " 98L" | grep LPJ_10-7_6i3gp8
ComponentLog.3:2010-08-04 10:18:49,067:Updating at [Terminated] the db for jobs [82L, 83L, 84L, 85L, 86L, 87L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L, 96L, 98L, 99L, 100L, 101L, 102L] -> [82L, 83L, 84L, 85L, 86L, 87L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L, 96L, 98L, 99L, 100L, 101L, 102L] {leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8}

The action is triggered by computeJobStatus, TaskTrackingComponent.py, L940:

            if action in ["SubSuccess", "KillFailed"]:
                if stato in ["E"]:  #["SD","E","DA"]:
                    updateStateTerminated.append(job)
                    vect[10] = "Terminated"

processed and closed are set up in JobOutput.py, L159 :

            # update status
            if ret is not None:
                job.runningJob['processStatus'] = 'processed'
                job.runningJob['closed'] = 'Y'

After adding some tons of debug and restarting the server, 4 jobs still in SubSuccess, curious behaviour in TaskTracking:

$ grep 4327 TaskTracking/ComponentLog*
[...]
TaskTracking/ComponentLog.9:2010-08-05 06:59:17,053:leo: [TaskTrackingComponent.computeJobStatus.1] DB job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.9:2010-08-05 06:59:17,053:leo: [TaskTrackingComponent.computeJobStatus.1] XML job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.9:2010-08-05 06:59:17,054:leo: [TaskTrackingComponent.computeJobStatus.2] XML job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.9:2010-08-05 06:59:17,055:leo: [TaskTrackingComponent.computeJobStatus.2] job 4327 updateStateTerminated has lenght 0 , updateStateAborted has lenght 0
TaskTracking/ComponentLog.9:2010-08-05 06:59:17,237:leo: [TaskTrackingComponent.computeJobStatus.1] DB job 4327 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.9:2010-08-05 06:59:17,238:leo: [TaskTrackingComponent.computeJobStatus.1] XML job 4327 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.9:2010-08-05 06:59:17,239:leo: [TaskTrackingComponent.computeJobStatus.2] XML job 4327 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.9:2010-08-05 06:59:17,240:leo: [TaskTrackingComponent.computeJobStatus.2] job 4327 updateStateTerminated has lenght 0 , updateStateAborted has lenght 0

Note: logrotation for this is going nuts maybe... the grepped logfile changes almost every second...e g. after writing these lines is grepped from TaskTracking/ComponentLog.8, then TaskTracking/ComponentLog.8... need to enlarge the logifile dimension. Anyway, it bounced between "SD, Done" and "E, Cleared":

crab@submit-2 ~/work$ grep 4327 TaskTracking/ComponentLog*
TaskTracking/ComponentLog.8:2010-08-05 07:02:01,976:leo: [TaskTrackingComponent.computeJobStatus.1] DB job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.8:2010-08-05 07:02:01,977:leo: [TaskTrackingComponent.computeJobStatus.1] XML job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.8:2010-08-05 07:02:01,978:leo: [TaskTrackingComponent.computeJobStatus.2] XML job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.8:2010-08-05 07:02:01,979:leo: [TaskTrackingComponent.computeJobStatus.2] job 4327 updateStateTerminated has lenght 0 , updateStateAborted has lenght 0
TaskTracking/ComponentLog.8:2010-08-05 07:02:02,093:leo: [TaskTrackingComponent.computeJobStatus.1] DB job 4327 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.8:2010-08-05 07:02:02,094:leo: [TaskTrackingComponent.computeJobStatus.1] XML job 4327 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.8:2010-08-05 07:02:02,094:leo: [TaskTrackingComponent.computeJobStatus.2] XML job 4327 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.8:2010-08-05 07:02:02,095:leo: [TaskTrackingComponent.computeJobStatus.2] job 4327 updateStateTerminated has lenght 0 , updateStateAborted has lenght 0
TaskTracking/ComponentLog.8:2010-08-05 07:02:02,160:leo: [TaskTrackingComponent.computeJobStatus.1] DB job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.8:2010-08-05 07:02:02,162:leo: [TaskTrackingComponent.computeJobStatus.1] XML job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.8:2010-08-05 07:02:02,163:leo: [TaskTrackingComponent.computeJobStatus.2] XML job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.8:2010-08-05 07:02:02,164:leo: [TaskTrackingComponent.computeJobStatus.2] job 4327 updateStateTerminated has lenght 0 , updateStateAborted has lenght 0
crab@submit-2 ~/work$ grep 4327 TaskTracking/ComponentLog*
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,113:leo: [TaskTrackingComponent.computeJobStatus.1] DB job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,114:leo: [TaskTrackingComponent.computeJobStatus.1] XML job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,115:leo: [TaskTrackingComponent.computeJobStatus.2] XML job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,116:leo: [TaskTrackingComponent.computeJobStatus.2] job 4327 updateStateTerminated has lenght 0 , updateStateAborted has lenght 0
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,141:leo: [TaskTrackingComponent.computeJobStatus.1] DB job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,142:leo: [TaskTrackingComponent.computeJobStatus.1] XML job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,143:leo: [TaskTrackingComponent.computeJobStatus.2] XML job 4327 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,144:leo: [TaskTrackingComponent.computeJobStatus.2] job 4327 updateStateTerminated has lenght 0 , updateStateAborted has lenght 0
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,254:leo: [TaskTrackingComponent.computeJobStatus.1] DB job 4327 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,255:leo: [TaskTrackingComponent.computeJobStatus.1] XML job 4327 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,256:leo: [TaskTrackingComponent.computeJobStatus.2] XML job 4327 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
TaskTracking/ComponentLog.6:2010-08-05 07:02:59,257:leo: [TaskTrackingComponent.computeJobStatus.2] job 4327 updateStateTerminated has lenght 0 , updateStateAborted has lenght 0

DB is not updated in the meanwhile.

But ehi, how many tables are used? Not only the bl_* ones... NB id=4327 is SubSuccess, id=1 is ok

mysql> select id,status,job_type workflow_id from we_Job where id="leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_job4327";
+--------------------------------------------------------------------------------------------------+----------+-------------+
| id                                                                                               | status   | workflow_id |
+--------------------------------------------------------------------------------------------------+----------+-------------+
| leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_job4327 | finished | Processing  |
+--------------------------------------------------------------------------------------------------+----------+-------------+
1 row in set (0.00 sec)

mysql> select id,status,job_type workflow_id from we_Job where id="leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_job1";
+-----------------------------------------------------------------------------------------------+----------+-------------+
| id                                                                                            | status   | workflow_id |
+-----------------------------------------------------------------------------------------------+----------+-------------+
| leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-7_6i3gp8_job1 | finished | Processing  |
+-----------------------------------------------------------------------------------------------+----------+-------------+

So, it seems that bl_runningjob is not actually updated by GetOutput.

Curious: in JobOutput.purgeService:

            statusSched = job.runningJob['status']

            oldTimeout = socket.getdefaulttimeout()
            socket.setdefaulttimeout( cls.params['timeout'] )
            schedSession.purgeService( task )
            socket.setdefaulttimeout(oldTimeout)

            if statusSched == 'UE' :
                job.runningJob['status'] = 'UE'

does not really make sense.... and what is status UE?

Run 8

Added more debug, in particular a check after BL DB update in GetOutput, L189.

Task id: 16

After one night:

mysql> select state,closed,process_status,status,wrapper_return_code,status_reason, count(process_status) from bl_runningjob where task_id=16 group by state,process_status,status,wrapper_return_code,status_reason;
+--------------+--------+------------------+--------+---------------------+---------------+-----------------------+
| state        | closed | process_status   | status | wrapper_return_code | status_reason | count(process_status) |
+--------------+--------+------------------+--------+---------------------+---------------+-----------------------+
| SubRequested | Y      | in_progress      | E      |                   0 |               |                    10 |
| SubRequested | N      | in_progress      | SD     |                   0 |               |                     2 |
| SubRequested | Y      | processed        | E      |                   0 |               |                   388 |
| SubSuccess   | N      | in_progress      | SD     |                   0 |               |                    69 |
| Terminated   | N      | in_progress      | E      |                   0 |               |                   622 |
| Terminated   | Y      | output_retrieved | E      |                   0 |               |                     5 |
| Terminated   | Y      | processed        | E      |                   0 |               |                  3315 |
+--------------+--------+------------------+--------+---------------------+---------------+-----------------------+

An admixture of strange states frown :

  • Job Terminated but not with the correct process_status or not closed (also triggered by GetOutput)
  • Usual SubSuccess
  • SubRequested (this should be the SubmittingForever issue)

Some SubSuccess jobs:

mysql> select job_id,state,closed,process_status,status from bl_runningjob where task_id=16 and state="SubSuccess" limit 10;
+--------+------------+--------+----------------+--------+
| job_id | state      | closed | process_status | status |
+--------+------------+--------+----------------+--------+
|    441 | SubSuccess | N      | in_progress    | SD     |
|    676 | SubSuccess | N      | in_progress    | SD     |
|   1116 | SubSuccess | N      | in_progress    | SD     |
|   2548 | SubSuccess | N      | in_progress    | SD     |
|   2549 | SubSuccess | N      | in_progress    | SD     |
|   3004 | SubSuccess | N      | in_progress    | SD     |
|   3193 | SubSuccess | N      | in_progress    | SD     |
|   3221 | SubSuccess | N      | in_progress    | SD     |
|   3222 | SubSuccess | N      | in_progress    | SD     |
|   3606 | SubSuccess | N      | in_progress    | SD     |
+--------+------------+--------+----------------+--------+

Let's see what happened to job 441:

crab@submit-2 ~/work$ grep 16.441.1 GetOutput/ComponentLog*
GetOutput/ComponentLog.2:2010-08-05 10:06:08,902:[Thread-3] Job 16.441.1: Getting output
GetOutput/ComponentLog.2:2010-08-05 10:06:08,903:leo: [GetOutput.doWork.1] job [Thread-3] Job 16.441.1 is in process_status in_progress closed N status SD
GetOutput/ComponentLog.2:2010-08-05 10:06:08,904:[Thread-3] Job 16.441.1: Processing output
GetOutput/ComponentLog.2:2010-08-05 10:06:08,904:leo [GetOutput.doWork.1b]: starting retrieval for [Thread-3] Job 16.441.1 in process_status in_progress closed N status SD
GetOutput/ComponentLog.2:2010-08-05 10:06:08,907:[Thread-3] Job 16.441.1: Creating directory /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_cjb864_spec/BossJob_16_441/Submission_1/
GetOutput/ComponentLog.2:2010-08-05 10:06:08,959:Evaluate job [Thread-3] Job 16.441.1 and publish the job results
GetOutput/ComponentLog.2:2010-08-05 10:06:08,959:Job [Thread-3] Job 16.441.1 : report rebounce
GetOutput/ComponentLog.2:2010-08-05 10:06:08,995:Job [Thread-3] Job 16.441.1 REBOUNCE DBG : /data/CSstoragePath/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_cjb864/crab_fjr_441.xml, /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_cjb864_spec/BossJob_16_441/Submission_1/crab_fjr_441.xml
GetOutput/ComponentLog.2:2010-08-05 10:06:18,196:Job [Thread-3] Job 16.441.1 Report rebounce completed
GetOutput/ComponentLog.2:2010-08-05 10:06:18,197:Job [Thread-3] Job 16.441.1 : report file name /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_cjb864_spec/BossJob_16_441/Submission_1//crab_fjr_441.xml exists: True
GetOutput/ComponentLog.2:2010-08-05 10:06:18,271:Job [Thread-3] Job 16.441.1 check Job Success: True
GetOutput/ComponentLog.2:2010-08-05 10:06:18,272:leo: [GetOutput.doWork.4] job [Thread-3] Job 16.441.1 got ret=(<ProdCommon.BossLite.DbObjects.Job.Job object at 0x2aaab0163a10>, True, '/data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_cjb864_spec/BossJob_16_441/Submission_1//crab_fjr_441.xml'), in process_status in_progress closed Y status E
GetOutput/ComponentLog.2:2010-08-05 10:06:18,272:[Thread-3] Job 16.441.1 : Processing output finished
GetOutput/ComponentLog.2:2010-08-05 10:06:18,272:leo: [GetOutput.doWork.7] job [Thread-3] Job 16.441.1: updating DB, process_status processed closed Y status E
GetOutput/ComponentLog.2:2010-08-05 10:06:18,283:leo: [GetOutput.doWork.8] job [Thread-3] Job 16.441.1 task 16: checking  DB after update , leoJobs has length 1
GetOutput/ComponentLog.2:2010-08-05 10:06:18,284:leo:                     job [Thread-3] Job 16.441.1 has process_status processed closed Y status E
GetOutput/ComponentLog.2:2010-08-05 10:06:18,301:Job [MainThread] Job 16.441.1 finished: JobState DB Notified
GetOutput/ComponentLog.2:2010-08-05 10:06:18,301:[MainThread] Job 16.441.1: Finished

For the log, the job is processed,closed, E state, but not true in the DB. As before, the report from TaskTracking oscillates:

crab@submit-2 ~/work$ grep " 441 " TaskTracking/ComponentLog.1
2010-08-06 01:19:59,458:leo: [TaskTrackingComponent.computeJobStatus.1] DB job 441 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
2010-08-06 01:19:59,459:leo: [TaskTrackingComponent.computeJobStatus.1] XML job 441 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
2010-08-06 01:19:59,461:leo: [TaskTrackingComponent.computeJobStatus.2] XML job 441 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
2010-08-06 01:19:59,463:leo: [TaskTrackingComponent.computeJobStatus.2] job 441 updateStateTerminated has lenght 0 , updateStateAborted has lenght 0
2010-08-06 01:20:10,809:leo: [TaskTrackingComponent.computeJobStatus.1] DB job 441 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
2010-08-06 01:20:10,811:leo: [TaskTrackingComponent.computeJobStatus.1] XML job 441 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
2010-08-06 01:20:10,812:leo: [TaskTrackingComponent.computeJobStatus.2] XML job 441 is in state Cleared status E and schedulerStatus Cleared, WexitCode 0 AexitCode 0
2010-08-06 01:20:10,812:leo: [TaskTrackingComponent.computeJobStatus.2] job 441 updateStateTerminated has lenght 0 , updateStateAborted has lenght 0
2010-08-06 01:20:35,728:leo: [TaskTrackingComponent.computeJobStatus.1] DB job 441 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
2010-08-06 01:20:35,731:leo: [TaskTrackingComponent.computeJobStatus.1] XML job 441 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
2010-08-06 01:20:35,733:leo: [TaskTrackingComponent.computeJobStatus.2] XML job 441 is in state SubSuccess status SD and schedulerStatus Done, WexitCode 0 AexitCode 0
2010-08-06 01:20:35,734:leo: [TaskTrackingComponent.computeJobStatus.2] job 441 updateStateTerminated has lenght 0 , updateStateAborted has lenght 0

Anyway, the DB does not oscillate! It seems something stuck in DB writing from GetOutput. Threading?

After a GetOutput reboot:

crab@submit-2 ~/work$ tail -fn1000 GetOutput/ComponentLog* |  grep 16.441.1


2010-08-06 01:30:58,611:[Thread-3] Job 16.441.1: Getting output
2010-08-06 01:30:58,612:leo: [GetOutput.doWork.1] job [Thread-3] Job 16.441.1 is in process_status in_progress closed N status SD
2010-08-06 01:30:58,612:[Thread-3] Job 16.441.1: Processing output
2010-08-06 01:30:58,612:leo [GetOutput.doWork.1b]: starting retrieval for [Thread-3] Job 16.441.1 in process_status in_progress closed N status SD
2010-08-06 01:30:58,615:[Thread-3] Job 16.441.1: Creating directory /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_cjb864_spec/BossJob_16_441/Submission_1/
2010-08-06 01:30:58,669:Evaluate job [Thread-3] Job 16.441.1 and publish the job results
2010-08-06 01:30:58,669:Job [Thread-3] Job 16.441.1 : report rebounce
2010-08-06 01:30:58,707:Job [Thread-3] Job 16.441.1 REBOUNCE DBG : /data/CSstoragePath/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_cjb864/crab_fjr_441.xml, /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_cjb864_spec/BossJob_16_441/Submission_1/crab_fjr_441.xml
2010-08-06 01:31:08,757:Job [Thread-3] Job 16.441.1 Report rebounce completed
2010-08-06 01:31:08,757:Job [Thread-3] Job 16.441.1 : report file name /data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_cjb864_spec/BossJob_16_441/Submission_1//crab_fjr_441.xml exists: True
2010-08-06 01:31:08,840:Job [Thread-3] Job 16.441.1 check Job Success: True
2010-08-06 01:31:08,840:leo: [GetOutput.doWork.4] job [Thread-3] Job 16.441.1 got ret=(<ProdCommon.BossLite.DbObjects.Job.Job object at 0x2b46bc551fd0>, True, '/data/CSstoragePath/logs/leo_leo-DS_MinimumBias_Comissioning10_SD_JetMETTauMonitor_Jun14thSkim_v1-LPJ_10-8_cjb864_spec/BossJob_16_441/Submission_1//crab_fjr_441.xml'), in process_status in_progress closed Y status E
2010-08-06 01:31:08,841:[Thread-3] Job 16.441.1 : Processing output finished
2010-08-06 01:31:08,841:leo: [GetOutput.doWork.7] job [Thread-3] Job 16.441.1: updating DB, process_status processed closed Y status E
2010-08-06 01:31:08,845:leo: [GetOutput.doWork.8] job [Thread-3] Job 16.441.1 task 16: checking  DB after update , leoJobs has length 1
2010-08-06 01:31:08,845:leo:                     job [Thread-3] Job 16.441.1 has process_status processed closed Y status E
2010-08-06 01:31:08,852:Job [MainThread] Job 16.441.1 finished: JobState DB Notified
2010-08-06 01:31:08,853:[MainThread] Job 16.441.1: Finished

It's the same as before, but now:

mysql> select job_id, task_id, state, status_reason,process_status,closed from bl_runningjob where job_id=441 and task_id=16;
+--------+---------+------------+---------------+----------------+--------+
| job_id | task_id | state      | status_reason | process_status | closed |
+--------+---------+------------+---------------+----------------+--------+
|    441 |      16 | Terminated |               | processed      | Y      |
+--------+---------+------------+---------------+----------------+--------+

-- LeonardoSala - 29-Jul-2010

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2010-08-07 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback