vocms0310

Agent tweaks

  • AgentStatusWatcher disabled ==> OK
  • Set maxRetries to 0 ==> OK
  • Restart the necessary components ==> OK
    $manage execute-agent wmcoreD --restart --components=RetryManager,ErrorHandler,AgentStatusWatcher
  • Enable ALL sites and set a high number of thresholds for them ==> OK
    $manage db-prompt wmagent
    UPDATE wmbs_location SET state=(SELECT id from wmbs_location_state where name='Normal') WHERE state!=(SELECT id from wmbs_location_state where name='Normal');
    UPDATE wmbs_location SET running_slots=2000, pending_slots=1000;
    UPDATE rc_threshold SET max_slots=2000, pending_slots=1000;

Draining logs

cmst1@vocms0310:/data/srv/wmagent/current $ python drainAgent.py 

*** Amount of jobs in condor per workflow, sorted by condor job status:
{}

*** WORKFLOWS: found 13 distinct workflows in this agent.
prozober_ACDC0_SUS-RunIISpring16FSPremix-00083_00047_v0__170130_100804_7952                                                  	completed
pdmvserv_task_JME-PhaseIFall16GS-00003__v1_T_170127_092523_4740                                                              	completed
pdmvserv_task_BPH-PhaseIIFall16GS82-00011__v1_T_170124_203656_2417                                                           	announced
prozober_ACDC0_SUS-RunIISpring16FSPremix-00086_00047_v0__170130_104559_455                                                   	completed
pdmvserv_task_HIG-RunIISummer16DR80Premix-01911__v1_T_170116_133411_9741                                                     	completed
pdmvserv_task_SUS-RunIISummer16DR80Premix-00135__v1_T_161124_210645_3749                                                     	announced
vlimant_task_JME-RunIISummer16DR80-00010__v1_T_161128_025525_6414                                                            	aborted-archived
pdmvserv_task_TOP-RunIISummer16DR80Premix-00083__v1_T_161219_123711_1390                                                     	running-closed
pdmvserv_task_HIG-RunIISummer16DR80Premix-01880__v1_T_170113_162606_6082                                                     	running-closed
pdmvserv_SUS-RunIISpring16FSPremix-00083_00047_v0__170116_081703_7893                                                        	completed
wmagent_dmason_TEST_GOOGLE_161110_204157_8813                                                                                	aborted-archived
pdmvserv_task_EXO-RunIISummer15wmLHEGS-04239__v1_T_170114_004935_4431                                                        	announced
wmagent_dmason_TEST_GOOGLE_161111_134234_5412                                                                                	aborted-archived

*** WORKFLOWS: there are 1 distinct workflows not completed.
wmagent_dmason_TEST_GOOGLE_161110_204157_8813                                                                                	aborted-archived

*** WORKFLOWS: found 0 workflows not fully injected.

*** WMBS: amount of wmbs jobs in each status:
[{'count': 105384, 'name': 'cleanout'}, {'count': 11, 'name': 'executing'}]

*** WMBS: 2 workflows with executing jobs in wmbs:
wmagent_dmason_TEST_GOOGLE_161110_204157_8813                                                                                	aborted-archived
wmagent_dmason_TEST_GOOGLE_161111_134234_5412                                                                                	aborted-archived

*** SUBSCRIPTIONS: subscriptions not finished: 3
vlimant_task_JME-RunIISummer16DR80-00010__v1_T_161128_025525_6414                                                            	aborted-archived
wmagent_dmason_TEST_GOOGLE_161110_204157_8813                                                                                	aborted-archived
wmagent_dmason_TEST_GOOGLE_161111_134234_5412                                                                                	aborted-archived

*** SUBSCRIPTIONS: found 4 files available in WMBS (waiting for job creation):
[{'count(*)': 39, 'subscription': 1491}, {'count(*)': 83, 'subscription': 2181}, {'count(*)': 5, 'subscription': 20733}, {'count(*)': 5, 'subscription': 1490}]

*** SUBSCRIPTIONS: found 0 files acquired in WMBS (waiting for jobs to finish):
[]

*** DBS: found 0 blocks open in DBS. Printing the first 20 blocks only:
[]

*** DBS: found 27 files not uploaded to DBS.

==> Which maps to 1 unique datasets:
set(['/Muplus_Pt10-gun/RunIISummer16DR80Premix-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/GEN-SIM-RAW'])
... that were NOT produced by any agent-known workflow. OR, the wfs are gone already.


*** PHEDEX: found 0 files not injected in PhEDEx, with valid block id (recoverable).

*** PHEDEX: found 481745 files not injected in PhEDEx, with valid block id (unrecoverable).
==> Which maps to 3229 unique datasets:
...tons of datasets...
     '/ttZJets_13TeV_madgraphMLM/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/GEN-SIM'])
... that were NOT produced by any agent-known workflow. OR, the wfs are gone already.


I'm done!

Cleaning these workflows that are - in theory - long gone from the agent. Those 3 workflows showing up with subscriptions not finished

 
cmst1@vocms0310:/data/srv/wmagent/current $ $manage execute-agent kill-workflow-in-agent vlimant_task_JME-RunIISummer16DR80-00010__v1_T_161128_025525_6414
Executing kill-workflow-in-agent vlimant_task_JME-RunIISummer16DR80-00010__v1_T_161128_025525_6414 ...
Canceling work for workflow: set(['vlimant_task_JME-RunIISummer16DR80-00010__v1_T_161128_025525_6414'])
Aborted worqueue elements:
[]
cmst1@vocms0310:/data/srv/wmagent/current $ 
cmst1@vocms0310:/data/srv/wmagent/current $ $manage execute-agent kill-workflow-in-agent wmagent_dmason_TEST_GOOGLE_161110_204157_8813
Executing kill-workflow-in-agent wmagent_dmason_TEST_GOOGLE_161110_204157_8813 ...
Canceling work for workflow: set(['wmagent_dmason_TEST_GOOGLE_161110_204157_8813'])
Aborted worqueue elements:
[]
cmst1@vocms0310:/data/srv/wmagent/current $ 
cmst1@vocms0310:/data/srv/wmagent/current $ $manage execute-agent kill-workflow-in-agent wmagent_dmason_TEST_GOOGLE_161111_134234_5412
Executing kill-workflow-in-agent wmagent_dmason_TEST_GOOGLE_161111_134234_5412 ...
Canceling work for workflow: set(['wmagent_dmason_TEST_GOOGLE_161111_134234_5412'])
Aborted worqueue elements:
[]
cmst1@vocms0310:/data/srv/wmagent/current $ 

these workflows should be gone from the output of drainAgent.py in the next time we run it. CHECK

Draining logs - take 2

cmst1@vocms0310:/data/srv/wmagent/current $ python drainAgent.py 

*** Amount of jobs in condor per workflow, sorted by condor job status:
{}

*** WORKFLOWS: found 10 distinct workflows in this agent.
prozober_ACDC0_SUS-RunIISpring16FSPremix-00083_00047_v0__170130_100804_7952                                                  	completed
pdmvserv_task_JME-PhaseIFall16GS-00003__v1_T_170127_092523_4740                                                              	completed
pdmvserv_task_BPH-PhaseIIFall16GS82-00011__v1_T_170124_203656_2417                                                           	announced
prozober_ACDC0_SUS-RunIISpring16FSPremix-00086_00047_v0__170130_104559_455                                                   	completed
pdmvserv_task_HIG-RunIISummer16DR80Premix-01911__v1_T_170116_133411_9741                                                     	completed
pdmvserv_task_SUS-RunIISummer16DR80Premix-00135__v1_T_161124_210645_3749                                                     	announced
pdmvserv_task_TOP-RunIISummer16DR80Premix-00083__v1_T_161219_123711_1390                                                     	running-closed
pdmvserv_task_HIG-RunIISummer16DR80Premix-01880__v1_T_170113_162606_6082                                                     	running-closed
pdmvserv_SUS-RunIISpring16FSPremix-00083_00047_v0__170116_081703_7893                                                        	completed
pdmvserv_task_EXO-RunIISummer15wmLHEGS-04239__v1_T_170114_004935_4431                                                        	announced

*** WORKFLOWS: there are 0 distinct workflows not completed.

*** WORKFLOWS: found 0 workflows not fully injected.

*** WMBS: amount of wmbs jobs in each status:
[{'count': 91920, 'name': 'cleanout'}]

*** SUBSCRIPTIONS: subscriptions not finished: 0

*** SUBSCRIPTIONS: found 0 files available in WMBS (waiting for job creation):
[]

*** SUBSCRIPTIONS: found 0 files acquired in WMBS (waiting for jobs to finish):
[]

*** DBS: found 0 blocks open in DBS. Printing the first 20 blocks only:
[]

*** DBS: found 27 files not uploaded to DBS.

==> Which maps to 1 unique datasets:
set(['/Muplus_Pt10-gun/RunIISummer16DR80Premix-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/GEN-SIM-RAW'])
... that were NOT produced by any agent-known workflow. OR, the wfs are gone already.


*** PHEDEX: found 0 files not injected in PhEDEx, with valid block id (recoverable).

*** PHEDEX: found 481745 files not injected in PhEDEx, with valid block id (unrecoverable).
==> Which maps to 3229 unique datasets:
...
... that were NOT produced by any agent-known workflow. OR, the wfs are gone already.


I'm done!

and the agent is READY for redeployment.

Edit | Attach | Watch | Print version | History: r38 < r37 < r36 < r35 < r34 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r38 - 2017-02-24 - AlanMalta
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback