vocms0308
Agent tweaks
- AgentStatusWatcher disabled ==> OK
- Set maxRetries to 0 ==> OK
- Restart the necessary components ==> OK
$manage execute-agent wmcoreD --restart --components=RetryManager,ErrorHandler,AgentStatusWatcher
- Enable ALL sites and set a high number of thresholds for them ==> OK
$manage db-prompt wmagent
UPDATE wmbs_location SET state=(SELECT id from wmbs_location_state where name='Normal') WHERE state!=(SELECT id from wmbs_location_state where name='Normal');
UPDATE wmbs_location SET running_slots=2000, pending_slots=1000;
UPDATE rc_threshold SET max_slots=2000, pending_slots=1000;
Draining logs
cmst1@vocms0308:/data/srv/wmagent/current $ python drainAgent.py
*** Amount of jobs in condor per workflow, sorted by condor job status:
{}
*** WORKFLOWS: found 7 distinct workflows in this agent.
pdmvserv_task_JME-PhaseIFall16GS-00003__v1_T_170127_092523_4740 completed
pdmvserv_task_HIG-RunIISummer16DR80Premix-01881__v1_T_170113_162606_4964 running-closed
pdmvserv_HIG-RunIISummer15wmLHEGS-00691_00241_v0__161118_162113_5732 aborted-archived
pdmvserv_task_HIG-RunIISummer16DR80Premix-01911__v1_T_170116_133411_9741 completed
pdmvserv_task_TOP-RunIISummer16DR80Premix-00083__v1_T_161219_123711_1390 running-closed
pdmvserv_task_HIG-RunIISummer16DR80Premix-01880__v1_T_170113_162606_6082 running-closed
vlimant_HIG-RunIISummer15wmLHEGS-00688_00246_v0__161123_135709_4103 aborted-archived
*** WORKFLOWS: there are 2 distinct workflows not completed.
pdmvserv_HIG-RunIISummer15wmLHEGS-00691_00241_v0__161118_162113_5732 aborted-archived
vlimant_HIG-RunIISummer15wmLHEGS-00688_00246_v0__161123_135709_4103 aborted-archived
*** WORKFLOWS: found 0 workflows not fully injected.
*** WMBS: amount of wmbs jobs in each status:
[{'count': 181825, 'name': 'cleanout'}, {'count': 1005, 'name': 'executing'}]
*** WMBS: 2 workflows with executing jobs in wmbs:
pdmvserv_HIG-RunIISummer15wmLHEGS-00691_00241_v0__161118_162113_5732 aborted-archived
vlimant_HIG-RunIISummer15wmLHEGS-00688_00246_v0__161123_135709_4103 aborted-archived
*** SUBSCRIPTIONS: subscriptions not finished: 2
pdmvserv_HIG-RunIISummer15wmLHEGS-00691_00241_v0__161118_162113_5732 aborted-archived
vlimant_HIG-RunIISummer15wmLHEGS-00688_00246_v0__161123_135709_4103 aborted-archived
*** SUBSCRIPTIONS: found 10 files available in WMBS (waiting for job creation):
[{'count(*)': 3239, 'subscription': 6892}, {'count(*)': 17960, 'subscription': 66}, {'count(*)': 322, 'subscription': 6888}, {'count(*)': 33, 'subscription': 6890}, {'count(*)': 1741, 'subscription': 61}, {'count(*)': 322, 'subscription': 6891}, {'count(*)': 102, 'subscription': 64}, {'count(*)': 225, 'subscription': 6887}, {'count(*)': 1126, 'subscription': 62}, {'count(*)': 1126, 'subscription': 65}]
*** SUBSCRIPTIONS: found 0 files acquired in WMBS (waiting for jobs to finish):
[]
*** DBS: found 0 blocks open in DBS. Printing the first 20 blocks only:
[]
*** DBS: found 0 files not uploaded to DBS.
*** PHEDEX: found 0 files not injected in PhEDEx, with valid block id (recoverable).
*** PHEDEX: found 589056 files not injected in PhEDEx, with valid block id (unrecoverable).
==> Which maps to 3416 unique datasets:
set(['/ADDGravToGG_MS-3000_NED-2_KK-1_M-1000To2000_13TeV-sherpa/RunIISummer15GS-MCRUN2_71_V1-v2/GEN-SIM',
'/ADDGravToGG_MS-3000_NED-2_KK-4_M-2000To3000_13TeV-sherpa/RunIISummer16DR80Premix-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/AODSIM',
...
'/ttZJets_13TeV_madgraphMLM/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/GEN-SIM'])
... that were NOT produced by any agent-known workflow. OR, the wfs are gone already.
I'm done!
cmst1@vocms0308:/data/srv/wmagent/current $
which means we have to force-kill (locally) those two aborted-archived requests to get them out of the system and the agent ready for a new release.
Draining logs - take 2
cmst1@vocms0308:/data/srv/wmagent/current $ python drainAgent.py
*** Amount of jobs in condor per workflow, sorted by condor job status:
{}
*** WORKFLOWS: found 6 distinct workflows in this agent.
pdmvserv_task_JME-PhaseIFall16GS-00003__v1_T_170127_092523_4740 completed
pdmvserv_task_HIG-RunIISummer16DR80Premix-01881__v1_T_170113_162606_4964 running-closed
pdmvserv_HIG-RunIISummer15wmLHEGS-00691_00241_v0__161118_162113_5732 aborted-archived
pdmvserv_task_HIG-RunIISummer16DR80Premix-01911__v1_T_170116_133411_9741 completed
pdmvserv_task_TOP-RunIISummer16DR80Premix-00083__v1_T_161219_123711_1390 completed
pdmvserv_task_HIG-RunIISummer16DR80Premix-01880__v1_T_170113_162606_6082 running-closed
*** WORKFLOWS: there are 0 distinct workflows not completed.
*** WORKFLOWS: found 0 workflows not fully injected.
*** WMBS: amount of wmbs jobs in each status:
[{'count': 128831, 'name': 'cleanout'}]
*** SUBSCRIPTIONS: subscriptions not finished: 0
*** SUBSCRIPTIONS: found 0 files available in WMBS (waiting for job creation):
[]
*** SUBSCRIPTIONS: found 0 files acquired in WMBS (waiting for jobs to finish):
[]
*** DBS: found 0 blocks open in DBS. Printing the first 20 blocks only:
[]
*** DBS: found 0 files not uploaded to DBS.
*** PHEDEX: found 0 files not injected in PhEDEx, with valid block id (recoverable).
*** PHEDEX: found 589056 files not injected in PhEDEx, with valid block id (unrecoverable).
==> Which maps to 3416 unique datasets:
set(['/ADDGravToGG_MS-3000_NED-2_KK-1_M-1000To2000_13TeV-sherpa/RunIISummer15GS-MCRUN2_71_V1-v2/GEN-SIM',
...
'/ttZJets_13TeV_madgraphMLM/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/GEN-SIM'])
... that were NOT produced by any agent-known workflow. OR, the wfs are gone already.
I'm done!
and it's much better now, just double-checking the status of the local workqueue(_inbox) databases.
cmst1@vocms0308:/data/srv/wmagent/current $ python localWorkQueueStatus.py
INFO:root:************* LOCAL workqueue elements summary ************
INFO:root:Found a total of 471 elements in the 'workqueue_inbox' db
INFO:root:{u'Done': 471}
INFO:root:Found a total of 471 elements in the 'workqueue' db
INFO:root:{u'Done': 471}
INFO:root:
and this node is ready to have a new WMAgent release.