Tier-0 Operations - Useful commands

WMAgent

Condor

Contents :

Get information about the the jobs sent by the agent running in a site (for example T0_CH_CERN)

$manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN -p

Example output

T0_CH_CERN - 0 running, 28 pending, 10000 running slots total, 5000 pending slots total, Site is Normal:
  Cleanup - 0 running, 0 pending, 160 max running, 80 max pending, priority 5
  Merge - 0 running, 4 pending, 1000 max running, 400 max pending, priority 5
  Harvesting - 0 running, 0 pending, 80 max running, 40 max pending, priority 3
  Skim - 0 running, 0 pending, 1 max running, 1 max pending, priority 3
  LogCollect - 0 running, 1 pending, 80 max running, 40 max pending, priority 3
  Processing - 0 running, 23 pending, 9000 max running, 5000 max pending, priority 0
  Repack - 0 running, 0 pending, 2500 max running, 500 max pending, priority 0
  Production - 0 running, 0 pending, 1 max running, 1 max pending, priority 0
  Express - 0 running, 0 pending, 9000 max running, 500 max pending, priority 0

Get status of everything running in the pool (from the schedd)

This command should be executed from one of the schedds in the pool.
condor_status -schedd

Example output

Name                 Machine    TotalRunningJobs TotalIdleJobs TotalHeldJobs 

cmsgwms-submit1.fnal cmsgwms-su            19276         14202              9
cmsgwms-submit2.fnal cmsgwms-su             9322          9032              0
cmssrv113.fnal.gov   cmssrv113.                1             2              0
cmssrv218.fnal.gov   cmssrv218.              110          1329              0
cmssrv219.fnal.gov   cmssrv219.                0             0              0
vocms001.cern.ch     vocms001.c                8            40              0
vocms015.cern.ch     vocms015.c                0             0              0
vocms0230.cern.ch    vocms0230.                0             0              0
vocms0303.cern.ch    vocms0303.                0             0              0
vocms0308.cern.ch    vocms0308.            12413          3173              2
vocms0309.cern.ch    vocms0309.            24639         15850              8
vocms0310.cern.ch    vocms0310.            21581         18317            526
vocms0313.cern.ch    vocms0313.               27             1              0
vocms0314.cern.ch    vocms0314.               17             0              0
vocms039.cern.ch     vocms039.c                0             0              0
vocms047.cern.ch     vocms047.c                0             0              0
vocms053.cern.ch     vocms053.c                6             0              0
vocms074.cern.ch     vocms074.c             1113         19830              0
                      TotalRunningJobs      TotalIdleJobs      TotalHeldJobs

                    
               Total             88513              81776                545

Get the information about the VMs in a site (for example T0_CH_CERN) from a particular pool (-pool vocms007 refers to the collector of the pool)

condor_status -const 'ParentSlotId is undefined && GLIDEIN_CMSSite=?="T0_CH_CERN"' -totals -pool vocms007

WARNING: You are querying the Central Collector of the pool, using this command in a cronjob may result in harming the Collector's performance

Example output:

        X86_64/LINUX     1684     0       0        70       0          0

               Total     1684     0       0        70       0          0

This is a filtered version in case you need only the numeric value (used in the monitoring scripts)

condor_status -const 'GLIDEIN_CMSSite=?="T0_CH_CERN" && ParentSlotId is undefined' -totals | grep "X86_64/LINUX" | awk '{print $2}'

Example output:

1684

Turn off condor service

condor stop it if you want your data to be still available, then cp your spool directory to diskcp -r /mnt/ramdisk/spool /data/when you reboot the machinecopy it backotherwise just create an empty spool directory

Change the thresholds for jobs in the queue

(used by the agent to decide how many jobs to send to the queue).

The number of jobs per task-type has to be consistent with the global number of jobs

  •    $manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --task-type=Repack --pending-slots=600 
  •  $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --apply-to-all-tasks 
  •  $manage execute-agent wmagent-unregister-wmstats `hostname -f`:9999 

Number of Jobs per number of cores running in the pool

 condor_q -const 'JobStatus=?=2' -name vocms0314.cern.ch -af:h RequestCpus | sort | uniq -c

Details of the command

  • Basic condor command to check the queue
          condor_q 
          
    * Adding constraints to the query of the queue -const * Check only for jobs running 'JobStatus=?=2' * Check only the jobs sent from vocms0314 Schedd -name vocms0314.cern.ch * Choose ClassAd to show -af:h RequestCpus * Sort and count the output sort | uniq -c


How to choose a good run to get tarballs from?

7. Automatización runs para replays 8. Confirmar si se limpia toda la run y con qué peridiocidad

Nuevo plot para cores usadso

11. Nuevo plot para uso de HighIO Slots

--
-- Schedd: vocms0314.cern.ch : <128.142.209.169:4080?...

977 jobs; 0 completed, 0 removed, 0 idle, 977 running, 0 held, 0 suspended

<verbatim>
condor_q -w -const 'MATCH_GLIDEIN_CMSSite=?="T0_CH_CERN" && JobStatus =?=2 && RequestIoslots &gt;0 && CMS_JobType != "'Merge'"' -totals
</verbatim>
887 jobs; 0 completed, 0 removed, 0 idle, 887 running, 0 held, 0 suspended

<verbatim>
condor_q -w -const 'MATCH_GLIDEIN_CMSSite=?="T0_CH_CERN" && JobStatus =?=2 && RequestIoslots &gt;0 && CMS_JobType == "'Merge'"' -totals
</verbatim>

<verbatim>
condor_q -w -const 'MATCH_GLIDEIN_CMSSite=?="T0_CH_CERN" && JobStatus =?=2 && RequestIoslots &gt;0 && CMS_JobType == "'Merge'"' -totals
</verbatim>

Node Use Type
vocms001
  • Replays: Normally used by the developer
  • Transfer System
Virtual machine
vocms015
  • Replays: Normally used by the operators
  • Tier0 Monitoring
Virtual machine
vocms047
  • Replays: Normally used by the operators
Virtual machine
vocms0313
  • Production node
Physical Machine
vocms0314
  • Production node
Physical Machine
To restart this node you need to check the following:
  • Production node:
    • The agent is not running and the couch processes were stopped correctly.
    • These nodes uses a RAMDISK. Mounting it is puppetized, so you need to make sure that puppet ran before starting the agent again.
  • TransferSystem:
  • Replays:
    • The agent should not be running, check the Tier0 Elog to make sure you are not interfering with a particular test.
  • Tier0 Monitoring:
    • The monitoring is executed via a cronjob. The only consequence of the restart should be that no reports are produced during the down time. However you can check that everything is working going to:
      /data/tier0/sls/scripts/Logs

To restart a machine you need to:

  • Login and become root
  • Do the particular checks (listed above) based on the machine that you are restarting)
  • Run the restart command
    shutdown -r now

After restarting the machine, it is convenient to run puppet. You can either wait for the periodical execution or execute it manually:

  • puppet agent -tv
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2016-08-03 - JohnHarveyCasallasLeon
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback