WMAgent Preparation of Production Machines

Objective

The objective of this twiki is to explain the necessary steps and configuration changes to be done to the Production WMAgents once they are installed and before they start to receive work. The twiki is designed in steps that must be performed in order. If some special tweaking is needed depending if the use is for MC or Reprocessing, It will be mentioned.

Applying patches to WMAgents

You'll need either to have the cmst1 password or talk to Alan. Connect to aiadm.cern.ch with cmst1 user and run the following command. NOTICE you need to update the patch file you want to apply and may also need to change the directory.
for h in vocms0{250,251,252,254,255,256,257,258}; do echo ""; ssh cmst1@$h 'source /data/admin/wmagent/env.sh;
echo -e "\n\n   ********** Patching `hostname` ************";
for pr in {7168,7187,7190,7198}; do
wget -nv https://github.com/dmwm/WMCore/pull/$pr.patch -O - | patch -d apps/wmagent/lib/python2*/site-packages/ -p 3;
done;
$manage stop-agent;
echo -e "\nSleeping 3 seconds ..." && sleep 3;
$manage start-agent'; done

For FNAL agents, we need to connect to one of the agents instead of using cmslpc, since it user a different shell. Example:

for h in cmsgwms-submit{3,4}; do echo ""; ssh cmsdataops@$h 'source /data/admin/wmagent/env.sh;
echo -e "\n\n   ********** Patching `hostname` ************";
for pr in 5574; do
wget -nv https://github.com/dmwm/WMCore/pull/$pr.patch -O - | patch -d apps/wmagent/lib/python2*/site-packages/ -p 3;
done;
$manage execute-agent wmcoreD --shutdown --components=AgentStatusWatcher;
echo -e "\nSleeping 3 seconds ..." && sleep 3;
$manage execute-agent wmcoreD --restart --components=AgentStatusWatcher'; done

Condor upgrade

When the agent is drained, you should contact Krista Majewski (klarson1@fnalNOSPAMPLEASE.gov) to coordinate the condor version upgrade. The machine needs to be restarted after doing this.

Patching LHEStepZero agent

For the dedicated agent to run LHEStepZero workflows at CERN with .lhe input files there is an additional patch that is not applied to other agents. This is 352ff5. Currently this only applies for vocms237.

Adding the sites to the WMAgent resource DB

/data/srv/wmagent/current > ./config/wmagent/manage execute-agent wmagent-resource-control --add-all-sites  --plugin=CondorPlugin --no-disk

Setting Site Thresholds and Status

  • Download the script for removing old jobs.
cp /afs/cern.ch/user/j/jbadillo/public/rmOldJobs.sh ./
  • Download the thresholds and status scripts
#download status and thresholds scripts
wget https://raw.github.com/CMSCompOps/WmAgentScripts/master/updateSiteStatus.py --no-check-certificate
wget https://raw.github.com/CMSCompOps/WmAgentScripts/master/thresholdsFromSSB.py --no-check-certificate
  • Create script cronjobs with:
    • Type:
      crontab -e
    • Add the following lines:
 
#Update site status
1,21,41 * * * * (source /data/admin/wmagent/env.sh ; source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh ; python /data/srv/wmagent/current/updateSiteStatus.py ) &> /tmp/updateSiteStatus.log
#Update site thresholds
1,21,41 * * * * (source /data/admin/wmagent/env.sh ; source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh ; python /data/srv/wmagent/current/thresholdsFromSSB.py ) &> /tmp/thresholdsFromSSB.log
#remove old jobs script
10 */4 * * * source /data/srv/wmagent/current/rmOldJobs.sh &> /tmp/rmJobs.log

If the cronjobs are not in the crontab list, you should add them doing crontab -e

Set up the Removal of Old Jobs Script

This step if needed for all agents.

Verify that rmOldJobs.sh file is here:

 /data/srv/wmagent/current/rmOldJobs.sh 

Otherwise get it from another MC WMAgent.

Verify the following cronjob is in place:

 10 */4 * * * source /data/srv/wmagent/current/rmOldJobs.sh &> /tmp/rmJobs.log 

Add it if it's not there.

Configuration changes

In the configuration file located here:

 /data/srv/wmagent/current/config/wmagent/config.py 

For Montecarlo:

config.JobStatusLite.stateTimeouts = {'Running': 172800, 'Pending': 259200, 'Error': 1800} 

For Reprocessing

config.JobStatusLite.stateTimeouts = {'Running': 604800, 'Pending': 259200, 'Error': 1800}

Change the default retry policy to the following:

  • Montecarlo:
    config.ErrorHandler.maxRetries = {'default' : 3, 'Harvesting' : 2, 'Merge' : 4, 'LogCollect' : 1, 'Cleanup' : 2}
  • For Reprocessing:
    config.ErrorHandler.maxRetries = {'default' : 3, 'Merge' : 4, 'LogCollect' : 2, 'Cleanup' : 2}
Add these lines:
config.BossAir.submitWMSMode = True
config.PhEDExInjector.diskSites = ["storm-fe-cms.cr.cnaf.infn.it","srm-cms-disk.gridpp.rl.ac.uk","cmssrm-fzk.gridka.de", "srmcms.pic.es"]

  • Check the ACDC - coach URL, if it is a production agent should point to central couch.
config.ACDC.couchurl = 'https://cmsweb.cern.ch/couchdb'
  • Check the Request Manager Service URL (since is new from versions > 0.9.82).
config.TaskArchiver.ReqMgrServiceURL = 'https://cmsweb.cern.ch/reqmgr/rest'
  • Check the TaksArchiver Couch URL (it usually comes in a bad shape).
config.TaskArchiver.workloadSummaryCouchURL = 'https://cmsweb.cern.ch/couchdb'

For all agents.

Add the following lines to config.py, in order to make the DBS3Upload component available in agents >= 0.9.58 (make sure to ALWAYS use the production cmsweb):

config.component_('DBS3Upload')
config.DBS3Upload.workerThreads = 1
config.DBS3Upload.componentDir = '/data/srv/wmagent/v0.9.59/install/wmagent/DBS3Upload'
config.DBS3Upload.logLevel = 'INFO'
config.DBS3Upload.namespace = 'WMComponent.DBS3Buffer.DBS3Upload'
config.DBS3Upload.dbsUrl = 'https://cmsweb.cern.ch/dbs/prod/global/DBSWriter'
config.DBS3Upload.pollInterval = 100
config.DBS3Upload.dbs3UploadOnly = True

Installation of Couch watchdog

This needs to be done for all agents:

  1. Stop wmagent
  2. Make sure no wmcoreD processes are running.
  3. Bring down couch.
  4. Modify:
     /data/srv/wmagent/current/sw/slc5_amd64_gcc461/external/couchdb/1.1.0-comp10/bin/couchdb 
    Change line 28, RESPAWN_TIMEOUT from 0 to 5.
  5. Bring up couch
  6. Bring up wmagent
-- EdgarFajardo - 02-Oct-2012
Edit | Attach | Watch | Print version | History: r66 < r65 < r64 < r63 < r62 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r66 - 2018-02-06 - AlanMalta
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback