CRAB Logo

Deployment of CRAB TaskWorker

Complete: 5 Go to SWGuideCrab

All questions about TaskWorker and deploying it on private machine should go to hn-cms-crabDevelopment@cernNOSPAMPLEASE.ch

For production and pre-production: Local service account used to deploy, run and operate the service is crab3.

Introduction

This twiki explains how to deploy the CRAB server backend (a.k.a. the CRAB TaskWorker or TaskManager). It will guide you through the steps required to:

  1. Get a virtual machine (VM) with the right architecture from the CERN OpenStack Cloud Infrastructure.
  2. Install the required software on the machine.
  3. Configure the machine.

Note: Legend of colors for the examples:

Commands to execute
Output sample of the executed commands
Configuration files
Other files

Get and install a virtual machine

See Deployment of CRAB REST Interface / Get and install a virtual machine. Make sure to go through the Machine preparation steps if you do not plan to install the REST interface on this machine.

Prepare directories and host cert to be used by the TaskWorker running as the service user

#sudo mkdir /data/certs  # This should have been done already by the Deploy script when installing the VM.
sudo mkdir /data/srv/TaskManager /data/certs/creds /data/srv/tmp
sudo chmod 700 /data/certs /data/certs/creds
sudo touch /data/srv/condor_config
#sudo cp -p /etc/grid-security/host{cert,key}.pem /data/certs # This should have been done already by the Deploy script when installing the VM.

For production and pre-production installations:

sudo chown crab3:zh /data/srv/TaskManager /data/certs /data/certs/creds /data/srv/tmp
sudo chown crab3:zh /data/certs/host{cert,key}.pem

For private installations:

sudo chown `whoami`:zh /data/srv/TaskManager /data/certs /data/certs/creds /data/srv/tmp
sudo chown `whoami`:zh /data/certs/host{cert,key}.pem

Setup a service certificate for interacting with CMSWEB

A service certificate with is needed with DN=/DC=ch/DC=cern/OU=computers/CN=tw/vocms052.cern.ch (or whatever the correct host name is). As of 2017, this is provided yearly by James Letts, who takes care of this for all Submission Infrastructure related machines, but if needed, anybody can do it using the procedure indicated here: https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsGlideinWMSCerts

The service certificate must be registered in VO CMS and in SiteDB. The certificate and private key should be in /data/certs/servicecert.pem and /data/certs/servicekey.pem respectively, and should be readable only by the service user.

Alternatively, and in particular for private installations, you can install your own short lived proxy (starting from your own account):

voms-proxy-init --voms cms --valid 192:00
sudo cp /tmp/x509up_u$UID /data/certs/servicecert.pem
sudo cp /tmp/x509up_u$UID /data/certs/servicekey.pem
sudo chmod 600 /data/certs/servicecert.pem
sudo chmod 400 /data/certs/servicekey.pem

For production and pre-production also do:

sudo chown crab3:zh /data/certs/service{cert,key}.pem

For private installations:

sudo chown `whoami`:zh /data/certs/service{cert,key}.pem

The proxy is created for 8 days (192 hours), because this is the maximum allowed duration of the VO CMS extension. Thus, the proxy has to be renewed every 7 days (at least). You can do it manually (executing the last six commands) or you can set up an automatic renewal procedure like is being done in production and pre-production.

Contact Analysis Ops for new instances

In addition to GSI authentication, CRAB3 operators whitelist the hostnames allowed to submit jobs for CRAB3. If this is a brand-new hostname, you will need to contact CRAB operator(s) to add this TaskWorker to the whitelist.

TaskWorker installation and configuration

Installation

For production and pre-production, switch to the service user crab3:

sudo -u crab3 -i bash

Create a script (/data/srv/TaskManager/taskworker_install.sh) to install the TaskWorker (here we use release 3.3.1607.rc7 from comp.erupeika, the list of possible releases is in github, recent ones are available in comp.erupeika ):

cd /data/srv/TaskManager

pkill -f TaskWorker

set -x
export RELEASE=3.3.1607.rc7
export REPO=comp.erupeika
export MYTESTAREA=/data/srv/TaskManager/$RELEASE
#For CC7 machines use slc7_amd64_gcc630
export SCRAM_ARCH=slc6_amd64_gcc493
export verbose=true
mkdir -p $MYTESTAREA
wget -O $MYTESTAREA/bootstrap.sh http://cmsrep.cern.ch/cmssw/repos/bootstrap.sh
sh $MYTESTAREA/bootstrap.sh -architecture $SCRAM_ARCH -path $MYTESTAREA -repository $REPO setup
#This should not be done on CC7
source $MYTESTAREA/$SCRAM_ARCH/external/apt/*/etc/profile.d/init.sh
$RELEASE/common/cmspkg -a $SCRAM_ARCH upgrade
$RELEASE/common/cmspkg -a $SCRAM_ARCH update
$RELEASE/common/cmspkg -a $SCRAM_ARCH install cms+crabtaskworker+$RELEASE
set +x

rm -f current
ln -s $RELEASE current
cd $MYTESTAREA

  • Install the TaskWorker:

source /data/srv/TaskManager/taskworker_install.sh

Configuration

Create a TaskWorker configuration file /data/srv/TaskManager/current/TaskWorkerConfig.py with the following content:

from multiprocessing import cpu_count
from WMCore.Configuration import ConfigurationEx

config = ConfigurationEx()

## External services URLs
config.section_("Services")
config.Services.PhEDExurl = 'https://phedex.cern.ch'
config.Services.DBSUrl = 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader'
config.Services.MyProxy= 'myproxy.cern.ch'

config.section_("TaskWorker")
config.TaskWorker.polling = 20 #seconds
# We can add one worker per core, plus some spare ones since most of actions wait for I/O
config.TaskWorker.nslaves = 5
config.TaskWorker.name = 'CHANGE_ME'
config.TaskWorker.recurringActions = ['RenewRemoteProxies']
config.TaskWorker.scratchDir = '/data/srv/tmp' # Make sure this directory exists.

# Possible values for mode are:
#   - cmsweb-dev
#   - cmsweb-preprod
#   - cmsweb-prod
#   - private
config.TaskWorker.mode = 'private'
# If 'private' mode then a server URL is needed
config.TaskWorker.resturl = 'CHANGE_ME'

# The two parameters below are used to contact cmsweb services for the REST-DB interactions
config.TaskWorker.cmscert = '/data/certs/servicecert.pem'
config.TaskWorker.cmskey = '/data/certs/servicekey.pem'

config.TaskWorker.backend = 'glidein'

# Retry policy for any problem submitting task to HTC
config.TaskWorker.max_retry = 5
config.TaskWorker.retry_interval = [20, 40, 60, 80, 0]

# Config for SLS alarms. If this patch is included and service is for pre/production (https://github.com/dmwm/CRABServer/pull/4652)
# Name should start with CMS_Anything.
#config.TaskWorker.XML_Report_ID = "CMS_CRAB3_TaskWorker"

# Default is False. If True, dagman will not retry the job on ASO failures.
config.TaskWorker.retryOnASOFailures = True

# Default is 0. If -1 no ASO timeout, if transfer is stuck in ASO we'll retry the postjob FOREVER (well, eventually a dagman timeout for the node will be hit).
# If 0 default timeout of 4 to 6 hours will be used. If specified the timeout set will be used (minutes).
config.TaskWorker.ASOTimeout = 86400

# Control the ordering of stageout attempts.
# - remote means a copy from the worker node to the final destination SE directly.
# - local means a copy from the worker node to the worker node site's SE.
# One can include any combination of the above, or leaving one of the methods out.
# For example, CRAB2 is effectively:
# config.TaskWorker.stageoutPolicy = ["remote"]
# This is the CRAB3 default:
config.TaskWorker.stageoutPolicy = ["local", "remote"]
config.TaskWorker.dashboardTaskType = 'analysis'

# How many times a job can be retried automatically under a recoverable error.
config.TaskWorker.numAutomJobRetries = 2

# 0 - number of post jobs = max( (# jobs)*.1, 20)
# -1 - no limit
# This is needed for Site Metrics
config.TaskWorker.maxPost = 20

# It should not block any site for Site Metrics and if needed for other activities
#config.TaskWorker.ActivitiesToRunEverywhere = ['hctest']

config.section_("Sites")
config.Sites.banned = [ ]
config.Sites.available = [ ]


config.section_("MyProxy")
config.MyProxy.serverhostcert = '/data/certs/hostcert.pem'
config.MyProxy.serverhostkey = '/data/certs/hostkey.pem'

# From 3.3.9.rc2 version uisource is optional.  This was required for getting rid of cms_ui_env which is by default on SLC6
#config.MyProxy.uisource = '/afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh'

config.MyProxy.cleanEnvironment = True
config.MyProxy.credpath = '/data/certs/creds' #make sure this directory exists
config.MyProxy.serverdn = 'CHANGE_ME'

The parameters that you have to change are marked with 'CHANGE_ME'. They are:

Parameter Type Explanation
TaskWorker.name string A name that identifies this TaskWorker. For example the host name.
TaskWorker.resturl string host name for the frontend (e.g. cmsweb.cern.ch, cmsweb-testbed.cern.ch, myprivateRESThost.cern.ch). This parameter is incorrectly named "url" in the configuration, see https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/MasterWorker.py#L126
MyProxy.serverdn string Grid host certificate DN as given by the output of openssl x509 -noout -subject -in /data/certs/hostcert.pem | cut -d ' ' -f2

Before running the TaskWorker, double check all the parameters. If you don't know what parameters to use, please contact hn-cms-crabDevelopment@cernNOSPAMPLEASE.ch
The latest versions of all the configuration files for the TaskWorker instances in Prod & PreProd & Dev are uploaded in the same git repository as the CRAB3Rest config: https://gitlab.cern.ch/crab3/CRAB3ServerConfig/tree/master/Taskworker/config

Using your own CRABServer (and WMCore) repository

If you are a developer, most probably you want to use your own repositories.

1) Clone the repositories.

Fork the dmwm/CRABServer and dmwm/WMCore repositories on github to have them under your github username (you need to have a github account) and clone them.

Note: The instructions below suggest to put the cloned repositories into the /data/user/ directory of your host. But if you will install the TaskWorker and the REST Interface on different hosts, then you should better use another place that is accessible from both hosts, e.g. AFS.

cd /data/user/
git clone https://github.com/<your-github-username>/CRABServer
cd CRABServer
git remote -v
origin   https://github.com/<your-github-username>/CRABServer (fetch)
origin   https://github.com/<your-github-username>/CRABServer (push)
git remote add upstream https://github.com/dmwm/CRABServer
git remote -v
origin   https://github.com/<your-github-username>/CRABServer (fetch)
origin   https://github.com/<your-github-username>/CRABServer (push)
upstream   https://github.com/dmwm/CRABServer (fetch)
upstream   https://github.com/dmwm/CRABServer (push)

cd /data/user/
git clone https://github.com/<your-github-username>/WMCore
cd WMCore
git remote -v
origin   https://github.com/<your-github-username>/WMCore (fetch)
origin   https://github.com/<your-github-username>/WMCore (push)
git remote add upstream https://github.com/dmwm/WMCore
git remote -v
origin   https://github.com/<your-github-username>/WMCore (fetch)
origin   https://github.com/<your-github-username>/WMCore (push)
upstream   https://github.com/dmwm/WMCore (fetch)
upstream   https://github.com/dmwm/WMCore (push)

Each crabserver release uses a given version of WMCore as specified in https://github.com/cms-sw/cmsdist/blob/comp_gcc493/crabtaskworker.spec Unless you will use an old tag of CRABServer, you should use the given WMCore tag:

git checkout <tag> # e.g. 1.0.5.pre5

2) Configure the TaskWorker to use your repositories.

In the TaskWorker init script, fix the setting of the PYTHONPATH to point to your cloned CRABServer and WMCore repositories.

Edit the file /data/srv/TaskManager/current/slc6_amd64_gcc493/cms/crabtaskworker/*/etc/profile.d/init.sh, comment out the lines that are changing the PYTHONPATH (should be the last two lines) and add a new line at the end of the script adding your repositories to the PYTHONPATH (here we assume the repositories are under /data/user/):

#[ ! -d /data/srv/TaskManager/3.3.16.rc2/slc6_amd64_gcc493/cms/crabtaskworker/3.3.16.rc2/${PYTHON_LIB_SITE_PACKAGES} ] || export PYTHONPATH="/data/srv/TaskManager/3.3.16.rc2/slc6_amd64_gcc493/cms/crabtaskworker/3.3.16.rc2/${PYTHON_LIB_SITE_PACKAGES}${PYTHONPATH:+:$PYTHONPATH}";
#[ ! -d /data/srv/TaskManager/3.3.16.rc2/slc6_amd64_gcc493/cms/crabtaskworker/3.3.16.rc2/x${PYTHON_LIB_SITE_PACKAGES} ] || export PYTHONPATH="/data/srv/TaskManager/3.3.16.rc2/slc6_amd64_gcc493/cms/crabtaskworker/3.3.16.rc2/x${PYTHON_LIB_SITE_PACKAGES}${PYTHONPATH:+:$PYTHONPATH}";
export PYTHONPATH=/data/user/CRABServer/src/python:/data/user/WMCore/src/python:$PYTHONPATH

Update the TaskWorker archives

ONLY for PRIVATE installations

Create a script (named e.g. updateTMRuntime.sh) with the following content (change TW_RELEASE, TW_ARCH, CRAB_OVERRIDE_SOURCE and the hostnames and usernames):

# Run the script from your github CRABServer directory
# this is a hardcoded name in htcondor_make_runtime.sh with no relation
# with actual CRAB version
CRAB3_DUMMY_VERSION=3.3.0-pre1
# this must match the TW release to update
TW_ARCH=slc6_amd64_gcc493
TW_RELEASE=3.3.1706.rc2
MYTESTAREA=/data/srv/TaskManager/current
CRABTASKWORKER_ROOT=${MYTESTAREA}/${TW_ARCH}/cms/crabtaskworker/${TW_RELEASE}
# CRAB_OVERRIDE_SOURCE tells htcondor_make_runtime.sh where to find the CRABServer repository
export CRAB_OVERRIDE_SOURCE=/data/user

pushd $CRAB_OVERRIDE_SOURCE/CRABServer
sh bin/htcondor_make_runtime.sh
mv TaskManagerRun-$CRAB3_DUMMY_VERSION.tar.gz TaskManagerRun.tar.gz
mv CMSRunAnalysis-$CRAB3_DUMMY_VERSION.tar.gz CMSRunAnalysis.tar.gz
if [ "$(hostname)" != "osmatanasi2" ]; then
    cmd=scp
    targethost='atanasi@osmatanasi2.cern.ch:'
else
    cmd=cp
    targethost=
fi
$cmd -v CMSRunAnalysis.tar.gz CRAB3-externals.zip TaskManagerRun.tar.gz $CRAB_OVERRIDE_SOURCE/CRABServer/scripts/AdjustSites.py $CRAB_OVERRIDE_SOURCE/CRABServer/scripts/dag_bootstrap_startup.sh $CRAB_OVERRIDE_SOURCE/CRABServer/scripts/dag_bootstrap.sh $CRAB_OVERRIDE_SOURCE/CRABServer/scripts/gWMS-CMSRunAnalysis.sh $CRAB_OVERRIDE_SOURCE/CRABServer/scripts/CMSRunAnalysis.sh $CRAB_OVERRIDE_SOURCE/CRABServer/scripts/cmscp.py $CRAB_OVERRIDE_SOURCE/CRABServer/scripts/DashboardFailure.sh $targethost$CRABTASKWORKER_ROOT/data

popd

Run the script

  • before starting the TaskWorker for the first time;
  • every time you restart the TaskWorker after having done a change in the TaskWorker code.

Missing gsissh

If gsissh is missing in the host machine, install it:

#'yum provides /usr/bin/gsissh' will tell that the package that provides /usr/bin/gsissh is gsi-openssh-clients-5.3p1-11.el6.x86_64
sudo yum install gsi-openssh-clients-5.3p1-11.el6.x86_64

Missing gfal-copy

For a private installation, the gfal-copy command is likely to be missing in the VM, which means that the TaskWorker will have no ability to check for user's write permissions at a destination site. In that case, if a site where the user isn't authorized is selected, the TaskWorker will submit the task as usual, though it will later fail during stageout. To install gfal-copy, run the following command:

sudo yum install gfal2-util

Also as we never know which kind of transfer protocol a site provides/support is good to install all gfal protocol plugins:

sudo yum install gfal2-all

Start/stop the TaskWorker

Create a file (/data/srv/TaskManager/taskworker_env.sh) that you will source to set the TaskWorker environment:

export MYTESTAREA=/data/srv/TaskManager/current
source $MYTESTAREA/slc6_amd64_gcc493/cms/crabtaskworker/*/etc/profile.d/init.sh
export CRABTASKWORKER_ROOT
export CONDOR_CONFIG=/data/srv/condor_config

Create a file (/data/srv/TaskManager/taskworker_start.sh) that you will source to start the TaskWorker (this runs the MasterWorker from the installed TaskWorker, but you can change it to run instead the MasterWorker from your local CRABServer repository):

for prod and preprod instance:

USER=$(/usr/bin/id -u)

if [[ "$USER" = '100001' ]]; then
  echo starting
  nohup python $MYTESTAREA/slc6_amd64_gcc493/cms/crabtaskworker/*/lib/python2.7/site-packages/TaskWorker/MasterWorker.py --config $MYTESTAREA/TaskWorkerConfig.py --debug &
else
  echo $USER
  echo 'Please change to CRAB3 user, TW cannot run as root'
fi

for private installations:

wd=$PWD
cd $MYTESTAREA
nohup python $MYTESTAREA/slc6_amd64_gcc493/cms/crabtaskworker/*/lib/python2.7/site-packages/TaskWorker/MasterWorker.py --config $MYTESTAREA/TaskWorkerConfig.py --debug &
cd $wd

  • Environment setup for starting the service:

source /data/srv/TaskManager/taskworker_env.sh

  • Starting the service (need to have the TaskWorker environment):

sh /data/srv/TaskManager/taskworker_start.sh

which will write $MYTESTAREA/nohup.out, empty if start is successful, with error messages if not.

  • Stopping the service on non production TW:

pkill -f TaskWorker

  • Stopping the service on production TW:

first of all identify the pid of the master process:

ps f -fu crab3
crab3    29690  5181  0 11:06 pts/1    R+     0:00  \_ ps f -fu crab3
crab3    13459     1  4 Jul28 ?        Rl    46:03 python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManager/c
crab3    13466 13459  0 Jul28 ?        Sl     0:08  \_ python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManag
crab3    13467 13459  0 Jul28 ?        Sl     0:20  \_ python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManag
crab3    13469 13459  0 Jul28 ?        Sl     0:51  \_ python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManag
crab3    13471 13459  0 Jul28 ?        Sl     0:15  \_ python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManag
crab3    13473 13459  0 Jul28 ?        Sl     0:19  \_ python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManag

and then kill only the master process

kill 13459

wait until all slaves finish

TaskWorker log files

The TaskWorker log files are in the logs subdirectory of the current directory from where the service is started. This subdirectory is created by the TaskWorker process if needed. If you started the service following the instructions above, the logs subdirectory should be in $MYTESTAREA/logs/. The main log file is twlog.log. There are also other log entries in the subdirectories logs/processes and logs/tasks. The twlog.log is automatically rotated every day by the service.

TaskWorker Banned sites

For sites issues check last validation twiki page Integration twiki

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2018-03-20 - TitasKulikauskas
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback