CRAB Logo

TW and Publisher deployment on Docker

This page describes how to deploye TaskWorker and Publisher.

For production and pre-production: Local service account used to deploy, run and operate the service is crab3.

Introduction

This twiki explains how to deploy the TaskWorker and Publisher on Docker.

Note: Legend of colors for the examples:

Commands to execute
Output sample of the executed commands
Configuration files
Other files

Prerequisites

Get and install a virtual machine

See Deployment of CRAB REST Interface / Get and install a virtual machine. Make sure to go through the Machine preparation steps if you do not plan to install the REST interface on this machine.

Prepare directories and host cert to be used by the TW running as the service user

#sudo mkdir /data/certs  # This should have been done already by the Deploy script when installing the VM.
sudo mkdir -p /data/certs/creds      /data/container/TaskWorker/logs/      /data/container/TaskWorker/cfg      /data/container/Publisher/logs/      /data/container/Publisher/cfg 
sudo chmod 700 /data/certs/creds      /data/container/TaskWorker/logs/      /data/container/TaskWorker/cfg      /data/container/Publisher/logs/      /data/container/Publisher/cfg 
#sudo cp -p /etc/grid-security/host{cert,key}.pem /data/certs # This should have been done already by the Deploy script when installing the VM.

For production and pre-production installations:

sudo chown crab3:zh /data/certs/creds      /data/container/logs/TaskWorker      /data/container/logs/Publisher      /data/container/cfg 
sudo chown crab3:zh /data/certs/host{cert,key}.pem

For private installations:

sudo chown `whoami`:zh /data/certs /data/certs/creds /data/user/logs 
sudo chown `whoami`:zh /data/certs/host{cert,key}.pem

Setup a service certificate for interacting with CMSWEB

A service certificate is needed with DN=/DC=ch/DC=cern/OU=computers/CN=tw/vocms052.cern.ch (or whatever the correct host name is). As of 2017, this is provided yearly by James Letts, who takes care of this for all Submission Infrastructure related machines, but if needed, anybody can do it using the procedure indicated here: https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsGlideinWMSCerts

The service certificate must be registered in VO CMS and in SiteDB. The certificate and private key should be in /data/certs/servicecert.pem and /data/certs/servicekey.pem respectively, and should be readable only by the service user.

For production and pre-production also do:

sudo chown crab3:zh /data/certs/service{cert,key}.pem

For private installations:

sudo chown `whoami`:zh /data/certs/service{cert,key}.pem

The proxy is created for 8 days (192 hours), because this is the maximum allowed duration of the VO CMS extension. Thus, the proxy has to be renewed every 7 days (at least). You can do it manually or you can set up an automatic renewal procedure like is being done in production and pre-production.

Install Docker daemon

Install the Docker daemon with the following commands

sudo yum install -y yum-utils \
  device-mapper-persistent-data \
  lvm2
sudo yum-config-manager \
  --add-repo \
  https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install docker-ce

Then start the daemon with

sudo systemctl start docker

You can check if everything worked correctly with:

sudo docker ps 
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS 

Optionally you can add your user to the docker group, this will allow you to skip sudo command when using Docker cli

sudo usermod -aG docker your-user 

Testing the service certificate

The certificate can be tested via command line curl

check VOMS:

curl --cert /data/certs/servicecert.pem --key /data/certs/servicekey.pem https://voms2.cern.ch:8443/voms/cms/info/index.action
output is a longish HTML page, ugly but readable, anyhow if it works it will have something like this close to the end
      Your certificate:
      <ul class="certificate-info">
          <li>
            is currently linked to the following membership:
            <a href="/voms/cms/user/load.action;jsessionid=em8vmjdl2krp1k8o14htoe05z?userId=166">
              JAMES LETTS (166)
            </a>
          </li>

check CMSWEB:

curl --cert /data/certs/servicecert.pem --key /data/certs/servicekey.pem https://cmsweb.cern.ch/auth/trouble
output will be self-explanatory

Alternatively, and in particular for private installations, you can install your own short lived proxy (starting from your own account):

voms-proxy-init --voms cms --valid 192:00
sudo cp /tmp/x509up_u$UID /data/certs/servicecert.pem
sudo cp /tmp/x509up_u$UID /data/certs/servicekey.pem
sudo chmod 600 /data/certs/servicecert.pem
sudo chmod 400 /data/certs/servicekey.pem

Contact Analysis Ops for new instances

In addition to GSI authentication, CRAB3 operators whitelist the hostnames allowed to submit jobs for CRAB3. If this is a brand-new hostname, you will need to contact CRAB operator(s) to add this TaskWorker to the list of ALLOW_WRITE hosts in 82_cms_schedd_crab_generic.config

Configuration files

TW Configuration

Create a TaskWorker configuration file and store it in /data/container/TaskWorker/cfg

with the content of this file: https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsglidein/blob/master/code/templates/crab/crabtaskworker/TaskWorkerConfig.py.erb

The parameters that you have to change are:

Parameter Type Explanation
TaskWorker.name string A name that identifies this TaskWorker. For example the host name.
TaskWorker.instance string REST endpoint to use, e.g. TaskWorker.instance='preprod'. Documentation how to use this parameter (available nicknames, how to set restHost/dbInstance combination) can be found in here: https://github.com/dmwm/CRABServer/wiki/INSTANCES:--versions,-REST-hosts,-clusters,-DataBases When 'instance' is set to 'other' the configuration must have one pair of strings indicating the REST host fqdn and the DB instance [prod/preprod/dev].
TaskWorker.restHost string REST host name (only if TaskWorker.instance='other')
TaskWorker.dbInstance string Database name (only if TaskWorker.instance='other')

Publisher Configuration

Create a Publisher configuration file and store it in /data/container/Publisher/cfg

with the content of this file: https://github.com/dmwm/CRABServer/blob/master/src/script/Deployment/Publisher/PublisherConfig.py

Before running the TaskWorker or Publisher, double check all the parameters. If you don't know what parameters to use, please contact hn-cms-crabDevelopment@cernNOSPAMPLEASE.ch

Build new container image

Image built in this section can be used to deploy two services: TaskWorker and Publisher. Which service will be deployed is decided at the container start time. You can build images on your private VM or use cmsdocker01.

1. Clone CRABServer github repository

git clone https://github.com/dmwm/CRABServer.git

2. cd into the directory where dockerfile is placed cd CRABServer/Docker

3. Set release version

TW_VERSION=3.3.2004

4. Build image

docker build . -t ${DHUSER}/crabtaskworker:$TW_VERSION --build-arg TW_VERSION=${TW_VERSION}
  • Be aware that ${DHUSER} is a Docker Hub username. Production TW images are stored in 'cmssw' account
  • Also,before building image, be aware which REPO for RPMs is used in install.sh script. If you want to use your own repo, don't forget to change this file.
  • Finally, if later you are going to push this image to you own Docker Hub repo, that will require to create a repository called 'crabtaskworker' in Docker Hub. Different name can be used, but then don't forget to change it other places.

5. Check that the image was built

docker images
REPOSITORY            TAG                 IMAGE ID            CREATED             SIZE
${DHUSER}/crabtaskworker           3.3.2004            c312fb512323        5 days ago          5.57GB

6. Push image to Docker Hub

docker push  ${DHUSER}/crabtaskworker:$TW_VERSION

Every time new image needs to be deployed, steps 2-6 has to be repeated.

Run new container

Mounted points

Below is a list of mounted points used to start the container. Since one image can be used to start 2 different services (TaskWorker and Publisher), be aware that when mounting directory for logs at the container start time, you have to specify which one to mount: TaskWorker or Publisher.

Mounted point Source Target
/cvmfs/cms.cern.ch/SITECONF : /cvmfs/cms.cern.ch/SITECONF host machine container
/etc/grid-security/ : /etc/grid-security/ host machine container
/data/srv/tmp/ : /data/srv/tmp/ host machine container
/data/certs/ : /data/certs/ host machine container
/etc/vomses/ : /etc/vomses/ host machine container
/data/container/ : /data/hostdisk/ host machine container

Start new container

New TaskWorker container can be started by running command:
TW_VERSION={Set CRABServer release tag}
docker run --name TaskWorker -d -ti --net host --privileged -e SERVICE=TaskWorker -w /data/srv/TaskManager -v /cvmfs/cms.cern.ch/SITECONF:/cvmfs/cms.cern.ch/SITECONF -v /data/srv/tmp/:/data/srv/tmp/ -v /etc/grid-security/:/etc/grid-security/ -v /data/certs/:/data/certs/ -v /etc/vomses/:/etc/vomses/ -v /data/container/:/data/hostdisk/ cmssw/crabtaskworker:$TW_VERSION

New Publisher container can be started by running command:

TW_VERSION={Set CRABServer release tag}
docker run --name Publisher -d -ti --net host --privileged -e SERVICE=Publisher -w /data/srv/Publisher/ -v /cvmfs/cms.cern.ch/SITECONF:/cvmfs/cms.cern.ch/SITECONF -v /data/srv/tmp/:/data/srv/tmp/ -v /etc/grid-security/:/etc/grid-security/ -v /data/certs/:/data/certs/ -v /etc/vomses/:/etc/vomses/  -v /data/container/:/data/hostdisk/  cmssw/crabtaskworker:$TW_VERSION

Option Explanation
--name Container name. Containers are name "TaskWorker" or "Publisher"
-d Start a container in detached mode
-ti Allocate a pseudo-tty and keep STDIN open even if not attached
--net host Makes the program inside the Docker container look like they are running on the host itself, from the perspective of the network.
--privileged Give extended privileges to this container
-v Bind mount a volume

Enter container

#enter TaskWorker container
docker exec -it TaskWorker bash
#enter Publisher container
docker exec -it Publisher bash

Stop Service

To stop service (TaskWorker or Publisher) 3 steps needs to be done:

  1. Stop running processes inside the container;
  2. Stop running container;
  3. Delete old container - it is important to delete old stopped container so that we could re-use the same name for the next container.

Importat NOTE You should always ensure that you stopped the TaskWorker or/and Publisher processes inside that container (step number 1) and only then stop the container itself (step number 2)!

This sequence of actions can be achieved in two ways (if you want to stop Publisher container, then in the command replace 'TaskWorker' with 'Publisher'):

  • manually log in into the container to use pre-configured stop.sh script to stop the service and then exit the container to stop it
    docker exec -it TaskWorker bash #log in to the container
    ./stop.sh  #run stop script
    ps uxf #check if all slaves were killed
    exit  #exit the container
    docker stop TaskWorker  #stop the container
    docker rm TaskWorker #remove the container
    
  • run a command in a running container using docker exec [OPTIONS] CONTAINER COMMAND [ARG...]
    docker exec TaskWorker bash -c "./stop.sh" #run stop.sh script inside the container
    docker exec TaskWorker bash  -c "ps uxf" #check if TW was stopped properly and all slaves were killed
    docker stop TaskWorker  #stop the container 
    docker rm TaskWorker #remove the container
    

Both steps 1 and 2 are explained in more detailed below:

Stop running processes

To stop the running processes you should use the preconfigured script:

sh stop.sh

Which executes the following steps:

first identify the pid of the master process:
ps faux crab3
crab3    29690  5181  0 11:06 pts/1    R+     0:00  \_ ps f -fu crab3
crab3    13459     1  4 Jul28 ?        Rl    46:03 python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManager/c
crab3    13466 13459  0 Jul28 ?        Sl     0:08  \_ python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManag
crab3    13467 13459  0 Jul28 ?        Sl     0:20  \_ python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManag
crab3    13469 13459  0 Jul28 ?        Sl     0:51  \_ python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManag
crab3    13471 13459  0 Jul28 ?        Sl     0:15  \_ python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManag
crab3    13473 13459  0 Jul28 ?        Sl     0:19  \_ python /data/srv/TaskManager/current/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1507.rc7/lib/python2.6/site-packages/TaskWorker/MasterWorker.py --config /data/srv/TaskManag

and then kill only the master process

kill 13459

You should wait until all slaves finish.

Importat NOTE : the above procedure will make each TW slave stop when it has completed the work it pulled in its queue (2 tasks at most), so that no tasks are left in QUEUED. Tasks in status QUEUED when TW restarts would be automatically failed. But at times it may happen that one slave is instead executing one of the recurring actions (see TaskWorkerConfig.py), in particular the proxy renewal action can last a long time. In this case it is better not to wait (hours ?) since the proxy renewer script can be safely killed at any time. You can monitor TW winding down progress using ps fux and/ior using tail on most recent logs/processes/proc.c3id_*.txt files. Generally speaking it is usually safe to kill any lingering TW slave after about 5min. If they did not complete in that time, something is wrong, but you should put in the elog the name and last lines of the log for the stuck process for investigation.

To restart the TaskWorker|Publisher have to use the preconfigured script:

# Start the TaskWorker
sh start.sh

Stop running TaskWorker|Publisher container

Stop the running TaskWorker|Publisher container by running docker stop command with the container name/id

docker stop TaskWorker 

Restart stopped TaskWorker|Publisher container

Stopped container can be started by running

docker start TaskWorker

Find TaskWorker|Publisher logs

#TW logs are stored on the host machine and can be found in /data/container/TaskWorker/logs/. The main log file is twlog.log
tail -f /data/container/TaskWorker/logs/twlog.txt
#Publisher logs are stored on the host machine and can be found in /data/container/Publisher/logs/. The main log file is log.log
tail -f /data/container/Publisher/logs/log.txt

twlog.log and log.log fils are automatically rotated every day by the service.

Naming convention for images and containers

Suggested naming convention for images and containers: It is suggested to use TW release version as a tag for Docker image and use "TaskWorker" and "Publisher" for coresponding container name, see example below.

Repository name Image Tag Container name
${DHUSER}/crabtaskworker ${DHUSER}/crabtaskworker:3.3.2004 TW release version, e.g. 3.3.2004 TaskWorker
${DHUSER}/crabtaskworker ${DHUSER}/crabtaskworker:3.3.2004 TW release version, e.g. 3.3.2004 Publisher

Script to pull and run TW/Publisher image

  • runContainer.sh script which can be used to pull image from Docker Hub and run it. Importat NOTE runContainer.sh script is placed in every crab dev|preprod|prod machine's home directory (/home/crab3) so you can source it directly from there.
    [crab3@crab-prod-tw02 ~]$ ./runContainer.sh -h
    
    Usage: runContainer.sh -v v3.201118 -s TaskWorker
       -v TW/Publisher version
       -s which service should be started: Publisher, Publisher_scheddy, Publisher_asoless, Publisher_rucio or TaskWorker
       -r docker hub repo, if not provided, default is 'cmssw'
    

Deploying new TW release

List of steps which needs to be done every time new TW release has to be deployed:

  1. Build new image and upload it to the Docker Hub repository. Note, that every time new release in CRABServer repository is created, docker image is automatically created and pushed to cmssw/crabtaskworker Docker Hub account, read more at: https://twiki.cern.ch/twiki/bin/view/CMSPublic/NotesAboutReleaseManagement#Automatic_Docker_image_build_for
  2. If needed, stop container which was running old version of the TW/Publisher and remove the container so that name could be reused
  3. Start new container manually by pasting commands listed in section 'Run new container' or by calling runContainer.sh script (see one section 'Scripts to pull and run TW/Publisher image').

Publisher_schedd (on crab-prod-tw02)

These works as of Nov 15, 2020. May change as things are restructured and updated:
sudo su crab3
# stop the contrainer
docker stop Publisher_schedd
# remove old container so that can reuse name
docker rm Publisher_schedd
# start a container with new
TW_VERSION=<newtag>
docker run --name Publisher_schedd -d -ti --net host --privileged -e SERVICE=Publisher -w /data/srv/Publisher/ -v /cvmfs/cms.cern.ch/SITECONF:/cvmfs/cms.cern.ch/SITECONF -v /data/srv/tmp/:/data/srv/tmp/ -v /etc/grid-security/:/etc/grid-security/ -v /data/certs/:/data/certs/ -v /etc/vomses/:/etc/vomses/ -v /data/container/Publisher_schedd/logs/:/data/srv/Publisher/logs/  -v /data/container/Publisher_schedd/cfg/:/data/srv/cfg/ -v /data/container/Publisher_schedd/PublisherFiles/:/data/srv/Publisher_files cmssw/crabtaskworker:$TW_VERSION
  

Useful Docker commands

If the container for some reasons crashed, you can use 'docker logs' command to inspect what has happened

docker logs TaskWorker 

-- DainaDirmaite - 2020-04-09

Edit | Attach | Watch | Print version | History: r50 < r49 < r48 < r47 < r46 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r50 - 2020-12-15 - StefanoBelforte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback