Contents :

Recipes for tier-0 troubleshooting, most of them are written such that you can copy-paste and just replace with your values and obtain the expected results.

BEWARE: The writers are not responsible for side effects of these recipes, always understand the commands before executing them.

Corrupted merged file

This includes files that are on tape, already registered on DBS/TMDB. The procedure to recover them is basically to run all the jobs that lead up to this file, starting from the parent merged file, then replace the desired output and make the proper changes in the catalog systems (i.e. DBS/TMDB).

Print .pkl files, Change job.pkl

  • Print job.pkl or Report.pkl in a tier0 WMAgent vm:
# source environment
source /data/tier0/srv/wmagent/current/apps/t0/etc/profile.d/

# go to the job area, open a python console and do:
import cPickle
jobHandle = open('job.pkl', "r")
loadedJob = cPickle.load(jobHandle)
print loadedJob

# for Report.*.pkl do:
import cPickle
jobHandle = open("Report.3.pkl", "r")
loadedJob = cPickle.load(jobHandle)
print loadedJob

  • In addition, to change the job.pkl
import cPickle, os
jobHandle = open('job.pkl', "r")
loadedJob = cPickle.load(jobHandle)
# Do the changes on the loadedJob
output = open('job.pkl', 'w')
cPickle.dump(loadedJob, output, cPickle.HIGHEST_PROTOCOL)

  • Print PSet.pkl in a workernode:
Set the same environment for run a job interactively, go to the PSet.pkl location, open a python console and do:
import FWCore.ParameterSet.Config as cms
import pickle
handle = open('PSet.pkl', 'r')
process = pickle.load(handle)
print process.dumpConfig()

Delete entries in database when corrupted input files (Repack jobs)


# This will actually show the pending active lumi sections for repack. One of this should be related to the corrupted file, compare this result with the first query


# You HAVE to be completely sure about to delete an entry from the database (don't do this if you don't understand what this implies)


Change Cmsweb Tier0 Data Service Passwords (Oracle DB)

All the T0 WMAgent instances has the capability of access the Cmsweb Tier0 Data Service instances. So, when changing the passwords it is necessary to be aware of which instances are running.

Instances currently in use currently (03/03/2015)

Instance Name TNS

  1. Review running instances.
  2. Stop each of them using:
  3. Verify that everything is stopped using:
     ps aux | egrep 'couch|wmcore' 
  4. Make sure of having the new password ready (generating it or getting it in a safe way from the one who is creating it).
  5. From lxplus or any of the T0 machines, log in to the instances you want to change the password to using:
     sqlplus <instanceName>/<password>@<tns> 
    Replacing the brackets with the proper values for each instance.
  6. In sqlplus run the command password, you will be prompt for entering the Old password, the*New Password* and confirming this last. Then you can exit from sqlplus
          SQL> password
          Changing password for <user>
          Old password: 
          New password: 
          Retype new password: 
          Password changed
          SQL> exit
  7. Then, you should retry logging in to the same instance, if you can not, you are in trouble!
  8. Communicate the password with the CMSWEB contact in a safe way. After his confirmation you can continue with the following steps.
  9. If everything went well now you can access all the instances with the new passwords. Now it is necessary to update the files secrets files within all the machines, These files are located in:
    And normally are named as following (not all the instances will have all the files):

  10. If there was an instance running you may also change the password in:
    There you must look for the entry:
    and do the update.
  11. You can now restart the instances that were running before the change. Be careful, some components may fail if you start the instance so you should have clear the trade off of starting it.

Modifying a workflow sandbox

If you need to change a file in a workflow sandbox, i.e. in the WMCore zip, this is the procedure:

# Copy the workflow sandbox from /data/tier0/admin/Specs to your work area
cp /data/tier0/admin/Specs/PromptReco_Run245436_Cosmics/PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 /data/tier0/lcontrer/temp

The work area should only contain the workflow sandbox. Go there and then untar the sandbox and unzip WMCore:

cd /data/tier0/lcontrer/temp
tar -xjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 
unzip -q

Now replace/modify the files in WMCore. Then you have to merge all again. You should remove the old sandbox and too:

# Remove former sandbox and, then create the new
rm PromptReco_Run245436_Cosmics-Sandbox.tar.bz2
zip -rq WMCore

# Now remove the WMCore folder and then create the new sandbox
rm -rf WMCore/
tar -cjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 ./*

# Clean workarea
rm -rf PSetTweaks/ WMSandbox/

Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong.

Force Releasing PromptReco

Normally PromptReco workflows has a predefined release delay (currently: 48h). We can require to manually release them in a particular moment. For doing it:

  • Check which runs do you want to release
  • Remember, if some runs are in active the workflows will be created but solve the bookkeeping (or similar) problems.
  • The followinq query makes the pre-release of the non released Runs which ID is lower or equal to a particular value. Depending on which Runs you want to release, you should "play" with this condition. You can run only the SELECT to be sure you are only releasing the runs you want to, before doing the update.
         SELECT reco_release_config.released AS released,
                reco_release_config.delay AS delay,
                reco_release_config.delay_offset AS delay_offset
         FROM  reco_release_config
         WHERE checkForZeroOneState(reco_release_config.released) = 0
               AND reco_release_config.run_id <= <Replace By the desired Run Number> ) t
         SET t.released = 1,
             t.delay = 10,
             t.delay_offset = 5;
  • Check the Tier0Feeder logs. You should see log lines for all the runs you released.

Running a replay on a headnode

  • To run a replay in a instance used for production (for example before deploying it in production) you should check the following:
    • If production ran in this instance before, be sure that the T0AST was backed up. Deploying a new instance will wipe it.
    • Modify the WMAgent.secrets file to point to the replay couch and t0datasvc.
    • Download the latest from the Github repository. Check the processing to use based on the elog history.
    • Do not use production Use the replays script instead. This is the list of changes:
      • Points to the replay secrets file instead of the production secrets file:
      • Points to the ReplayOfflineConfiguration instead of the ProdOfflineConfiguration:
         sed -i 's+TIER0_CONFIG_FILE+/data/tier0/admin/' ./config/tier0/ 
      • Uses the "tier0replay" team instead of the "tier0production" team (relevant for WMStats monitoring):
         sed -i "s+'team1,team2,cmsdataops'+'tier0replay'+g" ./config/tier0/ 
      • Changes the archive delay hours from 168 to 1:
        # Workflow archive delay
                    echo 'config.TaskArchiver.archiveDelayHours = 1' >> ./config/tier0/
      • Uses lower thresholds in the resource-control:
        ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --cms-name=T0_CH_CERN --pnn=T0_CH_CERN_Disk --ce-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=SimpleCondorPlugin
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Merge --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Cleanup --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=LogCollect --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Skim --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Production --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Harvesting --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Express --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Repack --pending-slots=160 --running-slots=320
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN --cms-name=T2_CH_CERN --pnn=T2_CH_CERN --ce-name=T2_CH_CERN --pending-slots=0 --running-slots=0 --plugin=PyCondorPlugin

Changing Tier0 Headnode

# Instruction Responsible Role
0. Run a replay in the new headnode. Some changes have to be done to safely run it in a Prod instance. Please check the Running a replay on a headnode section Tier0
1. Deploy the new prod instance in vocms0314, check that we use: Tier0
1.5. Check the ProdOfflineconfiguration that is being used Tier0
2. Start the Tier0 instance in vocms0314 Tier0
3. Coordinate with Storage Manager so we have a stop in data transfers, respecting run boundaries (Before this, we need to check that all the runs currently in the Tier0 are ok with bookkeeping. This means no runs in Active status.) SMOps
4. Checking al transfer are stopped Tier0
4.1. Check
4.2. Check /data/Logs/General.log
5. Change the config file of the transfer system to point to T0AST1. It means, going to /data/TransferSystem/Config/TransferSystem_CERN.cfg and change that the following settings to match the new head node T0AST)
  "DatabaseInstance" => "dbi:Oracle:CMS_T0AST",
  "DatabaseUser"     => "CMS_T0AST_1",
  "DatabasePassword" => 'superSafePassword123',
6. Make a backup of the General.log.* files (This backup is only needed if using t0_control restart in the next step, if using t0_control_stop + t0_control start logs won't be affected) Tier0

Restart transfer system using:


t0_control restart (will erase the logs)


t0_control stop

t0_control start (will keep the logs)

8. Kill the replay processes (if any) Tier0
9. Start notification logs to the SM in vocms0314 Tier0
10. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
11. Restart transfers SMOps

Changing CMSSW Version

If you need to upgrade the CMSSW version the normal procedure is:

  • Change the defaultCMSSWVersion filed for the desired CMSSW version, for example:
      defaultCMSSWVersion = "CMSSW_7_4_7"
  • Update the repack and express mappings, For example:
      repackVersionOverride = {
          "CMSSW_7_4_2" : "CMSSW_7_4_7",
          "CMSSW_7_4_3" : "CMSSW_7_4_7",
          "CMSSW_7_4_4" : "CMSSW_7_4_7",
          "CMSSW_7_4_5" : "CMSSW_7_4_7",
          "CMSSW_7_4_6" : "CMSSW_7_4_7",
     expressVersionOverride = {
        "CMSSW_7_4_2" : "CMSSW_7_4_7", 
        "CMSSW_7_4_3" : "CMSSW_7_4_7",
        "CMSSW_7_4_4" : "CMSSW_7_4_7",
        "CMSSW_7_4_5" : "CMSSW_7_4_7",
        "CMSSW_7_4_6" : "CMSSW_7_4_7",
  • Save the changes

  • Find either the last run using the previous version or the first version using the new version for Express and PromptReco. You can use the following query in T0AST to find runs with specific CMSSW version:

  • Report the change including the information of the first runs using the new version (or last runs using the old one).

Backup T0AST (Database)

If you want to do a backup of a database (for example, after retiring a production node, you want to keep the information of the old T0AST) you should.

  • Request a target Database: Normally this databases are owned by, so he should request a new database to be the target of the backup.
  • When the database is ready, you can open a ticket for requesting the backup. For this you should send an email to An example of a message can be found in this Elog .
  • When the backup is done you will get a reply to your ticket confirming it.

Repacking gets stuck but the bookkeeping is consistent

  • P5 sent over data with lumi holes and consistent accounting.
  • T0 started the Repacking .
  • P5 sent data for the previous lumis where the bookkeeping said there wasn't any data.
  • T0 jobsplitter went to an inconsistent state because it runs in order, so when it found unrepacked data that was supposed to go before data that was already repacked, it got "confused".
  • T0 Runs got stuck, some of them never start, some other never finish despite bookkeeping is ok.
  1. Bookkeeping shows that lumis 52,55,56,57 will be transferred.
  2. Lumis 52,55,56,57 are transferred.
  3. Lumis 52,55,56,57 are repacked.
  4. Lumis 41, 60,91,121,141,145 are transferred.
  5. JobSplitter gets confused because of lumi 41 (lumis 60 and above are all higher than the last repacked lumi, so no problem with them) -> JobSplitter gets stuck
Procedure to fix

Please note: These are not copy/paste instructions. Is more a description of the procedure that was followed in the past to deal with the problem and can be used as a guide.

  • Update the lumi_section_closed records to have filecount=0 for lumis without data and filecount=1 for lumis with data.
           # find lumis to update
           update lumi_section_closed set filecount = 0 where lumi_id in ( ... ) and run_id = <run> and stream_id = <stream_id>;
           update lumi_section_closed set filecount = 1 where lumi_id in ( ... ) and run_id = <run> and stream_id = <stream_id>;      
  • Delete the problematic data (this is the files that belongs to the lumis that were not originally in the bookkeeping but were transferred).
           delete from wmbs_sub_files_available where fileid in ( ... );
           delete from wmbs_fileset_files where fileid in ( ... );
           delete from wmbs_file_location where fileid in ( ... );
           delete from wmbs_file_runlumi_map where fileid in ( ... );
           delete from streamer where id in ( ... );
           delete from wmbs_file_details where id in ( ... );     

Updating the wall time the jobs are using in the condor ClassAd

This time can be modified using the following command. Remember that it should be executed as the owner of the jobs.

condor_qedit -const 'MaxWallTimeMins>30000' MaxWallTimeMins 1440

Updating T0AST when a lumisection can not be transferred.

update lumi_section_closed set filecount = 0, CLOSE_TIME = <timestamp>
where lumi_id in ( <lumisection ID> ) and run_id = <Run ID> and stream_id = <stream ID>;


update lumi_section_closed set filecount = 0, CLOSE_TIME = 1436179634
where lumi_id in ( 11 ) and run_id = 250938 and stream_id = 14;

Restarting head node machine

  1. Stop Tier0 agent
  2. Stop condor
    service condor stop 
    If you want your data to be still available, then cp your spool directory to disk
    cp -r /mnt/ramdisk/spool /data/
  3. Restart the machine (or request its restart)
  4. Mount the RAM Disk (Condor spool won't work otherwise).
  5. If necessary, copy back the data to the spool.
  6. When restarted, start the sminject component
    t0_control start 
  7. Start the agent
    Particularly, check the PhEDExInjector component, if there you see errors, try restarting it after sourcing
    source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/
    $manage execute-agent wmcoreD --restart --component PhEDExInjector

Updating TransferSystem for StorageManager change of alias

Ideally this process should be transparent to us. However, it might be that the TransferSystem doesn't update the IP address of the SM alias when the alias is changed to point to the new machine. In this case you will need to restart the TransferSystem in both the /data/tier0/sminject area on the T0 headnode and the /data/TransferSystem area on vocms001. Steps for this process are below:

  1. Watch the relevant logs on the headnode to see if streamers are being received by the Tier0Injector and if repack notices are being sent by the LoggerReceiver. A useful command for this is:
     watch "tail /data/tier0/srv/wmagent/current/install/tier0/Tier0Feeder/ComponentLog; tail /data/tier0/sminject/Logs/General.log; tail /data/tier0/srv/wmagent/current/install/tier0/JobCreator/ComponentLog" 
  2. Also watch the TransferSystem on vocms001 to see if streamers / files are being received from the SM and if CopyCheck notices are being sent to the SM. A useful command for this is:
     watch "tail /data/TransferSystem/Logs/General.log; tail /data/TransferSystem/Logs/Logger/LoggerReceiver.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckManager.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckWorker.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorManager.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorWorker.log" 
  3. If any of these services stop sending and/or receiving, you will need to restart the TransferSystem.
  4. Restart the TransferSystem on vocms001. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/TransferSystem
    ./t0_control stop
    ./t0_control start
  5. Restart the TransferSystem on the T0 headnode. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/tier0/sminject
    ./t0_control stop
    ./t0_control start

Restart component in case of deadlock

If a component crashes due to a deadlock, in most of cases restarted is enough to control the situation. In that case the procedure is:

  • Login in the Tier0 headnode (cmst1 user is required)
  • Source the environment
    source /data/tier0/admin/  
  • Execute the following command to restart the component, replacing   for the specific component name (DBS3Upload, PhEDExInjector, etc.)
    $manage execute-agent wmcoreD --restart --components=<componentName>
    $manage execute-agent wmcoreD --restart --components=DBS3Upload

Changing Tier0 certificates

  • Check that using the new certificates guarantees privileges to all the needed resources:


  • Copy the servicecert*.pem, servicekey*.pem and serviceproxy*.pem files to
  • Update the following files to point to the new certificates


  • Change the certificates in the monitoring scripts where they are used, to see where the certificates are being used and the current monitoring head node please check the Tier0 Montoring Twiki.


  • In the TransferSystem (currently vocms001), update the following file to point to the new certificate and restart component.

Getting Job Statistics

This is the base script to compile the information of jobs that are already done:


For the analysis we need to define certain things:

  • main_dir: Folder where input log archives are. e.g. '/data/tier0/srv/wmagent/current/install/tier0/JobArchiver/logDir/Pí, in main()
  • temp: Folder where output json files are going to be generated, in main().
  • runList: Runs to be analyzed, in main()
  • Job type in two places:
    • getStats()
    • stats[dataset] in main()

The script is run without any parameter. This generates a json file with information about cpu, memory, storage and start and stop times. Task is also included. An example of output file is:


With a separate script in R, I was reading and summarizing the data:


There, task type should be defined and also output file. With this script I was just summarizing cpu data, but we could modify it a little to get memory data. Maybe it is quicker to do it directly with the first python script, if you like to do it :P

That script calculates efficiency of each job:

TotalLoopCPU / TotalJobTime * numberOfCores 

and an averaged efficiency per dataset:

sum(TotalLoopCPU) / sum(TotalJobTime * numberOfCores) 

numberOfCores was obtained from job.pkl, TotalLoopCPU and TotalJobTime were obtained from report.pkl

Job type could be Processing, Merge and Harvesting. For Processing type, task could be Reco or AlcaSkim and for Merge type, ALCASkimMergeALCARECO, RecoMergeSkim, RecoMergeWrite _AOD, RecoMergeWrite _DQMIO, RecoMergeWrite _MINIAOD and RecoMergeWrite _RECO.

Unpickling the PSet.pkl file (job configuration file)

To modify the configuration of a job, you can modify the content of the PSet.pkl file. In order to to this you have to dump the pkl file into a python file and there make the necessary changes. To do this normally you'll need ParameterSet.Config. If it is not present in your python path you can modify it:


export PYTHONPATH=/cvmfs/

In the previous example we assume the job is using CMSSW_7_5_8_patch1 for runningm and that's why we point to this particular path in cvmfs. You should modify it according to the CMSSW version your job is intended to use.

Now you can use the following snippet to dump the file:


import FWCore.ParameterSet.Config
import pickle
pickleHandle = open('PSet.pkl','rb')
process = pickle.load(pickleHandle)

#This line only will print the python version of the pkl file on the screen

#The actual writing of the file
outputFile = open('', 'w')

After dumping the file you can modify its contents. It is not necessary to pkl it again. you can use the cmsRun command normally



cmsRun -e 2>err.txt 1>out.txt &

Checking transfer status at agent shutdown

Before shutting down an agent, you should check if all subscriptions and transfers were performed. This is equivalent to check the deletion of all blocks in T0_CH_CERN_Disk and resolve any related issue. Issues can be open blocks, or DDM conflicts, etc.

Check which blocks have not been deleted yet.

select blockname
from dbsbuffer_block
where deleted = 0

Some datasets could be marked as subscribed in the database, but not been really subscribed in PhEDEx. You can check this with Transfer Team and if that is the case, retry the subscription setting subscribed to 0. You can narrow the query to some blocks with a given name pattern or blocks in a specific site.

update dbsbuffer_dataset_subscription
set subscribed = 0
where dataset_id in (
  select dataset_id
  from dbsbuffer_block
  where deleted = 0
  <and blockname like...>
<and site like ...>

Some blocks can be marked as closed, but still being open in PhEDEx. If this is the case, you can set status to "InDBS", to try closing them again. For example, if you want to closed MiniAOD blocks, you can provide a name pattern like '%/MINIAOD#%'.

Attribute status can have 3 values: 'Open', 'InDBS' and 'Closed'. 'Open' is the first value assigned to all blocks, when they are closed and injected into DBS, status is changed to 'InDBS' and when they are closed in PhEDEx, status is changed to 'Closed'. Setting status to 'InDBS' would make the agent retries to close the blocks in PhEDEx.

update dbsbuffer_block
set status = 'InDBS'
where deleted = 0
and status = 'Closed'
and blockname like ... 

If some subscriptions shouldn't be checked anymore, remove these subscriptions from database. For instance, if you want to remove RAW subscriptions to disk of all T1s, you can give a path pattern like '/%/%/RAW' and a site like 'T1_%_Disk'.

delete dbsbuffer_dataset_subscription
where dataset_id in (
  select id
  from dbsbuffer_dataset
  where path like ...
and site like ...

Disabling flocking to Tier0 Pool

If it is needed to prevent new Central Production jobs to be executed in the Tier0 pool it is necessary to disable flocking. To do so you should follow these steps. Be Careful, you will make changes in the GlideInWMS Collector and Negociator, you can cause a big mess if you don't proceed with caution.

NOTE: The root access to the GlideInWMS Collector is guaranteed for the members of the e-group.

  • Login to vocms007 (GlideInWMS Collector-Negociator)
  • Login as root
     sudo su - 
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/
  • There you will find a list of files. Most of them are puppetized which means any change will be overridden when puppet be executed. There is one no-puppetized file called 99_local_tweaks.config that is the one that will be used to do the changes we desire.
     -rw-r--r--. 1 condor condor  1849 Mar 19  2015 00_gwms_general.config
     -rw-r--r--. 1 condor condor  1511 Mar 19  2015 01_gwms_collectors.config
     -rw-r--r--  1 condor condor   678 May 27  2015 03_gwms_local.config
     -rw-r--r--  1 condor condor  2613 Nov 30 11:16 10_cms_htcondor.config
     -rw-r--r--  1 condor condor  3279 Jun 30  2015 10_had.config
     -rw-r--r--  1 condor condor 36360 Jun 29  2015 20_cms_secondary_collectors_tier0.config
     -rw-r--r--  1 condor condor  2080 Feb 22 12:24 80_cms_collector_generic.config
     -rw-r--r--  1 condor condor  3186 Mar 31 14:05 81_cms_collector_tier0_generic.config
     -rw-r--r--  1 condor condor  1875 Feb 15 14:05 90_cms_negotiator_policy_tier0.config
     -rw-r--r--  1 condor condor  3198 Aug  5  2015 95_cms_daemon_monitoring.config
     -rw-r--r--  1 condor condor  6306 Apr 15 11:21 99_local_tweaks.config

Within this file there is a special section for the Tier0 ops. The other sections of the file should not be modified.

  • To actually disable flocking you should:
    • Uncomment this line:
       # <----- Uncomment here ------->
       # CERTIFICATE_MAPFILE= /data/srv/glidecondor/condor_mapfile 
    • Comment from the whitelist all the Central Production Schedds:
       # <---- Comment out all the schedds below this to disable flocking ---->
       # Adding global pool CERN production schedds for flocking
       # Adding global pool FNAL production schedds for flocking
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/ 
  • Save the changes in the 99_local_tweaks.config file and execute the following command to apply the changes:

  • Now, you can check the whitelisted Schedds to run in Tier0 pool, the Central Production Schedds should not appear there.
     condor_config_val -master gsi_daemon_name  

Remember that this change won't remove/evict the jobs that are actually running, but will prevent new jobs to be sent.

Enabling pre-emption in the Tier0 pool

BEWARE: Please DO NOT use this strategy unless you are sure it is necessary and you agree in doing it with the Workflow Team.

  • Login to vocms007 (GlideInWMS Collector-Negociator)
  • Login as root
     sudo su -  
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/ 
  • Open 99_local_tweaks.config
  • Locate this section:
     # How to drain the slots
        # graceful: let the jobs finish, accept no more jobs
        # quick: allow job to checkpoint (if supported) and evict it
        # fast: hard kill the jobs
       DEFRAG_SCHEDULE = graceful 
  • Change it to:
     DEFRAG_SCHEDULE = fast 
  • Leave it enabled only for ~5 minutes. After this the Tier0 jobs will start being killed as well. After the 5 minutes, revert the change
     DEFRAG_SCHEDULE = graceful 

Changing the status of T0_CH_CERN site in SSB

  • You should go to the Prodstatus Metric Manual Override site.
  • There, you will be able to change the status of T0_CH_CERN. You can set Enabled, Disabled, Drain or No override. The Reason field is mandatory (the history of these reason can be checked here). Then click "Apply" and the procedure will be complete. Only the users in the cms-tier0-operations e-group are able to do this change.
  • The status in the SSB site is updated every 15 minutes. So you should be able to see the change there maximum after this amount of time.
  • The documentation can be check here.

Changing priority of jobs that are in the condor queue

  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> JobPrio "<New Prio (numeric value)>" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
    for job in $(condor_q -w | awk '{print $1}')
               condor_qedit $job JobPrio "508200001"

Changing highIO flag of jobs that are in the condor queue

  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> Requestioslots "0" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
     for job in $(cat <text_file_with_the_list_of_job_condor_IDs>)
                 condor_qedit $job Requestioslots "0"

Updating workflow from completed to normal-archived in WMStats

  • To move workflows in completed state to archived state in WMStats, the next code should be executed in one of the agents (prod or test): 

  • The script should be copied to bin folder of wmagent code. For instance, in replay instances:

  • The script should be modified, assigning a run number in the next statement
     if info['Run'] < <RunNumber>
    As you should notice, the given run number would be the oldest run to be shown in WMStats.

  • After it, the code can be executed with:
     $manage execute-agent 

Adding runs to the skip list in the t0Streamer cleanup script

The script is running as a cronjob under the cmsprod acrontab. It is located in the cmsprod area on lxplus.

# Tier0 - /eos/cms/store/t0streamer/ area cleanup script. Running here as cmsprod has writing permission on eos -
0 5 * * * lxplus /afs/ >> /afs/ 2>&1

To add a run to the skip list:

  • Login as cmsprod on lxplus.
  • Go to the script location and open it with an editor:
  • The skip list is on the line 83:
      # run number in this list will be skipped in the iteration below
        runSkip = [251251, 251643, 254608, 254852, 263400, 263410, 263491, 263502, 263584, 263685, 273424, 273425, 273446, 273449, 274956, 274968, 276357]  
  • Add the desired run in the end of the list. Be careful in not removing the existing runs.
  • Save the changes.
  • It is done!. Don't forget to add it to the Good Runs Twiki

NOTE: It was decided to keep the skip list within the script instead of a external file to avoid the risk of deleting the runs in case of an error reading such external file.

Restarting Tier-0 voboxes

NodeSorted ascending Use Type
  • Replays: Normally used by the developer
  • Transfer System
Virtual machine
  • Replays: Normally used by the operators
  • Tier0 Monitoring
Virtual machine
  • Production node
Physical Machine
  • Production node
Physical Machine
  • Replays: Normally used by the operators
Virtual machine
To restart this node you need to check the following:
  • Production node:
    • The agent is not running and the couch processes were stopped correctly.
    • These nodes uses a RAMDISK. Mounting it is puppetized, so you need to make sure that puppet ran before starting the agent again.
  • TransferSystem:
  • Replays:
    • The agent should not be running, check the Tier0 Elog to make sure you are not interfering with a particular test.
  • Tier0 Monitoring:
    • The monitoring is executed via a cronjob. The only consequence of the restart should be that no reports are produced during the down time. However you can check that everything is working going to:

To restart a machine you need to:

  • Login and become root
  • Do the particular checks (listed above) based on the machine that you are restarting)
  • Run the restart command
    shutdown -r now

After restarting the machine, it is convenient to run puppet. You can either wait for the periodical execution or execute it manually:

  • puppet agent -tv

Modifying jobs to resume them with other features (like memory, disk, etc.)

Some scripts are already available to do this, provided with:

  • the cache directory of the job (or location of the job in JobCreator),
  • the feature to modify
  • and the value to be assigned to the feature.

Depending of the feature you want to modify, you would need to change:

  • the config of the single job (job.pkl),
  • the config of the whole workflow (WMWorkload.pkl),
  • or both.

We have learnt by trial and error which variables and files need to be modified to get the desired result, so you would need to do the same depending of the case. Down below we show some basic examples of how to do this:

Some cases have proven you need to modify the Workflow Sandbox when you want to modify next variables:

  • Memory thresholds (maxRSS, memoryRequirement)
  • Number of processing threads (numberOfCores)
  • CMSSW release (cmsswVersion)
  • SCRAM architecture (scramArch)

Modifying the job description has proven to be useful to change next variables:

  • Condor ClassAd of RequestCpus (numberOfCores)
  • CMSSW release (swVersion)
  • SCRAM architecture (scramArch)

At /afs/ there are two directories named "job" and "workflow". You should enter the respective directory. Follow next instructions in the agent machine in charge of the jobs to modify.

Modifying the Workflow Sandbox

Go to next folder: /afs/

In a file named "list", list the jobs you need to modify. Follow the procedure for each type of job/task, given the workflow configuration is different for different Streams.

Use script to generate a human readable copy of WMWorkload.pkl. Look for the name of the variable of the feature to change, for instance maxRSS. Now use the script to create a script to modify that feature. You should provide the name of the feature and the value to be assigned, for instance:


Executing would create a script named after the feature, like The later will modify the selected feature in the Workflow Sandbox.

After generated, you need to add a call to that script in The later will call all the required scripts, create the tarball and locate it where required (Specs folder).

Finally, execute which will call for all the desired workflows.

The previous procedure has been followed for several jobs, so for some features the required personalization of the scripts has been already done, and you would just need to comment or uncomment the required lines. As a summary, you would need to proceed as detailed bellow:

vim list

Modifying the Job Description

Go to next folder: /afs/

Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use as base to create another file which would modify the required feature, you can give it a name like Add a call to the just created script in Finally, execute, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.

vim list
cp modify_pset_<feature>.py
vim modify_pset_<feature>.py

PromptReconstruction at T1s

There are 3 basic requirements to perform PromptReconstruction at T1s (and possibly T2s):

  • Each desired site should be configured in the T0 Agent Resource Control. For this, /data/tier0/ file should be modified specifying pending and running slot thresholds for each type of processing task. For instance:
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --cms-name=T1_IT_CNAF --pnn=T1_IT_CNAF_Disk --ce-name=T1_IT_CNAF --pending-slots=100 --running-slots=1000 --plugin=PyCondorPlugin
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Processing --pending-slots=1500 --running-slots=4000
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Production --pending-slots=1500 --running-slots=4000
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Merge --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Cleanup --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=LogCollect --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Skim --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Harvesting --pending-slots=10 --running-slots=20

  • A list of possible sites where the reconstruction is wanted should be provided under the parameter siteWhitelist. This is done per Primary Dataset in the configuration file /data/tier0/admin/ For instance:
datasets = [ "DisplacedJet" ]

for dataset in datasets:
    addDataset(tier0Config, dataset,
               do_reco = True,
               raw_to_disk = True,
               tape_node = "T1_IT_CNAF_MSS",
               disk_node = "T1_IT_CNAF_Disk",
               siteWhitelist = [ "T1_IT_CNAF" ],
               dqm_sequences = [ "@common" ],
               physics_skims = [ "LogError", "LogErrorMonitor" ],
               scenario = ppScenario)

  • Jobs should be able to write in the T1 storage systems, for this, a proxy with the production VOMS role should be provided at /data/certs/. The variable X509_USER_PROXY defined at /data/tier0/admin/ should point to the proxy location. A proxy with the required role can not be generated for a time span mayor than 8 days, then a cron job should be responsible of the renewal. For jobs to stage out at T1s, there is no need of mappings of the Distinguished Name (DN) shown in the certificate to specific users in the T1 sites, the mapping is made with the role of the certificate. This could be needed to stage out at T2 sites. Down below, the information of a valid proxy is shown:
subject   : /DC=ch/DC=cern/OU=computers/CN=tier0/
issuer    : /DC=ch/DC=cern/OU=computers/CN=tier0/
identity  : /DC=ch/DC=cern/OU=computers/CN=tier0/
type      : RFC3820 compliant impersonation proxy
strength  : 1024
path      : /data/certs/serviceproxy-vocms001.pem
timeleft  : 157:02:59
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO        : cms
subject   : /DC=ch/DC=cern/OU=computers/CN=tier0/
issuer    : /DC=ch/DC=cern/OU=computers/
attribute : /cms/Role=production/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
timeleft  : 157:02:58
uri       :

Manually modify the First Conditions Safe Run (fcsr)

The current fcsr can be checked in the Tier0 Data Service:

In the CMS_T0DATASVC_PROD database check the table the first run with locked = 0 is fcsr

 reco_locked table 

If you want to manually set a run as the fcsr you have to make sure that it is the lowest run with locked = 0

 update reco_locked set locked = 0 where run >= <desired_run> 

Modify the thresholds in the resource control of the Agent

  • Login into the desired agent and become cmst1
  • Source the environment
     source /data/tier0/admin/ 
  • Execute the following command with the desired values:
     $manage execute-agent wmagent-resource-control --site-name=<Desired_Site> --task-type=Processing --pending-slots=<desired_value> --running-slots=<desired_value> 
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=3000 --running-slots=9000 

  • To change the general values
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=PyCondorPlugin 

  • To see the current thresholds and use
     $manage execute-agent wmagent-resource-control -p 

Overriding the limit of Maximum Running jobs by the Condor Schedd

  • Login as root in the Schedd machine
  • Go to:
  • There, override the limit adding/modifying this line:
     MAX_JOBS_RUNNING = <value>  
  • For example:
     MAX_JOBS_RUNNING = 12000  
  • Then, to apply the changes, run:

Unregistering an agent from WMStats

  • Log into the agent
  • Source the environment:
     source /data/tier0/admin/  
  • Execute:
     $manage execute-agent wmagent-unregister-wmstats `hostname -f`:9999  
  • You will be prompt for confirmation. Type 'yes'.
  • Check that the agent doesn't appear in WMStats.

Checking what is locking a database / Cern Session Manager

  • Go to this link 
  • Login using the DB credentials.
  • Check the sessions and see if you see any errors or something unusual.

Commissioning of a new node


Folder's structure and permissions

  • These folders should be placed at /data/:
# Permissions Owner Group Folder Name
1. (775) drwxrwxr-x. root zh admin
2. (775) drwxrwxr-x. root zh certs
3. (755) drwxr-xr-x. cmsprod zh cmsprod
4. (700) drwx------. root root lost+found
5. (775) drwxrwxr-x. root zh srv
6. (755) drwxr-xr-x. cmst1 zh tier0


  • To get the folder permissions as a number:
    stat -c %a /path/to/file
  • To change permissions of a file/folder:
    EXAMPLE 1: chmod 775 /data/certs/
  • To change the user and/or group ownership of a file/directory:
    EXAMPLE 1: chown :zh /data/certs/
    EXAMPLE 2: chown -R cmst1:zh /data/certs/* 

2. certs

  • Certificates are placed on this folder. You should copy them from another node:
    • servicecert-vocms001.pem
    • servicekey-vocms001-enc.pem
    • servicekey-vocms001.pem
    • vocms001.p12
    • serviceproxy-vocms001.pem

NOTE: serviceproxy-vocms001.pem is renewed periodically via a cronjob. Please check the cronjobs section

5. srv

  • There you will find the
    folder, used to....
  • Other condor-related folders could be found. Please check with the Submission Infrastructure operator/team what is needed and who is responsible for it.

6. tier0

  • Main folder for the WMAgent, containing the configuration, source code, deployment scripts, and deployed agent.

File Description Script to deploy the WMAgent for production(*) Script to deploy the WMAgent for a replay(*)
00_readme.txt Some documentation about the scripts Gets the source code to use form Github for WMCore and the Tier0. Applies the described patches if any. Starts the agent after it is deployed. Used during the deployment to start services such as CouchDB Stops the components of the agent. It doesn't delete any information from the file system or the T0AST, just kill the processes of the services and the WMAgent components Invoked by the 00_deploy script. Wipes the content of the T0AST. Be careful!

(*) This script is not static. It might change depending on the version of the Tier0 used and the site where the jobs are running. Check its content before deploying. (**) This script is not static. It might change when new patches are required and when the release versions of the WMCore and the Tier0 change. Check it before deploying.

Folder Description


This topic: CMSPublic > CompOps > CompOpsTier0Team > CompOpsTier0TeamCookbook
Topic revision: r49 - 2017-03-16 - JohnHarveyCasallasLeon
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback