Cookbook

Contents :

Recipes for tier-0 troubleshooting, most of them are written such that you can copy-paste and just replace with your values and obtain the expected results.

BEWARE: The writers are not responsible for side effects of these recipes, always understand the commands before executing them.

Corrupted merged file

This includes files that are on tape, already registered on DBS/TMDB. The procedure to recover them is basically to run all the jobs that lead up to this file, starting from the parent merged file, then replace the desired output and make the proper changes in the catalog systems (i.e. DBS/TMDB).

Print .pkl files, Change job.pkl

  • Print job.pkl or Report.pkl in a tier0 WMAgent vm:
# source environment
source /data/tier0/srv/wmagent/current/apps/t0/etc/profile.d/init.sh

# go to the job area, open a python console and do:
import cPickle
jobHandle = open('job.pkl', "r")
loadedJob = cPickle.load(jobHandle)
jobHandle.close()
print loadedJob

# for Report.*.pkl do:
import cPickle
jobHandle = open("Report.3.pkl", "r")
loadedJob = cPickle.load(jobHandle)
jobHandle.close()
print loadedJob

  • In addition, to change the job.pkl
import cPickle, os
jobHandle = open('job.pkl', "r")
loadedJob = cPickle.load(jobHandle)
jobHandle.close()
# Do the changes on the loadedJob
output = open('job.pkl', 'w')
cPickle.dump(loadedJob, output, cPickle.HIGHEST_PROTOCOL)
output.flush()
os.fsync(output.fileno())
output.close()

  • Print PSet.pkl in a workernode:
Set the same environment for run a job interactively, go to the PSet.pkl location, open a python console and do:

import FWCore.ParameterSet.Config as cms
import pickle
handle = open('PSet.pkl', 'r')
process = pickle.load(handle)
handle.close()
print process.dumpConfig()

Delete entries in database when corrupted input files (Repack jobs)

SELECT WMBS_WORKFLOW.NAME AS NAME, WMBS_WORKFLOW.TASK AS TASK, LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION AS SUBSCRIPTION, LUMI_SECTION_SPLIT_ACTIVE.RUN_ID AS RUN_ID, LUMI_SECTION_SPLIT_ACTIVE.LUMI_ID AS LUMI_ID FROM LUMI_SECTION_SPLIT_ACTIVE INNER JOIN WMBS_SUBSCRIPTION ON LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION = WMBS_SUBSCRIPTION.ID INNER JOIN WMBS_WORKFLOW ON WMBS_SUBSCRIPTION.WORKFLOW = WMBS_WORKFLOW.ID;

# This will actually show the pending active lumi sections for repack. One of this should be related to the corrupted file, compare this result with the first query

SELECT * FROM LUMI_SECTION_SPLIT_ACTIVE;

# You HAVE to be completely sure about to delete an entry from the database (don't do this if you don't understand what this implies)

DELETE FROM LUMI_SECTION_SPLIT_ACTIVE WHERE SUBSCRIPTION = 1345 and RUN_ID = 207279 and LUMI_ID = 129;

Change Cmsweb Tier0 Data Service Passwords (Oracle DB)

All the T0 WMAgent instances has the capability of access the Cmsweb Tier0 Data Service instances. So, when changing the passwords it is necessary to be aware of which instances are running.

Instances currently in use currently (03/03/2015)

Instance Name TNS
CMS_T0DATASVC_REPLAY_1 INT2R
CMS_T0DATASVC_REPLAY_2 INT2R
CMS_T0DATASVC_PROD CMSR

  1. Review running instances.
  2. Stop each of them using:
     /data/tier0/00_stop_agent.sh 
  3. Verify that everything is stopped using:
     ps aux | egrep 'couch|wmcore' 
  4. Make sure of having the new password ready (generating it or getting it in a safe way from the one who is creating it).
  5. From lxplus or any of the T0 machines, log in to the instances you want to change the password to using:
     sqlplus <instanceName>/<password>@<tns> 
    Replacing the brackets with the proper values for each instance.
  6. In sqlplus run the command password, you will be prompt for entering the Old password, the*New Password* and confirming this last. Then you can exit from sqlplus
          SQL> password
          Changing password for <user>
          Old password: 
          New password: 
          Retype new password: 
          Password changed
          SQL> exit
          
  7. Then, you should retry logging in to the same instance, if you can not, you are in trouble!
  8. Communicate the password with the CMSWEB contact in a safe way. After his confirmation you can continue with the following steps.
  9. If everything went well now you can access all the instances with the new passwords. Now it is necessary to update the files secrets files within all the machines, These files are located in:
          /data/tier0/admin/
          
    And normally are named as following (not all the instances will have all the files):
          WMAgent.secrets
          WMAgent.secrets.replay
          WMAgent.secrets.prod
          WMAgent.secrets.localcouch
          WMAgent.secrets.remotecouch
          
  10. If there was an instance running you may also change the password in:
         /data/tier0/srv/wmagent/current/config/tier0/config.py
         
    There you must look for the entry:
          config.T0DAtaScvDatabase.connectUrl
         
    and do the update.
  11. You can now restart the instances that were running before the change. Be careful, some components may fail if you start the instance so you should have clear the trade off of starting it.

Modifying a workflow sandbox

If you need to change a file in a workflow sandbox, i.e. in the WMCore zip, this is the procedure:

# Copy the workflow sandbox from /data/tier0/admin/Specs to your work area
cp /data/tier0/admin/Specs/PromptReco_Run245436_Cosmics/PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 /data/tier0/lcontrer/temp

The work area should only contain the workflow sandbox. Go there and then untar the sandbox and unzip WMCore:

cd /data/tier0/lcontrer/temp
tar -xjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 
unzip -q WMCore.zip

Now replace/modify the files in WMCore. Then you have to merge all again. You should remove the old sandbox and WMCore.zip too:

# Remove former sandbox and WMCore.zip, then create the new WMCore.zip
rm PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 WMCore.zip
zip -rq WMCore.zip WMCore

# Now remove the WMCore folder and then create the new sandbox
rm -rf WMCore/
tar -cjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 ./*

# Clean workarea
rm -rf PSetTweaks/ WMCore.zip WMSandbox/

Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong.

Force Releasing PromptReco

Normally PromptReco workflows has a predefined release delay (currently: 48h). We can require to manually release them in a particular moment. For doing it:

  • Check which runs do you want to release
  • Remember, if some runs are in active the workflows will be created but solve the bookkeeping (or similar) problems.
  • The followinq query makes the pre-release of the non released Runs which ID is lower or equal to a particular value. Depending on which Runs you want to release, you should "play" with this condition. You can run only the SELECT to be sure you are only releasing the runs you want to, before doing the update.
UPDATE ( 
         SELECT reco_release_config.released AS released,
                reco_release_config.delay AS delay,
                reco_release_config.delay_offset AS delay_offset
         FROM  reco_release_config
         WHERE checkForZeroOneState(reco_release_config.released) = 0
               AND reco_release_config.run_id <= <Replace By the desired Run Number> ) t
         SET t.released = 1,
             t.delay = 10,
             t.delay_offset = 5;
  • Check the Tier0Feeder logs. You should see log lines for all the runs you released.

Running a replay on a headnode

  • To run a replay in a instance used for production (for example before deploying it in production) you should check the following:
    • If production ran in this instance before, be sure that the T0AST was backed up. Deploying a new instance will wipe it.
    • Modify the WMAgent.secrets file to point to the replay couch and t0datasvc.
    • Download the latest ReplayOfflineConfig.py from the Github repository. Check the processing to use based on the elog history.
    • Do not use production 00_deploy.sh. Use the replays 00_deploy.sh script instead.

Changing Tier0 Headnode

# Instruction Responsible Role
0. Run a replay in the new headnode. Some changes have to be done to safely run it in a Prod instance. Please check the Running a replay on a headnode section Tier0
1. Deploy the new prod instance in vocms0314, check that we use: Tier0
1.5. Check the ProdOfflineconfiguration that is being used Tier0
2. Start the Tier0 instance in vocms0314 Tier0
3. Coordinate with Storage Manager so we have a stop in data transfers, respecting run boundaries (Before this, we need to check that all the runs currently in the Tier0 are ok with bookkeeping. This means no runs in Active status.) SMOps
4. Checking al transfer are stopped Tier0
4.1. Check http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
4.2. Check /data/Logs/General.log
5. Change the config file of the transfer system to point to T0AST1. It means, going to /data/TransferSystem/Config/TransferSystem_CERN.cfg and change that the following settings to match the new head node T0AST)
  "DatabaseInstance" => "dbi:Oracle:CMS_T0AST",
  "DatabaseUser"     => "CMS_T0AST_1",
  "DatabasePassword" => 'superSafePassword123',
Tier0
6. Make a backup of the General.log.* files (This backup is only needed if using t0_control restart in the next step, if using t0_control_stop + t0_control start logs won't be affected) Tier0
7.

Restart transfer system using:

A)

t0_control restart (will erase the logs)

B)

t0_control stop

t0_control start (will keep the logs)

Tier0
8. Kill the replay processes (if any) Tier0
9. Start notification logs to the SM in vocms0314 Tier0
10. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
11. Restart transfers SMOps

Changing CMSSW Version

If you need to upgrade the CMSSW version the normal procedure is:

      /data/tier0/admin/ProdOfflineConfiguration.py
  • Change the defaultCMSSWVersion filed for the desired CMSSW version, for example:
      defaultCMSSWVersion = "CMSSW_7_4_7"
  • Update the repack and express mappings, For example:
      repackVersionOverride = {
          "CMSSW_7_4_2" : "CMSSW_7_4_7",
          "CMSSW_7_4_3" : "CMSSW_7_4_7",
          "CMSSW_7_4_4" : "CMSSW_7_4_7",
          "CMSSW_7_4_5" : "CMSSW_7_4_7",
          "CMSSW_7_4_6" : "CMSSW_7_4_7",
      }
     expressVersionOverride = {
        "CMSSW_7_4_2" : "CMSSW_7_4_7", 
        "CMSSW_7_4_3" : "CMSSW_7_4_7",
        "CMSSW_7_4_4" : "CMSSW_7_4_7",
        "CMSSW_7_4_5" : "CMSSW_7_4_7",
        "CMSSW_7_4_6" : "CMSSW_7_4_7",
    }
  • Save the changes

  • Find either the last run using the previous version or the first version using the new version for Express and PromptReco. You can use the following query in T0AST to find runs with specific CMSSW version:
 
       select RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where name = '<CMSSW_X_X_X>'
       select EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID where name = '<CMSSW_X_X_X>'
       

  • Report the change including the information of the first runs using the new version (or last runs using the old one).

Backup T0AST (Database)

If you want to do a backup of a database (for example, after retiring a production node, you want to keep the information of the old T0AST) you should.

  • Request a target Database: Normally this databases are owned by dirk.hufnagel@cernNOSPAMPLEASE.ch, so he should request a new database to be the target of the backup.
  • When the database is ready, you can open a ticket for requesting the backup. For this you should send an email to phydb.support@cernNOSPAMPLEASE.ch. An example of a message can be found in this Elog .
  • When the backup is done you will get a reply to your ticket confirming it.

Repacking gets stuck but the bookkeeping is consistent

Description
  • P5 sent over data with lumi holes and consistent accounting.
  • T0 started the Repacking .
  • P5 sent data for the previous lumis where the bookkeeping said there wasn't any data.
  • T0 jobsplitter went to an inconsistent state because it runs in order, so when it found unrepacked data that was supposed to go before data that was already repacked, it got "confused".
  • T0 Runs got stuck, some of them never start, some other never finish despite bookkeeping is ok.
Example
  1. Bookkeeping shows that lumis 52,55,56,57 will be transferred.
  2. Lumis 52,55,56,57 are transferred.
  3. Lumis 52,55,56,57 are repacked.
  4. Lumis 41, 60,91,121,141,145 are transferred.
  5. JobSplitter gets confused because of lumi 41 (lumis 60 and above are all higher than the last repacked lumi, so no problem with them) -> JobSplitter gets stuck
Procedure to fix

Please note: These are not copy/paste instructions. Is more a description of the procedure that was followed in the past to deal with the problem and can be used as a guide.

  • Update the lumi_section_closed records to have filecount=0 for lumis without data and filecount=1 for lumis with data.
           # find lumis to update
           update lumi_section_closed set filecount = 0 where lumi_id in ( ... ) and run_id = <run> and stream_id = <stream_id>;
           update lumi_section_closed set filecount = 1 where lumi_id in ( ... ) and run_id = <run> and stream_id = <stream_id>;      
  • Delete the problematic data (this is the files that belongs to the lumis that were not originally in the bookkeeping but were transferred).
           delete from wmbs_sub_files_available where fileid in ( ... );
           delete from wmbs_fileset_files where fileid in ( ... );
           delete from wmbs_file_location where fileid in ( ... );
           delete from wmbs_file_runlumi_map where fileid in ( ... );
           delete from streamer where id in ( ... );
           delete from wmbs_file_details where id in ( ... );     

Updating the wall time the jobs are using in the condor ClassAd

This time can be modified using the following command. Remember that it should be executed as the owner of the jobs.

condor_qedit -const 'MaxWallTimeMins>30000' MaxWallTimeMins 1440

Updating T0AST when a lumisection can not be transferred.

update lumi_section_closed set filecount = 0, CLOSE_TIME = <timestamp>
where lumi_id in ( <lumisection ID> ) and run_id = <Run ID> and stream_id = <stream ID>;

Example:

update lumi_section_closed set filecount = 0, CLOSE_TIME = 1436179634
where lumi_id in ( 11 ) and run_id = 250938 and stream_id = 14;

Restarting head node machine

  1. Stop Tier0 agent
    00_stop_agent.sh
  2. Stop condor
    service condor stop 
    If you want your data to be still available, then cp your spool directory to disk
    cp -r /mnt/ramdisk/spool /data/
  3. Restart the machine (or request its restart)
  4. Mount the RAM Disk (Condor spool won't work otherwise).
  5. If necessary, copy back the data to the spool.
  6. When restarted, start the sminject component
    t0_control start 
  7. Start the agent
    00_start_agent
    Particularly, check the PhEDExInjector component, if there you see errors, try restarting it after sourcing init.sh
    source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
    $manage execute-agent wmcoreD --restart --component PhEDExInjector

Updating TransferSystem for StorageManager change of alias

Ideally this process should be transparent to us. However, it might be that the TransferSystem doesn't update the IP address of the SM alias when the alias is changed to point to the new machine. In this case you will need to restart the TransferSystem in both the /data/tier0/sminject area on the T0 headnode and the /data/TransferSystem area on vocms001. Steps for this process are below:

  1. Watch the relevant logs on the headnode to see if streamers are being received by the Tier0Injector and if repack notices are being sent by the LoggerReceiver. A useful command for this is:
     watch "tail /data/tier0/srv/wmagent/current/install/tier0/Tier0Feeder/ComponentLog; tail /data/tier0/sminject/Logs/General.log; tail /data/tier0/srv/wmagent/current/install/tier0/JobCreator/ComponentLog" 
  2. Also watch the TransferSystem on vocms001 to see if streamers / files are being received from the SM and if CopyCheck notices are being sent to the SM. A useful command for this is:
     watch "tail /data/TransferSystem/Logs/General.log; tail /data/TransferSystem/Logs/Logger/LoggerReceiver.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckManager.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckWorker.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorManager.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorWorker.log" 
  3. If any of these services stop sending and/or receiving, you will need to restart the TransferSystem.
  4. Restart the TransferSystem on vocms001. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/TransferSystem
    ./t0_control stop
    ./t0_control start
              
  5. Restart the TransferSystem on the T0 headnode. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/tier0/sminject
    ./t0_control stop
    ./t0_control start
              

Restart component in case of deadlock

If a component crashes due to a deadlock, in most of cases restarted is enough to control the situation. In that case the procedure is:

  • Login in the Tier0 headnode (cmst1 user is required)
  • Source the environment
    source /data/tier0/admin/env.sh  
  • Execute the following command to restart the component, replacing   for the specific component name (DBS3Upload, PhEDExInjector, etc.)
    $manage execute-agent wmcoreD --restart --components=<componentName>
    Example
    $manage execute-agent wmcoreD --restart --components=DBS3Upload

Changing Tier0 certificates

  • Check that using the new certificates guarantees privileges to all the needed resources:

Voboxes

  • Copy the servicecert*.pem, servicekey*.pem and serviceproxy*.pem files to
/data/certs 
  • Update the following files to point to the new certificates
admin/env.sh
admin/env_unit.sh

Kibana

  • Change the certificates in the monitoring scripts where they are used, to see where the certificates are being used and the current monitoring head node please check the Tier0 Montoring Twiki.

TransferSystem

  • In the TransferSystem (currently vocms001), update the following file to point to the new certificate and restart component.
/data/TransferSystem/t0_control.sh

Getting Job Statistics

This is the base script to compile the information of jobs that are already done:

/afs/cern.ch/user/e/ebohorqu/public/HIStats/stats.py

For the analysis we need to define certain things:

  • main_dir: Folder where input log archives are. e.g. '/data/tier0/srv/wmagent/current/install/tier0/JobArchiver/logDir/Pí, in main()
  • temp: Folder where output json files are going to be generated, in main().
  • runList: Runs to be analyzed, in main()
  • Job type in two places:
    • getStats()
    • stats[dataset] in main()

The script is run without any parameter. This generates a json file with information about cpu, memory, storage and start and stop times. Task is also included. An example of output file is:

/afs/cern.ch/user/e/ebohorqu/public/HIStats/RecoStatsProcessing.json

With a separate script in R, I was reading and summarizing the data:

/afs/cern.ch/user/e/ebohorqu/public/HIStats/parse_cpu_info.R

There, task type should be defined and also output file. With this script I was just summarizing cpu data, but we could modify it a little to get memory data. Maybe it is quicker to do it directly with the first python script, if you like to do it :P

That script calculates efficiency of each job:

TotalLoopCPU / TotalJobTime * numberOfCores 

and an averaged efficiency per dataset:

sum(TotalLoopCPU) / sum(TotalJobTime * numberOfCores) 

numberOfCores was obtained from job.pkl, TotalLoopCPU and TotalJobTime were obtained from report.pkl

Job type could be Processing, Merge and Harvesting. For Processing type, task could be Reco or AlcaSkim and for Merge type, ALCASkimMergeALCARECO, RecoMergeSkim, RecoMergeWrite _AOD, RecoMergeWrite _DQMIO, RecoMergeWrite _MINIAOD and RecoMergeWrite _RECO.

Unpickling the PSet.pkl file (job configuration file)

To modify the configuration of a job, you can modify the content of the PSet.pkl file. In order to to this you have to dump the pkl file into a python file and there make the necessary changes. To do this normally you'll need ParameterSet.Config. If it is not present in your python path you can modify it:

//BASH

export PYTHONPATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc491/cms/cmssw-patch/CMSSW_7_5_8_patch1/python

In the previous example we assume the job is using CMSSW_7_5_8_patch1 for runningm and that's why we point to this particular path in cvmfs. You should modify it according to the CMSSW version your job is intended to use.

Now you can use the following snippet to dump the file:

//PYTHON

import FWCore.ParameterSet.Config
import pickle
pickleHandle = open('PSet.pkl','rb')
process = pickle.load(pickleHandle)

#This line only will print the python version of the pkl file on the screen
process.dumpPython()

#The actual writing of the file
outputFile = open('PSetPklAsPythonFile.py', 'w')
outputFile.write(process.dumpPython())
outputFile.close()

After dumping the file you can modify its contents. It is not necessary to pkl it again. you can use the cmsRun command normally

cmsRun PSetPklAsPythonFile.py

or

 
cmsRun -e PSet.py 2>err.txt 1>out.txt &

Checking transfer status at agent shutdown

Before shutting down an agent, you should check if all subscriptions and transfers were performed. This is equivalent to check the deletion of all blocks in T0_CH_CERN_Disk and resolve any related issue. Issues can be open blocks, or DDM conflicts, etc.

Check which blocks have not been deleted yet.

select blockname
from dbsbuffer_block
where deleted = 0

Some datasets could be marked as subscribed in the database, but not been really subscribed in PhEDEx. You can check this with Transfer Team and if that is the case, retry the subscription setting subscribed to 0. You can narrow the query to some blocks with a given name pattern or blocks in a specific site.

update dbsbuffer_dataset_subscription
set subscribed = 0
where dataset_id in (
  select dataset_id
  from dbsbuffer_block
  where deleted = 0
  <and blockname like...>
)
<and site like ...>

Some blocks can be marked as closed, but still being open in PhEDEx. If this is the case, you can set status to "InDBS", to try closing them again. For example, if you want to closed MiniAOD blocks, you can provide a name pattern like '%/MINIAOD#%'.

Attribute status can have 3 values: 'Open', 'InDBS' and 'Closed'. 'Open' is the first value assigned to all blocks, when they are closed and injected into DBS, status is changed to 'InDBS' and when they are closed in PhEDEx, status is changed to 'Closed'. Setting status to 'InDBS' would make the agent retries to close the blocks in PhEDEx.

update dbsbuffer_block
set status = 'InDBS'
where deleted = 0
and status = 'Closed'
and blockname like ... 

If some subscriptions shouldn't be checked anymore, remove these subscriptions from database. For instance, if you want to remove RAW subscriptions to disk of all T1s, you can give a path pattern like '/%/%/RAW' and a site like 'T1_%_Disk'.

delete dbsbuffer_dataset_subscription
where dataset_id in (
  select id
  from dbsbuffer_dataset
  where path like ...
)
and site like ...

Disabling flocking to Tier0 Pool

If it is needed to prevent new Central Production jobs to be executed in the Tier0 pool it is necessary to disable flocking. To do so you should follow these steps. Be Careful, you will make changes in the GlideInWMS Collector and Negociator, you can cause a big mess if you don't proceed with caution.

NOTE: The root access to the GlideInWMS Collector is guaranteed for the members of the cms-tier0-operations@cern.ch e-group.

  • Login to vocms007 (GlideInWMS Collector-Negociator)
  • Login as root
     sudo su - 
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/
  • There you will find a list of files. Most of them are puppetized which means any change will be overridden when puppet be executed. There is one no-puppetized file called 99_local_tweaks.config that is the one that will be used to do the changes we desire.
     -rw-r--r--. 1 condor condor  1849 Mar 19  2015 00_gwms_general.config
     -rw-r--r--. 1 condor condor  1511 Mar 19  2015 01_gwms_collectors.config
     -rw-r--r--  1 condor condor   678 May 27  2015 03_gwms_local.config
     -rw-r--r--  1 condor condor  2613 Nov 30 11:16 10_cms_htcondor.config
     -rw-r--r--  1 condor condor  3279 Jun 30  2015 10_had.config
     -rw-r--r--  1 condor condor 36360 Jun 29  2015 20_cms_secondary_collectors_tier0.config
     -rw-r--r--  1 condor condor  2080 Feb 22 12:24 80_cms_collector_generic.config
     -rw-r--r--  1 condor condor  3186 Mar 31 14:05 81_cms_collector_tier0_generic.config
     -rw-r--r--  1 condor condor  1875 Feb 15 14:05 90_cms_negotiator_policy_tier0.config
     -rw-r--r--  1 condor condor  3198 Aug  5  2015 95_cms_daemon_monitoring.config
     -rw-r--r--  1 condor condor  6306 Apr 15 11:21 99_local_tweaks.config

Within this file there is a special section for the Tier0 ops. The other sections of the file should not be modified.

  • To actually disable flocking you should:
    • Uncomment this line:
       # <----- Uncomment here ------->
       # CERTIFICATE_MAPFILE= /data/srv/glidecondor/condor_mapfile 
    • Comment from the whitelist all the Central Production Schedds:
       # <---- Comment out all the schedds below this to disable flocking ---->
       # Adding global pool CERN production schedds for flocking
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0230.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0304.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0308.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0309.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0310.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0311.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0303.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms026.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms053.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms005.cern.ch
       # Adding global pool FNAL production schedds for flocking
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmsgwms-submit2.fnal.gov
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmsgwms-submit1.fnal.gov
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmssrv217.fnal.gov
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmssrv218.fnal.gov
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmssrv219.fnal.gov
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmssrv248.fnal.gov 
  • Save the changes in the 99_local_tweaks.config file and execute the following command to apply the changes:
     condor_reconfig 

  • Now, you can check the whitelisted Schedds to run in Tier0 pool, the Central Production Schedds should not appear there.
     condor_config_val -master gsi_daemon_name  

Remember that this change won't remove/evict the jobs that are actually running, but will prevent new jobs to be sent.

Enabling pre-emption in the Tier0 pool

BEWARE: Please DO NOT use this strategy unless you are sure it is necessary and you agree in doing it with the Workflow Team.

  • Login to vocms007 (GlideInWMS Collector-Negociator)
  • Login as root
     sudo su -  
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/ 
  • Open 99_local_tweaks.config
  • Locate this section:
     # How to drain the slots
        # graceful: let the jobs finish, accept no more jobs
        # quick: allow job to checkpoint (if supported) and evict it
        # fast: hard kill the jobs
       DEFRAG_SCHEDULE = graceful 
  • Change it to:
     DEFRAG_SCHEDULE = fast 
  • Leave it enabled only for ~5 minutes. After this the Tier0 jobs will start being killed as well. After the 5 minutes, revert the change
     DEFRAG_SCHEDULE = graceful 

Changing the status of T0_CH_CERN site in SSB

  • You should go to the Prodstatus Metric Manual Override site.
  • There, you will be able to change the status of T0_CH_CERN. You can set Enabled, Disabled, Drain or No override. The Reason field is mandatory (the history of these reason can be checked here). Then click "Apply" and the procedure will be complete. Only the users in the cms-tier0-operations e-group are able to do this change.
  • The status in the SSB site is updated every 15 minutes. So you should be able to see the change there maximum after this amount of time.
  • The documentation can be check here.

Changing priority of jobs that are in the condor queue

  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> JobPrio "<New Prio (numeric value)>" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
    for job in $(condor_q -w | awk '{print $1}')
         do
               condor_qedit $job JobPrio "508200001"
         done  

Updating workflow from completed to normal-archived in WMStats

  • To move workflows in completed state to archived state in WMStats, the next code should be executed in one of the agents (prod or test):
     https://github.com/ticoann/WmAgentScripts/blob/wmstat_temp_test/test/updateT0RequestStatus.py 

  • The script should be copied to bin folder of wmagent code. For instance, in replay instances:
     /data/tier0/srv/wmagent/2.0.4/sw/slc6_amd64_gcc493/cms/wmagent/1.0.17.pre4/bin/ 

  • The script should be modified, assigning a run number in the next statement
     if info['Run'] < <RunNumber>
    As you should notice, the given run number would be the oldest run to be shown in WMStats.

  • After it, the code can be executed with:
     $manage execute-agent updateT0RequestStatus.py 

Adding runs to the skip list in the t0Streamer cleanup script

The script is running as a cronjob under the cmsprod acrontab. It is located in the cmsprod area on lxplus.

# Tier0 - /eos/cms/store/t0streamer/ area cleanup script. Running here as cmsprod has writing permission on eos - cms-tier0-operations@cern.ch
0 5 * * * lxplus /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py >> /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/streamer_delete.log 2>&1

To add a run to the skip list:

  • Login as cmsprod on lxplus.
  • Go to the script location and open it with an editor:
    /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py 
  • The skip list is on the line 83:
      # run number in this list will be skipped in the iteration below
        runSkip = [251251, 251643, 254608, 254852, 263400, 263410, 263491, 263502, 263584, 263685, 273424, 273425, 273446, 273449, 274956, 274968, 276357]  
  • Add the desired run in the end of the list. Be careful in not removing the existing runs.
  • Save the changes.
  • It is done!. Don't forget to add it to the Good Runs Twiki

NOTE: It was decided to keep the skip list within the script instead of a external file to avoid the risk of deleting the runs in case of an error reading such external file.

Restarting Tier-0 voboxes

Node Use Type
vocms001
  • Replays: Normally used by the developer
  • Transfer System
Virtual machine
vocms015
  • Replays: Normally used by the operators
  • Tier0 Monitoring
Virtual machine
vocms047
  • Replays: Normally used by the operators
Virtual machine
vocms0313
  • Production node
Physical Machine
vocms0314
  • Production node
Physical Machine
To restart this node you need to check the following:
  • Production node:
    • The agent is not running and the couch processes were stopped correctly.
    • These nodes uses a RAMDISK. Mounting it is puppetized, so you need to make sure that puppet ran before starting the agent again.
  • TransferSystem:
  • Replays:
    • The agent should not be running, check the Tier0 Elog to make sure you are not interfering with a particular test.
  • Tier0 Monitoring:
    • The monitoring is executed via a cronjob. The only consequence of the restart should be that no reports are produced during the down time. However you can check that everything is working going to:
      /data/tier0/sls/scripts/Logs

To restart a machine you need to:

  • Login and become root
  • Do the particular checks (listed above) based on the machine that you are restarting)
  • Run the restart command
    shutdown -r now

After restarting the machine, it is convenient to run puppet. You can either wait for the periodical execution or execute it manually:

  • puppet agent -tv

Modifying jobs to resume them with other features (like memory, disk, etc.)

Some scripts are already available to do this, provided with:

  • the cache directory of the job (or location of the job in JobCreator),
  • the feature to modify
  • and the value to be assigned to the feature.

Depending of the feature you want to modify, you would need to change:

  • the config of the single job (job.pkl),
  • the config of the whole workflow (WMWorkload.pkl),
  • or both.

We have learnt by trial and error which variables and files need to be modified to get the desired result, so you would need to do the same depending of the case. Down below we show some basic examples of how to do this:

Some cases have proven you need to modify the Workflow Sandbox when you want to modify next variables:

  • Memory thresholds (maxRSS, memoryRequirement)
  • Number of processing threads (numberOfCores)
  • CMSSW release (cmsswVersion)
  • SCRAM architecture (scramArch)

Modifying the job description has proven to be useful to change next variables:

  • Condor ClassAd of RequestCpus (numberOfCores)
  • CMSSW release (swVersion)
  • SCRAM architecture (scramArch)

At /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs there are two directories named "job" and "workflow". You should enter the respective directory. Follow next instructions in the agent machine in charge of the jobs to modify.

Modifying the Workflow Sandbox

Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/workflow

In a file named "list", list the jobs you need to modify. Follow the procedure for each type of job/task, given the workflow configuration is different for different Streams.

Use script print_workflow_config.sh to generate a human readable copy of WMWorkload.pkl. Look for the name of the variable of the feature to change, for instance maxRSS. Now use the script generate_code.sh to create a script to modify that feature. You should provide the name of the feature and the value to be assigned, for instance:

feature=maxRSS
value=15360000

Executing generate_code.sh would create a script named after the feature, like modify_wmworkload_maxRSS.py. The later will modify the selected feature in the Workflow Sandbox.

After generated, you need to add a call to that script in modify_one_workflow.sh. The later will call all the required scripts, create the tarball and locate it where required (Specs folder).

Finally, execute modify_several_workflows.sh which will call modify_one_workflow.sh for all the desired workflows.

The previous procedure has been followed for several jobs, so for some features the required personalization of the scripts has been already done, and you would just need to comment or uncomment the required lines. As a summary, you would need to proceed as detailed bellow:

vim list
./print_workflow_config.sh
vim generate_code.sh
./generate_code.sh
vim modify_one_workflow.sh
./modify_several_workflows.sh

Modifying the Job Description

Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/job

Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use modify_pset.py as base to create another file which would modify the required feature, you can give it a name like modify_pset_.py. Add a call to the just created script in modify_one_job.sh. Finally, execute modify_several_jobs.sh, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.

vim list
cp modify_pset.py modify_pset_<feature>.py
vim modify_pset_<feature>.py
vim modify_one_job.sh
./modify_several_jobs.sh


This topic: CMSPublic > CompOps > CompOpsTier0Team > CompOpsTier0TeamCookbook
Topic revision: r38 - 2016-10-03 - JohnHarveyCasallasLeon
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback