Cookbook

Recipes for tier-0 troubleshooting, most of them are written such that you can copy-paste and just replace with your values and obtain the expected results.

BEWARE: The authors are not responsible for any side effects of these recipes, one should always understand the commands/actions before executing them.

Contents :



Tier-0 Configuration modifications

Replay instructions

NOTE1: An old and more theoretical version of replays docs is available here.

NOTE2: The 00_XYZ scripts are located at

/data/tier0/
, absolute paths are provided in these instructions for clearness

In order to start a new replay, firstly you need to make sure that the instance is available Check the Tier0 project on Jira and look for the latest ticket about the vobox. If it is closed and/or it is indicated that the previous replay is over, you can proceed.

A) You need no to check that the previous deployed Tier0 WMAgent is not running on the machine anymore. In order to do so, use the next commands.

  • The queue of the condor jobs:
    condor_q
    . If any, you can use
    condor_rm -all
    to remove everything.
  • The list of the Tier0 WMAgent related processes:
    runningagent (This is an alias included in the cmst1 config, actual command: ps aux | egrep 'couch|wmcore|mysql|beam'
  • If the list is not empty, you need to stop the agent:
    /data/tier0/00_stop_agent.sh

B) Setting up the new replay: *How do you choose a run number for the replays:

    • Go to prodmon to check processing status. There you can see that run was 5.5h long and quite large (13TB) - I would look for something smaller (~1h), but letís assume itís ok.
    • Go to WBM and check conditions:
      • Check initial and ending lumi - it should be ~10000 or higher. Currently we run at ~6000-7000, so itís ok as well.
      • Check if physics flag was set for a reasonably high number of lumi sections (if not, we are looking at some non-physics/junk data) - looks good.
    • So, given that the run is too long. I would look for a different one following that logic. You should look for a recent collision run at prodmon.

  • Edit the Replay configuration (change the run, CMSSW version or whatever you need):
    /data/tier0/admin/ReplayOfflineConfiguration.py
  • Later on, run the scripts to start the replay:
      ./00_software.sh # loads the newest version of WMCore and T0 github repositories.
      ./00_deploy.sh # deploys the new configuration, wipes the toast database etc.  
  • 00_deploy.sh script wipes the t0ast db. So, as in replay machines it's fine, you don't want this to happen in a headnode machines. Therefore, be careful while running this script!
       ./00_start_agent.sh # starts the new agent - loads the job list etc.
       vim /data/tier0/srv/wmagent/2.0.8/install/tier0/Tier0Feeder/ComponentLog
       vim /data/tier0/srv/wmagent/2.0.8/install/tier0/JobCreator/ComponentLog 
       vim /data/tier0/srv/wmagent/2.0.8/install/tier0/JobSubmitter/ComponentLog
  • Finally, check again the condor queue and the runningagent lists. Now there should be a list of newly available jobs and their states:
      condor_q
      runningagent

Adding a new scenario to the configuration

  • Go to the scenarios section.
  • Declare a new variable for the new scenario. Give it a meaningful name ending with the "Scenario" suffix.
     <meaningfulName>Scenario = "<actualNameOfTheNewScenario>" 
  • Make a new Pull Request, adding the scenario to the scenarios creation: https://github.com/dmwm/T0/blob/master/src/python/T0/WMBS/Oracle/Create.py#L864
  • NOTE: If the instance is already deployed, you can manually add the new scenario directly on the event_scenario table of the T0AST. The Tier0Feeder will pick the change up in the next polling cycle.

How to Delay Prompt Reco Release?

To delay the PromptReco release is really easy, you just only have to change in the config file (/data/tier0/admin/ProdOfflineConfiguration.py):

defaultRecoTimeout =  48 * 3600

to something higher like 10 * 48 * 3600. Tier0Feeder checks this timeout every polling cycle. So when you want to release it again, you just need to go back to the 48h delay.

Changing CMSSW Version

If you need to upgrade the CMSSW version the normal procedure is:

      /data/tier0/admin/ProdOfflineConfiguration.py
  • Change the defaultCMSSWVersion filed for the desired CMSSW version, for example:
      defaultCMSSWVersion = "CMSSW_7_4_7"
  • Update the repack and express mappings, For example:
      repackVersionOverride = {
          "CMSSW_7_4_2" : "CMSSW_7_4_7",
          "CMSSW_7_4_3" : "CMSSW_7_4_7",
          "CMSSW_7_4_4" : "CMSSW_7_4_7",
          "CMSSW_7_4_5" : "CMSSW_7_4_7",
          "CMSSW_7_4_6" : "CMSSW_7_4_7",
      }
     expressVersionOverride = {
        "CMSSW_7_4_2" : "CMSSW_7_4_7", 
        "CMSSW_7_4_3" : "CMSSW_7_4_7",
        "CMSSW_7_4_4" : "CMSSW_7_4_7",
        "CMSSW_7_4_5" : "CMSSW_7_4_7",
        "CMSSW_7_4_6" : "CMSSW_7_4_7",
    }
  • Save the changes

  • Find either the last run using the previous version or the first version using the new version for Express and PromptReco. You can use the following query in T0AST to find runs with specific CMSSW version:
 
       select RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where name = '<CMSSW_X_X_X>'
       select EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID where name = '<CMSSW_X_X_X>'
       

  • Report the change including the information of the first runs using the new version (or last runs using the old one).

Change a file size limit on Tier0

As for October 2017, the file size limit was increased from 12GB to 16GB. However, if a change is needed, then the following values need to be modified:

  • maxSizeSingleLumi and maxEdmSize in ProdOfflineConfiguration.py
  • maxAllowedRepackOutputSize in srv/wmagent/current/config/tier0/config.py

Force Releasing PromptReco

Normally PromptReco workflows has a predefined release delay (currently: 48h). We can require to manually release them in a particular moment. For doing it:

  • Check which runs do you want to release
  • Remember, if some runs are in active the workflows will be created but solve the bookkeeping (or similar) problems.
  • The followinq query makes the pre-release of the non released Runs which ID is lower or equal to a particular value. Depending on which Runs you want to release, you should "play" with this condition. You can run only the SELECT to be sure you are only releasing the runs you want to, before doing the update.
UPDATE ( 
         SELECT reco_release_config.released AS released,
                reco_release_config.delay AS delay,
                reco_release_config.delay_offset AS delay_offset
         FROM  reco_release_config
         WHERE checkForZeroOneState(reco_release_config.released) = 0
               AND reco_release_config.run_id <= <Replace By the desired Run Number> ) t
         SET t.released = 1,
             t.delay = 10,
             t.delay_offset = 5;
  • Check the Tier0Feeder logs. You should see log lines for all the runs you released.

PromptReconstruction at T1s/T2s

There are 3 basic requirements to perform PromptReconstruction at T1s (and T2s):

  • Each desired site should be configured in the T0 Agent Resource Control. For this, /data/tier0/00_deploy.sh file should be modified specifying pending and running slot thresholds for each type of processing task. For instance:
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --cms-name=T1_IT_CNAF --pnn=T1_IT_CNAF_Disk --ce-name=T1_IT_CNAF --pending-slots=100 --running-slots=1000 --plugin=PyCondorPlugin
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Processing --pending-slots=1500 --running-slots=4000
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Production --pending-slots=1500 --running-slots=4000
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Merge --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Cleanup --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=LogCollect --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Skim --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Harvesting --pending-slots=10 --running-slots=20

A useful command to check the current state of the site (agent parameters for the site, running jobs etc.):

$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN -p

  • A list of possible sites where the reconstruction is wanted should be provided under the parameter siteWhitelist. This is done per Primary Dataset in the configuration file /data/tier0/admin/ProdOfflineConfiguration.py. For instance:
datasets = [ "DisplacedJet" ]

for dataset in datasets:
    addDataset(tier0Config, dataset,
               do_reco = True,
               raw_to_disk = True,
               tape_node = "T1_IT_CNAF_MSS",
               disk_node = "T1_IT_CNAF_Disk",
               siteWhitelist = [ "T1_IT_CNAF" ],
               dqm_sequences = [ "@common" ],
               physics_skims = [ "LogError", "LogErrorMonitor" ],
               scenario = ppScenario)

  • Jobs should be able to write in the T1 storage systems, for this, a proxy with the production VOMS role should be provided at /data/certs/. The variable X509_USER_PROXY defined at /data/tier0/admin/env.sh should point to the proxy location. A proxy with the required role can not be generated for a time span mayor than 8 days, then a cron job should be responsible of the renewal. For jobs to stage out at T1s, there is no need of mappings of the Distinguished Name (DN) shown in the certificate to specific users in the T1 sites, the mapping is made with the role of the certificate. This could be needed to stage out at T2 sites. Down below, the information of a valid proxy is shown:
subject   : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch/CN=110263821
issuer    : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch
identity  : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch
type      : RFC3820 compliant impersonation proxy
strength  : 1024
path      : /data/certs/serviceproxy-vocms001.pem
timeleft  : 157:02:59
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO        : cms
subject   : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch
issuer    : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
attribute : /cms/Role=production/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
timeleft  : 157:02:58
uri       : voms2.cern.ch:15002

Adding runs to the skip list in the t0Streamer cleanup script

The script is running as an acrontab job under the cmst0 acc. It is located in the cmst0 area on lxplus.

# Tier0 - /eos/cms/store/t0streamer/ area cleanup script. Running here as cmst0 has writing permission on eos - cms-tier0-operations@cern.ch
0 10,22 * * * lxplus.cern.ch /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.sh >> /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/streamer_delete.log 2>&1

To add a run to the skip list:

  • Login as cmst0 on lxplus.
  • Go to the script location and open it:
    /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py 
  • The skip list is on the line 117:
      # run number in this list will be skipped in the iteration below
        runSkip = [251251, 251643, 254608, 254852, 263400, 263410, 263491, 263502, 263584, 263685, 273424, 273425, 273446, 273449, 274956, 274968, 276357,...]  
  • Add the desired run in the end of the list. Be careful and do not remove any existing runs.
  • Save the changes.
  • It is done!. Don't forget to add it to the Good Runs Twiki

NOTE: It was decided to keep the skip list within the script instead of a external file to avoid the risk of deleting the runs in case of an error reading such external file.



Debugging/fixing operational issues (failing, paused jobs etc.)

How do I look for paused jobs?

  • WMStats:
Go to the production WMStats and click on the Request tab. Organize jobs by the paused jobs column, if there is any paused job you will see the workflows that have them. Click on the 'L' related to the workflow with paused jobs. You will go the the Jobs tab, now click on the 'L' of the jobs that are paused.

For documentation on WMStats, please reas CompOpsTier0TeamWMStatsMonitoring

  • T0AST: Log into the T0AST and run the following query, you will get the paused jobs id, name and cache_dir.
SELECT id, name, cache_dir FROM wmbs_job WHERE state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpaused');

You can use this query to get the workflows that have paused jobs:

SELECT DISTINCT(wmbs_workflow.NAME) FROM wmbs_job 
inner join wmbs_jobgroup on wmbs_job.jobgroup = wmbs_jobgroup.ID
inner join wmbs_subscription on wmbs_subscription.ID = wmbs_jobgroup.subscription
inner join wmbs_workflow on wmbs_subscription.workflow = wmbs_workflow.ID
WHERE wmbs_job.state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpaused')
and wmbs_job.cache_dir like '%Reco%';

Paused jobs can also be in state 'submitfailed'

  • Tier0 vm: Log into the tier0 vm and do:
cd /data/tier0/srv/wmagent/current/install/tier0
find ./JobCreator/JobCache -name Report.3.pkl

This will return the cache dir of the paused jobs (This may not work if the jobs were not actually submitted - submitfailed jobs do not create Report.*.pkl)

How do I get the job tarballs?

  • Go to the cache dir of the job
  • Look for the output .tar.gz PFN from the last retry condor.*.out
  • from a lxplus machine do:
xrdcp PFN .

How do I fail/resume paused jobs?

#Source environment 
source /data/tier0/admin/env.sh

# Fail paused-jobs
$manage execute-agent paused-jobs -f -j 10231

# Resume paused-jobs
$manage execute-agent paused-jobs -r -j 10231

You can use the following options:

-j job
-w workflow
-t taskType
-s site
-d do not commit changes, only show what will do

To do mass fails / resumes for a single error code, the follow commands are useful:

cp ListOfPausedJobsFromDB /data/tier0/jocasall/pausedJobsClean.txt
python /data/tier0/jocasall/checkPausedJobs.py
awk -F '_' '{print $6}' code_XXX > jobsToResume.txt
while read job; do $manage execute-agent paused-jobs -r -j ${job}; done <jobsToResume.txt

Data is lost in /store/unmerged - input files for Merge jobs are lost (an intro to run a job interactively)

If some intermediate data in EOS (i.e. data in /store/unmerged) is lost/corrupted due to some problem (power outage, disk write buffer problems) and there is nothing site support can do about it, you can run the successful jobs (that the wmagent thinks are already done) interactively:

  • First get the job tarball: logCollect may have already run, so go to the oracle database and run this query (replace the LFN of the lost/corrupted file) for knowing which .tar to look in:
select DISTINCT(tar_details.LFN) from wmbs_file_parent
inner join wmbs_file_details parentdetails on wmbs_file_parent.CHILD = parentdetails.ID
left outer join wmbs_file_parent parents on parents.PARENT = wmbs_file_parent.PARENT
left outer join wmbs_file_details childsdetails on parents.CHILD = childsdetails.ID
left outer join wmbs_file_parent childs on childsdetails.ID = childs.PARENT
left outer join wmbs_file_details tar_details on childs.CHILD = tar_details.ID
where childsdetails.LFN like '%tar.gz' and parentdetails.LFN in ('/store/unmerged/express/Commissioning2014/StreamExpressCosmics/ALCARECO/Express-v3/000/227/470/00000/A25ED7B5-5455-E411-AA08-02163E008F52.root',
'/store/unmerged/data/Commissioning2014/MinimumBias/RECO/PromptReco-v3/000/227/430/00000/EC5CF866-5855-E411-BC82-02163E008F75.root');

  • Get the .tar from castor (i.e.):
lcg-cp srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/logs/prod/2014/10/WMAgent/PromptReco_Run227430_MinimumBias/PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar ./PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar

  • Untar the log collection and look for the file UUID among the tarballs
tar -xvf PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar
zgrep <UUID> ./LogCollection/*.tar.gz 

  • Now you should know what job reported to have the UUID for the corrupted file. Untar that tarball and run the job interactively (to untar: tar - zxvf ).
  • If you need to set a local input file, you can change the PSet.pkl file to point to a local file. However you need to change the trivialcatalog_file and override the protocol to direct. i.e.
S'trivialcatalog_file:/home/glidein_pilot/glide_aXehes/execute/dir_30664/job/WMTaskSpace/cmsRun2/CMSSW_7_1_10_patch2/override_catalog.xml?protocol=direct'

Changes to:

S'trivialcatalog_file:/afs/cern.ch/user/l/lcontrer/scram/plugins/override_catalog.xml?protocol=direct'
  • Copy the output file of the job you run interactively to /store/unmerged/... You need to do this using a valid production proxy/cert (i.e. you can use the certs of the t0 production machine). cmsprod user is the owner of these file.
eos cp <local file> </eos/cms/store/unmerged/...>

Run a job interactively

  • Log into lxplus
  • Get the tarball of the job you want to run, untar it (i.e.:)
tar -zxvf 68d93c9c-db7e-11e3-a585-00221959e789-46-0-logArchive.tar.gz 

  • Create your proxy, then create the scram area:
# Create a valid proxy
voms-proxy-init -voms cms

# Source CMSSW environment 
source /cvmfs/cms.cern.ch/cmsset_default.sh

# Create the scram area (Replace the release for the one the job should use)
scramv1 project CMSSW CMSSW_7_4_0

  • Go to src area in the CMSSW directory you created, then copy the PSet.pkl and PSet.py (from the untared job cmsRun1/2)
# Go to the src area
cd CMSSW_7_4_0/src/

  • Do eval and run the job
eval `scramv1 runtime -sh`

# Actually run the job (you can pass the parameter to create a fwjr too)
cmsRun PSet.py

  • Hacking CMSSW configuration
# If you need to modify the job for whatever reason (like drop some input to get at least some 
# statistics for a DQM harvesting job) you need to first to get a config dump in python format 
# instead of pickle. Keep in mind that the config file is very big.
# Modify PSet.py by adding "print process.dumpPython()" as a last command and run it using python
python PSet.py > cmssw_config.py

# Modify cmssw_config.py (For example find process.source and remove files that you don't want to run on). Save it and use it as input for cmsRun instead of PSet.py
cmsRun cmssw_config.py

Updating T0AST when a lumisection can not be transferred.

update lumi_section_closed set filecount = 0, CLOSE_TIME = <timestamp>
where lumi_id in ( <lumisection ID> ) and run_id = <Run ID> and stream_id = <stream ID>;

Example:

update lumi_section_closed set filecount = 0, CLOSE_TIME = 1436179634
where lumi_id in ( 11 ) and run_id = 250938 and stream_id = 14;

Unpickling the PSet.pkl file (job configuration file)

To modify the configuration of a job, you can modify the content of the PSet.pkl file. In order to to this you have to dump the pkl file into a python file and there make the necessary changes. To do this normally you'll need ParameterSet.Config. If it is not present in your python path you can modify it:

//BASH

export PYTHONPATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc491/cms/cmssw-patch/CMSSW_7_5_8_patch1/python

In the previous example we assume the job is using CMSSW_7_5_8_patch1 for runningm and that's why we point to this particular path in cvmfs. You should modify it according to the CMSSW version your job is intended to use.

Now you can use the following snippet to dump the file:

//PYTHON

import FWCore.ParameterSet.Config
import pickle
pickleHandle = open('PSet.pkl','rb')
process = pickle.load(pickleHandle)

#This line only will print the python version of the pkl file on the screen
process.dumpPython()

#The actual writing of the file
outputFile = open('PSetPklAsPythonFile.py', 'w')
outputFile.write(process.dumpPython())
outputFile.close()

After dumping the file you can modify its contents. It is not necessary to pkl it again. you can use the cmsRun command normally

cmsRun PSetPklAsPythonFile.py

or

 
cmsRun -e PSet.py 2>err.txt 1>out.txt &

Transfers to T1 sites are taking longer than expected

Transfers can take a while, so this is somewhat normal. If it takes a very long time, one could ask in the phedex ops HN forum if there is a problem. You can also ping the facility admins or open a GGUS ticket if the issue is backlogging the PromptReco processing in the a given T1.

Diagnose bookkeeping problems

You can run the diagnose active runs script. It will show what is missing for the Tier 0 to process data for the given run. Post in the SMOps hn if there are missing logs or if the bookkeeping is inconsistent.

#Source environment 
source /data/tier0/admin/env.sh

# Run the diagnose script (change run number)
$manage execute-tier0 diagnoseActiveRuns 231087

Looking for jobs that were submitted in a given time frame

Best way is to look to the wmbs_jobs table while the workflow is still executing jobs. But if the workflow is already archived, no record in the T0AST about the job is kept. Anyway, there is a way to find out the jobs that were submitted in a given time frame from the couch db:

Add this patch to the couch app (this actually add a view), you may have to modify the path where to patch according to the WMAgent/!Tier0 tags you are using.

curl https://github.com/dmwm/WMCore/commit/8c5cca41a0ce5946d0a6fb9fb52ed62165594eb0.patch | patch -d /data/tier0/srv/wmagent/1.9.92/sw.pre.hufnagel/slc6_amd64_gcc481/cms/wmagent/1.0.7.pre6/data/ -p 2

Then init couchapp, this will create the view. It may take some time if you have a big database to map.

$manage execute-agent wmagent-couchapp-init

Then curl the results for the given time frame (look for the timestamps you need, change user and password accordingly)

curl -g -X GET 'http://user:password@localhost:5984/wmagent_jobdump%2Fjobs/_design/JobDump/_view/statusByTime?startkey=["executing",1432223400]&endkey=["executing",1432305900]'

Corrupted merged file

This includes files that are on tape, already registered on DBS/TMDB. The procedure to recover them is basically to run all the jobs that lead up to this file, starting from the parent merged file, then replace the desired output and make the proper changes in the catalog systems (i.e. DBS/TMDB).

Print .pkl files, Change job.pkl

  • Print job.pkl or Report.pkl in a tier0 WMAgent vm:
# source environment
source /data/tier0/srv/wmagent/current/apps/t0/etc/profile.d/init.sh

# go to the job area, open a python console and do:
import cPickle
jobHandle = open('job.pkl', "r")
loadedJob = cPickle.load(jobHandle)
jobHandle.close()
print loadedJob

# for Report.*.pkl do:
import cPickle
jobHandle = open("Report.3.pkl", "r")
loadedJob = cPickle.load(jobHandle)
jobHandle.close()
print loadedJob

  • In addition, to change the job.pkl
import cPickle, os
jobHandle = open('job.pkl', "r")
loadedJob = cPickle.load(jobHandle)
jobHandle.close()
# Do the changes on the loadedJob
output = open('job.pkl', 'w')
cPickle.dump(loadedJob, output, cPickle.HIGHEST_PROTOCOL)
output.flush()
os.fsync(output.fileno())
output.close()

  • Print PSet.pkl in a workernode:
Set the same environment for run a job interactively, go to the PSet.pkl location, open a python console and do:

import FWCore.ParameterSet.Config as cms
import pickle
handle = open('PSet.pkl', 'r')
process = pickle.load(handle)
handle.close()
print process.dumpConfig()

Change a variable in a running Sandbox configuration (wmworkload.pkl)

First of all you need to locate the working Sandbox that can be located after logging into the appropriate machine and then going to the following folder: /data/tier0/admin/Specs/[name of the process, run number and workflow name].

Make a copy of this Sandbox in a folder located on the private folder of an lxplus machine and make a backup copy of this original Sandbox.

The sandbox file is a compressed file with a tar.bz compression so it should be decompressed using tar command like this tar Ėxjvf [name of the compressed file]. You can create a separate folder for the resulting files and folders or not.

After it is decompressed a new folder appears, called WMSandbox and within it there is a file called WMWorkload.pkl which is a pickle file of all the specifications that runs on the sandbox.

To modify WMWorkload.pkl a script was created named print_workload.py available at /afs/cern.ch/work/c/cmst0/private/scripts/jobs/modifyConfigs/workflow but it will be left on a public folder.

This program allows the wmworkload to be unpicked and a variable named HcalCalHO to be removed form the workload, however this can be done because skims variable parameters as passed as a list to the workload. If you want to modify another argument you will have to find it on the wmworkload.txt that comes out of the wmworkload.pkl. To obtain this readable file you can use the file called generate_code.sh located on the same public folder.

After this process a new WMWorkload.pkl is generated and you need to add the updated WMWorkload.pkl back to the original sandbox and to compress it back to a tar file using the command tar -cvjf and the exact name of the original compressed file. You should also compress al the original files that appeared after the first decompression and continued without modifications (folders PSetTweaks, Utils and the WMCore.zip archive)

Finally the new compressed archive should be copied back to its original location to replace the old one.

Modifying jobs to resume them with other features (like memory, disk, etc.)

Some scripts are already available to do this, provided with:

  • the cache directory of the job (or location of the job in JobCreator),
  • the feature to modify
  • and the value to be assigned to the feature.

Depending of the feature you want to modify, you would need to change:

  • the config of the single job (job.pkl),
  • the config of the whole workflow (WMWorkload.pkl),
  • or both.

We have learnt by trial and error which variables and files need to be modified to get the desired result, so you would need to do the same depending of the case. Down below we show some basic examples of how to do this:

Some cases have proven you need to modify the Workflow Sandbox when you want to modify next variables:

  • Memory thresholds (maxRSS, memoryRequirement)
  • Number of processing threads (numberOfCores)
  • CMSSW release (cmsswVersion)
  • SCRAM architecture (scramArch)

We created some scripts to deal with usual issues - the maxRSS values get exceeded time after time. Therefore, you need to modify the workflow sandbox to modify the maxRSS values.

  • On /data/tier0/tier0_monitoring/src/v3_modifyMaxRSS/ located in every node, there are scripts checkCurrentMaxRSS.sh and modifySandbox.sh which are usable in case you want to check/modify the maxRSS values.
  • You may see when you try it out that the modifySandbox.sh script does not override RSS limits for "Merge", "Cleanup" and "LogCollect" tasks in a workflow. This is a desired behavior of the WMCore (WMTask). In order to override the maxRSS for Merge tasks, one can bypass these limitations using approach Alan Malta Rodriguez shared with T0 (see the draft). Of course, it may be needed to update the script to make it usable for certain cases.

Modifying the job description has proven to be useful to change next variables:

  • Condor ClassAd of RequestCpus (numberOfCores)
  • CMSSW release (swVersion)
  • SCRAM architecture (scramArch)

At /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs there are two directories named "job" and "workflow". You should enter the respective directory. Follow next instructions in the agent machine in charge of the jobs to modify.

Modifying the Workflow Sandbox

Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/workflow

In a file named "list", list the jobs you need to modify. Follow the procedure for each type of job/task, given the workflow configuration is different for different Streams.

Use script print_workflow_config.sh to generate a human readable copy of WMWorkload.pkl. Look for the name of the variable of the feature to change, for instance maxRSS. Now use the script generate_code.sh to create a script to modify that feature. You should provide the name of the feature and the value to be assigned, for instance:

feature=maxRSS
value=15360000

Executing generate_code.sh would create a script named after the feature, like modify_wmworkload_maxRSS.py. The later will modify the selected feature in the Workflow Sandbox.

After generated, you need to add a call to that script in modify_one_workflow.sh. The later will call all the required scripts, create the tarball and locate it where required (Specs folder).

Finally, execute modify_several_workflows.sh which will call modify_one_workflow.sh for all the desired workflows.

The previous procedure has been followed for several jobs, so for some features the required personalization of the scripts has been already done, and you would just need to comment or uncomment the required lines. As a summary, you would need to proceed as detailed bellow:

vim list
./print_workflow_config.sh
vim generate_code.sh
./generate_code.sh
vim modify_one_workflow.sh
./modify_several_workflows.sh

Modifying the Job Description

Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/job

Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use modify_pset.py as base to create another file which would modify the required feature, you can give it a name like modify_pset_.py. Add a call to the just created script in modify_one_job.sh. Finally, execute modify_several_jobs.sh, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.

vim list
cp modify_pset.py modify_pset_<feature>.py
vim modify_pset_<feature>.py
vim modify_one_job.sh
./modify_several_jobs.sh

Modifying a workflow sandbox

If you need to change a file in a workflow sandbox, i.e. in the WMCore zip, this is the procedure:

# Copy the workflow sandbox from /data/tier0/admin/Specs to your work area
cp /data/tier0/admin/Specs/PromptReco_Run245436_Cosmics/PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 /data/tier0/lcontrer/temp

The work area should only contain the workflow sandbox. Go there and then untar the sandbox and unzip WMCore:

cd /data/tier0/lcontrer/temp
tar -xjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 
unzip -q WMCore.zip

Now replace/modify the files in WMCore. Then you have to merge all again. You should remove the old sandbox and WMCore.zip too:

# Remove former sandbox and WMCore.zip, then create the new WMCore.zip
rm PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 WMCore.zip
zip -rq WMCore.zip WMCore

# Now remove the WMCore folder and then create the new sandbox
rm -rf WMCore/
tar -cjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 ./*

# Clean workarea
rm -rf PSetTweaks/ WMCore.zip WMSandbox/

Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong.

Repacking gets stuck but the bookkeeping is consistent

Description
  • P5 sent over data with lumi holes and consistent accounting.
  • T0 started the Repacking .
  • P5 sent data for the previous lumis where the bookkeeping said there wasn't any data.
  • T0 jobsplitter went to an inconsistent state because it runs in order, so when it found unrepacked data that was supposed to go before data that was already repacked, it got "confused".
  • T0 Runs got stuck, some of them never start, some other never finish despite bookkeeping is ok.
Example
  1. Bookkeeping shows that lumis 52,55,56,57 will be transferred.
  2. Lumis 52,55,56,57 are transferred.
  3. Lumis 52,55,56,57 are repacked.
  4. Lumis 41, 60,91,121,141,145 are transferred.
  5. JobSplitter gets confused because of lumi 41 (lumis 60 and above are all higher than the last repacked lumi, so no problem with them) -> JobSplitter gets stuck
Procedure to fix

Please note: These are not copy/paste instructions. Is more a description of the procedure that was followed in the past to deal with the problem and can be used as a guide.

  • Update the lumi_section_closed records to have filecount=0 for lumis without data and filecount=1 for lumis with data.
           # find lumis to update
           update lumi_section_closed set filecount = 0 where lumi_id in ( ... ) and run_id = <run> and stream_id = <stream_id>;
           update lumi_section_closed set filecount = 1 where lumi_id in ( ... ) and run_id = <run> and stream_id = <stream_id>;      
  • Delete the problematic data (this is the files that belongs to the lumis that were not originally in the bookkeeping but were transferred).
           delete from wmbs_sub_files_available where fileid in ( ... );
           delete from wmbs_fileset_files where fileid in ( ... );
           delete from wmbs_file_location where fileid in ( ... );
           delete from wmbs_file_runlumi_map where fileid in ( ... );
           delete from streamer where id in ( ... );
           delete from wmbs_file_details where id in ( ... );     

Delete entries in database when corrupted input files (Repack jobs)

SELECT WMBS_WORKFLOW.NAME AS NAME, WMBS_WORKFLOW.TASK AS TASK, LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION AS SUBSCRIPTION,    LUMI_SECTION_SPLIT_ACTIVE.RUN_ID AS RUN_ID, LUMI_SECTION_SPLIT_ACTIVE.LUMI_ID AS LUMI_ID    FROM LUMI_SECTION_SPLIT_ACTIVE    INNER JOIN WMBS_SUBSCRIPTION ON LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION = WMBS_SUBSCRIPTION.ID    INNER JOIN WMBS_WORKFLOW ON WMBS_SUBSCRIPTION.WORKFLOW = WMBS_WORKFLOW.ID;

# This will actually show the pending active lumi sections for repack. One of this should be related to the corrupted file, compare this result with the first query

SELECT * FROM LUMI_SECTION_SPLIT_ACTIVE;
# You HAVE to be completely sure about to delete an entry from the database (don't do this if you don't understand what this implies)

DELETE FROM LUMI_SECTION_SPLIT_ACTIVE    WHERE SUBSCRIPTION = 1345 and RUN_ID = 207279 and LUMI_ID = 129;

Manually modify the First Conditions Safe Run (fcsr)

The current fcsr can be checked in the Tier0 Data Service: https://cmsweb.cern.ch/t0wmadatasvc/prod/firstconditionsaferun

In the CMS_T0DATASVC_PROD database check the table the first run with locked = 0 is fcsr

 reco_locked table 

If you want to manually set a run as the fcsr you have to make sure that it is the lowest run with locked = 0

 update reco_locked set locked = 0 where run >= <desired_run> 



Check a stream/dataset/run completion status (Tier0 Data Service (T0DATASVC) queries)

Some useful T0DATASVC queries to check a stream/dataset/run completion status:

https://cmsweb.cern.ch/t0wmadatasvc/prod/run_stream_done?run=305199&stream=ZeroBias
https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306462
https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306460&primary_dataset=MET

run_dataset_done can be called without any primary_dataset parameters, in which case it reports back overall PromptReco status. It aggregates over all known datasets for that run in the system (ie. all datasets for all streams for which we have data for this run).



WMAgent instructions

Restart component in case of deadlock

If a component crashes due to a deadlock, in most of cases restarted is enough to control the situation. In that case the procedure is:

  • Login in the Tier0 headnode (cmst1 user is required)
  • Source the environment
    source /data/tier0/admin/env.sh  
  • Execute the following command to restart the component, replacing   for the specific component name (DBS3Upload, PhEDExInjector, etc.)
    $manage execute-agent wmcoreD --restart --components=<componentName>
    Example
    $manage execute-agent wmcoreD --restart --components=DBS3Upload

How do I restart the Tier 0 WMAgent?

  • Restart the whole agent:
cd /data/tier0/
./00_stop_agent.sh
./00_start_agent.sh

  • Restart a single component (Replace ComponentName):
source /data/tier0/admin/env.sh
$manage execute-agent wmcoreD --restart --component ComponentName

Updating workflow from completed to normal-archived in WMStats

  • To move workflows in completed state to archived state in WMStats, the next code should be executed in one of the agents (prod or test):
     https://github.com/ticoann/WmAgentScripts/blob/wmstat_temp_test/test/updateT0RequestStatus.py 

  • The script should be copied to bin folder of wmagent code. For instance, in replay instances:
     /data/tier0/srv/wmagent/2.0.4/sw/slc6_amd64_gcc493/cms/wmagent/1.0.17.pre4/bin/ 

  • The script should be modified, assigning a run number in the next statement
     if info['Run'] < <RunNumber>
    As you should notice, the given run number would be the oldest run to be shown in WMStats.

  • After it, the code can be executed with:
     $manage execute-agent updateT0RequestStatus.py 

How do I set job thresholds in the WMAgent?

  • AgentStatusWatcher and SSB:
Site thresholds are automatically updated by a WMAgent component: AgentStatusWatcher. This components takes information about site status and resources (CPU Bound and IO Bound) from the SiteStatusBoard Pledges view There are some configurations in the WMAgent config that can be tuned, please have a look to the documentation

  • Add sites to resource control/manual update of the thresholds
This doesn't worth unless AgentStatusWatcher is shutdown. Some useful commands are:

#Source environment 
source /data/tier0/admin/env.sh

# Add a site to Resource Control - Change site, thresholds and plugin if needed
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --cms-name=T2_CH_CERN_T0 --se-name=srm-eoscms.cern.ch --ce-name=T2_CH_CERN_T0 --pending-slots=1000 --running-slots=1000 --plugin=CondorPlugin

# Change/init thresholds by task:
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --task-type=Processing --pending-slots=500 --running-slots=500

# Change site status (normal, drain, down)
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --down

Unregistering an agent from WMStats

First thing to know - an agent has to be stopped to unregister it. Otherwise, AgentStatusWatcher will just keep updating a new doc for wmstats.

  • Log into the agent
  • Source the environment:
     source /data/tier0/admin/env.sh  
  • Execute:
     $manage execute-agent wmagent-unregister-wmstats `hostname -f` 
  • You will be prompt for confirmation. Type 'yes'.
  • Check that the agent doesn't appear in WMStats.

Modify the thresholds in the resource control of the Agent

  • Login into the desired agent and become cmst1
  • Source the environment
     source /data/tier0/admin/env.sh 
  • Execute the following command with the desired values:
     $manage execute-agent wmagent-resource-control --site-name=<Desired_Site> --task-type=Processing --pending-slots=<desired_value> --running-slots=<desired_value> 
    Example:
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=3000 --running-slots=9000 

  • To change the general values
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=PyCondorPlugin 

  • To see the current thresholds and use
     $manage execute-agent wmagent-resource-control -p 

Checking transfer status at agent shutdown

Before shutting down an agent, you should check if all subscriptions and transfers were performed. This is equivalent to check the deletion of all blocks in T0_CH_CERN_Disk and resolve any related issue. Issues can be open blocks, or DDM conflicts, etc.

Check which blocks have not been deleted yet.

select blockname
from dbsbuffer_block
where deleted = 0

Some datasets could be marked as subscribed in the database, but not been really subscribed in PhEDEx. You can check this with Transfer Team and if that is the case, retry the subscription setting subscribed to 0. You can narrow the query to some blocks with a given name pattern or blocks in a specific site.

update dbsbuffer_dataset_subscription
set subscribed = 0
where dataset_id in (
  select dataset_id
  from dbsbuffer_block
  where deleted = 0
  <and blockname like...>
)
<and site like ...>

Some blocks can be marked as closed, but still being open in PhEDEx. If this is the case, you can set status to "InDBS", to try closing them again. For example, if you want to closed MiniAOD blocks, you can provide a name pattern like '%/MINIAOD#%'.

Attribute status can have 3 values: 'Open', 'InDBS' and 'Closed'. 'Open' is the first value assigned to all blocks, when they are closed and injected into DBS, status is changed to 'InDBS' and when they are closed in PhEDEx, status is changed to 'Closed'. Setting status to 'InDBS' would make the agent retries to close the blocks in PhEDEx.

update dbsbuffer_block
set status = 'InDBS'
where deleted = 0
and status = 'Closed'
and blockname like ... 

If some subscriptions shouldn't be checked anymore, remove these subscriptions from database. For instance, if you want to remove RAW subscriptions to disk of all T1s, you can give a path pattern like '/%/%/RAW' and a site like 'T1_%_Disk'.

delete dbsbuffer_dataset_subscription
where dataset_id in (
  select id
  from dbsbuffer_dataset
  where path like ...
)
and site like ...



Condor instructions

Useful queries

To get condor attributes:

condor_q 52982.15 -l | less -i
To get condor list by regexp:
condor_q -const 'regexp("30199",WMAgent_RequestName)' -af

Changing priority of jobs that are in the condor queue

  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> JobPrio "<New Prio (numeric value)>" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
    for job in $(condor_q -w | awk '{print $1}')
         do
               condor_qedit $job JobPrio "508200001"
         done  

Updating the wall time the jobs are using in the condor ClassAd

This time can be modified using the following command. Remember that it should be executed as the owner of the jobs.

condor_qedit -const 'MaxWallTimeMins>30000' MaxWallTimeMins 1440

Check the number of jobs and CPUs in condor

The following commands can be executed from any VM where there's a Tier0 schedd present (recheck if the list of VMs corresponds with the current list of Tier0 schedds. Tier0 production Central Manager is hosted on vocms007).

  • Get the number of tier0 jobs sorted by a number of CPUs they are using:

condor_status -pool vocms007  -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || 
ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")'  -af Cpus | sort | uniq -c

  • Get the number of total CPUs used for Tier0 jobs on Tier0 Pool:

condor_status -pool vocms007  -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || 
ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")'  -af Cpus | awk '{sum+= $1} END {print(sum)}'

  • The total number of CPUs used by NOT Tier0 jobs on Tier0 Pool:

condor_status -pool vocms007  -const 'State=="Claimed" && ( ClientMachine=!="vocms001.cern.ch" && ClientMachine=!="vocms014.cern.ch" && ClientMachine=!="vocms015.cern.ch" && 
ClientMachine=!="vocms0313.cern.ch" && ClientMachine=!="vocms0314.cern.ch" &&
 ClientMachine=!="vocms039.cern.ch" && ClientMachine=!="vocms047.cern.ch" && ClientMachine=!="vocms013.cern.ch")'  -af Cpus | awk '{sum+= $1} END {print(sum)}'

Overriding the limit of Maximum Running jobs by the Condor Schedd

  • Login as root in the Schedd machine
  • Go to:
     /etc/condor/config.d/99_local_tweaks.config  
  • There, override the limit adding/modifying this line:
     MAX_JOBS_RUNNING = <value>  
  • For example:
     MAX_JOBS_RUNNING = 12000  
  • Then, to apply the changes, run:
    condor_reconfig 

Changing highIO flag of jobs that are in the condor queue

  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> Requestioslots "0" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
     for job in $(cat <text_file_with_the_list_of_job_condor_IDs>)
             do
                 condor_qedit $job Requestioslots "0"
             done 



GRID certificates

Changing the certificate mapping to access eos

  • The VOC is responsible for this change. This mapping is specified on a file deployed at:
    • /afs/cern.ch/cms/caf/gridmap/gridmap.txt
  • Current VOC, Daiel Valbuena, has the script writting the gridmap file versioned here: (See Gitlab repo).
  • The following lines were added there to map the certificate used by our agents to the cmst0 service account.
    9.  namesToMapToTIER0 = [ "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms15.cern.ch",
    10.                 "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch"]
    38.        elif p[ 'dn' ] in namesToMapToTIER0:
    39.           dnmap[ p['dn'] ] = "cmst0" 

Changing Tier0 certificates

  • Check that using the new certificates guarantees privileges to all the needed resources:

Voboxes

  • Copy the servicecert*.pem, servicekey*.pem and serviceproxy*.pem files to
/data/certs 
  • Update the following files to point to the new certificates
admin/env.sh
admin/env_unit.sh

Kibana

  • Change the certificates in the monitoring scripts where they are used, to see where the certificates are being used and the current monitoring head node please check the Tier0 Montoring Twiki.

TransferSystem

TransferSystem is not used anymore

  • In the TransferSystem (currently vocms001), update the following file to point to the new certificate and restart component.
/data/TransferSystem/t0_control.sh



OracleDB (T0AST) instructions

Change Cmsweb Tier0 Data Service Passwords (Oracle DB)

All the T0 WMAgent instances has the capability of access the Cmsweb Tier0 Data Service instances. So, when changing the passwords it is necessary to be aware of which instances are running.

Instances currently in use currently (03/03/2015)

Instance Name TNS
CMS_T0DATASVC_REPLAY_1 INT2R
CMS_T0DATASVC_REPLAY_2 INT2R
CMS_T0DATASVC_PROD CMSR

  1. Review running instances.
  2. Stop each of them using:
     /data/tier0/00_stop_agent.sh 
  3. Verify that everything is stopped using:
     ps aux | egrep 'couch|wmcore' 
  4. Make sure of having the new password ready (generating it or getting it in a safe way from the one who is creating it).
  5. From lxplus or any of the T0 machines, log in to the instances you want to change the password to using:
     sqlplus <instanceName>/<password>@<tns> 
    Replacing the brackets with the proper values for each instance.
  6. In sqlplus run the command password, you will be prompt for entering the Old password, the*New Password* and confirming this last. Then you can exit from sqlplus
          SQL> password
          Changing password for <user>
          Old password: 
          New password: 
          Retype new password: 
          Password changed
          SQL> exit
          
  7. Then, you should retry logging in to the same instance, if you can not, you are in trouble!
  8. Communicate the password with the CMSWEB contact in a safe way. After his confirmation you can continue with the following steps.
  9. If everything went well now you can access all the instances with the new passwords. Now it is necessary to update the files secrets files within all the machines, These files are located in:
          /data/tier0/admin/
          
    And normally are named as following (not all the instances will have all the files):
          WMAgent.secrets
          WMAgent.secrets.replay
          WMAgent.secrets.prod
          WMAgent.secrets.localcouch
          WMAgent.secrets.remotecouch
          
  10. If there was an instance running you may also change the password in:
         /data/tier0/srv/wmagent/current/config/tier0/config.py
         
    There you must look for the entry:
          config.T0DAtaScvDatabase.connectUrl
         
    and do the update.
  11. You can now restart the instances that were running before the change. Be careful, some components may fail if you start the instance so you should have clear the trade off of starting it.

Backup T0AST (Database)

If you want to do a backup of a database (for example, after retiring a production node, you want to keep the information of the old T0AST) you should.

  • Request a target Database: Normally this databases are owned by dirk.hufnagel@cernNOSPAMPLEASE.ch, so he should request a new database to be the target of the backup.
  • When the database is ready, you can open a ticket for requesting the backup. For this you should send an email to phydb.support@cernNOSPAMPLEASE.ch. An example of a message can be found in this Elog .
  • When the backup is done you will get a reply to your ticket confirming it.

Checking what is locking a database / Cern Session Manager

  • Go to this link
     https://session-manager.web.cern.ch/session-manager/ 
  • Login using the DB credentials.
  • Check the sessions and see if you see any errors or something unusual.



T0 nodes, headnodes

Restarting Tier-0 voboxes

Node Use Type
vocms001
  • Replays: Normally used by the developer
  • Transfer System
Virtual machine
vocms015
  • Replays: Normally used by the operators
  • Tier0 Monitoring
Virtual machine
vocms047
  • Replays: Normally used by the operators
Virtual machine
vocms0313
  • Production node
Physical Machine
vocms0314
  • Production node
Physical Machine
To restart this node you need to check the following:
  • Production node:
    • The agent is not running and the couch processes were stopped correctly.
    • These nodes uses a RAMDISK. Mounting it is puppetized, so you need to make sure that puppet ran before starting the agent again.
  • TransferSystem:
  • Replays:
    • The agent should not be running, check the Tier0 Elog to make sure you are not interfering with a particular test.
  • Tier0 Monitoring:
    • The monitoring is executed via a cronjob. The only consequence of the restart should be that no reports are produced during the down time. However you can check that everything is working going to:
      /data/tier0/sls/scripts/Logs

To restart a machine you need to:

  • Login and become root
  • Do the particular checks (listed above) based on the machine that you are restarting)
  • Run the restart command
    shutdown -r now

After restarting the machine, it is convenient to run puppet. You can either wait for the periodical execution or execute it manually:

  • puppet agent -tv

Commissioning of a new node

*INCOMPLETE INSTRUCTIONS: WORK IN PROGRESS 2017/03*

Folder's structure and permissions

  • These folders should be placed at /data/:
# Permissions OwnerSorted ascending Group Folder Name
3. (755) drwxr-xr-x. cmsprod zh cmsprod
6. (755) drwxr-xr-x. cmst1 zh tier0
1. (775) drwxrwxr-x. root zh admin
2. (775) drwxrwxr-x. root zh certs
4. (700) drwx------. root root lost+found
5. (775) drwxrwxr-x. root zh srv
TIPS:
  • To get the folder permissions as a number:
    stat -c %a /path/to/file
  • To change permissions of a file/folder:
    EXAMPLE 1: chmod 775 /data/certs/
  • To change the user and/or group ownership of a file/directory:
    EXAMPLE 1: chown :zh /data/certs/
    EXAMPLE 2: chown -R cmst1:zh /data/certs/* 

2. certs

  • Certificates are placed on this folder. You should copy them from another node:
    • servicecert-vocms001.pem
    • servicekey-vocms001-enc.pem
    • servicekey-vocms001.pem
    • vocms001.p12
    • serviceproxy-vocms001.pem

NOTE: serviceproxy-vocms001.pem is renewed periodically via a cronjob. Please check the cronjobs section

5. srv

  • There you will find the
    glidecondor
    folder, used to....
  • Other condor-related folders could be found. Please check with the Submission Infrastructure operator/team what is needed and who is responsible for it.

6. tier0

  • Main folder for the WMAgent, containing the configuration, source code, deployment scripts, and deployed agent.

File Description
00_deploy.prod.sh Script to deploy the WMAgent for production(*)
00_deploy.replay.sh Script to deploy the WMAgent for a replay(*)
00_fix_t0_status.sh
00_patches.sh
00_readme.txt Some documentation about the scripts
00_software.sh Gets the source code to use form Github for WMCore and the Tier0. Applies the described patches if any.
00_start_agent.sh Starts the agent after it is deployed.
00_start_services.sh Used during the deployment to start services such as CouchDB
00_stop_agent.sh Stops the components of the agent. It doesn't delete any information from the file system or the T0AST, just kill the processes of the services and the WMAgent components
00_wipe_t0ast.sh Invoked by the 00_deploy script. Wipes the content of the T0AST. Be careful!
(*) This script is not static. It might change depending on the version of the Tier0 used and the site where the jobs are running. Check its content before deploying. (**) This script is not static. It might change when new patches are required and when the release versions of the WMCore and the Tier0 change. Check it before deploying.

Folder Description

Cronjobs

Running a replay on a headnode

  • To run a replay in a instance used for production (for example before deploying it in production) you should check the following:
    • If production ran in this instance before, be sure that the T0AST was backed up. Deploying a new instance will wipe it.
    • Modify the WMAgent.secrets file to point to the replay couch and t0datasvc.
      • [20171012] There is a replay WMAgent.secrets file example on vocms0313.
    • Download the latest ReplayOfflineConfig.py from the Github repository. Check the processing version to use based on the jira history.
    • Do not use production 00_deploy.sh. Use the replays 00_deploy.sh script instead. This is the list of changes:
      • Points to the replay secrets file instead of the production secrets file:
            WMAGENT_SECRETS_LOCATION=$HOME/WMAgent.replay.secrets; 
      • Points to the ReplayOfflineConfiguration instead of the ProdOfflineConfiguration:
         sed -i 's+TIER0_CONFIG_FILE+/data/tier0/admin/ReplayOfflineConfiguration.py+' ./config/tier0/config.py 
      • Uses the "tier0replay" team instead of the "tier0production" team (relevant for WMStats monitoring):
         sed -i "s+'team1,team2,cmsdataops'+'tier0replay'+g" ./config/tier0/config.py 
      • Changes the archive delay hours from 168 to 1:
        # Workflow archive delay
                    echo 'config.TaskArchiver.archiveDelayHours = 1' >> ./config/tier0/config.py
      • Uses lower thresholds in the resource-control:
        ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --cms-name=T0_CH_CERN --pnn=T0_CH_CERN_Disk --ce-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=SimpleCondorPlugin
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Merge --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Cleanup --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=LogCollect --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Skim --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Production --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Harvesting --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Express --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Repack --pending-slots=160 --running-slots=320
        
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN --cms-name=T2_CH_CERN --pnn=T2_CH_CERN --ce-name=T2_CH_CERN --pending-slots=0 --running-slots=0 --plugin=SimpleCondorPlugin

Again, keep in mind that 00_deploy.sh script wipes t0ast db - production instance in this case - so, carefully.

Changing Tier0 Headnode

# Instruction Responsible Role
0. | If there are any exceptions when logging into a candidate headnode, then you should restart it at first. | Tier0 |
0. Run a replay in the new headnode. Some changes have to be done to safely run it in a Prod instance. Please check the Running a replay on a headnode section Tier0
1. Deploy the new prod instance in a new vocmsXXX node, check that we use. Obviously, you should use a production version of 00_deploy.sh script. Tier0
1.5. Check the ProdOfflineconfiguration that is being used Tier0
2. Start the Tier0 instance in vocmsXXX Tier0
3. THIS IS OUTDATED ALREADY I THINK Coordinate with Storage Manager so we have a stop in data transfers, respecting run boundaries (Before this, we need to check that all the runs currently in the Tier0 are ok with bookkeeping. This means no runs in Active status.) SMOps
4. THIS IS OUTDATED ALREADY I THINK Checking al transfer are stopped Tier0
4.1. THIS IS OUTDATED ALREADY I THINK Check http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
4.2. THIS IS OUTDATED ALREADY I THINK Check /data/Logs/General.log
5. THIS IS OUTDATED ALREADY I THINK Change the config file of the transfer system to point to T0AST1. It means, going to /data/TransferSystem/Config/TransferSystem_CERN.cfg and change that the following settings to match the new head node T0AST)
  "DatabaseInstance" => "dbi:Oracle:CMS_T0AST",
  "DatabaseUser"     => "CMS_T0AST_1",
  "DatabasePassword" => 'superSafePassword123',
Tier0
6. THIS IS OUTDATED ALREADY I THINK Make a backup of the General.log.* files (This backup is only needed if using t0_control restart in the next step, if using t0_control_stop + t0_control start logs won't be affected) Tier0
7. THIS IS OUTDATED ALREADY I THINK

Restart transfer system using:

A)

t0_control restart (will erase the logs)

B)

t0_control stop

t0_control start (will keep the logs)

Tier0
8. THIS IS OUTDATED ALREADY I THINK Kill the replay processes (if any) Tier0
9. THIS IS OUTDATED ALREADY I THINK Start notification logs to the SM in vocmsXXX Tier0
10. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
11. THIS IS OUTDATED ALREADY I THINK Restart transfers SMOps
12. RECHECK THE LIST OF CRONTAB JOBS Point acronjobs ran as cmst1 on lxplus to a new headnode. They are checkActiveRuns and checkPendingTransactions scripts. Tier0

Restarting head node machine

  1. Stop Tier0 agent
    00_stop_agent.sh
  2. Stop condor
    service condor stop 
    If you want your data to be still available, then cp your spool directory to disk
    cp -r /mnt/ramdisk/spool /data/
  3. Restart the machine (or request its restart)
  4. Mount the RAM Disk (Condor spool won't work otherwise).
  5. If necessary, copy back the data to the spool.
  6. When restarted, start the sminject component
    t0_control start 
  7. Start the agent
    00_start_agent
    Particularly, check the PhEDExInjector component, if there you see errors, try restarting it after sourcing init.sh
    source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
    $manage execute-agent wmcoreD --restart --component PhEDExInjector

Configuring a newly created VM to be used as a T0 headnode/replay VM

This was started on 30/01/2018. To be continued.

  1. Whenever a new VM is created for T0, it has a mesa-libGLU package missing and, therefore, the deployment script is not going to work:
       Some required packages are missing:
       + for p in '$missingSeeds'
       + echo mesa-libGLU
       mesa-libGLU
       + exit 1
       
    One needs to install the package manually (with a superuser access):
    $ sudo yum install mesa-libGLU



T0 Pool instructions

Disabling flocking to Tier0 Pool

If it is needed to prevent new Central Production jobs to be executed in the Tier0 pool it is necessary to disable flocking.

NOTE1: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/

Only in case of emergency out of working hours, consider executing the below procedure on your own. But posting elog entry in this case is even more important as SI team needs to be aware of such meaningful changes.

GENERAL INFO:

  • Difference between site whitelisting and enabling/disabling flocking. When flocking jobs of different core counts, defragmentation may have to be re-tuned.
  • Also when the core-count is smaller than the defragmentation policy objective. E.g., the current defragmentation policy is focused on defragmenting slots with less than 4 cores. Having flocking enabled and only single or 2-core jobs in the mix, will trigger unnecessary defragmentation. I know this is not a common case, but if the policy were focused on 8-cores and for some reason, they inject 4-core jobs, while flocking is enabled, the same would happen.

As the changes directly affect the GlideInWMS Collector and Negotiator, you can cause a big mess if you don't proceed with caution. To do so you should follow these steps.

NOTE2: The root access to the GlideInWMS Collector is guaranteed for the members of the cms-tier0-operations@cern.ch e-group.

  • Login to vocms007 (GlideInWMS Collector-Negociator)
  • Login as root
     sudo su - 
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/
  • There you will find a list of files. Most of them are puppetized which means any change will be overridden when puppet be executed. There is one no-puppetized file called 99_local_tweaks.config that is the one that will be used to do the changes we desire.
     -rw-r--r--. 1 condor condor  1849 Mar 19  2015 00_gwms_general.config
     -rw-r--r--. 1 condor condor  1511 Mar 19  2015 01_gwms_collectors.config
     -rw-r--r--  1 condor condor   678 May 27  2015 03_gwms_local.config
     -rw-r--r--  1 condor condor  2613 Nov 30 11:16 10_cms_htcondor.config
     -rw-r--r--  1 condor condor  3279 Jun 30  2015 10_had.config
     -rw-r--r--  1 condor condor 36360 Jun 29  2015 20_cms_secondary_collectors_tier0.config
     -rw-r--r--  1 condor condor  2080 Feb 22 12:24 80_cms_collector_generic.config
     -rw-r--r--  1 condor condor  3186 Mar 31 14:05 81_cms_collector_tier0_generic.config
     -rw-r--r--  1 condor condor  1875 Feb 15 14:05 90_cms_negotiator_policy_tier0.config
     -rw-r--r--  1 condor condor  3198 Aug  5  2015 95_cms_daemon_monitoring.config
     -rw-r--r--  1 condor condor  6306 Apr 15 11:21 99_local_tweaks.config

Within this file there is a special section for the Tier0 ops. The other sections of the file should not be modified.

  • To disable flocking you should locate the flocking config section:
    # Knob to enable or disable flocking
    # To enable, set this to True (defragmentation is auto enabled)
    # To disable, set this to False (defragmentation is auto disabled)
    ENABLE_PROD_FLOCKING = True
  • Change the value to False
    ENABLE_PROD_FLOCKING = False
  • Save the changes in the 99_local_tweaks.config file and execute the following command to apply the changes:
     condor_reconfig 

  • The negociator has a 12h cache, so the schedds don't need to authenticate during this period of time. It is required to restart the negotiator.

  • Now, you can check the whitelisted Schedds to run in Tier0 pool, the Central Production Schedds should not appear there.
     condor_config_val -master gsi_daemon_name  

  • Now you need to restart the condor negociator to make sure that the changes are applied right away.
     ps aux | grep "condor_negotiator"   
    kill -9 <replace_by_condor_negotiator_process_id> 

  • After killing the process it should reappear again after a couple of minutes.

  • It is done!

Remember that this change won't remove/evict the jobs that are actually running, but will prevent new jobs to be sent.

Enabling pre-emption in the Tier0 pool

BEWARE: Please DO NOT use this strategy unless you are sure it is necessary and you agree in doing it with the Workflow Team. This literally kills all the Central Production jobs which are in Tier0 Pool (including ones which are being executed at that moment).

NOTE: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/ Only in case of emergency out of working hours, consider executing the below procedure on your own. But posting elog entry in this case is even more important as SI team needs to be aware of such meaningful changes.

  • Login to vocms007 (GlideInWMS Collector-Negotiator)
  • Login as root
     sudo su -  
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/ 
  • Open 99_local_tweaks.config
  • Locate this section:
     # How to drain the slots
        # graceful: let the jobs finish, accept no more jobs
        # quick: allow job to checkpoint (if supported) and evict it
        # fast: hard kill the jobs
       DEFRAG_SCHEDULE = graceful 
  • Change it to:
     DEFRAG_SCHEDULE = fast 
  • Leave it enabled only for ~5 minutes. After this the Tier0 jobs will start being killed as well. After the 5 minutes, revert the change
     DEFRAG_SCHEDULE = graceful 

Changing the status of _CH_CERN sites in SSB

To change T2_CH_CERN and T2_CH_CERN_HLT

*Please note that Tier0 Ops changing the status of T2_CH_CERN and T2_CH_CERN_HLT is an emergency procedure, not a standard one*

  • Open a GGUS Ticket to the site before proceeding, asking them to change the status themselves.
  • If there is no response after 1 hour, reply to the same ticket reporting you are changing it and proceed with the steps in the next section.

To change T0_CH_CERN

  • You should go to the Prodstatus Metric Manual Override site.
  • There, you will be able to change the status of T0_CH_CERN/T2_CH_CERN/T2_CH_CERN_HLT. You can set Enabled, Disabled, Drain or No override. The Reason field is mandatory (the history of thisreason can be checked here). Then click "Apply" and the procedure will be complete. The users in the cms-tier0-operations e-group are able to do this change.
  • The status in the SSB site is updated every 15 minutes. So you should be able to see the change there maximum after this amount of time.
  • More extensive documentation can be checked here.



Other (did not fit into the categories above/outdated/in progress)

Updating TransferSystem for StorageManager change of alias (probably outdated)

Ideally this process should be transparent to us. However, it might be that the TransferSystem doesn't update the IP address of the SM alias when the alias is changed to point to the new machine. In this case you will need to restart the TransferSystem in both the /data/tier0/sminject area on the T0 headnode and the /data/TransferSystem area on vocms001. Steps for this process are below:

  1. Watch the relevant logs on the headnode to see if streamers are being received by the Tier0Injector and if repack notices are being sent by the LoggerReceiver. A useful command for this is:
     watch "tail /data/tier0/srv/wmagent/current/install/tier0/Tier0Feeder/ComponentLog; tail /data/tier0/sminject/Logs/General.log; tail /data/tier0/srv/wmagent/current/install/tier0/JobCreator/ComponentLog" 
  2. Also watch the TransferSystem on vocms001 to see if streamers / files are being received from the SM and if CopyCheck notices are being sent to the SM. A useful command for this is:
     watch "tail /data/TransferSystem/Logs/General.log; tail /data/TransferSystem/Logs/Logger/LoggerReceiver.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckManager.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckWorker.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorManager.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorWorker.log" 
  3. If any of these services stop sending and/or receiving, you will need to restart the TransferSystem.
  4. Restart the TransferSystem on vocms001. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/TransferSystem
    ./t0_control stop
    ./t0_control start
              
  5. Restart the TransferSystem on the T0 headnode. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/tier0/sminject
    ./t0_control stop
    ./t0_control start
              

Getting Job Statistics (needs to be reviewed)

This is the base script to compile the information of jobs that are already done:

/afs/cern.ch/user/e/ebohorqu/public/HIStats/stats.py

For the analysis we need to define certain things:

  • main_dir: Folder where input log archives are. e.g. '/data/tier0/srv/wmagent/current/install/tier0/JobArchiver/logDir/Pí, in main()
  • temp: Folder where output json files are going to be generated, in main().
  • runList: Runs to be analyzed, in main()
  • Job type in two places:
    • getStats()
    • stats[dataset] in main()

The script is run without any parameter. This generates a json file with information about cpu, memory, storage and start and stop times. Task is also included. An example of output file is:

/afs/cern.ch/user/e/ebohorqu/public/HIStats/RecoStatsProcessing.json

With a separate script in R, I was reading and summarizing the data:

/afs/cern.ch/user/e/ebohorqu/public/HIStats/parse_cpu_info.R

There, task type should be defined and also output file. With this script I was just summarizing cpu data, but we could modify it a little to get memory data. Maybe it is quicker to do it directly with the first python script, if you like to do it :P

That script calculates efficiency of each job:

TotalLoopCPU / TotalJobTime * numberOfCores 

and an averaged efficiency per dataset:

sum(TotalLoopCPU) / sum(TotalJobTime * numberOfCores) 

numberOfCores was obtained from job.pkl, TotalLoopCPU and TotalJobTime were obtained from report.pkl

Job type could be Processing, Merge and Harvesting. For Processing type, task could be Reco or AlcaSkim and for Merge type, ALCASkimMergeALCARECO, RecoMergeSkim, RecoMergeWrite _AOD, RecoMergeWrite _DQMIO, RecoMergeWrite _MINIAOD and RecoMergeWrite _RECO.

Report an incident or create a request to CERN IT via SNOW tickets

Some sort of incidents and requets requires to be executed directly by CERN IT, just like the ones related to EOS and CASTOR storage systems. For example, if a file stored in EOS is not accessible, that should be reported to EOS team via SNOW tickets (take this ticket as an example).

To open a SNOW ticket follow these steps:

  • Enter to SNOW website.
  • Enter to the link Submit a request or Report an incident depending on the situation.
  • Set a meaningful title in Short description field. Followed by this, describe the issue in Description and symptoms field.
  • In some cases, just like open a ticket to EOS or/and CASTOR teams, it's useful to select a support team for the ticket. To do this, look for the team writing on the field Optionally, select the support team (Functional Element) that corresponds to your problem. Once you start writing in this field, the available teams are going to be displayed.
  • When you create a ticket ALWAYS add Tier0 team email list (cms-tier0-operations@cernNOSPAMPLEASENOSPAMPLEASE.ch) to the Watch list.
  • Use the default visibility for the ticket. It may automatically change depending on the support team.
  • Finally, check all the information and submit the ticket. You and everyone in the Watch list will be notified about any updates via email. Also, a ticket id will be assigned. These tickets are closed by the teams assigned to them, Tier0 can only cancel the ticket if it is the case (that's not something usual).

Update code in the dmwm/T0 repository

This guide contains all the necessary steps, read it at first. Execute this commands locally, where you already made a copy of the repository

  • Get the latest code from the repository
git checkout master
git fetch dmwm
git pull dmwm master
git push origin master

  • Create a branch to add the code changes. Use a meaningful name.
git checkout -b <branch-name> dmwm/master

  • Make the changes in the code

  • Add the modified files to the changes to be commit
git add <file-name>

  • Make commit of the changes
git commit

  • Push the changes from your local repository to the remote repository
git push origin <branch-name>

  • Make a pull request from the GitHub web page

NOTE: If you want to modify your last commit, before it is merged into the code of dmwm (even if you already made the pull request), use these steps:

  • Make the required modifications in the branch
  • Fix the previous commit
git commit --amend
  • Force update
git push -f origin <branch-name>
  • If a pull request was done before, it will update automatically.

After the branch is merged, it can be safely deleted:

git branch -d <branch-name>

Other useful commands

  • Show the branch in which you are working and the status of the changes. Useful before doing commit or while working on a branch.
git branch
git status
  • Others
git reset
git diff
git log
git checkout . 

EOS Areas of interest

The Tier-0 WMAgent uses four areas on eos.

Path Use Who writes Who reads Who cleans
/eos/cms/store/t0streamer/ Input streamer files transferred from P5 Storage Manager Tier-0 worker nodes Tier-0 t0streamer area cleanup script
/eos/cms/store/unmerged/ Store output files smaller than 2GB until the merge jobs put them together Tier-0 worker nodes (Processing/Repack jobs) Tier-0 worker nodes(Merge Jobs) ?
/eos/cms/tier0/ Files ready to be transferred to Tape and Disk Tier-0 worker nodes (Processing/Repack/Merge jobs) PhEDEx Agent Tier-0 WMAgent creates and auto approves transfer/deletion requests. PhEDEx executes them
/eos/cms/store/express/ Output from Express processing Tier-0 worker nodes Users Tier-0 express area clenaup script

/eos/cms/store/t0streamer/
SM writes raw files there. And we delete the files with the script. the script is on the acronjob under the cmst0 acc. It keeps data which are not repacked yet. Also, keeps the data not older than 7 days. The data is repacked (rewritten) dat files > PDs (raw .root files).

/eos/cms/store/unmerged/
There go the files which need to be merged into larger files. Not all the files go there. The job itself manages it (after merging, the job deletes the unmerged files).

/eos/cms/store/express/
Express output after being merged. Jobs from the tier0 are writing to it. Data deletions are managed by DDM.

Checking subscription requests on PhEDEx

It's possible to check data subscriptions in PhEDEx using the PhEDEx website or PhEDEx web services.

To check a subscription using the website, go to Data/Subscriptions. Once the page is loaded, select the tab Select Data and then use the form to set query restrictions. The usual restrictions are Nodes and Data Items. Notice that the system will only retrieve current subscriptions, not requests that are waiting to be approved.

If you don't want to check current subscriptions but requests, go to Requests/View/Manage Requests. In this form you can filter the requests by type (transfer or deletion) and by state (pending approval, approved or disapproved). If you need more options to filter the requests, you have to use the PhEDEx web services.

The available PhEDEx services are published on the documentation website. To query requests, use the service Request list. For example, to query the dataset /*/Tier0_REPLAY_vocms015*/* with requests over node T2_CH_CERN do:

https://cmsweb.cern.ch/phedex/datasvc/json/prod/requestlist?dataset=/*/Tier0_REPLAY_vocms015*/*&node=T2_CH_CERN

The request above will return a JSON resultset as follows:

{
    "phedex": {
        "call_time": 9.78992,
        "instance": "prod",
        "request": [
            {
                "approval": "approved",
                "id": 1339469,
                "node": [
                    {
                        "decided_by": "Daniel Valbuena Sosa",
                        "decision": "approved",
                        "id": 1561,
                        "name": "T2_CH_CERN",
                        "se": "srm-eoscms.cern.ch",
                        "time_decided": 1526987916
                    }
                ],
                "requested_by": "Vytautas Jankauskas",
                "time_create": 1526644111.24301,
                "type": "delete"
            },
            {
            ...
            }
        ],
        "request_call": "requestlist",
        "request_date": "2018-07-17 20:53:23 UTC",
        "request_timestamp": 1531860803.37134,
        "request_url": "http://cmsweb.cern.ch:7001/phedex/datasvc/json/prod/requestlist",
        "request_version": "2.4.0pre1"
    }
}

The PhEDEx services not only allows you to create more detailed queries, but are faster than query the information on PhEDEx website.

Re-processing from scratch (in case of unexpected data deletion, corruption, other disaster etc.)

In case something unexpected happens and we lose RAW data (deletion, corruption). If this happens, you need to check if the streamer files for lost data are still available in the t0streamer area.

  • More detailed recovery plan was documented on JIRA.
  • An easy way to check the last cleaned up runs is checking the cleanup script output at
    /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/streamer_delete.log
    .

If related run streamer files are still available, then you want to fully re-process that data:

  • Prepare and set up a new headnode (specify runs to inject, acquisition era, processing versions etc.).
  • If not all previously processed files (RAW, PromptReco output) are deleted, then you need to retrieve a list of such files because they will need to be invalidated later. You can collect the list of files querying DBS:
# Firstly 
# source /data/tier0/admin/env.sh
# source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
# then you can use this simple script to retrieve a list of files from related runs.
# Keep in mind, that in the below snippet, we are ignoring all Express streams output.
# This is just a snippet, so it may not work out of box.

from dbs.apis.dbsClient import DbsApi
from pprint import pprint
import os


dbsUrl = 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader'

dbsApi = DbsApi(url = dbsUrl)


runList = [316569]

with open(os.path.join('/data/tier0/srv/wmagent/current/tmpRecovery/', "testRuns.txt"), 'a') as the_file:
    for a in runList:
        datasets = dbsApi.listDatasets(run_num=a)
        pprint(datasets)
        for singleDataset in datasets:
            pdName = singleDataset['dataset']
            if 'Express' not in pdName and 'HLTMonitor' not in pdName and 'Calibration' not in pdName and 'ALCALUMIPIXELSEXPRESS' not in pdName:
                datasetFiles = dbsApi.listFileArray(run_num=a, dataset=pdName)
                #print("For run %d the dataset %s", a, pdName)
                for singleFile in datasetFiles:
                    print(singleFile['logical_file_name'])
                    the_file.write(singleFile['logical_file_name']+"\n")

  • Once the list of files to be invalidated later is ready, you can configure a headnode for re-processing.
  • In the Prod configuration, you should inject every affected run individually (See the GH).
  • Since we want to skip all the Express processing, we have to skip injection of all Express streams (there were only these streams at the moment of writing, re-check the config before). For this, you need to modify the filters of new stream injection on GetNewData.py, simply add filters to skip Express streams:
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ALCALUMIPIXELSEXPRESS'
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'Express'
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'HLTMonitor'
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'Calibration'
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ExpressAlignment'
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ExpressCosmics'

  • At this point, it is safe to start the WMAgent on the configured node. Keep in mind that the Tier0Feeder component took 3h+ to inject ~150 runs.
  • When all the runs are processed and cleaned up, it is time to invalidate old files. Again, you need to be extra careful with RAW data. Invalidation and deletion should be done by the Transfers team - a usual way is to create a GGUS ticket with a request.


This topic: CMSPublic > CompOps > CompOpsTier0Team > CompOpsTier0TeamCookbook
Topic revision: r78 - 2018-11-06 - DmytroKovalskyi
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback