Difference: CompOpsTier0TeamCookbook (1 vs. 91)

Revision 912019-10-02 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1732 to 1732
 
  • Once you have a tag on dmwm T0 repo, you need to create a PR on the main cmsdist repository (changing the release version in the t0.spec file, nothing else).
  • Finally, you need to ask the WMCore devs to force-merge this into the master branch (for the time being, it's comp_gcc630) of the cmsdist repo, because Tier0 package does not follow a regular/monthly-ish cmsweb release cycle.
  • The official CMS T0 release will be build in the next build cycle (it runs every few hours or so, therefore, be patient). The new release should be available on the official CMS comp CC7 repository.
Changed:
<
<
  • Whenever the new T0 release is available, make sure to update T0 release twiki adding the release notes.
>
>
  • Whenever the new T0 release is available, make sure to update T0 release twiki adding the release notes.
 
  • Also, whenever any patches for the release are created, make sure to add them to the release notes as well.

Voila! The new release is ready to use.

Revision 902019-08-21 - AndresFelipeQuinteroParra

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 909 to 909
 
</>
<!--/twistyPlugin-->


Added:
>
>

How to put streamer files to tape and get streamer files from tape

<!--/twistyPlugin twikiMakeVisibleInline-->

Whenever a streamer file needs to be put on tape or retrieved the following application should be used. Gitlab . It is a tool to transfer files between two filesystems, EOS and CASTOR, using FTS3 and FTS3 command line tool should be installed on the machine. Some dependencies needs to be installed on a virtual environment. Then the source and destination paths should be determined in order to decide the type of process (retrieval or put). There is also a Runs JSON file that is used to submit transfer jobs to FTS3. It is a list of runs that contains a job type (archive or retrieve) and a run id. After everything is set properly on Runs JSON file and the paths, then run.sh can be executed manually or via cronjob.


<!--/twistyPlugin-->


 

WMAgent instructions

Restart component in case of deadlock

Revision 892019-06-12 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1631 to 1631
 Firstly, please read the general documentation about Software packaging and distribution in CMS.

1. Before building a new Tier-0 release you need:

Changed:
<
<
  • to get access to vocms055 (a dedicated CC7 CMS RPM build VM). Ask CMSWEB operator for that.
  • to be added to cms-comp-devs e-group (in order to be able to push to your personal cmsrep). Ask CMSWEB operator for that.
>
>
  • to get access to vocms055 (a dedicated CC7 CMS RPM build VM). Ask the CMSWEB operator for that.
  • to be added to cms-comp-devs e-group (in order to be able to push to your personal cmsrep). Ask the CMSWEB operator for that.
  2. Once you can access vocms055, you need:
  • to create your personal directory at /build/[your-CERN-username].
Line: 1654 to 1654
  3. Now you should have the build environment properly configured. In order to build a new release:
Changed:
<
<
  • When there is a new GH release available, you need to specify it in t0.spec file. It is located at your cloned cmsdist repository directory.
  • Not getting into details, the release in the spec file is specified on the very first line:
    ### RPM cms t0 2.1.4
    Normally, you only need to increment it to the tag version created on GH before.
  • Finally, you need to build a new release and then to upload it (the important parts are specifying the package to build (t0) and uploading the newly build package to your personal repository (--upload-user part)):
>
>
  • Then you need to modify the t0.spec file, located at your cloned cmsdist repository directory.:
    • When there is a new T0 GH release available, you need to specify it in the first line of the t0.spec file:
      ### RPM cms t0 2.1.5
      Normally, you only need to increment it to the tag version created on GH before.
    • Starting from the T0 2.1.5 release, you need to specify the WMCore release version as well. In the spec file it is defined as the wmcver variable:
      %define wmcver 1.2.3 
    • Since you want to test a new release created in your personal repository, you need to get the T0 source code from there as well (Source0 parameter in the spec file):
      Source0: git://github.com/<YOUR_GH_USERNAME>/T0.git?obj=master/%{realversion}&export=T0-%{realversion}&output=/T0-%{realversion}.tar.gz 
      If needed, the WMCore release can be adjusted as well (Source1 parameter in the spec file).
  • Finally, you need to build a new release and then upload it (the important parts are specifying the package to build (t0) and uploading the newly build package to your personal repository (--upload-user part)):
 
# build a new release
pkgtools/cmsBuild -c cmsdist --repository comp   -a slc7_amd64_gcc630 --builders 8 -j 4 --work-dir w build t0 | tee logBuild

#upload it:
Changed:
<
<
pkgtools/cmsBuild -c cmsdist --repository comp -a slc7_amd64_gcc630 --builders 8 -j 4 --work-dir w --upload-user=$USER upload t0 | tee logUpload
>
>
pkgtools/cmsBuild -c cmsdist --repository comp -a slc7_amd64_gcc630 --builders 8 -j 4 --work-dir w --upload-user=$USER upload t0 | tee logUpload
  4. You should build a new release under your personal repository, run unit tests on it and test it in a replay:
  • You need to draft a new release on your personal forked T0 repository. The tag for the new release should be up to date with the dmwm T0 master branch.

5. For testing a new release, you need to modify the deployment scripts on replay node (the same applies for production nodes accordingly):

Deleted:
<
<
  • 00_software.sh:
# set WMCore and T0 release versions accordingly:
WMCORE_VERSION=1.1.19.pre5-comp2
T0_VERSION=2.1.4
...
# If you want to deploy a personal build, then you should checkout your forked T0 repository:
git clone https://github.com/[your-GH-username]/T0.git
# Also, don't forget to remove/comment out older fix patches from the script, e.g.:
# request more memory for the ScoutingPF repacking
#git fetch https://github.com/hufnagel/T0 scouting-repack-memory && git cherry-pick FETCH_HEAD
 
  • 00_deploy_*.sh:
Changed:
<
<
# here also change T0 release tag:
>
>
# here change T0 release tag:
 TIER0_VERSION=2.1.4 ...
Changed:
<
<
# change the deployment source to your personal repo if it's a personal release
>
>
# change the deployment source to your personal repo if it's a release from your personal repository
 #Vytas private repo deployment ./Deploy -s prep -r comp=comp.[your-CERN-username] -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION ./Deploy -s sw -r comp=comp.[your-CERN-username] -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION
Line: 1695 to 1686
 #Usual deployment #./Deploy -s prep -r comp=comp -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION #./Deploy -s sw -r comp=comp -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION
Changed:
<
<
#./Deploy -s post -r comp=comp -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION
>
>
#./Deploy -s post -r comp=comp -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION
  That's it. Now you can deploy a new T0 release. Just make sure there are no GH conflicts and other errors during the deployment procedure.
Changed:
<
<
6. Then, once your personal release is working properly, you want to build it as the CMS package in cmssw-cmsdist repository. To prepare for that:
>
>
6. Then, once your personal release is tested and proven as working properly, you want to build it as the CMS package in cmssw-cmsdist repository. To prepare for that:
 
  • you need to generate the changelog of what's in the release:
cd T0 (local T0 repo work area)
Line: 1718 to 1710
 git push upstream master git push origin master git push --tags upstream master
Changed:
<
<
git push --tags origin master
>
>
git push --tags origin master
 
  • Once you have a tag on dmwm T0 repo, you need to create a PR on the main cmsdist repository (changing the release version in the t0.spec file, nothing else).
Changed:
<
<
  • Finally, you need to ask WMCore devs to force-merge this into the master branch (for the time being, it's comp_gcc630) of the cmsdist repo, because Tier0 package does not follow a regular/monthly-ish cmsweb release cycle.
>
>
  • Finally, you need to ask the WMCore devs to force-merge this into the master branch (for the time being, it's comp_gcc630) of the cmsdist repo, because Tier0 package does not follow a regular/monthly-ish cmsweb release cycle.
 
  • The official CMS T0 release will be build in the next build cycle (it runs every few hours or so, therefore, be patient). The new release should be available on the official CMS comp CC7 repository.
  • Whenever the new T0 release is available, make sure to update T0 release twiki adding the release notes.
Changed:
<
<
  • Also, whenever any patches for the release are created, make sure to add the to the release notes as well.
>
>
  • Also, whenever any patches for the release are created, make sure to add them to the release notes as well.
 
Changed:
<
<
Voila! Now the new release is ready to use.
>
>
Voila! The new release is ready to use.
 
Changed:
<
<
VytautasJankauskas - 2019-02-28
</>
<!--/twistyPlugin-->
>
>
</>
<!--/twistyPlugin-->
 

Restarting Tier-0 voboxes

Revision 882019-06-07 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1894 to 1894
 If we need to patch a new T0 agent, then it is done during the deployment (in the 00_patches.sh script). Starting from the 2.1.5 T0 release, the 00_software.sh script becomes obsolete and the 00_patches.sh script is modified. Following the instructions for the WMCore, you need to take a PR/commit from GitHub and simply add a ".patch" to the end of its URL. Let's say there is a T0 PR: https://github.com/dmwm/T0/pull/4502 . Then one needs to add .patch to the end of the URL and the patch in raw format is displayed (the url then may change to something like: https://patch-diff.githubusercontent.com/raw/dmwm/T0/pull/4502.patch ). Once the patch URL is available, the T0 WMAgent can be patched.
Added:
>
>
Keep in mind that for patching a production agent, only stable and already tested PRs/commits should be used (basically the stuff which eventually gets merged into the DMWM T0 repository master branch).
 The following patch line has to be either added to the 00_patches.sh file if you want to apply patches during the deployment of a new agent or it has to be executed manually whenever desired (to have the patch in action, you need to restart the respective component or the whole agent after patching).

Revision 872019-06-07 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 265 to 265
  NOTE: It was decided to keep the skip list within the script instead of a external file to avoid the risk of deleting the runs in case of an error reading such external file.
Changed:
<
<

</>
<!--/twistyPlugin-->


>
>
</>
<!--/twistyPlugin-->

Configure a number of retries for failing jobs before they go to "paused" state

<!--/twistyPlugin twikiMakeVisibleInline-->

There is a number of different cases when Tier-0 jobs fail. By default, all T0 jobs are retried 3 times before they go to "paused" state (state=17 in T0AST db). However, sometimes such approach is simply wasting T0 resources and preventing operators from spotting the failures ASAP. To deal with this, there were some changes made on the WMCore side and now we are able to configure a number of retries for failing jobs depending on their exit code. One should keep in mind that the exit codes are specified for every job type (Express, PromptReco Processing, Repack, etc.).And the key value pairs in the list of retryErrorCodes per job type are configured as

 <job_exit_code>: <number of retries> 
On every T0 vocms VM in the 00_deploy*.sh script, there is the following configuration for Express, Processing and Repack jobs written to the main T0 WMAgent config.py file:

...
#Configurable retry number for failing jobs before they go to paused.
#Also, here we need to initialize the job type section in the PauseAlgo first
echo "config.RetryManager.PauseAlgo.section_('Express')" >> ./config/tier0/config.py
echo "config.RetryManager.PauseAlgo.Express.retryErrorCodes = { 8001: 0, 70: 0, 50513: 0, 50660: 0, 50661: 0, 71304: 0, 99109: 0, 99303: 0, 99400: 0, 8001: 0, 50115: 0 }" >> ./config/tier0/config.py
echo "config.RetryManager.PauseAlgo.section_('Processing')" >> ./config/tier0/config.py
echo "config.RetryManager.PauseAlgo.Processing.retryErrorCodes = { 8001: 0, 70: 0, 50513: 0, 50660: 0, 50661: 0, 71304: 0, 99109: 0, 99303: 0, 99400: 0, 8001: 0, 50115: 0 }" >> ./config/tier0/config.py
echo "config.RetryManager.PauseAlgo.section_('Repack')" >> ./config/tier0/config.py
echo "config.RetryManager.PauseAlgo.Repack.retryErrorCodes = { 8001: 0, 70: 0, 50513: 0, 50660: 0, 50661: 0, 71304: 0, 99109: 0, 99303: 0, 99400: 0, 8001: 0, 50115: 0 }" >> ./config/tier0/config.py
...

Obviously, the above piece of code gets executed only during the deployment of the T0 WMAgent. If there is a need to adjust this configuration in an already deployed and running agent, then one just needs to modify the main WMAgent configuration file config.py. It is stored at /data/tier0/srv/wmagent/current/config/tier0/config.py directory. After modifications a respective WMAgent component needs to be restarted (RetryManager in this case). There are instructions on how to restart a component in this twiki cookbook.

<!--/twistyPlugin-->
 

Debugging/fixing operational issues (failing, paused jobs etc.)

Line: 1848 to 1881
 
7

Change the main Production configuration symlink on cmst1 lxplus acrontab job at /afs/cern.ch/user/c/cmst1/www/tier0/ :

ln -sfn ProdOfflineConfiguration_123.py ProdOfflineConfiguration.py
Tier0

</>
<!--/twistyPlugin-->
Added:
>
>

Patching T0 Production headnode

<!--/twistyPlugin twikiMakeVisibleInline-->

In case we need to patch the production headnode to fix some failures, etc., the procedure is more or less identical to the one described in WMCore GH wiki. Please read that article at first. If we need to patch a new T0 agent, then it is done during the deployment (in the 00_patches.sh script). Starting from the 2.1.5 T0 release, the 00_software.sh script becomes obsolete and the 00_patches.sh script is modified. Following the instructions for the WMCore, you need to take a PR/commit from GitHub and simply add a ".patch" to the end of its URL. Let's say there is a T0 PR: https://github.com/dmwm/T0/pull/4502 . Then one needs to add .patch to the end of the URL and the patch in raw format is displayed (the url then may change to something like: https://patch-diff.githubusercontent.com/raw/dmwm/T0/pull/4502.patch ). Once the patch URL is available, the T0 WMAgent can be patched. The following patch line has to be either added to the 00_patches.sh file if you want to apply patches during the deployment of a new agent or it has to be executed manually whenever desired (to have the patch in action, you need to restart the respective component or the whole agent after patching).

curl https://patch-diff.githubusercontent.com/raw/dmwm/T0/pull/4500.patch | patch -d /data/tier0/srv/wmagent/current/apps/t0/lib/python2.7/site-packages/ -p3

Do not forget to make sure that the destination lib directory exists and there were no git errors/conflicts when adding the patch.

<!--/twistyPlugin-->
 

Restarting head node machine

%TWISTY{

Revision 862019-04-01 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Changed:
<
<
Recipes for tier-0 troubleshooting, most of them are written such that you can copy-paste and just replace with your values and obtain the expected results.
>
>
Recipes for Tier-0 troubleshooting, most of them are written such that you can copy-paste and just replace with your values and obtain the expected results.
  BEWARE: The authors are not responsible for any side effects of these recipes, one should always understand the commands/actions before executing them.
Changed:
<
<
Contents :
>
>
Contents :
 
Changed:
<
<


>
>


 

Tier-0 Configuration modifications

Deleted:
<
<
 

Replay instructions

%TWISTY{

Line: 23 to 20
 mode="div" }%
Changed:
<
<
NOTE1: An old and more theoretical version of replays docs is available here.
>
>
NOTE1: An old and more theoretical version of replays docs is available here.

NOTE2: The 00_XYZ scripts are located at

/data/tier0/
 
Changed:
<
<
NOTE2: The 00_XYZ scripts are located at
/data/tier0/
, absolute paths are provided in these instructions for clearness
>
>
, absolute paths are provided in these instructions for clearness
  In order to start a new replay, firstly you need to make sure that the instance is available Check the Tier0 project on Jira and look for the latest ticket about the vobox. If it is closed and/or it is indicated that the previous replay is over, you can proceed.

A) You need no to check that the previous deployed Tier0 WMAgent is not running on the machine anymore. In order to do so, use the next commands.

  • The queue of the condor jobs:
    condor_q
    . If any, you can use
    condor_rm -all
    to remove everything.
  • The list of the Tier0 WMAgent related processes:
    runningagent (This is an alias included in the cmst1 config, actual command: ps aux | egrep 'couch|wmcore|mysql|beam'
Changed:
<
<
  • If the list is not empty, you need to stop the agent:
    /data/tier0/00_stop_agent.sh
>
>
  • If the list is not empty, you need to stop the agent:
    /data/tier0/00_stop_agent.sh
 
Changed:
<
<
B) Setting up the new replay: *How do you choose a run number for the replays:
>
>
B) Setting up the new replay: *How do you choose a run number for the replays:
 
    • Go to prodmon to check processing status. There you can see that run was 5.5h long and quite large (13TB) - I would look for something smaller (~1h), but letís assume itís ok.
    • Go to WBM and check conditions:
      • Check initial and ending lumi - it should be ~10000 or higher. Currently we run at ~6000-7000, so itís ok as well.
Line: 47 to 46
 
  • Later on, run the scripts to start the replay:
      ./00_software.sh # loads the newest version of WMCore and T0 github repositories.
Changed:
<
<
./00_deploy.sh # deploys the new configuration, wipes the toast database etc.
  • 00_deploy.sh script wipes the t0ast db. So, as in replay machines it's fine, you don't want this to happen in a headnode machines. Therefore, be careful while running this script!
>
>
./00_deploy_replay.sh # deploys the new configuration, wipes the T0AST database etc.
  • <bold>00_deploy_replay.sh script wipes the t0ast db. So, as in replay machines it's fine, you don't want this to happen in a headnode machines. Therefore, be careful while running this script!</bold>
 
       ./00_start_agent.sh # starts the new agent - loads the job list etc.
       vim /data/tier0/srv/wmagent/2.0.8/install/tier0/Tier0Feeder/ComponentLog
Line: 59 to 57
 
      condor_q
      runningagent
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Adding a new scenario to the configuration

Line: 74 to 69
 }%

  • Go to the scenarios section.
Changed:
<
<
>
>
 
  • NOTE: If the instance is already deployed, you can manually add the new scenario directly on the event_scenario table of the T0AST. The Tier0Feeder will pick the change up in the next polling cycle.

</>

<!--/twistyPlugin-->
Line: 92 to 84
 mode="div" }%
Changed:
<
<
To delay the PromptReco release is really easy, you just only have to change in the config file (/data/tier0/admin/ProdOfflineConfiguration.py):
>
>
To delay the PromptReco release is really easy, you just only have to change in the config file (/data/tier0/admin/ProdOfflineConfiguration.py):
 
Changed:
<
<
defaultRecoTimeout = 48 * 3600
>
>
defaultRecoTimeout = 48 * 3600
 
Changed:
<
<
to something higher like 10 * 48 * 3600. Tier0Feeder checks this timeout every polling cycle. So when you want to release it again, you just need to go back to the 48h delay.

</>

<!--/twistyPlugin-->
>
>
to something higher like 10 * 48 * 3600. Tier0Feeder checks this timeout every polling cycle. So when you want to release it again, you just need to go back to the 48h delay.
 
Added:
>
>

</>
<!--/twistyPlugin-->
 

Changing CMSSW Version

Line: 160 to 146
 mode="div" }%
Changed:
<
<
As for October 2017, the file size limit was increased from 12GB to 16GB. However, if a change is needed, then the following values need to be modified:
  • maxSizeSingleLumi and maxEdmSize in ProdOfflineConfiguration.py
  • maxAllowedRepackOutputSize in srv/wmagent/current/config/tier0/config.py
>
>
As for October 2017, the file size limit was increased from 12GB to 16GB. However, if a change is needed, then the following values need to be modified:
  • maxSizeSingleLumi and maxEdmSize in ProdOfflineConfiguration.py
  • maxAllowedRepackOutputSize in srv/wmagent/current/config/tier0/config.py
  </>
<!--/twistyPlugin-->
Line: 220 to 203
  A useful command to check the current state of the site (agent parameters for the site, running jobs etc.):
Changed:
<
<
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN -p
>
>
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN -p
 
  • A list of possible sites where the reconstruction is wanted should be provided under the parameter siteWhitelist. This is done per Primary Dataset in the configuration file /data/tier0/admin/ProdOfflineConfiguration.py. For instance:
Line: 257 to 239
 timeleft : 157:02:58 uri : voms2.cern.ch:15002
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Adding runs to the skip list in the t0Streamer cleanup script

Line: 276 to 256
  To add a run to the skip list:
  • Login as cmst0 on lxplus.
Changed:
<
<
  • Go to the script location and open it:
    /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py 
  • The skip list is on the line 117:
      # run number in this list will be skipped in the iteration below
>
>
  • Go to the script location and open it:
    /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py 
  • The skip list is on the line 117:
      # run number in this list will be skipped in the iteration below
  runSkip = [251251, 251643, 254608, 254852, 263400, 263410, 263491, 263502, 263584, 263685, 273424, 273425, 273446, 273449, 274956, 274968, 276357,...]
  • Add the desired run in the end of the list. Be careful and do not remove any existing runs.
  • Save the changes.
Line: 287 to 265
  NOTE: It was decided to keep the skip list within the script instead of a external file to avoid the risk of deleting the runs in case of an error reading such external file.
Changed:
<
<
</>
<!--/twistyPlugin-->


>
>

</>
<!--/twistyPlugin-->


 

Debugging/fixing operational issues (failing, paused jobs etc.)

Line: 329 to 305
  This will return the cache dir of the paused jobs (This may not work if the jobs were not actually submitted - submitfailed jobs do not create Report.*.pkl)
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

How do I get the job tarballs?

Line: 347 to 321
 
xrdcp PFN .
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

How do I fail/resume paused jobs?

Line: 383 to 355
 cp ListOfPausedJobsFromDB /data/tier0/jocasall/pausedJobsClean.txt python /data/tier0/jocasall/checkPausedJobs.py awk -F '_' '{print $6}' code_XXX > jobsToResume.txt
Changed:
<
<
while read job; do $manage execute-agent paused-jobs -r -j ${job}; done <jobsToResume.txt

</>

<!--/twistyPlugin-->
>
>
while read job; do $manage execute-agent paused-jobs -r -j ${job}; done <jobsToResume.txt
 
Added:
>
>

</>
<!--/twistyPlugin-->
 

Data is lost in /store/unmerged - input files for Merge jobs are lost (an intro to run a job interactively)

Line: 419 to 389
 tar -xvf PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar zgrep ./LogCollection/*.tar.gz
Changed:
<
<
  • Now you should know what job reported to have the UUID for the corrupted file. Untar that tarball and run the job interactively (to untar: tar - zxvf ).
>
>
  • Now you should know what job reported to have the UUID for the corrupted file. Untar that tarball and run the job interactively (to untar: tar - zxvf <tarball>).
 
  • If you need to set a local input file, you can change the PSet.pkl file to point to a local file. However you need to change the trivialcatalog_file and override the protocol to direct. i.e.
S'trivialcatalog_file:/home/glidein_pilot/glide_aXehes/execute/dir_30664/job/WMTaskSpace/cmsRun2/CMSSW_7_1_10_patch2/override_catalog.xml?protocol=direct'
Line: 431 to 401
 
eos cp <local file> </eos/cms/store/unmerged/...>
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Run a job interactively

Line: 480 to 449
 python PSet.py > cmssw_config.py

# Modify cmssw_config.py (For example find process.source and remove files that you don't want to run on). Save it and use it as input for cmsRun instead of PSet.py

Changed:
<
<
cmsRun cmssw_config.py

</>

<!--/twistyPlugin-->
>
>
cmsRun cmssw_config.py
 
Added:
>
>

</>
<!--/twistyPlugin-->
 

Updating T0AST when a lumisection can not be transferred.

Deleted:
<
<
 %TWISTY{ showlink="Show..." hidelink="Hide" remember="on" mode="div"
Changed:
<
<
}%
>
>
}%
 
update lumi_section_closed set filecount = 0, CLOSE_TIME = <timestamp>
where lumi_id in ( <lumisection ID> ) and run_id = <Run ID> and stream_id = <stream ID>;
Line: 507 to 470
 update lumi_section_closed set filecount = 0, CLOSE_TIME = 1436179634 where lumi_id in ( 11 ) and run_id = 250938 and stream_id = 14;
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Unpickling the PSet.pkl file (job configuration file)

Line: 554 to 514
 
 
cmsRun -e PSet.py 2>err.txt 1>out.txt &
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Transfers to T1 sites are taking longer than expected

Line: 568 to 527
  Transfers can take a while, so this is somewhat normal. If it takes a very long time, one could ask in the phedex ops HN forum if there is a problem. You can also ping the facility admins or open a GGUS ticket if the issue is backlogging the PromptReco processing in the a given T1.
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Diagnose bookkeeping problems

Line: 589 to 546
 # Run the diagnose script (change run number) $manage execute-tier0 diagnoseActiveRuns 231087
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Looking for jobs that were submitted in a given time frame

Line: 601 to 557
 mode="div" }%
Changed:
<
<
Best way is to look to the wmbs_jobs table while the workflow is still executing jobs. But if the workflow is already archived, no record in the T0AST about the job is kept. Anyway, there is a way to find out the jobs that were submitted in a given time frame from the couch db:
>
>
Best way is to look to the wmbs_jobs table while the workflow is still executing jobs. But if the workflow is already archived, no record in the T0AST about the job is kept. Anyway, there is a way to find out the jobs that were submitted in a given time frame from the couch db:
  Add this patch to the couch app (this actually add a view), you may have to modify the path where to patch according to the WMAgent/!Tier0 tags you are using.
Deleted:
<
<
 
curl https://github.com/dmwm/WMCore/commit/8c5cca41a0ce5946d0a6fb9fb52ed62165594eb0.patch | patch -d /data/tier0/srv/wmagent/1.9.92/sw.pre.hufnagel/slc6_amd64_gcc481/cms/wmagent/1.0.7.pre6/data/ -p 2
Line: 619 to 571
 
curl -g -X GET 'http://user:password@localhost:5984/wmagent_jobdump%2Fjobs/_design/JobDump/_view/statusByTime?startkey=["executing",1432223400]&endkey=["executing",1432305900]'
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Corrupted merged file

Line: 633 to 584
  This includes files that are on tape, already registered on DBS/TMDB. The procedure to recover them is basically to run all the jobs that lead up to this file, starting from the parent merged file, then replace the desired output and make the proper changes in the catalog systems (i.e. DBS/TMDB).
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Print .pkl files, Change job.pkl

Line: 687 to 637
 handle.close() print process.dumpConfig()
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Change a variable in a running Sandbox configuration (wmworkload.pkl)

Line: 714 to 664
  Finally the new compressed archive should be copied back to its original location to replace the old one.
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Modifying jobs to resume them with other features (like memory, disk, etc.)

Line: 785 to 734
  Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/job
Changed:
<
<
Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use modify_pset.py as base to create another file which would modify the required feature, you can give it a name like modify_pset_.py. Add a call to the just created script in modify_one_job.sh. Finally, execute modify_several_jobs.sh, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.
>
>
Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use modify_pset.py as base to create another file which would modify the required feature, you can give it a name like modify_pset_<feature>.py. Add a call to the just created script in modify_one_job.sh. Finally, execute modify_several_jobs.sh, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.
 
vim list
cp modify_pset.py modify_pset_<feature>.py
Line: 793 to 742
 vim modify_one_job.sh ./modify_several_jobs.sh
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Modifying a workflow sandbox

Line: 832 to 777
 # Clean workarea rm -rf PSetTweaks/ WMCore.zip WMSandbox/
Changed:
<
<
Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong. </>
<!--/twistyPlugin-->
>
>
Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong.
</>
<!--/twistyPlugin-->
 

Repacking gets stuck but the bookkeeping is consistent

Line: 872 to 815
  delete from streamer where id in ( ... ); delete from wmbs_file_details where id in ( ... );
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Delete entries in database when corrupted input files (Repack jobs)

Line: 883 to 825
 remember="on" mode="div" }%
Deleted:
<
<
 
Changed:
<
<
SELECT WMBS_WORKFLOW.NAME AS NAME, WMBS_WORKFLOW.TASK AS TASK, LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION AS SUBSCRIPTION, LUMI_SECTION_SPLIT_ACTIVE.RUN_ID AS RUN_ID, LUMI_SECTION_SPLIT_ACTIVE.LUMI_ID AS LUMI_ID FROM LUMI_SECTION_SPLIT_ACTIVE INNER JOIN WMBS_SUBSCRIPTION ON LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION = WMBS_SUBSCRIPTION.ID INNER JOIN WMBS_WORKFLOW ON WMBS_SUBSCRIPTION.WORKFLOW = WMBS_WORKFLOW.ID;
>
>
SELECT WMBS_WORKFLOW.NAME AS NAME, WMBS_WORKFLOW.TASK AS TASK, LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION AS SUBSCRIPTION, LUMI_SECTION_SPLIT_ACTIVE.RUN_ID AS RUN_ID, LUMI_SECTION_SPLIT_ACTIVE.LUMI_ID AS LUMI_ID FROM LUMI_SECTION_SPLIT_ACTIVE INNER JOIN WMBS_SUBSCRIPTION ON LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION = WMBS_SUBSCRIPTION.ID INNER JOIN WMBS_WORKFLOW ON WMBS_SUBSCRIPTION.WORKFLOW = WMBS_WORKFLOW.ID;
  # This will actually show the pending active lumi sections for repack. One of this should be related to the corrupted file, compare this result with the first query
Deleted:
<
<
 
Changed:
<
<
SELECT * FROM LUMI_SECTION_SPLIT_ACTIVE; # You HAVE to be completely sure about to delete an entry from the database (don't do this if you don't understand what this implies)
>
>
SELECT * FROM LUMI_SECTION_SPLIT_ACTIVE;
 
Added:
>
>
# You HAVE to be completely sure about to delete an entry from the database (don't do this if you don't understand what this implies)
 
Changed:
<
<
DELETE FROM LUMI_SECTION_SPLIT_ACTIVE WHERE SUBSCRIPTION = 1345 and RUN_ID = 207279 and LUMI_ID = 129;

</>

<!--/twistyPlugin-->
>
>
DELETE FROM LUMI_SECTION_SPLIT_ACTIVE WHERE SUBSCRIPTION = 1345 and RUN_ID = 207279 and LUMI_ID = 129;
 
Added:
>
>

</>
<!--/twistyPlugin-->
 

Manually modify the First Conditions Safe Run (fcsr)

Line: 911 to 847
 mode="div" }%
Changed:
<
<
The current fcsr can be checked in the Tier0 Data Service: https://cmsweb.cern.ch/t0wmadatasvc/prod/firstconditionsaferun
>
>
The current fcsr can be checked in the Tier0 Data Service: https://cmsweb.cern.ch/t0wmadatasvc/prod/firstconditionsaferun
  In the CMS_T0DATASVC_PROD database check the table the first run with locked = 0 is fcsr
Deleted:
<
<
 
 reco_locked table 

If you want to manually set a run as the fcsr you have to make sure that it is the lowest run with locked = 0

Line: 918 to 853
 
 reco_locked table 

If you want to manually set a run as the fcsr you have to make sure that it is the lowest run with locked = 0

Deleted:
<
<
 
 update reco_locked set locked = 0 where run >= <desired_run> 
Changed:
<
<
</>
<!--/twistyPlugin-->


>
>

</>
<!--/twistyPlugin-->


 

Check a stream/dataset/run completion status (Tier0 Data Service (T0DATASVC) queries)

Line: 937 to 870
 
https://cmsweb.cern.ch/t0wmadatasvc/prod/run_stream_done?run=305199&stream=ZeroBias
https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306462
Changed:
<
<
https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306460&primary_dataset=MET

run_dataset_done can be called without any primary_dataset parameters, in which case it reports back overall PromptReco status. It aggregates over all known datasets for that run in the system (ie. all datasets for all streams for which we have data for this run).

</>

<!--/twistyPlugin-->


>
>
https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306460&primary_dataset=MET
 
Added:
>
>
run_dataset_done can be called without any primary_dataset parameters, in which case it reports back overall PromptReco status. It aggregates over all known datasets for that run in the system (ie. all datasets for all streams for which we have data for this run).
 
Added:
>
>

</>
<!--/twistyPlugin-->


 

WMAgent instructions

Line: 987 to 915
 source /data/tier0/admin/env.sh $manage execute-agent wmcoreD --restart --component ComponentName
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Updating workflow from completed to normal-archived in WMStats

Line: 999 to 926
 mode="div" }%
Changed:
<
<
  • To move workflows in completed state to archived state in WMStats, the next code should be executed in one of the agents (prod or test):
     https://github.com/ticoann/WmAgentScripts/blob/wmstat_temp_test/test/updateT0RequestStatus.py 
>
>
  • To move workflows in completed state to archived state in WMStats, the next code should be executed in one of the agents (prod or test):
     https://github.com/ticoann/WmAgentScripts/blob/wmstat_temp_test/test/updateT0RequestStatus.py 
 
  • The script should be copied to bin folder of wmagent code. For instance, in replay instances:
     /data/tier0/srv/wmagent/2.0.4/sw/slc6_amd64_gcc493/cms/wmagent/1.0.17.pre4/bin/ 
Line: 1021 to 946
 }%

  • AgentStatusWatcher and SSB:
Changed:
<
<
Site thresholds are automatically updated by a WMAgent component: AgentStatusWatcher. This components takes information about site status and resources (CPU Bound and IO Bound) from the SiteStatusBoard Pledges view There are some configurations in the WMAgent config that can be tuned, please have a look to the documentation
>
>
Site thresholds are automatically updated by a WMAgent component: AgentStatusWatcher. This components takes information about site status and resources (CPU Bound and IO Bound) from the SiteStatusBoard Pledges view There are some configurations in the WMAgent config that can be tuned, please have a look to the documentation
 
  • Add sites to resource control/manual update of the thresholds
This doesn't worth unless AgentStatusWatcher is shutdown. Some useful commands are:
Line: 1040 to 963
 # Change site status (normal, drain, down) $manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --down
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Unregistering an agent from WMStats

Line: 1054 to 976
  First thing to know - an agent has to be stopped to unregister it. Otherwise, AgentStatusWatcher will just keep updating a new doc for wmstats.
  • Log into the agent
Changed:
<
<
  • Source the environment:
     source /data/tier0/admin/env.sh  
  • Execute:
     $manage execute-agent wmagent-unregister-wmstats `hostname -f` 
>
>
  • Source the environment:
     source /data/tier0/admin/env.sh  
  • Execute:
     $manage execute-agent wmagent-unregister-wmstats `hostname -f` 
 
  • You will be prompt for confirmation. Type 'yes'.
  • Check that the agent doesn't appear in WMStats.
Line: 1074 to 993
 }%

  • Login into the desired agent and become cmst1
Changed:
<
<
  • Source the environment
     source /data/tier0/admin/env.sh 
  • Execute the following command with the desired values:
     $manage execute-agent wmagent-resource-control --site-name=<Desired_Site> --task-type=Processing --pending-slots=<desired_value> --running-slots=<desired_value> 
    Example:
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=3000 --running-slots=9000 
>
>
  • Source the environment
     source /data/tier0/admin/env.sh 
  • Execute the following command with the desired values:
     $manage execute-agent wmagent-resource-control --site-name=<Desired_Site> --task-type=Processing --pending-slots=<desired_value> --running-slots=<desired_value> 
    Example:
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=3000 --running-slots=9000 
 
Changed:
<
<
  • To change the general values
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=PyCondorPlugin 
>
>
  • To change the general values
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=PyCondorPlugin 
 
Changed:
<
<
  • To see the current thresholds and use
     $manage execute-agent wmagent-resource-control -p 
>
>
  • To see the current thresholds and use
     $manage execute-agent wmagent-resource-control -p 
  </>
<!--/twistyPlugin-->
Line: 1139 to 1051
 ) and site like ...
Changed:
<
<
</>
<!--/twistyPlugin-->


>
>

</>
<!--/twistyPlugin-->


 

Condor instructions

Line: 1153 to 1064
 mode="div" }%
Deleted:
<
<
 To get condor attributes:
Changed:
<
<
condor_q 52982.15 -l | less -i
>
>
condor_q 52982.15 -l | less -i
 To get condor list by regexp:
Changed:
<
<
condor_q -const 'regexp("30199",WMAgent_RequestName)' -af

</>

<!--/twistyPlugin-->
>
>
condor_q -const 'regexp("30199",WMAgent_RequestName)' -af
 
Added:
>
>

</>
<!--/twistyPlugin-->
 

Changing priority of jobs that are in the condor queue

Line: 1175 to 1083
 mode="div" }%
Changed:
<
<
  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> JobPrio "<New Prio (numeric value)>" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
    for job in $(condor_q -w | awk '{print $1}')
>
>
  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> JobPrio "<New Prio (numeric value)>" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
    for job in $(condor_q -w | awk '{print $1}')
  do condor_qedit $job JobPrio "508200001" done
Line: 1200 to 1104
 
condor_qedit -const 'MaxWallTimeMins>30000' MaxWallTimeMins 1440
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Check the number of jobs and CPUs in condor

Line: 1239 to 1142
 
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Overriding the limit of Maximum Running jobs by the Condor Schedd

Line: 1252 to 1154
 }%

  • Login as root in the Schedd machine
Changed:
<
<
  • Go to:
     /etc/condor/config.d/99_local_tweaks.config  
  • There, override the limit adding/modifying this line:
     MAX_JOBS_RUNNING = <value>  
  • For example:
     MAX_JOBS_RUNNING = 12000  
  • Then, to apply the changes, run:
    condor_reconfig 
>
>
  • Go to:
     /etc/condor/config.d/99_local_tweaks.config  
  • There, override the limit adding/modifying this line:
     MAX_JOBS_RUNNING = <value>  
  • For example:
     MAX_JOBS_RUNNING = 12000  
  • Then, to apply the changes, run:
    condor_reconfig 
  </>
<!--/twistyPlugin-->
Line: 1273 to 1170
 mode="div" }%
Changed:
<
<
  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> Requestioslots "0" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
     for job in $(cat <text_file_with_the_list_of_job_condor_IDs>)
>
>
  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> Requestioslots "0" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
     for job in $(cat <text_file_with_the_list_of_job_condor_IDs>)
  do condor_qedit $job Requestioslots "0" done
Changed:
<
<
</>
<!--/twistyPlugin-->


>
>
</>
<!--/twistyPlugin-->


 

GRID certificates

Line: 1298 to 1192
 
  • The VOC is responsible for this change. This mapping is specified on a file deployed at:
    • /afs/cern.ch/cms/caf/gridmap/gridmap.txt
  • Current VOC, Daiel Valbuena, has the script writting the gridmap file versioned here: (See Gitlab repo).
Changed:
<
<
  • The following lines were added there to map the certificate used by our agents to the cmst0 service account.
    9.  namesToMapToTIER0 = [ "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms15.cern.ch",
    10.                 "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch"]
    38.        elif p[ 'dn' ] in namesToMapToTIER0:
>
>
  • The following lines were added there to map the certificate used by our agents to the cmst0 service account.
    9.  namesToMapToTIER0 = [ "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms15.cern.ch",
    10.                 "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch"]
    38.        elif p[ 'dn' ] in namesToMapToTIER0:
 39. dnmap[ p['dn'] ] = "cmst0"

</>

<!--/twistyPlugin-->
Line: 1332 to 1224
 
  • Change the certificates in the monitoring scripts where they are used, to see where the certificates are being used and the current monitoring head node please check the Tier0 Montoring Twiki.

TransferSystem

Changed:
<
<
TransferSystem is not used anymore
>
>
TransferSystem is not used anymore
 
  • In the TransferSystem (currently vocms001), update the following file to point to the new certificate and restart component.
/data/TransferSystem/t0_control.sh
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

OracleDB (T0AST) instructions

Line: 1352 to 1243
 

Software

Changed:
<
<
  1. Download and install Oracle Instant Client. Use the basic package. Follow general installation steps in web page. You need to have Java installed.
>
>
  1. Download and install Oracle Instant Client. Use the basic package. Follow general installation steps in web page. You need to have Java installed.
 
Changed:
<
<
  1. Download and install the Oracle SQL Developer software.
>
>
  1. Download and install the Oracle SQL Developer software.
 
Changed:
<
<
  1. Configure SQL Developer to use Instant Client as Client Type. You may need to do several configurations before connecting to any database, depending on your system. These links can be useful:
>
>
  1. Configure SQL Developer to use Instant Client as Client Type. You may need to do several configurations before connecting to any database, depending on your system. These links can be useful:
 

Connections

Changed:
<
<
  1. Download the tnsnames.ora file from the dmwm/PHEDEX repository. This file contains the mapping between service names and connect descriptors. tnsnames.org is going to be used for the configuration of the database connections.
>
>
  1. Download the tnsnames.ora file from the dmwm/PHEDEX repository. This file contains the mapping between service names and connect descriptors. tnsnames.org is going to be used for the configuration of the database connections.
 
Changed:
<
<
  1. Setup SQL Developer to use the downloaded tnsnames.ora and show available CMS net services
>
>
  1. Setup SQL Developer to use the downloaded tnsnames.ora and show available CMS net services
 

Connections to T0AST databases

For each production and replay instance (Check REPLACE twiki), create a connection:

Changed:
<
<
  1. User credentials are stored in the respective T0 machine. They can be found in the /data/tier0/admin/WMAgent.secrets file. Use values of ORACLE_USER and ORACLE_PASS fields.
  2. Use TNS as Connection Type, default as Role and the ORACLE_TNS value in WMAgent.secrets as Network Alias.
>
>
  1. User credentials are stored in the respective T0 machine. They can be found in the /data/tier0/admin/WMAgent.secrets file. Use values of ORACLE_USER and ORACLE_PASS fields.
  2. Use TNS as Connection Type, default as Role and the ORACLE_TNS value in WMAgent.secrets as Network Alias.
 

Connection to other Tier0 related databases

Create a connection for the next read only databases. User credentials and Network aliases can be found in any WMAgent.secret file of any T0 agent.

Changed:
<
<
  1. Create a connection to the Storage Manager database. Use values of SMDB_URL field.
  2. Create a connection to the HLT database. Use values of CONFDB_URL field.
  3. Create a connection to the Conditions database. Use values of POPCONLOGDB_URL field.
  4. Create a connection to the Tier0 data service database. Use values of T0DATASVCDB_URL field.
>
>
  1. Create a connection to the Storage Manager database. Use values of SMDB_URL field.
  2. Create a connection to the HLT database. Use values of CONFDB_URL field.
  3. Create a connection to the Conditions database. Use values of POPCONLOGDB_URL field.
  4. Create a connection to the Tier0 data service database. Use values of T0DATASVCDB_URL field.
 
Changed:
<
<
>
>
  </>
<!--/twistyPlugin-->
Line: 1458 to 1345
 
  • When the database is ready, you can open a SNOW ticket requesting the backup. The ticket is created automatically by sending an email to phydb.support@cernNOSPAMPLEASE.ch.
  • When the backup is done you will get a reply to your ticket confirming it. Recheck that the backup is fine, consistent etc. and ask to close the ticket.
Changed:
<
<
Whenever creating a backup, please add a row to the Tier0 Archive Accounts table of records.

</>

<!--/twistyPlugin-->
>
>
Whenever creating a backup, please add a row to the Tier0 Archive Accounts table of records.
 
Added:
>
>

</>
<!--/twistyPlugin-->
 

Checking what is locking a database / Cern Session Manager

Line: 1473 to 1358
 mode="div" }%
Changed:
<
<
  • Go to this link
     https://session-manager.web.cern.ch/session-manager/ 
>
>
  • Go to this link
     https://session-manager.web.cern.ch/session-manager/ 
 
  • Login using the DB credentials.
  • Check the sessions and see if you see any errors or something unusual.
Line: 1492 to 1376
 

Run related

  • Configuration of a certain run
Deleted:
<
<
 
Changed:
<
<
select * from run where run_id = 293501;
>
>
select * from run where run_id = 293501;
 
  • Streams of a run
Deleted:
<
<
 
select * from streamer
join stream on streamer.STREAM_ID=stream.ID
Line: 1503 to 1384
 select * from streamer join stream on streamer.STREAM_ID=stream.ID where run_id= 293501;
Deleted:
<
<
 
Deleted:
<
<
 select count(*) from streamer join stream on streamer.STREAM_ID=stream.ID where run_id= 293501
Changed:
<
<
group by stream.ID;
>
>
group by stream.ID;
 

Job related

Line: 1519 to 1396
 
select wmbs_job_state.name, count(*) from wmbs_job 
join wmbs_job_state on wmbs_job.state = wmbs_job_state.id
Changed:
<
<
GROUP BY wmbs_job_state.name;
>
>
GROUP BY wmbs_job_state.name;
 
  • Get jobs in paused state
Deleted:
<
<
 
select id, cache_dir from wmbs_job where STATE =17;
Changed:
<
<
SELECT id, cache_dir FROM wmbs_job WHERE state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpausedí);
>
>
SELECT id, cache_dir FROM wmbs_job WHERE state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpaused?);
 
  • Get jobs in paused state per type of job
Deleted:
<
<
 
select id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Repack%' order by cache_dir;
select id, cache_dir from wmbs_job where STATE =17 and cache_dir not like '%Repack%' order by cache_dir;
Line: 1538 to 1411
 select id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Express%' order by cache_dir; select id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Express%' and cache_dir not like '%Repack%' order by cache_dir; select retry_count, id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Repack%' and cache_dir not like '%Merge%' order by cache_dir;
Changed:
<
<
select retry_count, id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Repack%' and cache_dir like '%Merge%' order by cache_dir;
>
>
select retry_count, id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Repack%' and cache_dir like '%Merge%' order by cache_dir;
 

CMSSW related

Line: 1551 to 1421
 
select distinct RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG
inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID
Changed:
<
<
order by RECO_CONFIG.RUN_ID desc;
>
>
order by RECO_CONFIG.RUN_ID desc;
 
  • First run configured to use certain CMSSW version for PromptReco
Deleted:
<
<
 
select distinct min(RECO_CONFIG.RUN_ID) from RECO_CONFIG
inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID
Line: 1560 to 1428
 select distinct min(RECO_CONFIG.RUN_ID) from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where CMSSW_VERSION.NAME = 'CMSSW_7_4_12'
Changed:
<
<
order by RECO_CONFIG.RUN_ID desc;
>
>
order by RECO_CONFIG.RUN_ID desc;
 
  • All runs configured to use certain CMSSW version for PromptReco
Deleted:
<
<
 
select distinct RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG
inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID
Line: 1569 to 1435
 select distinct RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where CMSSW_VERSION.NAME = 'CMSSW_7_4_12'
Changed:
<
<
order by RECO_CONFIG.RUN_ID desc ;
>
>
order by RECO_CONFIG.RUN_ID desc ;
 
Express (Reconstruction Step)
Line: 1579 to 1443
 
select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID
Changed:
<
<
order by EXPRESS_CONFIG.RUN_ID desc ;
>
>
order by EXPRESS_CONFIG.RUN_ID desc ;
 
  • First run configured to use certain CMSSW version for Express
Deleted:
<
<
 
select distinct min(EXPRESS_CONFIG.RUN_ID) from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID
Line: 1588 to 1450
 select distinct min(EXPRESS_CONFIG.RUN_ID) from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID where CMSSW_VERSION.NAME = 'CMSSW_7_4_12'
Changed:
<
<
order by EXPRESS_CONFIG.RUN_ID desc;
>
>
order by EXPRESS_CONFIG.RUN_ID desc;
 
  • All runs configured to use certain CMSSW version for Express
Deleted:
<
<
 
select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID
Line: 1597 to 1457
 select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID where CMSSW_VERSION.NAME = 'CMSSW_7_4_12'
Changed:
<
<
order by EXPRESS_CONFIG.RUN_ID desc;
>
>
order by EXPRESS_CONFIG.RUN_ID desc;
 
Specific run
Line: 1609 to 1467
 
select distinct RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG
inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID
Changed:
<
<
where RECO_CONFIG.RUN_ID=299325 order by RECO_CONFIG.RUN_ID desc ;
>
>
where RECO_CONFIG.RUN_ID=299325 order by RECO_CONFIG.RUN_ID desc ;
 
  • Express (Reconstruction step)
Deleted:
<
<
 
select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID
Line: 1617 to 1473
 
select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID
Changed:
<
<
where EXPRESS_CONFIG.RUN_ID=299325 order by EXPRESS_CONFIG.RUN_ID desc ;
>
>
where EXPRESS_CONFIG.RUN_ID=299325 order by EXPRESS_CONFIG.RUN_ID desc ;
 
  • Express (Repack step)
Deleted:
<
<
 
select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.CMSSW_ID = CMSSW_VERSION.ID
Line: 1625 to 1479
 
select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.CMSSW_ID = CMSSW_VERSION.ID
Changed:
<
<
where EXPRESS_CONFIG.RUN_ID=299325 order by EXPRESS_CONFIG.RUN_ID desc ;
>
>
where EXPRESS_CONFIG.RUN_ID=299325 order by EXPRESS_CONFIG.RUN_ID desc ;
  NOTE: Remember that the Express configuration includes two CMSSW releases; one for repacking: CMSSW_ID and another for reconstruction: RECO_CMSSW_ID
Line: 1638 to 1490
 select wmbs_file_details.* from wmbs_job join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID
Changed:
<
<
where wmbs_job.ID = 3463;
>
>
where wmbs_job.ID = 3463;
 
  • Lumis contained in a file
Deleted:
<
<
 
select * from wmbs_file_runlumi_map
Changed:
<
<
where fileid = 8356;
>
>
where fileid = 8356;
 
  • Input files with lumis of a specific job given its id
Deleted:
<
<
 
select wmbs_file_details.ID, LFN, FILESIZE, EVENTS, lumi from wmbs_job
join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB
Line: 1655 to 1503
 join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID join wmbs_file_runlumi_map on wmbs_job_assoc.FILEID = wmbs_file_runlumi_map.FILEID
Changed:
<
<
where wmbs_job.ID = 3463 order by lumi;
>
>
where wmbs_job.ID = 3463 order by lumi;
 
  • Input file of a job with specific lumi
Deleted:
<
<
 
select wmbs_file_details.ID, LFN, FILESIZE, EVENTS, lumi from wmbs_job
join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB
Line: 1665 to 1511
 join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID join wmbs_file_runlumi_map on wmbs_job_assoc.FILEID = wmbs_file_runlumi_map.FILEID
Changed:
<
<
where wmbs_job.ID = 3463 and lumi = 73;
>
>
where wmbs_job.ID = 3463 and lumi = 73;
 
  • Job that has a specific file as input given file id
Deleted:
<
<
 
select wmbs_job.* from wmbs_job_assoc
join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID
Line: 1673 to 1517
 
select wmbs_job.* from wmbs_job_assoc
join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID
Changed:
<
<
where wmbs_job_assoc.FILEID = 4400;
>
>
where wmbs_job_assoc.FILEID = 4400;
 
  • Job that has a specific file as input given file lfn
Deleted:
<
<
 
select wmbs_job.CACHE_DIR from wmbs_job
join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB
Line: 1682 to 1524
 select wmbs_job.CACHE_DIR from wmbs_job join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID
Changed:
<
<
where wmbs_file_details.LFN = '/store/unmerged/data/Run2017C/MuOnia/RAW/v1/000/300/515/00000/D662FD9E-177A-E711-8F1B-02163E019D28.root';
>
>
where wmbs_file_details.LFN = '/store/unmerged/data/Run2017C/MuOnia/RAW/v1/000/300/515/00000/D662FD9E-177A-E711-8F1B-02163E019D28.root';
 
  • Job details of job that has a specific input file **MASK
Deleted:
<
<
 
select wmbs_job_mask.*, wmbs_job.CACHE_DIR from wmbs_job_assoc
join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID
Line: 1691 to 1531
 select wmbs_job_mask.*, wmbs_job.CACHE_DIR from wmbs_job_assoc join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID join wmbs_job_mask on wmbs_job.ID = wmbs_job_mask.JOB
Changed:
<
<
where wmbs_job_assoc.FILEID = 4400;
>
>
where wmbs_job_assoc.FILEID = 4400;
 
  • Parent file of a specified child file
Deleted:
<
<
 
Changed:
<
<
select * from wmbs_file_parent where child = 6708;
>
>
select * from wmbs_file_parent where child = 6708;
 
  • Details of parent file of a specified child file
Deleted:
<
<
 
select wmbs_file_details.* from wmbs_file_parent 
join wmbs_file_details on wmbs_file_parent.PARENT = wmbs_file_details.ID
Line: 1706 to 1541
 
select wmbs_file_details.* from wmbs_file_parent 
join wmbs_file_details on wmbs_file_parent.PARENT = wmbs_file_details.ID
Changed:
<
<
where child = 6708;
>
>
where child = 6708;
 
  • Jobs which input files are a parent file of a specific file ("Parent job of a file")
Deleted:
<
<
 
select distinct wmbs_job.* from wmbs_job_assoc
join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID
Line: 1714 to 1547
 
select distinct wmbs_job.* from wmbs_job_assoc
join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID
Changed:
<
<
where wmbs_job_assoc.FILEID in (select parent from wmbs_file_parent where child = 3756584);
>
>
where wmbs_job_assoc.FILEID in (select parent from wmbs_file_parent where child = 3756584);
 
  • Jobs which input files are a parent file of a specific file given lfn of child file. (In other words, get cache dir of father job and filesize of child file given lfn of child file)
Deleted:
<
<
 
select wmbs_job.CACHE_DIR, wmbs_file_details.FILESIZE
from wmbs_file_details
Line: 1725 to 1556
 join wmbs_file_parent on wmbs_file_parent.CHILD = wmbs_file_details.ID join wmbs_job_assoc on wmbs_file_parent.PARENT = wmbs_job_assoc.FILEID join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID
Changed:
<
<
where wmbs_file_details.LFN = '/store/unmerged/data/Run2017C/MuOnia/RAW/v1/000/300/515/00000/D662FD9E-177A-E711-8F1B-02163E019D28.root';
>
>
where wmbs_file_details.LFN = '/store/unmerged/data/Run2017C/MuOnia/RAW/v1/000/300/515/00000/D662FD9E-177A-E711-8F1B-02163E019D28.root';
 
  • Same query as before, but excluding Cleanup jobs
Deleted:
<
<
 
select wmbs_job.CACHE_DIR, wmbs_file_details.*
from wmbs_file_details
Line: 1738 to 1567
 join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID join wmbs_job_mask on wmbs_job.ID = wmbs_job_mask.JOB where wmbs_file_details.LFN = '/store/unmerged/data/Run2017C/SingleMuon/ALCARECO/DtCalib-PromptReco-v1/000/299/616/00000/86B8AC3B-5B71-E711-ABA7-02163E0118E2.root';
Changed:
<
<
and wmbs_job.CACHE_DIR not like '%Cleanup%'
>
>
and wmbs_job.CACHE_DIR not like '%Cleanup%'
 
  • File details of jobs in paused
Deleted:
<
<
 
select wmbs_file_details.* from wmbs_job
join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB
Line: 1747 to 1574
 select wmbs_file_details.* from wmbs_job join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID
Changed:
<
<
where wmbs_job.STATE = 17;
>
>
where wmbs_job.STATE = 17;
 

Processing status related

Line: 1754 to 1579
 

Processing status related

  • File sets to be processed
Deleted:
<
<
 
Changed:
<
<
select name from wmbs_fileset;
>
>
select name from wmbs_fileset;
 
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

T0 nodes, headnodes

Line: 1790 to 1613
 git fetch origin V00-32-XX git checkout V00-32-XX git pull origin V00-32-XX
Added:
>
>
 
    • cmsdist:
Deleted:
<
<
 
#Clone cmsdist branch: comp_gcc630:
https://github.com/cms-sw/cmsdist.git
Line: 1863 to 1685
 git push upstream master git push origin master git push --tags upstream master
Changed:
<
<
git push --tags origin master
>
>
git push --tags origin master
 
  • Once you have a tag on dmwm T0 repo, you need to create a PR on the main cmsdist repository (changing the release version in the t0.spec file, nothing else).
  • Finally, you need to ask WMCore devs to force-merge this into the master branch (for the time being, it's comp_gcc630) of the cmsdist repo, because Tier0 package does not follow a regular/monthly-ish cmsweb release cycle.
  • The official CMS T0 release will be build in the next build cycle (it runs every few hours or so, therefore, be patient). The new release should be available on the official CMS comp CC7 repository.
Changed:
<
<
  • Whenever the new T0 release is available, make sure to update T0 release twiki adding the release notes.
>
>
  • Whenever the new T0 release is available, make sure to update T0 release twiki adding the release notes.
 
  • Also, whenever any patches for the release are created, make sure to add the to the release notes as well.

Voila! Now the new release is ready to use.

Changed:
<
<
VytautasJankauskas - 2019-02-28 </>
<!--/twistyPlugin-->
>
>
VytautasJankauskas - 2019-02-28
</>
<!--/twistyPlugin-->
 

Restarting Tier-0 voboxes

Line: 1886 to 1705
 mode="div" }%
Changed:
<
<
Node Use Type
vocms001
  • Replays: Normally used by the developer
  • Transfer System
Virtual machine
vocms015
  • Replays: Normally used by the operators
  • Tier0 Monitoring
Virtual machine
vocms047
  • Replays: Normally used by the operators
Virtual machine
vocms0313
  • Production node
Physical Machine
vocms0314
  • Production node
Physical Machine
>
>
An up-to-date list of T0 voboxes can be found in T0 policies section.
 To restart this node you need to check the following:
  • Production node:
    • The agent is not running and the couch processes were stopped correctly.
Changed:
<
<
    • These nodes uses a RAMDISK. Mounting it is puppetized, so you need to make sure that puppet ran before starting the agent again.
  • TransferSystem:
>
>
    • Mounting it is puppetized, so you need to make sure that puppet ran before starting the agent again.
 
  • Replays:
    • The agent should not be running, check the Tier0 Elog to make sure you are not interfering with a particular test.
  • Tier0 Monitoring:
Changed:
<
<
    • The monitoring is executed via a cronjob. The only consequence of the restart should be that no reports are produced during the down time. However you can check that everything is working going to:
      /data/tier0/sls/scripts/Logs
>
>
    • The monitoring is executed via a cronjob. The only consequence of the restart should be that no reports are produced during the down time. However you can check that everything is working going to:
      /data/tier0/sls/scripts/Logs
  To restart a machine you need to:
  • Login and become root
Line: 1914 to 1726
  </>
<!--/twistyPlugin-->
Deleted:
<
<
 

Commissioning of a new node

Deleted:
<
<
*INCOMPLETE INSTRUCTIONS: WORK IN PROGRESS 2017/03*
 
<!--/twistyPlugin twikiMakeVisibleInline-->
Changed:
<
<

Folder's structure and permissions

>
>
CMS VOC should be contacted if T0 needs a new node/vobox.

Generally, one needs to check that a new node has the same directories and their contents as already existing nodes.

It's just the WMAgent.secrets what needs to be adjusted for a new node (T0AST DB credentials).

Instructions below will guide you through the process.

1. Structure and permissions of directories
 
  • These folders should be placed at /data/:
# Permissions Owner Group Folder Name
1. (775) drwxrwxr-x. root zh admin
Line: 1935 to 1752
 
5. (775) drwxrwxr-x. root zh srv
6. (755) drwxr-xr-x. cmst1 zh tier0
TIPS:
Changed:
<
<
  • To get the folder permissions as a number:
    stat -c %a /path/to/file
  • To change permissions of a file/folder:
    EXAMPLE 1: chmod 775 /data/certs/
  • To change the user and/or group ownership of a file/directory:
    EXAMPLE 1: chown :zh /data/certs/
    EXAMPLE 2: chown -R cmst1:zh /data/certs/* 
>
>
  • To get the folder permissions as a number:
    stat -c %a /path/to/file
  • To change permissions of a file/folder:
    EXAMPLE 1: chmod 775 /data/certs/
  • To change the user and/or group ownership of a file/directory:
    EXAMPLE 1: chown :zh /data/certs/
    EXAMPLE 2: chown -R cmst1:zh /data/certs/* 
 
Changed:
<
<

2. certs

>
>
2. certs dir
 
  • Certificates are placed on this folder. You should copy them from another node:
    • servicecert-vocms001.pem
    • servicekey-vocms001-enc.pem
Line: 1953 to 1767
  NOTE: serviceproxy-vocms001.pem is renewed periodically via a cronjob. Please check the cronjobs section
Changed:
<
<

5. srv

  • There you will find the
    glidecondor
    folder, used to....
  • Other condor-related folders could be found. Please check with the Submission Infrastructure operator/team what is needed and who is responsible for it.

6. tier0

  • Main folder for the WMAgent, containing the configuration, source code, deployment scripts, and deployed agent.
>
>
5. srv
  • Condor-related packages/directories are located there. Please check with the Submission Infrastructure operator/team what is needed and who is responsible for it.
6. tier0
  • Main folder for the T0 WMAgent, containing the configuration, source code, deployment scripts, and deployed agent.
 
File Description
Changed:
<
<
00_deploy.prod.sh Script to deploy the WMAgent for production(*)
00_deploy.replay.sh Script to deploy the WMAgent for a replay(*)
00_fix_t0_status.sh
00_patches.sh
>
>
00_deploy_prod.sh Script to deploy the WMAgent for production(*)
00_deploy_replay.sh Script to deploy the WMAgent for a replay(*)
00_patches.sh Script to apply patches from 00_software script.
 
00_readme.txt Some documentation about the scripts
Changed:
<
<
00_software.sh Gets the source code to use form Github for WMCore and the Tier0. Applies the described patches if any.
>
>
00_software.sh * Gets the source code to use form Github for WMCore and the Tier0. Applies the described patches.
 
00_start_agent.sh Starts the agent after it is deployed.
00_start_services.sh Used during the deployment to start services such as CouchDB
00_stop_agent.sh Stops the components of the agent. It doesn't delete any information from the file system or the T0AST, just kill the processes of the services and the WMAgent components
00_wipe_t0ast.sh Invoked by the 00_deploy script. Wipes the content of the T0AST. Be careful!
Changed:
<
<
(*) This script is not static. It might change depending on the version of the Tier0 used and the site where the jobs are running. Check its content before deploying. (**) This script is not static. It might change when new patches are required and when the release versions of the WMCore and the Tier0 change. Check it before deploying.
>
>
* This script is not static. It might change depending on the version of the Tier0 used and the site where the jobs are running. Check its content before deploying.
Cronjobs
 
Changed:
<
<
Folder Description

Cronjobs

>
>
Again - normally you should simply copy the crontab jobs from the existing node to a new one.
 
Changed:
<
<
  • Proxy cronjobs
>
>
  • Proxy renewal cronjobs
  • Also, you would need to add a new node to Grafana monitoring and on vocms015 /data/tier0/sls/etc/config_<new node ID>.py configuration and to the list of T0 schedds at /data/tier0/sls/scripts/cmst0_checkTotalJobs/tier0_schedds.
 
<!--/twistyPlugin-->
Line: 1993 to 1800
 remember="on" mode="div" }%
Changed:
<
<
  • To run a replay in a instance used for production (for example before deploying it in production) you should check the following:
    • If production ran in this instance before, be sure that the T0AST was backed up. Deploying a new instance will wipe it.
    • Modify the WMAgent.secrets file to point to the replay couch and t0datasvc.
      • [20171012] There is a replay WMAgent.secrets file example on vocms0313.
>
>
This includes wiping a T0AST DB. Therefore, firstly check if you are doing it on a right headnode.

  • To run a replay in the instance used for production (for example before deploying it in production) you should check the following:
    • If production ran in this instance before, then make sure that the T0AST db was backed up. Deploying a new instance will wipe it.
    • Modify the WMAgent.secrets file to point to the replay couch and t0datasvc. On all headnodes there should be a WMAgent.secrets.replay file with the adjusted replay credentials. Make sure you create a backup of production WMAgent.secrets file - it will be used for production later on.
 
    • Download the latest ReplayOfflineConfig.py from the Github repository. Check the processing version to use based on the jira history.
Changed:
<
<
    • Do not use production 00_deploy.sh. Use the replays 00_deploy.sh script instead. This is the list of changes:
      • Points to the replay secrets file instead of the production secrets file:
            WMAGENT_SECRETS_LOCATION=$HOME/WMAgent.replay.secrets; 
      • Points to the ReplayOfflineConfiguration instead of the ProdOfflineConfiguration:
         sed -i 's+TIER0_CONFIG_FILE+/data/tier0/admin/ReplayOfflineConfiguration.py+' ./config/tier0/config.py 
      • Uses the "tier0replay" team instead of the "tier0production" team (relevant for WMStats monitoring):
         sed -i "s+'team1,team2,cmsdataops'+'tier0replay'+g" ./config/tier0/config.py 
      • Changes the archive delay hours from 168 to 1:
        # Workflow archive delay
>
>
    • Do not use production 00_deploy_prod.sh. Use the replays 00_deploy_replay.sh script instead. This is the list of changes (ONLY FOR REFERENCE - these changes are already applied on 00_deploy_replay.sh):
      • Points to the replay secrets file instead of the production secrets file:
            WMAGENT_SECRETS_LOCATION=$HOME/WMAgent.replay.secrets; 
      • Points to the ReplayOfflineConfiguration instead of the ProdOfflineConfiguration:
         sed -i 's+TIER0_CONFIG_FILE+/data/tier0/admin/ReplayOfflineConfiguration.py+' ./config/tier0/config.py 
      • Uses the "tier0replay" team instead of the "tier0production" team (relevant for WMStats monitoring):
         sed -i "s+'team1,team2,cmsdataops'+'tier0replay'+g" ./config/tier0/config.py 
      • Changes the archive delay hours from 168 to 1:
        # Workflow archive delay
  echo 'config.TaskArchiver.archiveDelayHours = 1' >> ./config/tier0/config.py
Changed:
<
<
      • Uses lower thresholds in the resource-control:
        ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --cms-name=T0_CH_CERN --pnn=T0_CH_CERN_Disk --ce-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=SimpleCondorPlugin
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Merge --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Cleanup --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=LogCollect --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Skim --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Production --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Harvesting --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Express --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Repack --pending-slots=160 --running-slots=320
>
>
    • Start a replay normally (00_software, 00_deploy_replay, 00_start_agent scripts).
    • Once the replay is working properly you can wait for it to finish or you can kill it.
    • When the WMAgent is shut down, unregister it from testbed WMStats:
      • $manage execute-agent wmagent-unregister-wmstats `hostname -f`
    • Now the headnode is tested and ready to be used for production. When configuring it for production make sure WMAgent.secrets gets rolled back to production version. Also, that Production configuration is the desired one and that the agent is deployed using the 00_deploy_prod script.
 
Changed:
<
<
./config/tier0/manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN --cms-name=T2_CH_CERN --pnn=T2_CH_CERN --ce-name=T2_CH_CERN --pending-slots=0 --running-slots=0 --plugin=SimpleCondorPlugin
>
>
<bold>Again, keep in mind that 00_deploy*.sh script wipes t0ast db - production instance in this case - so, make sure you know what you're doing.</bold>
 
Changed:
<
<
Again, keep in mind that 00_deploy.sh script wipes t0ast db - production instance in this case - so, carefully.

</>

<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Changing Tier0 Headnode

Line: 2036 to 1832
 mode="div" }%
Changed:
<
<
# Instruction Responsible Role
0. | If there are any exceptions when logging into a candidate headnode, then you should restart it at first. | Tier0 |
0. Run a replay in the new headnode. Some changes have to be done to safely run it in a Prod instance. Please check the Running a replay on a headnode section Tier0
1. Deploy the new prod instance in a new vocmsXXX node, check that we use. Obviously, you should use a production version of 00_deploy.sh script. Tier0
1.5. Check the ProdOfflineconfiguration that is being used Tier0
2. Start the Tier0 instance in vocmsXXX Tier0
3. THIS IS OUTDATED ALREADY I THINK Coordinate with Storage Manager so we have a stop in data transfers, respecting run boundaries (Before this, we need to check that all the runs currently in the Tier0 are ok with bookkeeping. This means no runs in Active status.) SMOps
4. THIS IS OUTDATED ALREADY I THINK Checking al transfer are stopped Tier0
4.1. THIS IS OUTDATED ALREADY I THINK Check http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
4.2. THIS IS OUTDATED ALREADY I THINK Check /data/Logs/General.log
5. THIS IS OUTDATED ALREADY I THINK Change the config file of the transfer system to point to T0AST1. It means, going to /data/TransferSystem/Config/TransferSystem_CERN.cfg and change that the following settings to match the new head node T0AST)
  "DatabaseInstance" => "dbi:Oracle:CMS_T0AST",
  "DatabaseUser"     => "CMS_T0AST_1",
  "DatabasePassword" => 'superSafePassword123',
Tier0
6. THIS IS OUTDATED ALREADY I THINK Make a backup of the General.log.* files (This backup is only needed if using t0_control restart in the next step, if using t0_control_stop + t0_control start logs won't be affected) Tier0
7. THIS IS OUTDATED ALREADY I THINK

Restart transfer system using:

A)

t0_control restart (will erase the logs)

B)

t0_control stop

t0_control start (will keep the logs)

Tier0
8. THIS IS OUTDATED ALREADY I THINK Kill the replay processes (if any) Tier0
9. THIS IS OUTDATED ALREADY I THINK Start notification logs to the SM in vocmsXXX Tier0
10. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
11. THIS IS OUTDATED ALREADY I THINK Restart transfers SMOps
12. RECHECK THE LIST OF CRONTAB JOBS Point acronjobs ran as cmst1 on lxplus to a new headnode. They are checkActiveRuns and checkPendingTransactions scripts. Tier0
</>
<!--/twistyPlugin-->
>
>
Normally we are changing headnodes together with a requirement for a new acquisition era or when there are some special runs with an unusual configuration (like HI stuff, etc.).

A new Tier0 headnode VM should have its T0AST db backed up before. Also, it should be cleaned up and turned off.

 
Added:
>
>
# Instruction Responsible Role
0 If there are any exceptions when logging into a new headnode, then you should restart it at first. Restarting a vobox section. Tier0
1 Run a replay on the new headnode. Some changes have to be done to safely run it in a Prod instance. Please check the Running a replay on a headnode section Tier0
2 When the replay is done, deploy the new T0 prod WMAgent instance on the new headnode. You should use a 00_deploy_prod.sh script. Tier0
3 Check the ProdOfflineconfiguration that is being used. It should have the desired new configuration (acq. era, GTs, processing versions, etc.) Tier0
4 Start the Tier0 WMAgent on the new headnode. Tier0
5 Change the configuration for Grafana monitoring pointing to the proper T0AST instance. (on vocms015 at /data/tier0/sls/etc/config.py) Tier0
6

Change the acronjob job execution node on cmst1 lxplus to point to a new headnode. They are checkActiveRuns and checkPendingTransactions scripts:

*/10 * * * * lxplus ssh vocms0314 "/data/tier0/tier0_monitoring/src/cmst0_diagnoseActiveRuns/activeRuns.sh" &> /afs/cern.ch/user/c/cmst1/www/tier0/diagnoseActiveRuns.out
*/5 * * * * lxplus ssh vocms0314 "/data/tier0/tier0_monitoring/src/cmst0_checkPendingSubscriptions/checkPendingSubscriptions.sh" &> /afs/cern.ch/user/c/cmst1/www/tier0/checkPendingSubscriptions.out
Tier0
7

Change the main Production configuration symlink on cmst1 lxplus acrontab job at /afs/cern.ch/user/c/cmst1/www/tier0/ :

ln -sfn ProdOfflineConfiguration_123.py ProdOfflineConfiguration.py
Tier0

</>
<!--/twistyPlugin-->
 

Restarting head node machine

Line: 2080 to 1868
  </>
<!--/twistyPlugin-->
Changed:
<
<

Configuring a newly created VM to be used as a T0 headnode/replay VM

<!--/twistyPlugin twikiMakeVisibleInline-->

This was started on 30/01/2018. To be continued.

  1. Whenever a new VM is created for T0, it has a mesa-libGLU package missing and, therefore, the deployment script is not going to work:
       Some required packages are missing:
       + for p in '$missingSeeds'
       + echo mesa-libGLU
       mesa-libGLU
       + exit 1
       
    One needs to install the package manually (with a superuser access):
    $ sudo yum install mesa-libGLU

<!--/twistyPlugin-->


>
>


 

T0 Pool instructions

Line: 2116 to 1883
  If it is needed to prevent new Central Production jobs to be executed in the Tier0 pool it is necessary to disable flocking.
Changed:
<
<
NOTE1: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/
>
>
NOTE1: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/
  Only in case of emergency out of working hours, consider executing the below procedure on your own. But posting elog entry in this case is even more important as SI team needs to be aware of such meaningful changes.
Line: 2127 to 1891
 
  • Difference between site whitelisting and enabling/disabling flocking. When flocking jobs of different core counts, defragmentation may have to be re-tuned.
  • Also when the core-count is smaller than the defragmentation policy objective. E.g., the current defragmentation policy is focused on defragmenting slots with less than 4 cores. Having flocking enabled and only single or 2-core jobs in the mix, will trigger unnecessary defragmentation. I know this is not a common case, but if the policy were focused on 8-cores and for some reason, they inject 4-core jobs, while flocking is enabled, the same would happen.
Changed:
<
<
As the changes directly affect the GlideInWMS Collector and Negotiator, you can cause a big mess if you don't proceed with caution. To do so you should follow these steps.
>
>
As the changes directly affect the GlideInWMS Collector and Negotiator, you can cause a big mess if you don't proceed with caution. To do so you should follow these steps.
  NOTE2: The root access to the GlideInWMS Collector is guaranteed for the members of the cms-tier0-operations@cern.ch e-group.

  • Login to vocms007 (GlideInWMS Collector-Negociator)
Changed:
<
<
  • Login as root
     sudo su - 
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/
  • There you will find a list of files. Most of them are puppetized which means any change will be overridden when puppet be executed. There is one no-puppetized file called 99_local_tweaks.config that is the one that will be used to do the changes we desire.
     -rw-r--r--. 1 condor condor  1849 Mar 19  2015 00_gwms_general.config
>
>
  • Login as root
     sudo su - 
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/
  • There you will find a list of files. Most of them are puppetized which means any change will be overridden when puppet be executed. There is one no-puppetized file called 99_local_tweaks.config that is the one that will be used to do the changes we desire.
     -rw-r--r--. 1 condor condor  1849 Mar 19  2015 00_gwms_general.config
  -rw-r--r--. 1 condor condor 1511 Mar 19 2015 01_gwms_collectors.config -rw-r--r-- 1 condor condor 678 May 27 2015 03_gwms_local.config -rw-r--r-- 1 condor condor 2613 Nov 30 11:16 10_cms_htcondor.config
Line: 2153 to 1912
  Within this file there is a special section for the Tier0 ops. The other sections of the file should not be modified.
Changed:
<
<
  • To disable flocking you should locate the flocking config section:
    # Knob to enable or disable flocking
>
>
  • To disable flocking you should locate the flocking config section:
    # Knob to enable or disable flocking
 # To enable, set this to True (defragmentation is auto enabled) # To disable, set this to False (defragmentation is auto disabled) ENABLE_PROD_FLOCKING = True
Changed:
<
<
  • Change the value to False
    ENABLE_PROD_FLOCKING = False
  • Save the changes in the 99_local_tweaks.config file and execute the following command to apply the changes:
     condor_reconfig 
>
>
  • Change the value to False
    ENABLE_PROD_FLOCKING = False
  • Save the changes in the 99_local_tweaks.config file and execute the following command to apply the changes:
     condor_reconfig 
 
  • The negociator has a 12h cache, so the schedds don't need to authenticate during this period of time. It is required to restart the negotiator.
Changed:
<
<
  • Now, you can check the whitelisted Schedds to run in Tier0 pool, the Central Production Schedds should not appear there.
     condor_config_val -master gsi_daemon_name  
>
>
  • Now, you can check the whitelisted Schedds to run in Tier0 pool, the Central Production Schedds should not appear there.
     condor_config_val -master gsi_daemon_name  
 
Changed:
<
<
  • Now you need to restart the condor negociator to make sure that the changes are applied right away.
     ps aux | grep "condor_negotiator"   
    kill -9 <replace_by_condor_negotiator_process_id> 
>
>
  • Now you need to restart the condor negociator to make sure that the changes are applied right away.
     ps aux | grep "condor_negotiator"   
    kill -9 <replace_by_condor_negotiator_process_id> 
 
  • After killing the process it should reappear again after a couple of minutes.
Line: 2178 to 1932
  Remember that this change won't remove/evict the jobs that are actually running, but will prevent new jobs to be sent.
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Enabling pre-emption in the Tier0 pool

Line: 2191 to 1945
  BEWARE: Please DO NOT use this strategy unless you are sure it is necessary and you agree in doing it with the Workflow Team. This literally kills all the Central Production jobs which are in Tier0 Pool (including ones which are being executed at that moment).
Changed:
<
<
NOTE: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/ Only in case of emergency out of working hours, consider executing the below procedure on your own. But posting elog entry in this case is even more important as SI team needs to be aware of such meaningful changes.
>
>
NOTE: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/ Only in case of emergency out of working hours, consider executing the below procedure on your own. But posting elog entry in this case is even more important as SI team needs to be aware of such meaningful changes.
 
  • Login to vocms007 (GlideInWMS Collector-Negotiator)
Changed:
<
<
  • Login as root
     sudo su -  
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/ 
>
>
  • Login as root
     sudo su -  
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/ 
 
  • Open 99_local_tweaks.config
Changed:
<
<
  • Locate this section:
     # How to drain the slots
>
>
  • Locate this section:
     # How to drain the slots
  # graceful: let the jobs finish, accept no more jobs # quick: allow job to checkpoint (if supported) and evict it # fast: hard kill the jobs DEFRAG_SCHEDULE = graceful
Changed:
<
<
  • Change it to:
     DEFRAG_SCHEDULE = fast 
  • Leave it enabled only for ~5 minutes. After this the Tier0 jobs will start being killed as well. After the 5 minutes, revert the change
     DEFRAG_SCHEDULE = graceful 
>
>
  • Change it to:
     DEFRAG_SCHEDULE = fast 
  • Leave it enabled only for ~5 minutes. After this the Tier0 jobs will start being killed as well. After the 5 minutes, revert the change
     DEFRAG_SCHEDULE = graceful 
  </>
<!--/twistyPlugin-->
Line: 2227 to 1972
  To change T2_CH_CERN and T2_CH_CERN_HLT
Changed:
<
<
*Please note that Tier0 Ops changing the status of T2_CH_CERN and T2_CH_CERN_HLT is an emergency procedure, not a standard one*
>
>
*Please note that Tier0 Ops changing the status of T2_CH_CERN and T2_CH_CERN_HLT is an emergency procedure, not a standard one*
 
  • Open a GGUS Ticket to the site before proceeding, asking them to change the status themselves.
  • If there is no response after 1 hour, reply to the same ticket reporting you are changing it and proceed with the steps in the next section.
Line: 2238 to 1983
 
  • The status in the SSB site is updated every 15 minutes. So you should be able to see the change there maximum after this amount of time.
  • More extensive documentation can be checked here.
Changed:
<
<
</>
<!--/twistyPlugin-->


>
>
</>
<!--/twistyPlugin-->


 

Other (did not fit into the categories above/outdated/in progress)

Line: 2282 to 2026
  Job type could be Processing, Merge and Harvesting. For Processing type, task could be Reco or AlcaSkim and for Merge type, ALCASkimMergeALCARECO, RecoMergeSkim, RecoMergeWrite _AOD, RecoMergeWrite _DQMIO, RecoMergeWrite _MINIAOD and RecoMergeWrite _RECO.
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Report an incident or create a request to CERN IT via SNOW tickets

Line: 2315 to 2059
 mode="div" }%
Changed:
<
<
This guide contains all the necessary steps, read it at first. Execute this commands locally, where you already made a copy of the repository
>
>
This guide contains all the necessary steps, read it at first. Execute this commands locally, where you already made a copy of the repository
 
  • Get the latest code from the repository
git checkout master
git fetch dmwm
git pull dmwm master
Changed:
<
<
git push origin master
>
>
git push origin master
 
  • Create a branch to add the code changes. Use a meaningful name.
Changed:
<
<
git checkout -b <branch-name> dmwm/master
>
>
git checkout -b <branch-name> dmwm/master
 
  • Make the changes in the code

  • Add the modified files to the changes to be commit
Changed:
<
<
git add <file-name>
>
>
git add <file-name>
 
  • Make commit of the changes
Changed:
<
<
git commit
>
>
git commit
 
  • Push the changes from your local repository to the remote repository
Changed:
<
<
git push origin <branch-name>
>
>
git push origin <branch-name>
 
  • Make a pull request from the GitHub web page
Line: 2355 to 2093
 
  • Make the required modifications in the branch
  • Fix the previous commit
Changed:
<
<
git commit --amend
>
>
git commit --amend
 
  • Force update
Changed:
<
<
git push -f origin <branch-name>
>
>
git push -f origin <branch-name>
 
  • If a pull request was done before, it will update automatically.

After the branch is merged, it can be safely deleted:

Changed:
<
<
git branch -d <branch-name>
>
>
git branch -d <branch-name>
  Other useful commands
  • Show the branch in which you are working and the status of the changes. Useful before doing commit or while working on a branch.
git branch
Changed:
<
<
git status
>
>
git status
 
  • Others
git reset
git diff
git log
Changed:
<
<
git checkout .

</>

<!--/twistyPlugin-->
>
>
git checkout .
 
Added:
>
>

</>
<!--/twistyPlugin-->
 

EOS Areas of interest

Line: 2401 to 2132
 
/eos/cms/store/unmerged/ Store output files smaller than 2GB until the merge jobs put them together Tier-0 worker nodes (Processing/Repack jobs) Tier-0 worker nodes(Merge Jobs) ?
/eos/cms/tier0/ Files ready to be transferred to Tape and Disk Tier-0 worker nodes (Processing/Repack/Merge jobs) PhEDEx Agent Tier-0 WMAgent creates and auto approves transfer/deletion requests. PhEDEx executes them
/eos/cms/store/express/ Output from Express processing Tier-0 worker nodes Users Tier-0 express area clenaup script
Deleted:
<
<
 
Changed:
<
<
/eos/cms/store/t0streamer/ SM writes raw files there. And we delete the files with the script. the script is on the acronjob under the cmst0 acc. It keeps data which are not repacked yet. Also, keeps the data not older than 7 days. The data is repacked (rewritten) dat files > PDs (raw .root files).
>
>
/eos/cms/store/t0streamer/
 
Added:
>
>
SM writes raw files there. And we delete the files with the script. the script is on the acronjob under the cmst0 acc. It keeps data which are not repacked yet. Also, keeps the data not older than 7 days. The data is repacked (rewritten) dat files > PDs (raw .root files).
 
Changed:
<
<
/eos/cms/store/unmerged/ There go the files which need to be merged into larger files. Not all the files go there. The job itself manages it (after merging, the job deletes the unmerged files).
>
>
/eos/cms/store/unmerged/
 
Added:
>
>
There go the files which need to be merged into larger files. Not all the files go there. The job itself manages it (after merging, the job deletes the unmerged files).
 
Changed:
<
<
/eos/cms/store/express/ Express output after being merged. Jobs from the tier0 are writing to it. Data deletions are managed by DDM.
>
>
/eos/cms/store/express/
 
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>
Express output after being merged. Jobs from the tier0 are writing to it. Data deletions are managed by DDM.


</>

<!--/twistyPlugin-->
 

Checking subscription requests on PhEDEx

Line: 2437 to 2162
  If you don't want to check current subscriptions but requests, go to Requests/View/Manage Requests. In this form you can filter the requests by type (transfer or deletion) and by state (pending approval, approved or disapproved). If you need more options to filter the requests, you have to use the PhEDEx web services.
Changed:
<
<
The available PhEDEx services are published on the documentation website. To query requests, use the service Request list. For example, to query the dataset /*/Tier0_REPLAY_vocms015*/* with requests over node T2_CH_CERN do:
>
>
The available PhEDEx services are published on the documentation website. To query requests, use the service Request list. For example, to query the dataset /*/Tier0_REPLAY_vocms015*/* with requests over node T2_CH_CERN do:
 
Changed:
<
<
https://cmsweb.cern.ch/phedex/datasvc/json/prod/requestlist?dataset=/*/Tier0_REPLAY_vocms015*/*&node=T2_CH_CERN
>
>
https://cmsweb.cern.ch/phedex/datasvc/json/prod/requestlist?dataset=/*/Tier0_REPLAY_vocms015*/*&node=T2_CH_CERN
  The request above will return a JSON resultset as follows:
Deleted:
<
<
 
{
    "phedex": {
Line: 2478 to 2200
  "request_url": "http://cmsweb.cern.ch:7001/phedex/datasvc/json/prod/requestlist", "request_version": "2.4.0pre1" }
Changed:
<
<
}
>
>
}
  The PhEDEx services not only allows you to create more detailed queries, but are faster than query the information on PhEDEx website.
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->
 

Re-processing from scratch (in case of unexpected data deletion, corruption, other disaster etc.)

Line: 2533 to 2253
  #print("For run %d the dataset %s", a, pdName) for singleFile in datasetFiles: print(singleFile['logical_file_name'])
Changed:
<
<
the_file.write(singleFile['logical_file_name']+"\n")
>
>
the_file.write(singleFile['logical_file_name']+"\n")
 
  • Once the list of files to be invalidated later is ready, you can configure a headnode for re-processing.
  • In the Prod configuration, you should inject every affected run individually (See the GH).
Line: 2545 to 2264
 and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM = 'HLTMonitor' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM = 'Calibration' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM = 'ExpressAlignment'
Changed:
<
<
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM = 'ExpressCosmics'
>
>
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM = 'ExpressCosmics'
 
  • At this point, it is safe to start the WMAgent on the configured node. Keep in mind that the Tier0Feeder component took 3h+ to inject ~150 runs.
  • When all the runs are processed and cleaned up, it is time to invalidate old files. Again, you need to be extra careful with RAW data. Invalidation and deletion should be done by the Transfers team - a usual way is to create a GGUS ticket with a request.
Line: 2576 to 2293
  The list could be expanded depending on the special needs of some runs and the properties of cmsweb deployments.
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>

</>
<!--/twistyPlugin-->

Revision 852019-03-27 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1336 to 1336
 
  • In the TransferSystem (currently vocms001), update the following file to point to the new certificate and restart component.
/data/TransferSystem/t0_control.sh
Changed:
<
<
</>
<!--/twistyPlugin-->


>
>
</>
<!--/twistyPlugin-->
 

OracleDB (T0AST) instructions

Added:
>
>

Setting up the Oracle SQL Developer desktop client to connect to the Tier0 DBs

<!--/twistyPlugin twikiMakeVisibleInline-->

Software

  1. Download and install Oracle Instant Client. Use the basic package. Follow general installation steps in web page. You need to have Java installed.
  2. Download and install the Oracle SQL Developer software.
  3. Configure SQL Developer to use Instant Client as Client Type. You may need to do several configurations before connecting to any database, depending on your system. These links can be useful:

Connections

  1. Download the tnsnames.ora file from the dmwm/PHEDEX repository. This file contains the mapping between service names and connect descriptors. tnsnames.org is going to be used for the configuration of the database connections.
  2. Setup SQL Developer to use the downloaded tnsnames.ora and show available CMS net services

Connections to T0AST databases

For each production and replay instance (Check REPLACE twiki), create a connection:

  1. User credentials are stored in the respective T0 machine. They can be found in the /data/tier0/admin/WMAgent.secrets file. Use values of ORACLE_USER and ORACLE_PASS fields.
  2. Use TNS as Connection Type, default as Role and the ORACLE_TNS value in WMAgent.secrets as Network Alias.

Connection to other Tier0 related databases

Create a connection for the next read only databases. User credentials and Network aliases can be found in any WMAgent.secret file of any T0 agent.

  1. Create a connection to the Storage Manager database. Use values of SMDB_URL field.
  2. Create a connection to the HLT database. Use values of CONFDB_URL field.
  3. Create a connection to the Conditions database. Use values of POPCONLOGDB_URL field.
  4. Create a connection to the Tier0 data service database. Use values of T0DATASVCDB_URL field.

<!--/twistyPlugin-->

 

Change Cmsweb Tier0 Data Service Passwords (Oracle DB)

%TWISTY{

Line: 1429 to 1478
 
  • Login using the DB credentials.
  • Check the sessions and see if you see any errors or something unusual.
Changed:
<
<
</>
<!--/twistyPlugin-->


>
>
</>
<!--/twistyPlugin-->

Useful T0AST queries

<!--/twistyPlugin twikiMakeVisibleInline-->

Run related

  • Configuration of a certain run

select * from run where run_id = 293501;

  • Streams of a run

select * from streamer
join stream on streamer.STREAM_ID=stream.ID
where run_id= 293501;

select count(*) from streamer
join stream on streamer.STREAM_ID=stream.ID
where run_id= 293501
group by stream.ID;

Job related

  • Count of jobs per state

select wmbs_job_state.name, count(*) from wmbs_job 
join wmbs_job_state on wmbs_job.state = wmbs_job_state.id
GROUP BY wmbs_job_state.name;

  • Get jobs in paused state

select id, cache_dir from wmbs_job where STATE =17;
SELECT id, cache_dir FROM wmbs_job WHERE state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpausedí);

  • Get jobs in paused state per type of job

select id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Repack%' order by cache_dir;
select id, cache_dir from wmbs_job where STATE =17 and cache_dir not like '%Repack%' order by cache_dir;
select id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%PromptReco%' order by cache_dir;
select id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Express%' order by cache_dir;
select id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Express%' and cache_dir not like '%Repack%' order by cache_dir;
select retry_count, id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Repack%' and cache_dir not like '%Merge%' order by cache_dir;
select retry_count, id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Repack%' and cache_dir like '%Merge%' order by cache_dir;

CMSSW related

PromptReco

  • CMSSW version configured for PromptReco for all runs

select distinct RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG
inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID
order by RECO_CONFIG.RUN_ID desc;

  • First run configured to use certain CMSSW version for PromptReco

select distinct min(RECO_CONFIG.RUN_ID) from RECO_CONFIG
inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID
where CMSSW_VERSION.NAME = 'CMSSW_7_4_12'
order by RECO_CONFIG.RUN_ID desc; 

  • All runs configured to use certain CMSSW version for PromptReco

select distinct RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG
inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID
where CMSSW_VERSION.NAME = 'CMSSW_7_4_12'
order by RECO_CONFIG.RUN_ID desc ;

Express (Reconstruction Step)

  • CMSSW version configured for Express for all runs

select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID
order by EXPRESS_CONFIG.RUN_ID desc ;

  • First run configured to use certain CMSSW version for Express

select distinct min(EXPRESS_CONFIG.RUN_ID) from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID
where CMSSW_VERSION.NAME = 'CMSSW_7_4_12'
order by EXPRESS_CONFIG.RUN_ID desc; 

  • All runs configured to use certain CMSSW version for Express

select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID
where CMSSW_VERSION.NAME = 'CMSSW_7_4_12'
order by EXPRESS_CONFIG.RUN_ID desc; 

Specific run

CMSSW versions used for a specific run.

select distinct RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG
inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID
where RECO_CONFIG.RUN_ID=299325 order by RECO_CONFIG.RUN_ID desc ;

  • Express (Reconstruction step)

select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID
where EXPRESS_CONFIG.RUN_ID=299325 order by EXPRESS_CONFIG.RUN_ID desc ;

  • Express (Repack step)

select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG
inner join CMSSW_VERSION on EXPRESS_CONFIG.CMSSW_ID = CMSSW_VERSION.ID
where EXPRESS_CONFIG.RUN_ID=299325 order by EXPRESS_CONFIG.RUN_ID desc ;

NOTE: Remember that the Express configuration includes two CMSSW releases; one for repacking: CMSSW_ID and another for reconstruction: RECO_CMSSW_ID

File related

  • Input files of a specific job given its job id

select wmbs_file_details.* from wmbs_job
join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB
join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID
where wmbs_job.ID = 3463;

  • Lumis contained in a file

select * from wmbs_file_runlumi_map
where fileid = 8356;

  • Input files with lumis of a specific job given its id

select wmbs_file_details.ID, LFN, FILESIZE, EVENTS, lumi from wmbs_job
join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB
join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID
join wmbs_file_runlumi_map on wmbs_job_assoc.FILEID = wmbs_file_runlumi_map.FILEID
where wmbs_job.ID = 3463 order by lumi;

  • Input file of a job with specific lumi

select wmbs_file_details.ID, LFN, FILESIZE, EVENTS, lumi from wmbs_job
join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB
join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID
join wmbs_file_runlumi_map on wmbs_job_assoc.FILEID = wmbs_file_runlumi_map.FILEID
where wmbs_job.ID = 3463 and lumi = 73;

  • Job that has a specific file as input given file id

select wmbs_job.* from wmbs_job_assoc
join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID
where wmbs_job_assoc.FILEID = 4400;

  • Job that has a specific file as input given file lfn

select wmbs_job.CACHE_DIR from wmbs_job
join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB
join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID
where wmbs_file_details.LFN = '/store/unmerged/data/Run2017C/MuOnia/RAW/v1/000/300/515/00000/D662FD9E-177A-E711-8F1B-02163E019D28.root';

  • Job details of job that has a specific input file **MASK

select wmbs_job_mask.*, wmbs_job.CACHE_DIR from wmbs_job_assoc
join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID
join wmbs_job_mask on wmbs_job.ID = wmbs_job_mask.JOB
where wmbs_job_assoc.FILEID = 4400;

  • Parent file of a specified child file

select * from wmbs_file_parent where child = 6708;

  • Details of parent file of a specified child file

select wmbs_file_details.* from wmbs_file_parent 
join wmbs_file_details on wmbs_file_parent.PARENT = wmbs_file_details.ID
where child = 6708;

  • Jobs which input files are a parent file of a specific file ("Parent job of a file")

select distinct wmbs_job.* from wmbs_job_assoc
join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID
where wmbs_job_assoc.FILEID in (select parent from wmbs_file_parent where child = 3756584);
 
Added:
>
>
  • Jobs which input files are a parent file of a specific file given lfn of child file. (In other words, get cache dir of father job and filesize of child file given lfn of child file)

select wmbs_job.CACHE_DIR, wmbs_file_details.FILESIZE
from wmbs_file_details
join wmbs_file_parent on wmbs_file_parent.CHILD = wmbs_file_details.ID
join wmbs_job_assoc on wmbs_file_parent.PARENT = wmbs_job_assoc.FILEID
join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID
where wmbs_file_details.LFN = '/store/unmerged/data/Run2017C/MuOnia/RAW/v1/000/300/515/00000/D662FD9E-177A-E711-8F1B-02163E019D28.root';

  • Same query as before, but excluding Cleanup jobs

select wmbs_job.CACHE_DIR, wmbs_file_details.*
from wmbs_file_details
join wmbs_file_parent on wmbs_file_parent.CHILD = wmbs_file_details.ID
join wmbs_job_assoc on wmbs_file_parent.PARENT = wmbs_job_assoc.FILEID
join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID
join wmbs_job_mask on wmbs_job.ID = wmbs_job_mask.JOB
where wmbs_file_details.LFN = '/store/unmerged/data/Run2017C/SingleMuon/ALCARECO/DtCalib-PromptReco-v1/000/299/616/00000/86B8AC3B-5B71-E711-ABA7-02163E0118E2.root';
and wmbs_job.CACHE_DIR not like '%Cleanup%'

  • File details of jobs in paused

select wmbs_file_details.* from wmbs_job
join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB
join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID
where wmbs_job.STATE = 17;

Processing status related

  • File sets to be processed

select name from wmbs_fileset;

<!--/twistyPlugin-->
 

T0 nodes, headnodes

Line: 1914 to 2243
 

Other (did not fit into the categories above/outdated/in progress)

Deleted:
<
<

Updating TransferSystem for StorageManager change of alias (probably outdated)

<!--/twistyPlugin twikiMakeVisibleInline-->

Ideally this process should be transparent to us. However, it might be that the TransferSystem doesn't update the IP address of the SM alias when the alias is changed to point to the new machine. In this case you will need to restart the TransferSystem in both the /data/tier0/sminject area on the T0 headnode and the /data/TransferSystem area on vocms001. Steps for this process are below:

  1. Watch the relevant logs on the headnode to see if streamers are being received by the Tier0Injector and if repack notices are being sent by the LoggerReceiver. A useful command for this is:
     watch "tail /data/tier0/srv/wmagent/current/install/tier0/Tier0Feeder/ComponentLog; tail /data/tier0/sminject/Logs/General.log; tail /data/tier0/srv/wmagent/current/install/tier0/JobCreator/ComponentLog" 
  2. Also watch the TransferSystem on vocms001 to see if streamers / files are being received from the SM and if CopyCheck notices are being sent to the SM. A useful command for this is:
     watch "tail /data/TransferSystem/Logs/General.log; tail /data/TransferSystem/Logs/Logger/LoggerReceiver.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckManager.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckWorker.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorManager.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorWorker.log" 
  3. If any of these services stop sending and/or receiving, you will need to restart the TransferSystem.
  4. Restart the TransferSystem on vocms001. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/TransferSystem
    ./t0_control stop
    ./t0_control start
              
  5. Restart the TransferSystem on the T0 headnode. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/tier0/sminject
    ./t0_control stop
    ./t0_control start
              

<!--/twistyPlugin-->
 

Getting Job Statistics (needs to be reviewed)

%TWISTY{

Line: 2251 to 2552
 
  • When all the runs are processed and cleaned up, it is time to invalidate old files. Again, you need to be extra careful with RAW data. Invalidation and deletion should be done by the Transfers team - a usual way is to create a GGUS ticket with a request.

</>

<!--/twistyPlugin-->
\ No newline at end of file
Added:
>
>

Checklist for a new cmsweb deployment

<!--/twistyPlugin twikiMakeVisibleInline-->

The following items should be tested on a replay after a new deployment:

  • Correct upload of the information of a new workflow information to wmstats
  • Proper state transition (Run/workflow/job) for one job.
  • Release of PromptReco only hours after the run has finished
  • Run a DQM harvest job that gets its procedure completed in a good way and then it is uploaded to the DQMGUI
  • Subscribe and delete a transfer request (This may change due to the Rucio integration)
  • Connect successfully to CouchDB
  • Run jobSplitting algorithm on one job (could be express)
  • Check a paused job information on WMStats to see if all the necessary information is there.

The list could be expanded depending on the special needs of some runs and the properties of cmsweb deployments.

<!--/twistyPlugin-->

Revision 842019-03-27 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1409 to 1409
 
  • When the database is ready, you can open a SNOW ticket requesting the backup. The ticket is created automatically by sending an email to phydb.support@cernNOSPAMPLEASE.ch.
  • When the backup is done you will get a reply to your ticket confirming it. Recheck that the backup is fine, consistent etc. and ask to close the ticket.
Changed:
<
<
As of March 21st 2019, the following backups contain these runs.

wmaprod id T0AST id minRun maxRun machine id
wmaprod_25 CMS_T0AST_2 315252 316995 vocms0313
wmaprod_26 CMS_T0AST_1 316998 319311 vocms0314
wmaprod_27 CMS_T0AST_3 319313 320393 vocms014
wmaprod_28 CMS_T0AST_2 322680 322800 vocms0313
wmaprod_29 CMS_T0AST_3 325112 325112 vocms014
wmaprod_30 CMS_T0AST_2 325799 327824 vocms0313
wmaprod_32 CMS_T0AST_4 320413 325746 vocms013
>
>
Whenever creating a backup, please add a row to the Tier0 Archive Accounts table of records.
 

</>

<!--/twistyPlugin-->

Revision 832019-03-21 - AndresFelipeQuinteroParra

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1409 to 1409
 
  • When the database is ready, you can open a SNOW ticket requesting the backup. The ticket is created automatically by sending an email to phydb.support@cernNOSPAMPLEASE.ch.
  • When the backup is done you will get a reply to your ticket confirming it. Recheck that the backup is fine, consistent etc. and ask to close the ticket.
Added:
>
>
As of March 21st 2019, the following backups contain these runs.

wmaprod id T0AST id minRun maxRun machine id
wmaprod_25 CMS_T0AST_2 315252 316995 vocms0313
wmaprod_26 CMS_T0AST_1 316998 319311 vocms0314
wmaprod_27 CMS_T0AST_3 319313 320393 vocms014
wmaprod_28 CMS_T0AST_2 322680 322800 vocms0313
wmaprod_29 CMS_T0AST_3 325112 325112 vocms014
wmaprod_30 CMS_T0AST_2 325799 327824 vocms0313
wmaprod_32 CMS_T0AST_4 320413 325746 vocms013

 </>
<!--/twistyPlugin-->

Revision 822019-02-28 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1476 to 1476
 #upload it: pkgtools/cmsBuild -c cmsdist --repository comp -a slc7_amd64_gcc630 --builders 8 -j 4 --work-dir w --upload-user=$USER upload t0 | tee logUpload
Changed:
<
<
4. You should build a new release under your personal repository, run unit tests on it and test it in a replay. Then, once it is working properly, you want to build it as the CMS package:
  • You need to draft a new release on the main dmwm T0 repository (code version should be identical to what you just tested in your personal release - include the same PRs, etc.).
  • Once it is done, you need to create a PR on the main cmsdist repository (changing the release version in the t0.spec file, nothing else).
  • Finally, you need to ask WMCore devs to force-merge this to the master branch of the cmsdist repo, because Tier0 package does not follow a regular/monthly-ish cmsweb release cycle.
  • The official CMS T0 release will be build in the next build cycle (it runs every few hours or so, therefore, please be patient). The new release should be available on the official CMS comp CC7 repository.
>
>
4. You should build a new release under your personal repository, run unit tests on it and test it in a replay:
  • You need to draft a new release on your personal forked T0 repository. The tag for the new release should be up to date with the dmwm T0 master branch.
  5. For testing a new release, you need to modify the deployment scripts on replay node (the same applies for production nodes accordingly):
  • 00_software.sh:
Line: 1514 to 1511
  That's it. Now you can deploy a new T0 release. Just make sure there are no GH conflicts and other errors during the deployment procedure.
Changed:
<
<
-- VytautasJankauskas - 2019-02-11
>
>
6. Then, once your personal release is working properly, you want to build it as the CMS package in cmssw-cmsdist repository. To prepare for that:
  • you need to generate the changelog of what's in the release:
cd T0 (local T0 repo work area)
# stg branch master -> need to be in master branch
# stg pull -> need to be in sync with updates
# don't build the release, just generate the changelog
. bin/buildrelease.sh --skip-build --wmcore-tag=1.1.20.patch4 2.1.4

# update wmcore-tag (not sure it matters)
# Now it will open an editor window where you can edit the CHANGES file
# usually not desired since it auto-populates with all the changes
# this creates a tag in the local area, DO NOT TAG MANUALLY

# tag needs to be copied to the github user and dmwm repos
# some of these might be redundant by now, I usually do all for safety
git push upstream master
git push origin master
git push --tags upstream master
git push --tags origin master
  • Once you have a tag on dmwm T0 repo, you need to create a PR on the main cmsdist repository (changing the release version in the t0.spec file, nothing else).
  • Finally, you need to ask WMCore devs to force-merge this into the master branch (for the time being, it's comp_gcc630) of the cmsdist repo, because Tier0 package does not follow a regular/monthly-ish cmsweb release cycle.
  • The official CMS T0 release will be build in the next build cycle (it runs every few hours or so, therefore, be patient). The new release should be available on the official CMS comp CC7 repository.
  • Whenever the new T0 release is available, make sure to update T0 release twiki adding the release notes.
  • Also, whenever any patches for the release are created, make sure to add the to the release notes as well.

Voila! Now the new release is ready to use.

VytautasJankauskas - 2019-02-28

 </>
<!--/twistyPlugin-->

Restarting Tier-0 voboxes

Revision 812019-02-11 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1491 to 1491
 ... # If you want to deploy a personal build, then you should checkout your forked T0 repository: git clone https://github.com/[your-GH-username]/T0.git
Changed:
<
<
# Also, don't forget to remove older fix patches from the script.
>
>
# Also, don't forget to remove/comment out older fix patches from the script, e.g.: # request more memory for the ScoutingPF repacking #git fetch https://github.com/hufnagel/T0 scouting-repack-memory && git cherry-pick FETCH_HEAD
 
  • 00_deploy_*.sh:

Revision 802019-02-11 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1431 to 1431
 

T0 nodes, headnodes

Added:
>
>

Building a new T0 release

<!--/twistyPlugin twikiMakeVisibleInline-->

Firstly, please read the general documentation about Software packaging and distribution in CMS.

1. Before building a new Tier-0 release you need:

  • to get access to vocms055 (a dedicated CC7 CMS RPM build VM). Ask CMSWEB operator for that.
  • to be added to cms-comp-devs e-group (in order to be able to push to your personal cmsrep). Ask CMSWEB operator for that.

2. Once you can access vocms055, you need:

  • to create your personal directory at /build/[your-CERN-username].
  • Fetch pktools and cmsdist:
    • pktools:
git clone https://github.com/cms-sw/pkgtools.git
#points HEAD to V00-32-XX
cd pkgtools
git remote -v
git fetch origin V00-32-XX
git checkout V00-32-XX
git pull origin V00-32-XX
    • cmsdist:

#Clone cmsdist branch: comp_gcc630:
https://github.com/cms-sw/cmsdist.git

3. Now you should have the build environment properly configured. In order to build a new release:

  • Firstly, you need to create a new GH release (on https://github.com/[your-GH-username]/T0/releases). It should include all preferred PRs which you want to have in the release.
  • When there is a new GH release available, you need to specify it in t0.spec file. It is located at your cloned cmsdist repository directory.
  • Not getting into details, the release in the spec file is specified on the very first line:
    ### RPM cms t0 2.1.4
    Normally, you only need to increment it to the tag version created on GH before.
  • Finally, you need to build a new release and then to upload it (the important parts are specifying the package to build (t0) and uploading the newly build package to your personal repository (--upload-user part)):
# build a new release
pkgtools/cmsBuild -c cmsdist --repository comp   -a slc7_amd64_gcc630 --builders 8 -j 4 --work-dir w build t0 | tee logBuild

#upload it:
pkgtools/cmsBuild -c cmsdist --repository comp -a slc7_amd64_gcc630 --builders 8 -j 4 --work-dir w --upload-user=$USER upload t0 | tee logUpload

4. You should build a new release under your personal repository, run unit tests on it and test it in a replay. Then, once it is working properly, you want to build it as the CMS package:

  • You need to draft a new release on the main dmwm T0 repository (code version should be identical to what you just tested in your personal release - include the same PRs, etc.).
  • Once it is done, you need to create a PR on the main cmsdist repository (changing the release version in the t0.spec file, nothing else).
  • Finally, you need to ask WMCore devs to force-merge this to the master branch of the cmsdist repo, because Tier0 package does not follow a regular/monthly-ish cmsweb release cycle.
  • The official CMS T0 release will be build in the next build cycle (it runs every few hours or so, therefore, please be patient). The new release should be available on the official CMS comp CC7 repository.

5. For testing a new release, you need to modify the deployment scripts on replay node (the same applies for production nodes accordingly):

  • 00_software.sh:
# set WMCore and T0 release versions accordingly:
WMCORE_VERSION=1.1.19.pre5-comp2
T0_VERSION=2.1.4
...
# If you want to deploy a personal build, then you should checkout your forked T0 repository:
git clone https://github.com/[your-GH-username]/T0.git
# Also, don't forget to remove older fix patches from the script.

  • 00_deploy_*.sh:
# here also change T0 release tag:
TIER0_VERSION=2.1.4
...
# change the deployment source to your personal repo if it's a personal release
#Vytas private repo deployment
./Deploy -s prep -r comp=comp.[your-CERN-username] -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION
./Deploy -s sw -r comp=comp.[your-CERN-username] -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION
./Deploy -s post -r comp=comp.[your-CERN-username] -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION


#Usual deployment
#./Deploy -s prep -r comp=comp -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION
#./Deploy -s sw -r comp=comp -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION
#./Deploy -s post -r comp=comp -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION

That's it. Now you can deploy a new T0 release. Just make sure there are no GH conflicts and other errors during the deployment procedure.

-- VytautasJankauskas - 2019-02-11

<!--/twistyPlugin-->
 

Restarting Tier-0 voboxes

%TWISTY{

Revision 792018-11-12 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1404 to 1404
 }%

If you want to do a backup of a database (for example, after retiring a production node, you want to keep the information of the old T0AST) you should.

Changed:
<
<
  • Request a target Database: Normally this databases are owned by dirk.hufnagel@cernNOSPAMPLEASE.ch, so he should request a new database to be the target of the backup.
  • When the database is ready, you can open a ticket for requesting the backup. For this you should send an email to phydb.support@cernNOSPAMPLEASE.ch. An example of a message can be found in this Elog .
  • When the backup is done you will get a reply to your ticket confirming it.
>
>
  • Request a target Database: Normally this databases are owned by the Tier0 L2 (currently dmytro.kovalskyi@NOSPAMPLEASEcernNOSPAMPLEASE.ch), so he should request a new backup database instance to be the target of the backup.
    • A backup DB is username is "CMS_T0AST_WMAPROD_xx" (xx is an incrementing number). A sqlplus64 connection to the DB should be stored on vocms001 /data/tier0query directory.
  • When the database is ready, you can open a SNOW ticket requesting the backup. The ticket is created automatically by sending an email to phydb.support@cernNOSPAMPLEASE.ch.
  • When the backup is done you will get a reply to your ticket confirming it. Recheck that the backup is fine, consistent etc. and ask to close the ticket.
  </>
<!--/twistyPlugin-->

Revision 782018-11-06 - DmytroKovalskyi

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 471 to 471
 # Actually run the job (you can pass the parameter to create a fwjr too) cmsRun PSet.py
Added:
>
>
  • Hacking CMSSW configuration
# If you need to modify the job for whatever reason (like drop some input to get at least some 
# statistics for a DQM harvesting job) you need to first to get a config dump in python format 
# instead of pickle. Keep in mind that the config file is very big.
# Modify PSet.py by adding "print process.dumpPython()" as a last command and run it using python
python PSet.py > cmssw_config.py

# Modify cmssw_config.py (For example find process.source and remove files that you don't want to run on). Save it and use it as input for cmsRun instead of PSet.py
cmsRun cmssw_config.py
 </>
<!--/twistyPlugin-->

Revision 772018-09-26 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 731 to 731
 
  • CMSSW release (cmsswVersion)
  • SCRAM architecture (scramArch)
Added:
>
>
We created some scripts to deal with usual issues - the maxRSS values get exceeded time after time. Therefore, you need to modify the workflow sandbox to modify the maxRSS values.
  • On /data/tier0/tier0_monitoring/src/v3_modifyMaxRSS/ located in every node, there are scripts checkCurrentMaxRSS.sh and modifySandbox.sh which are usable in case you want to check/modify the maxRSS values.
  • You may see when you try it out that the modifySandbox.sh script does not override RSS limits for "Merge", "Cleanup" and "LogCollect" tasks in a workflow. This is a desired behavior of the WMCore (WMTask). In order to override the maxRSS for Merge tasks, one can bypass these limitations using approach Alan Malta Rodriguez shared with T0 (see the draft). Of course, it may be needed to update the script to make it usable for certain cases.
 Modifying the job description has proven to be useful to change next variables:

Revision 762018-08-16 - AndresFelipeQuinteroParra

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 676 to 676
  </>
<!--/twistyPlugin-->
Added:
>
>

Change a variable in a running Sandbox configuration (wmworkload.pkl)

<!--/twistyPlugin twikiMakeVisibleInline-->

First of all you need to locate the working Sandbox that can be located after logging into the appropriate machine and then going to the following folder: /data/tier0/admin/Specs/[name of the process, run number and workflow name].

Make a copy of this Sandbox in a folder located on the private folder of an lxplus machine and make a backup copy of this original Sandbox.

The sandbox file is a compressed file with a tar.bz compression so it should be decompressed using tar command like this tar Ėxjvf [name of the compressed file]. You can create a separate folder for the resulting files and folders or not.

After it is decompressed a new folder appears, called WMSandbox and within it there is a file called WMWorkload.pkl which is a pickle file of all the specifications that runs on the sandbox.

To modify WMWorkload.pkl a script was created named print_workload.py available at /afs/cern.ch/work/c/cmst0/private/scripts/jobs/modifyConfigs/workflow but it will be left on a public folder.

This program allows the wmworkload to be unpicked and a variable named HcalCalHO to be removed form the workload, however this can be done because skims variable parameters as passed as a list to the workload. If you want to modify another argument you will have to find it on the wmworkload.txt that comes out of the wmworkload.pkl. To obtain this readable file you can use the file called generate_code.sh located on the same public folder.

After this process a new WMWorkload.pkl is generated and you need to add the updated WMWorkload.pkl back to the original sandbox and to compress it back to a tar file using the command tar -cvjf and the exact name of the original compressed file. You should also compress al the original files that appeared after the first decompression and continued without modifications (folders PSetTweaks, Utils and the WMCore.zip archive)

Finally the new compressed archive should be copied back to its original location to replace the old one.

<!--/twistyPlugin-->
 

Modifying jobs to resume them with other features (like memory, disk, etc.)

Revision 752018-08-09 - AndresFelipeQuinteroParra

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1837 to 1837
 
  • Enter to the link Submit a request or Report an incident depending on the situation.
  • Set a meaningful title in Short description field. Followed by this, describe the issue in Description and symptoms field.
  • In some cases, just like open a ticket to EOS or/and CASTOR teams, it's useful to select a support team for the ticket. To do this, look for the team writing on the field Optionally, select the support team (Functional Element) that corresponds to your problem. Once you start writing in this field, the available teams are going to be displayed.
Changed:
<
<
>
>
 
  • Use the default visibility for the ticket. It may automatically change depending on the support team.
  • Finally, check all the information and submit the ticket. You and everyone in the Watch list will be notified about any updates via email. Also, a ticket id will be assigned. These tickets are closed by the teams assigned to them, Tier0 can only cancel the ticket if it is the case (that's not something usual).

Revision 742018-07-17 - JuanPabloVinchiraSalazar

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1959 to 1959
  </>
<!--/twistyPlugin-->
Added:
>
>

Checking subscription requests on PhEDEx

<!--/twistyPlugin twikiMakeVisibleInline-->

It's possible to check data subscriptions in PhEDEx using the PhEDEx website or PhEDEx web services.

To check a subscription using the website, go to Data/Subscriptions. Once the page is loaded, select the tab Select Data and then use the form to set query restrictions. The usual restrictions are Nodes and Data Items. Notice that the system will only retrieve current subscriptions, not requests that are waiting to be approved.

If you don't want to check current subscriptions but requests, go to Requests/View/Manage Requests. In this form you can filter the requests by type (transfer or deletion) and by state (pending approval, approved or disapproved). If you need more options to filter the requests, you have to use the PhEDEx web services.

The available PhEDEx services are published on the documentation website. To query requests, use the service Request list. For example, to query the dataset /*/Tier0_REPLAY_vocms015*/* with requests over node T2_CH_CERN do:

https://cmsweb.cern.ch/phedex/datasvc/json/prod/requestlist?dataset=/*/Tier0_REPLAY_vocms015*/*&node=T2_CH_CERN

The request above will return a JSON resultset as follows:

{
    "phedex": {
        "call_time": 9.78992,
        "instance": "prod",
        "request": [
            {
                "approval": "approved",
                "id": 1339469,
                "node": [
                    {
                        "decided_by": "Daniel Valbuena Sosa",
                        "decision": "approved",
                        "id": 1561,
                        "name": "T2_CH_CERN",
                        "se": "srm-eoscms.cern.ch",
                        "time_decided": 1526987916
                    }
                ],
                "requested_by": "Vytautas Jankauskas",
                "time_create": 1526644111.24301,
                "type": "delete"
            },
            {
            ...
            }
        ],
        "request_call": "requestlist",
        "request_date": "2018-07-17 20:53:23 UTC",
        "request_timestamp": 1531860803.37134,
        "request_url": "http://cmsweb.cern.ch:7001/phedex/datasvc/json/prod/requestlist",
        "request_version": "2.4.0pre1"
    }
}

The PhEDEx services not only allows you to create more detailed queries, but are faster than query the information on PhEDEx website.

<!--/twistyPlugin-->
 

Re-processing from scratch (in case of unexpected data deletion, corruption, other disaster etc.)

%TWISTY{

Revision 732018-07-17 - JuanPabloVinchiraSalazar

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1821 to 1821
  </>
<!--/twistyPlugin-->
Added:
>
>

Report an incident or create a request to CERN IT via SNOW tickets

<!--/twistyPlugin twikiMakeVisibleInline-->

Some sort of incidents and requets requires to be executed directly by CERN IT, just like the ones related to EOS and CASTOR storage systems. For example, if a file stored in EOS is not accessible, that should be reported to EOS team via SNOW tickets (take this ticket as an example).

To open a SNOW ticket follow these steps:

  • Enter to SNOW website.
  • Enter to the link Submit a request or Report an incident depending on the situation.
  • Set a meaningful title in Short description field. Followed by this, describe the issue in Description and symptoms field.
  • In some cases, just like open a ticket to EOS or/and CASTOR teams, it's useful to select a support team for the ticket. To do this, look for the team writing on the field Optionally, select the support team (Functional Element) that corresponds to your problem. Once you start writing in this field, the available teams are going to be displayed.
  • When you create a ticket ALWAYS add Tier0 team email list (cms-tier0-operations@cernNOSPAMPLEASE.ch) to the Watch list.
  • Use the default visibility for the ticket. It may automatically change depending on the support team.
  • Finally, check all the information and submit the ticket. You and everyone in the Watch list will be notified about any updates via email. Also, a ticket id will be assigned. These tickets are closed by the teams assigned to them, Tier0 can only cancel the ticket if it is the case (that's not something usual).

<!--/twistyPlugin-->
 

Update code in the dmwm/T0 repository

Revision 722018-06-11 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Deleted:
<
<
Contents :
 Recipes for tier-0 troubleshooting, most of them are written such that you can copy-paste and just replace with your values and obtain the expected results.
Changed:
<
<
BEWARE: The writers are not responsible for side effects of these recipes, always understand the commands before executing them.
>
>
BEWARE: The authors are not responsible for any side effects of these recipes, one should always understand the commands/actions before executing them.

Contents :

 

Tier-0 Configuration modifications

Line: 1948 to 1949
 

In case something unexpected happens and we lose RAW data (deletion, corruption). If this happens, you need to check if the streamer files for lost data are still available in the t0streamer area.

Added:
>
>
  • More detailed recovery plan was documented on JIRA.
 
  • An easy way to check the last cleaned up runs is checking the cleanup script output at
    /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/streamer_delete.log
    .

If related run streamer files are still available, then you want to fully re-process that data:

  • Prepare and set up a new headnode (specify runs to inject, acquisition era, processing versions etc.).
Changed:
<
<
  • If not all previously processed files (RAW, PromptReco output) are deleted, then you need to retrieve a list of such files because they will need to be invalidated later. You can check this querying DBS:
>
>
  • If not all previously processed files (RAW, PromptReco output) are deleted, then you need to retrieve a list of such files because they will need to be invalidated later. You can collect the list of files querying DBS:
 
# Firstly 
# source /data/tier0/admin/env.sh
# source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
# then you can use this simple script to retrieve a list of files from related runs.
Changed:
<
<
# Keep in mind, that in the below snippet, we are ignoring the Express output
>
>
# Keep in mind, that in the below snippet, we are ignoring all Express streams output. # This is just a snippet, so it may not work out of box.
  from dbs.apis.dbsClient import DbsApi from pprint import pprint
Line: 1986 to 1989
  the_file.write(singleFile['logical_file_name']+"\n")
Added:
>
>
  • Once the list of files to be invalidated later is ready, you can configure a headnode for re-processing.
  • In the Prod configuration, you should inject every affected run individually (See the GH).
  • Since we want to skip all the Express processing, we have to skip injection of all Express streams (there were only these streams at the moment of writing, re-check the config before). For this, you need to modify the filters of new stream injection on GetNewData.py, simply add filters to skip Express streams:
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ALCALUMIPIXELSEXPRESS'
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'Express'
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'HLTMonitor'
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'Calibration'
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ExpressAlignment'
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ExpressCosmics'
 
Added:
>
>
  • At this point, it is safe to start the WMAgent on the configured node. Keep in mind that the Tier0Feeder component took 3h+ to inject ~150 runs.
  • When all the runs are processed and cleaned up, it is time to invalidate old files. Again, you need to be extra careful with RAW data. Invalidation and deletion should be done by the Transfers team - a usual way is to create a GGUS ticket with a request.
  </>
<!--/twistyPlugin-->
\ No newline at end of file

Revision 712018-06-08 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1937 to 1937
  </>
<!--/twistyPlugin-->
Deleted:
<
<
VytautasJankauskas - 2018-01-30
 \ No newline at end of file
Added:
>
>

Re-processing from scratch (in case of unexpected data deletion, corruption, other disaster etc.)

<!--/twistyPlugin twikiMakeVisibleInline-->

In case something unexpected happens and we lose RAW data (deletion, corruption). If this happens, you need to check if the streamer files for lost data are still available in the t0streamer area.

  • An easy way to check the last cleaned up runs is checking the cleanup script output at
    /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/streamer_delete.log
    .

If related run streamer files are still available, then you want to fully re-process that data:

  • Prepare and set up a new headnode (specify runs to inject, acquisition era, processing versions etc.).
  • If not all previously processed files (RAW, PromptReco output) are deleted, then you need to retrieve a list of such files because they will need to be invalidated later. You can check this querying DBS:
# Firstly 
# source /data/tier0/admin/env.sh
# source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
# then you can use this simple script to retrieve a list of files from related runs.
# Keep in mind, that in the below snippet, we are ignoring the Express output

from dbs.apis.dbsClient import DbsApi
from pprint import pprint
import os


dbsUrl = 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader'

dbsApi = DbsApi(url = dbsUrl)


runList = [316569]

with open(os.path.join('/data/tier0/srv/wmagent/current/tmpRecovery/', "testRuns.txt"), 'a') as the_file:
    for a in runList:
        datasets = dbsApi.listDatasets(run_num=a)
        pprint(datasets)
        for singleDataset in datasets:
            pdName = singleDataset['dataset']
            if 'Express' not in pdName and 'HLTMonitor' not in pdName and 'Calibration' not in pdName and 'ALCALUMIPIXELSEXPRESS' not in pdName:
                datasetFiles = dbsApi.listFileArray(run_num=a, dataset=pdName)
                #print("For run %d the dataset %s", a, pdName)
                for singleFile in datasetFiles:
                    print(singleFile['logical_file_name'])
                    the_file.write(singleFile['logical_file_name']+"\n")

<!--/twistyPlugin-->
 \ No newline at end of file

Revision 702018-05-31 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 268 to 268
 mode="div" }%
Changed:
<
<
The script is running as a cronjob under the cmsprod acrontab. It is located in the cmsprod area on lxplus.
>
>
The script is running as an acrontab job under the cmst0 acc. It is located in the cmst0 area on lxplus.
 
Changed:
<
<
# Tier0 - /eos/cms/store/t0streamer/ area cleanup script. Running here as cmsprod has writing permission on eos - cms-tier0-operations@cern.ch
0 5 * * * lxplus /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py >> /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/streamer_delete.log 2>&1
>
>
# Tier0 - /eos/cms/store/t0streamer/ area cleanup script. Running here as cmst0 has writing permission on eos - cms-tier0-operations@cern.ch
0 10,22 * * * lxplus.cern.ch /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.sh >> /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/streamer_delete.log 2>&1
  To add a run to the skip list:
Changed:
<
<
  • Login as cmsprod on lxplus.
  • Go to the script location and open it with an editor:
    /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py 
  • The skip list is on the line 83:
>
>
  • Login as cmst0 on lxplus.
  • Go to the script location and open it:
    /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py 
  • The skip list is on the line 117:
 
  # run number in this list will be skipped in the iteration below
Changed:
<
<
runSkip = [251251, 251643, 254608, 254852, 263400, 263410, 263491, 263502, 263584, 263685, 273424, 273425, 273446, 273449, 274956, 274968, 276357]
  • Add the desired run in the end of the list. Be careful in not removing the existing runs.
>
>
runSkip = [251251, 251643, 254608, 254852, 263400, 263410, 263491, 263502, 263584, 263685, 273424, 273425, 273446, 273449, 274956, 274968, 276357,...]
  • Add the desired run in the end of the list. Be careful and do not remove any existing runs.
 
  • Save the changes.
  • It is done!. Don't forget to add it to the Good Runs Twiki

Revision 692018-01-30 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1586 to 1586
 
  1. Start the agent
    00_start_agent
    Particularly, check the PhEDExInjector component, if there you see errors, try restarting it after sourcing init.sh
    source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
    $manage execute-agent wmcoreD --restart --component PhEDExInjector
Changed:
<
<
</>
<!--/twistyPlugin-->


>
>
</>
<!--/twistyPlugin-->

Configuring a newly created VM to be used as a T0 headnode/replay VM

<!--/twistyPlugin twikiMakeVisibleInline-->
 
Added:
>
>
This was started on 30/01/2018. To be continued.
  1. Whenever a new VM is created for T0, it has a mesa-libGLU package missing and, therefore, the deployment script is not going to work:
       Some required packages are missing:
       + for p in '$missingSeeds'
       + echo mesa-libGLU
       mesa-libGLU
       + exit 1
       
    One needs to install the package manually (with a superuser access):
    $ sudo yum install mesa-libGLU

<!--/twistyPlugin-->


 

T0 Pool instructions

Line: 1915 to 1937
  </>
<!--/twistyPlugin-->
Deleted:
<
<
VytautasJankauskas - 2018-01-29
 \ No newline at end of file
Added:
>
>
VytautasJankauskas - 2018-01-30

Revision 682018-01-29 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 895 to 895
 https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306460&primary_dataset=MET
Added:
>
>
run_dataset_done can be called without any primary_dataset parameters, in which case it reports back overall PromptReco status. It aggregates over all known datasets for that run in the system (ie. all datasets for all streams for which we have data for this run).
 </>
<!--/twistyPlugin-->


Line: 1912 to 1915
  </>
<!--/twistyPlugin-->
Deleted:
<
<
VytautasJankauskas - 2018-01-11
 \ No newline at end of file
Added:
>
>
VytautasJankauskas - 2018-01-29
 \ No newline at end of file

Revision 672018-01-29 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 878 to 878
  </>
<!--/twistyPlugin-->


Added:
>
>

Check a stream/dataset/run completion status (Tier0 Data Service (T0DATASVC) queries)

<!--/twistyPlugin twikiMakeVisibleInline-->

Some useful T0DATASVC queries to check a stream/dataset/run completion status:

https://cmsweb.cern.ch/t0wmadatasvc/prod/run_stream_done?run=305199&stream=ZeroBias
https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306462
https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306460&primary_dataset=MET

<!--/twistyPlugin-->


 

WMAgent instructions

Revision 662018-01-11 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 11 to 11
 

Tier-0 Configuration modifications

Added:
>
>

Replay instructions

<!--/twistyPlugin twikiMakeVisibleInline-->

NOTE1: An old and more theoretical version of replays docs is available here.

NOTE2: The 00_XYZ scripts are located at

/data/tier0/
, absolute paths are provided in these instructions for clearness

In order to start a new replay, firstly you need to make sure that the instance is available Check the Tier0 project on Jira and look for the latest ticket about the vobox. If it is closed and/or it is indicated that the previous replay is over, you can proceed.

A) You need no to check that the previous deployed Tier0 WMAgent is not running on the machine anymore. In order to do so, use the next commands.

  • The queue of the condor jobs:
    condor_q
    . If any, you can use
    condor_rm -all
    to remove everything.
  • The list of the Tier0 WMAgent related processes:
    runningagent (This is an alias included in the cmst1 config, actual command: ps aux | egrep 'couch|wmcore|mysql|beam'
  • If the list is not empty, you need to stop the agent:
    /data/tier0/00_stop_agent.sh

B) Setting up the new replay: *How do you choose a run number for the replays:

    • Go to prodmon to check processing status. There you can see that run was 5.5h long and quite large (13TB) - I would look for something smaller (~1h), but letís assume itís ok.
    • Go to WBM and check conditions:
      • Check initial and ending lumi - it should be ~10000 or higher. Currently we run at ~6000-7000, so itís ok as well.
      • Check if physics flag was set for a reasonably high number of lumi sections (if not, we are looking at some non-physics/junk data) - looks good.
    • So, given that the run is too long. I would look for a different one following that logic. You should look for a recent collision run at prodmon.

  • Edit the Replay configuration (change the run, CMSSW version or whatever you need):
    /data/tier0/admin/ReplayOfflineConfiguration.py
  • Later on, run the scripts to start the replay:
      ./00_software.sh # loads the newest version of WMCore and T0 github repositories.
      ./00_deploy.sh # deploys the new configuration, wipes the toast database etc.  
  • 00_deploy.sh script wipes the t0ast db. So, as in replay machines it's fine, you don't want this to happen in a headnode machines. Therefore, be careful while running this script!
       ./00_start_agent.sh # starts the new agent - loads the job list etc.
       vim /data/tier0/srv/wmagent/2.0.8/install/tier0/Tier0Feeder/ComponentLog
       vim /data/tier0/srv/wmagent/2.0.8/install/tier0/JobCreator/ComponentLog 
       vim /data/tier0/srv/wmagent/2.0.8/install/tier0/JobSubmitter/ComponentLog
  • Finally, check again the condor queue and the runningagent lists. Now there should be a list of newly available jobs and their states:
      condor_q
      runningagent

<!--/twistyPlugin-->
 

Adding a new scenario to the configuration

%TWISTY{

Line: 1840 to 1891
 Data deletions are managed by DDM.

</>

<!--/twistyPlugin-->
Added:
>
>
VytautasJankauskas - 2018-01-11

Revision 652017-12-13 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 8 to 8
  BEWARE: The writers are not responsible for side effects of these recipes, always understand the commands before executing them.
Changed:
<
<

EOS Areas of interest

>
>


Tier-0 Configuration modifications

 
Deleted:
<
<
The Tier-0 WMAgent uses four areas on eos.
Path Use Who writes Who reads Who cleans
/eos/cms/store/t0streamer/ Input streamer files transferred from P5 Storage Manager Tier-0 worker nodes Tier-0 t0streamer area cleanup script
/eos/cms/store/unmerged/ Store output files smaller than 2GB until the merge jobs put them together Tier-0 worker nodes (Processing/Repack jobs) Tier-0 worker nodes(Merge Jobs) ?
/eos/cms/tier0/ Files ready to be transferred to Tape and Disk Tier-0 worker nodes (Processing/Repack/Merge jobs) PhEDEx Agent Tier-0 WMAgent creates and auto approves transfer/deletion requests. PhEDEx executes them
/eos/cms/store/express/ Output from Express processing Tier-0 worker nodes Users Tier-0 express area clenaup script

Modifying the Tier-0 Configuration

 

Adding a new scenario to the configuration

%TWISTY{

Line: 37 to 30
  </>
<!--/twistyPlugin-->
Changed:
<
<

Corrupted merged file

>
>

How to Delay Prompt Reco Release?

  %TWISTY{
Added:
>
>
 showlink="Show..." hidelink="Hide" remember="on"
Line: 46 to 40
 mode="div" }%
Changed:
<
<
This includes files that are on tape, already registered on DBS/TMDB. The procedure to recover them is basically to run all the jobs that lead up to this file, starting from the parent merged file, then replace the desired output and make the proper changes in the catalog systems (i.e. DBS/TMDB).
>
>
To delay the PromptReco release is really easy, you just only have to change in the config file (/data/tier0/admin/ProdOfflineConfiguration.py):

defaultRecoTimeout =  48 * 3600

to something higher like 10 * 48 * 3600. Tier0Feeder checks this timeout every polling cycle. So when you want to release it again, you just need to go back to the 48h delay.

  </>
<!--/twistyPlugin-->
Changed:
<
<

Print .pkl files, Change job.pkl

>
>

Changing CMSSW Version

  %TWISTY{ showlink="Show..."
Line: 59 to 62
 mode="div" }%
Changed:
<
<
  • Print job.pkl or Report.pkl in a tier0 WMAgent vm:
>
>
If you need to upgrade the CMSSW version the normal procedure is:

 
Changed:
<
<
# source environment source /data/tier0/srv/wmagent/current/apps/t0/etc/profile.d/init.sh
>
>
/data/tier0/admin/ProdOfflineConfiguration.py
  • Change the defaultCMSSWVersion filed for the desired CMSSW version, for example:
      defaultCMSSWVersion = "CMSSW_7_4_7"
  • Update the repack and express mappings, For example:
      repackVersionOverride = {
          "CMSSW_7_4_2" : "CMSSW_7_4_7",
          "CMSSW_7_4_3" : "CMSSW_7_4_7",
          "CMSSW_7_4_4" : "CMSSW_7_4_7",
          "CMSSW_7_4_5" : "CMSSW_7_4_7",
          "CMSSW_7_4_6" : "CMSSW_7_4_7",
      }
     expressVersionOverride = {
        "CMSSW_7_4_2" : "CMSSW_7_4_7", 
        "CMSSW_7_4_3" : "CMSSW_7_4_7",
        "CMSSW_7_4_4" : "CMSSW_7_4_7",
        "CMSSW_7_4_5" : "CMSSW_7_4_7",
        "CMSSW_7_4_6" : "CMSSW_7_4_7",
    }
  • Save the changes
 
Changed:
<
<
# go to the job area, open a python console and do: import cPickle jobHandle = open('job.pkl', "r") loadedJob = cPickle.load(jobHandle) jobHandle.close() print loadedJob
>
>
  • Find either the last run using the previous version or the first version using the new version for Express and PromptReco. You can use the following query in T0AST to find runs with specific CMSSW version:
 
       select RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where name = '<CMSSW_X_X_X>'
       select EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID where name = '<CMSSW_X_X_X>'
       
 
Changed:
<
<
# for Report.*.pkl do: import cPickle jobHandle = open("Report.3.pkl", "r") loadedJob = cPickle.load(jobHandle) jobHandle.close() print loadedJob
>
>
  • Report the change including the information of the first runs using the new version (or last runs using the old one).
 
Changed:
<
<
  • In addition, to change the job.pkl
>
>
</>
<!--/twistyPlugin-->

Change a file size limit on Tier0

<!--/twistyPlugin twikiMakeVisibleInline-->

As for October 2017, the file size limit was increased from 12GB to 16GB. However, if a change is needed, then the following values need to be modified:

  • maxSizeSingleLumi and maxEdmSize in ProdOfflineConfiguration.py
  • maxAllowedRepackOutputSize in srv/wmagent/current/config/tier0/config.py

<!--/twistyPlugin-->

Force Releasing PromptReco

<!--/twistyPlugin twikiMakeVisibleInline-->

Normally PromptReco workflows has a predefined release delay (currently: 48h). We can require to manually release them in a particular moment. For doing it:

  • Check which runs do you want to release
  • Remember, if some runs are in active the workflows will be created but solve the bookkeeping (or similar) problems.
  • The followinq query makes the pre-release of the non released Runs which ID is lower or equal to a particular value. Depending on which Runs you want to release, you should "play" with this condition. You can run only the SELECT to be sure you are only releasing the runs you want to, before doing the update.
UPDATE ( 
         SELECT reco_release_config.released AS released,
                reco_release_config.delay AS delay,
                reco_release_config.delay_offset AS delay_offset
         FROM  reco_release_config
         WHERE checkForZeroOneState(reco_release_config.released) = 0
               AND reco_release_config.run_id <= <Replace By the desired Run Number> ) t
         SET t.released = 1,
             t.delay = 10,
             t.delay_offset = 5;
  • Check the Tier0Feeder logs. You should see log lines for all the runs you released.

<!--/twistyPlugin-->

PromptReconstruction at T1s/T2s

<!--/twistyPlugin twikiMakeVisibleInline-->

There are 3 basic requirements to perform PromptReconstruction at T1s (and T2s):

  • Each desired site should be configured in the T0 Agent Resource Control. For this, /data/tier0/00_deploy.sh file should be modified specifying pending and running slot thresholds for each type of processing task. For instance:
 
Changed:
<
<
import cPickle, os jobHandle = open('job.pkl', "r") loadedJob = cPickle.load(jobHandle) jobHandle.close() # Do the changes on the loadedJob output = open('job.pkl', 'w') cPickle.dump(loadedJob, output, cPickle.HIGHEST_PROTOCOL) output.flush() os.fsync(output.fileno()) output.close()
>
>
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --cms-name=T1_IT_CNAF --pnn=T1_IT_CNAF_Disk --ce-name=T1_IT_CNAF --pending-slots=100 --running-slots=1000 --plugin=PyCondorPlugin $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Processing --pending-slots=1500 --running-slots=4000 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Production --pending-slots=1500 --running-slots=4000 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Merge --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Cleanup --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=LogCollect --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Skim --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Harvesting --pending-slots=10 --running-slots=20
 
Changed:
<
<
  • Print PSet.pkl in a workernode:
Set the same environment for run a job interactively, go to the PSet.pkl location, open a python console and do:
>
>
A useful command to check the current state of the site (agent parameters for the site, running jobs etc.):
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN -p
 
Added:
>
>
  • A list of possible sites where the reconstruction is wanted should be provided under the parameter siteWhitelist. This is done per Primary Dataset in the configuration file /data/tier0/admin/ProdOfflineConfiguration.py. For instance:
 
Changed:
<
<
import ParameterSet.Config as cms import pickle handle = open('PSet.pkl', 'r') process = pickle.load(handle) handle.close() print process.dumpConfig()
>
>
datasets = [ "DisplacedJet" ]

for dataset in datasets: addDataset(tier0Config, dataset, do_reco = True, raw_to_disk = True, tape_node = "T1_IT_CNAF_MSS", disk_node = "T1_IT_CNAF_Disk", siteWhitelist = [ "T1_IT_CNAF" ], dqm_sequences = [ "@common" ], physics_skims = [ "LogError", "LogErrorMonitor" ], scenario = ppScenario)

  • Jobs should be able to write in the T1 storage systems, for this, a proxy with the production VOMS role should be provided at /data/certs/. The variable X509_USER_PROXY defined at /data/tier0/admin/env.sh should point to the proxy location. A proxy with the required role can not be generated for a time span mayor than 8 days, then a cron job should be responsible of the renewal. For jobs to stage out at T1s, there is no need of mappings of the Distinguished Name (DN) shown in the certificate to specific users in the T1 sites, the mapping is made with the role of the certificate. This could be needed to stage out at T2 sites. Down below, the information of a valid proxy is shown:
subject   : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch/CN=110263821
issuer    : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch
identity  : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch
type      : RFC3820 compliant impersonation proxy
strength  : 1024
path      : /data/certs/serviceproxy-vocms001.pem
timeleft  : 157:02:59
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO        : cms
subject   : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch
issuer    : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
attribute : /cms/Role=production/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
timeleft  : 157:02:58
uri       : voms2.cern.ch:15002
 
<!--/twistyPlugin-->
Changed:
<
<

Delete entries in database when corrupted input files (Repack jobs)

>
>

Adding runs to the skip list in the t0Streamer cleanup script

  %TWISTY{ showlink="Show..."
Line: 113 to 217
 mode="div" }%
Changed:
<
<
SELECT WMBS_WORKFLOW.NAME AS NAME, WMBS_WORKFLOW.TASK AS TASK, LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION AS SUBSCRIPTION, LUMI_SECTION_SPLIT_ACTIVE.RUN_ID AS RUN_ID, LUMI_SECTION_SPLIT_ACTIVE.LUMI_ID AS LUMI_ID FROM LUMI_SECTION_SPLIT_ACTIVE INNER JOIN WMBS_SUBSCRIPTION ON LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION = WMBS_SUBSCRIPTION.ID INNER JOIN WMBS_WORKFLOW ON WMBS_SUBSCRIPTION.WORKFLOW = WMBS_WORKFLOW.ID;
>
>
The script is running as a cronjob under the cmsprod acrontab. It is located in the cmsprod area on lxplus.
 
Changed:
<
<
# This will actually show the pending active lumi sections for repack. One of this should be related to the corrupted file, compare this result with the first query
>
>
# Tier0 - /eos/cms/store/t0streamer/ area cleanup script. Running here as cmsprod has writing permission on eos - cms-tier0-operations@cern.ch
0 5 * * * lxplus /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py >> /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/streamer_delete.log 2>&1
 
Changed:
<
<
SELECT * FROM LUMI_SECTION_SPLIT_ACTIVE;
>
>
To add a run to the skip list:
  • Login as cmsprod on lxplus.
  • Go to the script location and open it with an editor:
    /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py 
  • The skip list is on the line 83:
      # run number in this list will be skipped in the iteration below
        runSkip = [251251, 251643, 254608, 254852, 263400, 263410, 263491, 263502, 263584, 263685, 273424, 273425, 273446, 273449, 274956, 274968, 276357]  
  • Add the desired run in the end of the list. Be careful in not removing the existing runs.
  • Save the changes.
  • It is done!. Don't forget to add it to the Good Runs Twiki
 
Changed:
<
<
# You HAVE to be completely sure about to delete an entry from the database (don't do this if you don't understand what this implies)
>
>
NOTE: It was decided to keep the skip list within the script instead of a external file to avoid the risk of deleting the runs in case of an error reading such external file.
 
Changed:
<
<
DELETE FROM LUMI_SECTION_SPLIT_ACTIVE WHERE SUBSCRIPTION = 1345 and RUN_ID = 207279 and LUMI_ID = 129;
>
>
</>
<!--/twistyPlugin-->


Debugging/fixing operational issues (failing, paused jobs etc.)

How do I look for paused jobs?

<!--/twistyPlugin twikiMakeVisibleInline-->

  • WMStats:
Go to the production WMStats and click on the Request tab. Organize jobs by the paused jobs column, if there is any paused job you will see the workflows that have them. Click on the 'L' related to the workflow with paused jobs. You will go the the Jobs tab, now click on the 'L' of the jobs that are paused.

For documentation on WMStats, please reas CompOpsTier0TeamWMStatsMonitoring

  • T0AST: Log into the T0AST and run the following query, you will get the paused jobs id, name and cache_dir.
SELECT id, name, cache_dir FROM wmbs_job WHERE state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpaused');

You can use this query to get the workflows that have paused jobs:

SELECT DISTINCT(wmbs_workflow.NAME) FROM wmbs_job 
inner join wmbs_jobgroup on wmbs_job.jobgroup = wmbs_jobgroup.ID
inner join wmbs_subscription on wmbs_subscription.ID = wmbs_jobgroup.subscription
inner join wmbs_workflow on wmbs_subscription.workflow = wmbs_workflow.ID
WHERE wmbs_job.state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpaused')
and wmbs_job.cache_dir like '%Reco%';

Paused jobs can also be in state 'submitfailed'

  • Tier0 vm: Log into the tier0 vm and do:
cd /data/tier0/srv/wmagent/current/install/tier0
find ./JobCreator/JobCache -name Report.3.pkl

This will return the cache dir of the paused jobs (This may not work if the jobs were not actually submitted - submitfailed jobs do not create Report.*.pkl)

 
<!--/twistyPlugin-->
Changed:
<
<

Change Cmsweb Tier0 Data Service Passwords (Oracle DB)

>
>

How do I get the job tarballs?

  %TWISTY{ showlink="Show..."
Line: 134 to 289
 mode="div" }%
Changed:
<
<
All the T0 WMAgent instances has the capability of access the Cmsweb Tier0 Data Service instances. So, when changing the passwords it is necessary to be aware of which instances are running.
>
>
  • Go to the cache dir of the job
  • Look for the output .tar.gz PFN from the last retry condor.*.out
  • from a lxplus machine do:
xrdcp PFN .
 
Changed:
<
<
Instances currently in use currently (03/03/2015)
>
>
</>
<!--/twistyPlugin-->
 
Deleted:
<
<
Instance Name TNS
CMS_T0DATASVC_REPLAY_1 INT2R
CMS_T0DATASVC_REPLAY_2 INT2R
CMS_T0DATASVC_PROD CMSR
 
Changed:
<
<
  1. Review running instances.
  2. Stop each of them using:
     /data/tier0/00_stop_agent.sh 
  3. Verify that everything is stopped using:
     ps aux | egrep 'couch|wmcore' 
  4. Make sure of having the new password ready (generating it or getting it in a safe way from the one who is creating it).
  5. From lxplus or any of the T0 machines, log in to the instances you want to change the password to using:
     sqlplus <instanceName>/<password>@<tns> 
    Replacing the brackets with the proper values for each instance.
  6. In sqlplus run the command password, you will be prompt for entering the Old password, the*New Password* and confirming this last. Then you can exit from sqlplus
          SQL> password
          Changing password for <user>
          Old password: 
          New password: 
          Retype new password: 
          Password changed
          SQL> exit
          
  7. Then, you should retry logging in to the same instance, if you can not, you are in trouble!
  8. Communicate the password with the CMSWEB contact in a safe way. After his confirmation you can continue with the following steps.
  9. If everything went well now you can access all the instances with the new passwords. Now it is necessary to update the files secrets files within all the machines, These files are located in:
          /data/tier0/admin/
          
    And normally are named as following (not all the instances will have all the files):
          WMAgent.secrets
          WMAgent.secrets.replay
          WMAgent.secrets.prod
          WMAgent.secrets.localcouch
          WMAgent.secrets.remotecouch
>
>

How do I fail/resume paused jobs?

<!--/twistyPlugin twikiMakeVisibleInline-->

#Source environment 
source /data/tier0/admin/env.sh

# Fail paused-jobs
$manage execute-agent paused-jobs -f -j 10231

# Resume paused-jobs
$manage execute-agent paused-jobs -r -j 10231

You can use the following options:

-j job
-w workflow
-t taskType
-s site
-d do not commit changes, only show what will do

To do mass fails / resumes for a single error code, the follow commands are useful:

cp ListOfPausedJobsFromDB /data/tier0/jocasall/pausedJobsClean.txt
python /data/tier0/jocasall/checkPausedJobs.py
awk -F '_' '{print $6}' code_XXX > jobsToResume.txt
while read job; do $manage execute-agent paused-jobs -r -j ${job}; done <jobsToResume.txt
 
Deleted:
<
<
  1. If there was an instance running you may also change the password in:
         /data/tier0/srv/wmagent/current/config/tier0/config.py
         
    There you must look for the entry:
          config.T0DAtaScvDatabase.connectUrl
         
    and do the update.
  2. You can now restart the instances that were running before the change. Be careful, some components may fail if you start the instance so you should have clear the trade off of starting it.
 
<!--/twistyPlugin-->
Changed:
<
<

Modifying a workflow sandbox

>
>

Data is lost in /store/unmerged - input files for Merge jobs are lost (an intro to run a job interactively)

  %TWISTY{ showlink="Show..."
Line: 186 to 346
 mode="div" }%
Changed:
<
<
If you need to change a file in a workflow sandbox, i.e. in the WMCore zip, this is the procedure:
>
>
If some intermediate data in EOS (i.e. data in /store/unmerged) is lost/corrupted due to some problem (power outage, disk write buffer problems) and there is nothing site support can do about it, you can run the successful jobs (that the wmagent thinks are already done) interactively:
  • First get the job tarball: logCollect may have already run, so go to the oracle database and run this query (replace the LFN of the lost/corrupted file) for knowing which .tar to look in:
select DISTINCT(tar_details.LFN) from wmbs_file_parent
inner join wmbs_file_details parentdetails on wmbs_file_parent.CHILD = parentdetails.ID
left outer join wmbs_file_parent parents on parents.PARENT = wmbs_file_parent.PARENT
left outer join wmbs_file_details childsdetails on parents.CHILD = childsdetails.ID
left outer join wmbs_file_parent childs on childsdetails.ID = childs.PARENT
left outer join wmbs_file_details tar_details on childs.CHILD = tar_details.ID
where childsdetails.LFN like '%tar.gz' and parentdetails.LFN in ('/store/unmerged/express/Commissioning2014/StreamExpressCosmics/ALCARECO/Express-v3/000/227/470/00000/A25ED7B5-5455-E411-AA08-02163E008F52.root',
'/store/unmerged/data/Commissioning2014/MinimumBias/RECO/PromptReco-v3/000/227/430/00000/EC5CF866-5855-E411-BC82-02163E008F75.root');
 
Added:
>
>
  • Get the .tar from castor (i.e.):
 
Changed:
<
<
# Copy the workflow sandbox from /data/tier0/admin/Specs to your work area cp /data/tier0/admin/Specs/PromptReco_Run245436_Cosmics/PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 /data/tier0/lcontrer/temp
>
>
lcg-cp srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/logs/prod/2014/10/WMAgent/PromptReco_Run227430_MinimumBias/PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar ./PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar
 
Changed:
<
<
The work area should only contain the workflow sandbox. Go there and then untar the sandbox and unzip WMCore:
>
>
  • Untar the log collection and look for the file UUID among the tarballs
tar -xvf PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar
zgrep <UUID> ./LogCollection/*.tar.gz 
 
Added:
>
>
  • Now you should know what job reported to have the UUID for the corrupted file. Untar that tarball and run the job interactively (to untar: tar - zxvf ).
  • If you need to set a local input file, you can change the PSet.pkl file to point to a local file. However you need to change the trivialcatalog_file and override the protocol to direct. i.e.
 
Changed:
<
<
cd /data/tier0/lcontrer/temp tar -xjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 unzip -q WMCore.zip
>
>
S'trivialcatalog_file:/home/glidein_pilot/glide_aXehes/execute/dir_30664/job/WMTaskSpace/cmsRun2/CMSSW_7_1_10_patch2/override_catalog.xml?protocol=direct'

Changes to:

S'trivialcatalog_file:/afs/cern.ch/user/l/lcontrer/scram/plugins/override_catalog.xml?protocol=direct'
  • Copy the output file of the job you run interactively to /store/unmerged/... You need to do this using a valid production proxy/cert (i.e. you can use the certs of the t0 production machine). cmsprod user is the owner of these file.
eos cp <local file> </eos/cms/store/unmerged/...>

</>

<!--/twistyPlugin-->
 
Deleted:
<
<
Now replace/modify the files in WMCore. Then you have to merge all again. You should remove the old sandbox and WMCore.zip too:
 
Added:
>
>

Run a job interactively

<!--/twistyPlugin twikiMakeVisibleInline-->

  • Log into lxplus
  • Get the tarball of the job you want to run, untar it (i.e.:)
 
Changed:
<
<
# Remove former sandbox and WMCore.zip, then create the new WMCore.zip rm PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 WMCore.zip zip -rq WMCore.zip WMCore
>
>
tar -zxvf 68d93c9c-db7e-11e3-a585-00221959e789-46-0-logArchive.tar.gz
 
Changed:
<
<
# Now remove the WMCore folder and then create the new sandbox rm -rf WMCore/ tar -cjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 ./*
>
>
  • Create your proxy, then create the scram area:
# Create a valid proxy
voms-proxy-init -voms cms
 
Changed:
<
<
# Clean workarea rm -rf PSetTweaks/ WMCore.zip WMSandbox/
>
>
# Source CMSSW environment source /cvmfs/cms.cern.ch/cmsset_default.sh
 
Changed:
<
<
Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong.
<!--/twistyPlugin-->
>
>
# Create the scram area (Replace the release for the one the job should use) scramv1 project CMSSW CMSSW_7_4_0
 
Changed:
<
<

Force Releasing PromptReco

>
>
  • Go to src area in the CMSSW directory you created, then copy the PSet.pkl and PSet.py (from the untared job cmsRun1/2)
# Go to the src area
cd CMSSW_7_4_0/src/

  • Do eval and run the job
eval `scramv1 runtime -sh`

# Actually run the job (you can pass the parameter to create a fwjr too)
cmsRun PSet.py

</>

<!--/twistyPlugin-->
 
Added:
>
>

Updating T0AST when a lumisection can not be transferred.

 %TWISTY{ showlink="Show..." hidelink="Hide" remember="on" mode="div"
Changed:
<
<
}%
>
>
}%
 
Changed:
<
<
Normally PromptReco workflows has a predefined release delay (currently: 48h). We can require to manually release them in a particular moment. For doing it:
  • Check which runs do you want to release
  • Remember, if some runs are in active the workflows will be created but solve the bookkeeping (or similar) problems.
  • The followinq query makes the pre-release of the non released Runs which ID is lower or equal to a particular value. Depending on which Runs you want to release, you should "play" with this condition. You can run only the SELECT to be sure you are only releasing the runs you want to, before doing the update.
UPDATE ( 
         SELECT reco_release_config.released AS released,
                reco_release_config.delay AS delay,
                reco_release_config.delay_offset AS delay_offset
         FROM  reco_release_config
         WHERE checkForZeroOneState(reco_release_config.released) = 0
               AND reco_release_config.run_id <= <Replace By the desired Run Number> ) t
         SET t.released = 1,
             t.delay = 10,
             t.delay_offset = 5;
  • Check the Tier0Feeder logs. You should see log lines for all the runs you released.
>
>
update lumi_section_closed set filecount = 0, CLOSE_TIME = <timestamp>
where lumi_id in ( <lumisection ID> ) and run_id = <Run ID> and stream_id = <stream ID>;

Example:

update lumi_section_closed set filecount = 0, CLOSE_TIME = 1436179634
where lumi_id in ( 11 ) and run_id = 250938 and stream_id = 14;
  </>
<!--/twistyPlugin-->
Changed:
<
<

Running a replay on a headnode

>
>

Unpickling the PSet.pkl file (job configuration file)

  %TWISTY{ showlink="Show..."
Line: 251 to 453
 remember="on" mode="div" }%
Deleted:
<
<
  • To run a replay in a instance used for production (for example before deploying it in production) you should check the following:
    • If production ran in this instance before, be sure that the T0AST was backed up. Deploying a new instance will wipe it.
    • Modify the WMAgent.secrets file to point to the replay couch and t0datasvc.
      • [20171012] There is a replay WMAgent.secrets file example on vocms0313.
    • Download the latest ReplayOfflineConfig.py from the Github repository. Check the processing version to use based on the jira history.
    • Do not use production 00_deploy.sh. Use the replays 00_deploy.sh script instead. This is the list of changes:
      • Points to the replay secrets file instead of the production secrets file:
            WMAGENT_SECRETS_LOCATION=$HOME/WMAgent.replay.secrets; 
      • Points to the ReplayOfflineConfiguration instead of the ProdOfflineConfiguration:
         sed -i 's+TIER0_CONFIG_FILE+/data/tier0/admin/ReplayOfflineConfiguration.py+' ./config/tier0/config.py 
      • Uses the "tier0replay" team instead of the "tier0production" team (relevant for WMStats monitoring):
         sed -i "s+'team1,team2,cmsdataops'+'tier0replay'+g" ./config/tier0/config.py 
      • Changes the archive delay hours from 168 to 1:
        # Workflow archive delay
                    echo 'config.TaskArchiver.archiveDelayHours = 1' >> ./config/tier0/config.py
      • Uses lower thresholds in the resource-control:
        ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --cms-name=T0_CH_CERN --pnn=T0_CH_CERN_Disk --ce-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=SimpleCondorPlugin
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Merge --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Cleanup --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=LogCollect --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Skim --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Production --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Harvesting --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Express --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Repack --pending-slots=160 --running-slots=320
 
Changed:
<
<
./config/tier0/manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN --cms-name=T2_CH_CERN --pnn=T2_CH_CERN --ce-name=T2_CH_CERN --pending-slots=0 --running-slots=0 --plugin=SimpleCondorPlugin
>
>
To modify the configuration of a job, you can modify the content of the PSet.pkl file. In order to to this you have to dump the pkl file into a python file and there make the necessary changes. To do this normally you'll need ParameterSet.Config. If it is not present in your python path you can modify it:
 
Changed:
<
<
Again, keep in mind that 00_deploy.sh script wipes t0ast db - production instance in this case - so, carefully.
>
>
//BASH
export PYTHONPATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc491/cms/cmssw-patch/CMSSW_7_5_8_patch1/python

In the previous example we assume the job is using CMSSW_7_5_8_patch1 for runningm and that's why we point to this particular path in cvmfs. You should modify it according to the CMSSW version your job is intended to use.

Now you can use the following snippet to dump the file:

//PYTHON

import FWCore.ParameterSet.Config
import pickle
pickleHandle = open('PSet.pkl','rb')
process = pickle.load(pickleHandle)

#This line only will print the python version of the pkl file on the screen
process.dumpPython()

#The actual writing of the file
outputFile = open('PSetPklAsPythonFile.py', 'w')
outputFile.write(process.dumpPython())
outputFile.close()

After dumping the file you can modify its contents. It is not necessary to pkl it again. you can use the cmsRun command normally

cmsRun PSetPklAsPythonFile.py

or

 
cmsRun -e PSet.py 2>err.txt 1>out.txt &
  </>
<!--/twistyPlugin-->
Changed:
<
<

Changing Tier0 Headnode

>
>

Transfers to T1 sites are taking longer than expected

  %TWISTY{ showlink="Show..."
Line: 294 to 501
 mode="div" }%
Changed:
<
<
# Instruction Responsible Role
0. | If there are any exceptions when logging into a candidate headnode, then you should restart it at first. | Tier0 |
0. Run a replay in the new headnode. Some changes have to be done to safely run it in a Prod instance. Please check the Running a replay on a headnode section Tier0
1. Deploy the new prod instance in a new vocmsXXX node, check that we use. Obviously, you should use a production version of 00_deploy.sh script. Tier0
1.5. Check the ProdOfflineconfiguration that is being used Tier0
2. Start the Tier0 instance in vocmsXXX Tier0
3. THIS IS OUTDATED ALREADY I THINK Coordinate with Storage Manager so we have a stop in data transfers, respecting run boundaries (Before this, we need to check that all the runs currently in the Tier0 are ok with bookkeeping. This means no runs in Active status.) SMOps
4. THIS IS OUTDATED ALREADY I THINK Checking al transfer are stopped Tier0
4.1. THIS IS OUTDATED ALREADY I THINK Check http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
4.2. THIS IS OUTDATED ALREADY I THINK Check /data/Logs/General.log
5. THIS IS OUTDATED ALREADY I THINK Change the config file of the transfer system to point to T0AST1. It means, going to /data/TransferSystem/Config/TransferSystem_CERN.cfg and change that the following settings to match the new head node T0AST)
  "DatabaseInstance" => "dbi:Oracle:CMS_T0AST",
  "DatabaseUser"     => "CMS_T0AST_1",
  "DatabasePassword" => 'superSafePassword123',
Tier0
6. THIS IS OUTDATED ALREADY I THINK Make a backup of the General.log.* files (This backup is only needed if using t0_control restart in the next step, if using t0_control_stop + t0_control start logs won't be affected) Tier0
7. THIS IS OUTDATED ALREADY I THINK

Restart transfer system using:

A)

t0_control restart (will erase the logs)

B)

t0_control stop

t0_control start (will keep the logs)

Tier0
8. THIS IS OUTDATED ALREADY I THINK Kill the replay processes (if any) Tier0
9. THIS IS OUTDATED ALREADY I THINK Start notification logs to the SM in vocmsXXX Tier0
10. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
11. THIS IS OUTDATED ALREADY I THINK Restart transfers SMOps
12. RECHECK THE LIST OF CRONTAB JOBS Point acronjobs ran as cmst1 on lxplus to a new headnode. They are checkActiveRuns and checkPendingTransactions scripts. Tier0
</>
<!--/twistyPlugin-->
>
>
Transfers can take a while, so this is somewhat normal. If it takes a very long time, one could ask in the phedex ops HN forum if there is a problem. You can also ping the facility admins or open a GGUS ticket if the issue is backlogging the PromptReco processing in the a given T1.

</>

<!--/twistyPlugin-->
 
Changed:
<
<

Changing CMSSW Version

>
>

Diagnose bookkeeping problems

  %TWISTY{ showlink="Show..."
Line: 326 to 515
 mode="div" }%
Changed:
<
<
If you need to upgrade the CMSSW version the normal procedure is:
>
>
You can run the diagnose active runs script. It will show what is missing for the Tier 0 to process data for the given run. Post in the SMOps hn if there are missing logs or if the bookkeeping is inconsistent.
 
Deleted:
<
<
 
Changed:
<
<
/data/tier0/admin/ProdOfflineConfiguration.py
  • Change the defaultCMSSWVersion filed for the desired CMSSW version, for example:
>
>
#Source environment source /data/tier0/admin/env.sh

# Run the diagnose script (change run number) $manage execute-tier0 diagnoseActiveRuns 231087

</>

<!--/twistyPlugin-->

Looking for jobs that were submitted in a given time frame

<!--/twistyPlugin twikiMakeVisibleInline-->

Best way is to look to the wmbs_jobs table while the workflow is still executing jobs. But if the workflow is already archived, no record in the T0AST about the job is kept. Anyway, there is a way to find out the jobs that were submitted in a given time frame from the couch db:

Add this patch to the couch app (this actually add a view), you may have to modify the path where to patch according to the WMAgent/!Tier0 tags you are using.

 
Changed:
<
<
defaultCMSSWVersion = "CMSSW_7_4_7"
  • Update the repack and express mappings, For example:
>
>
curl https://github.com/dmwm/WMCore/commit/8c5cca41a0ce5946d0a6fb9fb52ed62165594eb0.patch | patch -d /data/tier0/srv/wmagent/1.9.92/sw.pre.hufnagel/slc6_amd64_gcc481/cms/wmagent/1.0.7.pre6/data/ -p 2

Then init couchapp, this will create the view. It may take some time if you have a big database to map.

 
Changed:
<
<
repackVersionOverride = { "CMSSW_7_4_2" : "CMSSW_7_4_7", "CMSSW_7_4_3" : "CMSSW_7_4_7", "CMSSW_7_4_4" : "CMSSW_7_4_7", "CMSSW_7_4_5" : "CMSSW_7_4_7", "CMSSW_7_4_6" : "CMSSW_7_4_7", } expressVersionOverride = { "CMSSW_7_4_2" : "CMSSW_7_4_7", "CMSSW_7_4_3" : "CMSSW_7_4_7", "CMSSW_7_4_4" : "CMSSW_7_4_7", "CMSSW_7_4_5" : "CMSSW_7_4_7", "CMSSW_7_4_6" : "CMSSW_7_4_7", }
  • Save the changes
>
>
$manage execute-agent wmagent-couchapp-init

Then curl the results for the given time frame (look for the timestamps you need, change user and password accordingly)

 
Deleted:
<
<
  • Find either the last run using the previous version or the first version using the new version for Express and PromptReco. You can use the following query in T0AST to find runs with specific CMSSW version:
 
 
Changed:
<
<
select RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where name = '' select EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID where name = ''
>
>
curl -g -X GET 'http://user:password@localhost:5984/wmagent_jobdump%2Fjobs/_design/JobDump/_view/statusByTime?startkey=["executing",1432223400]&endkey=["executing",1432305900]'
 
Changed:
<
<
  • Report the change including the information of the first runs using the new version (or last runs using the old one).
>
>
<!--/twistyPlugin-->

Corrupted merged file

<!--/twistyPlugin twikiMakeVisibleInline-->

This includes files that are on tape, already registered on DBS/TMDB. The procedure to recover them is basically to run all the jobs that lead up to this file, starting from the parent merged file, then replace the desired output and make the proper changes in the catalog systems (i.e. DBS/TMDB).

 
<!--/twistyPlugin-->
Changed:
<
<

Backup T0AST (Database)

>
>

Print .pkl files, Change job.pkl

  %TWISTY{ showlink="Show..."
Line: 371 to 579
 mode="div" }%
Changed:
<
<
If you want to do a backup of a database (for example, after retiring a production node, you want to keep the information of the old T0AST) you should.
  • Request a target Database: Normally this databases are owned by dirk.hufnagel@cernNOSPAMPLEASE.ch, so he should request a new database to be the target of the backup.
  • When the database is ready, you can open a ticket for requesting the backup. For this you should send an email to phydb.support@cernNOSPAMPLEASE.ch. An example of a message can be found in this Elog .
  • When the backup is done you will get a reply to your ticket confirming it.
>
>
  • Print job.pkl or Report.pkl in a tier0 WMAgent vm:
# source environment
source /data/tier0/srv/wmagent/current/apps/t0/etc/profile.d/init.sh

# go to the job area, open a python console and do:
import cPickle
jobHandle = open('job.pkl', "r")
loadedJob = cPickle.load(jobHandle)
jobHandle.close()
print loadedJob

# for Report.*.pkl do:
import cPickle
jobHandle = open("Report.3.pkl", "r")
loadedJob = cPickle.load(jobHandle)
jobHandle.close()
print loadedJob

  • In addition, to change the job.pkl
import cPickle, os
jobHandle = open('job.pkl', "r")
loadedJob = cPickle.load(jobHandle)
jobHandle.close()
# Do the changes on the loadedJob
output = open('job.pkl', 'w')
cPickle.dump(loadedJob, output, cPickle.HIGHEST_PROTOCOL)
output.flush()
os.fsync(output.fileno())
output.close()

  • Print PSet.pkl in a workernode:
Set the same environment for run a job interactively, go to the PSet.pkl location, open a python console and do:

import FWCore.ParameterSet.Config as cms
import pickle
handle = open('PSet.pkl', 'r')
process = pickle.load(handle)
handle.close()
print process.dumpConfig()
  </>
<!--/twistyPlugin-->
Changed:
<
<

Repacking gets stuck but the bookkeeping is consistent

>
>

Modifying jobs to resume them with other features (like memory, disk, etc.)

<!--/twistyPlugin twikiMakeVisibleInline-->

Some scripts are already available to do this, provided with:

  • the cache directory of the job (or location of the job in JobCreator),
  • the feature to modify
  • and the value to be assigned to the feature.

Depending of the feature you want to modify, you would need to change:

  • the config of the single job (job.pkl),
  • the config of the whole workflow (WMWorkload.pkl),
  • or both.

We have learnt by trial and error which variables and files need to be modified to get the desired result, so you would need to do the same depending of the case. Down below we show some basic examples of how to do this:

Some cases have proven you need to modify the Workflow Sandbox when you want to modify next variables:

  • Memory thresholds (maxRSS, memoryRequirement)
  • Number of processing threads (numberOfCores)
  • CMSSW release (cmsswVersion)
  • SCRAM architecture (scramArch)

Modifying the job description has proven to be useful to change next variables:

  • Condor ClassAd of RequestCpus (numberOfCores)
  • CMSSW release (swVersion)
  • SCRAM architecture (scramArch)

At /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs there are two directories named "job" and "workflow". You should enter the respective directory. Follow next instructions in the agent machine in charge of the jobs to modify.

Modifying the Workflow Sandbox

Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/workflow

In a file named "list", list the jobs you need to modify. Follow the procedure for each type of job/task, given the workflow configuration is different for different Streams.

Use script print_workflow_config.sh to generate a human readable copy of WMWorkload.pkl. Look for the name of the variable of the feature to change, for instance maxRSS. Now use the script generate_code.sh to create a script to modify that feature. You should provide the name of the feature and the value to be assigned, for instance:

feature=maxRSS
value=15360000

Executing generate_code.sh would create a script named after the feature, like modify_wmworkload_maxRSS.py. The later will modify the selected feature in the Workflow Sandbox.

After generated, you need to add a call to that script in modify_one_workflow.sh. The later will call all the required scripts, create the tarball and locate it where required (Specs folder).

Finally, execute modify_several_workflows.sh which will call modify_one_workflow.sh for all the desired workflows.

The previous procedure has been followed for several jobs, so for some features the required personalization of the scripts has been already done, and you would just need to comment or uncomment the required lines. As a summary, you would need to proceed as detailed bellow:

vim list
./print_workflow_config.sh
vim generate_code.sh
./generate_code.sh
vim modify_one_workflow.sh
./modify_several_workflows.sh

Modifying the Job Description

Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/job

Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use modify_pset.py as base to create another file which would modify the required feature, you can give it a name like modify_pset_.py. Add a call to the just created script in modify_one_job.sh. Finally, execute modify_several_jobs.sh, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.

vim list
cp modify_pset.py modify_pset_<feature>.py
vim modify_pset_<feature>.py
vim modify_one_job.sh
./modify_several_jobs.sh

<!--/twistyPlugin-->

Modifying a workflow sandbox

<!--/twistyPlugin twikiMakeVisibleInline-->

If you need to change a file in a workflow sandbox, i.e. in the WMCore zip, this is the procedure:

# Copy the workflow sandbox from /data/tier0/admin/Specs to your work area
cp /data/tier0/admin/Specs/PromptReco_Run245436_Cosmics/PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 /data/tier0/lcontrer/temp

The work area should only contain the workflow sandbox. Go there and then untar the sandbox and unzip WMCore:

cd /data/tier0/lcontrer/temp
tar -xjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 
unzip -q WMCore.zip

Now replace/modify the files in WMCore. Then you have to merge all again. You should remove the old sandbox and WMCore.zip too:

# Remove former sandbox and WMCore.zip, then create the new WMCore.zip
rm PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 WMCore.zip
zip -rq WMCore.zip WMCore

# Now remove the WMCore folder and then create the new sandbox
rm -rf WMCore/
tar -cjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 ./*

# Clean workarea
rm -rf PSetTweaks/ WMCore.zip WMSandbox/

Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong.

<!--/twistyPlugin-->

Repacking gets stuck but the bookkeeping is consistent

  %TWISTY{ showlink="Show..."
Line: 416 to 778
  </>
<!--/twistyPlugin-->
Changed:
<
<

Updating the wall time the jobs are using in the condor ClassAd

>
>

Delete entries in database when corrupted input files (Repack jobs)

  %TWISTY{ showlink="Show..."
Line: 425 to 788
 mode="div" }%
Deleted:
<
<
This time can be modified using the following command. Remember that it should be executed as the owner of the jobs.
 
Changed:
<
<
condor_qedit -const 'MaxWallTimeMins>30000' MaxWallTimeMins 1440

</>

<!--/twistyPlugin-->

Updating T0AST when a lumisection can not be transferred.

>
>
SELECT WMBS_WORKFLOW.NAME AS NAME, WMBS_WORKFLOW.TASK AS TASK, LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION AS SUBSCRIPTION, LUMI_SECTION_SPLIT_ACTIVE.RUN_ID AS RUN_ID, LUMI_SECTION_SPLIT_ACTIVE.LUMI_ID AS LUMI_ID FROM LUMI_SECTION_SPLIT_ACTIVE INNER JOIN WMBS_SUBSCRIPTION ON LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION = WMBS_SUBSCRIPTION.ID INNER JOIN WMBS_WORKFLOW ON WMBS_SUBSCRIPTION.WORKFLOW = WMBS_WORKFLOW.ID;
 
Changed:
<
<
<!--/twistyPlugin twikiMakeVisibleInline-->
>
>
# This will actually show the pending active lumi sections for repack. One of this should be related to the corrupted file, compare this result with the first query
 
Changed:
<
<
update lumi_section_closed set filecount = 0, CLOSE_TIME = where lumi_id in ( ) and run_id = and stream_id = ;

Example:

>
>
SELECT * FROM LUMI_SECTION_SPLIT_ACTIVE; # You HAVE to be completely sure about to delete an entry from the database (don't do this if you don't understand what this implies)
 
Changed:
<
<
update lumi_section_closed set filecount = 0, CLOSE_TIME = 1436179634 where lumi_id in ( 11 ) and run_id = 250938 and stream_id = 14;
>
>
DELETE FROM LUMI_SECTION_SPLIT_ACTIVE WHERE SUBSCRIPTION = 1345 and RUN_ID = 207279 and LUMI_ID = 129;
 
<!--/twistyPlugin-->
Changed:
<
<

Restarting head node machine

>
>

Manually modify the First Conditions Safe Run (fcsr)

  %TWISTY{ showlink="Show..."
Line: 463 to 815
 mode="div" }%
Changed:
<
<
  1. Stop Tier0 agent
    00_stop_agent.sh
  2. Stop condor
    service condor stop 
    If you want your data to be still available, then cp your spool directory to disk
    cp -r /mnt/ramdisk/spool /data/
  3. Restart the machine (or request its restart)
  4. Mount the RAM Disk (Condor spool won't work otherwise).
  5. If necessary, copy back the data to the spool.
  6. When restarted, start the sminject component
    t0_control start 
  7. Start the agent
    00_start_agent
    Particularly, check the PhEDExInjector component, if there you see errors, try restarting it after sourcing init.sh
    source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
    $manage execute-agent wmcoreD --restart --component PhEDExInjector
>
>
The current fcsr can be checked in the Tier0 Data Service: https://cmsweb.cern.ch/t0wmadatasvc/prod/firstconditionsaferun
 
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>
In the CMS_T0DATASVC_PROD database check the table the first run with locked = 0 is fcsr
 
Changed:
<
<

Updating TransferSystem for StorageManager change of alias

>
>
 reco_locked table 
 
Changed:
<
<
<!--/twistyPlugin twikiMakeVisibleInline-->
>
>
If you want to manually set a run as the fcsr you have to make sure that it is the lowest run with locked = 0
 
Changed:
<
<
Ideally this process should be transparent to us. However, it might be that the TransferSystem doesn't update the IP address of the SM alias when the alias is changed to point to the new machine. In this case you will need to restart the TransferSystem in both the /data/tier0/sminject area on the T0 headnode and the /data/TransferSystem area on vocms001. Steps for this process are below:
>
>
 update reco_locked set locked = 0 where run >= <desired_run> 
 
Changed:
<
<
  1. Watch the relevant logs on the headnode to see if streamers are being received by the Tier0Injector and if repack notices are being sent by the LoggerReceiver. A useful command for this is:
     watch "tail /data/tier0/srv/wmagent/current/install/tier0/Tier0Feeder/ComponentLog; tail /data/tier0/sminject/Logs/General.log; tail /data/tier0/srv/wmagent/current/install/tier0/JobCreator/ComponentLog" 
  2. Also watch the TransferSystem on vocms001 to see if streamers / files are being received from the SM and if CopyCheck notices are being sent to the SM. A useful command for this is:
     watch "tail /data/TransferSystem/Logs/General.log; tail /data/TransferSystem/Logs/Logger/LoggerReceiver.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckManager.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckWorker.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorManager.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorWorker.log" 
  3. If any of these services stop sending and/or receiving, you will need to restart the TransferSystem.
  4. Restart the TransferSystem on vocms001. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/TransferSystem
    ./t0_control stop
    ./t0_control start
              
  5. Restart the TransferSystem on the T0 headnode. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/tier0/sminject
    ./t0_control stop
    ./t0_control start
              
>
>
<!--/twistyPlugin-->


 
Deleted:
<
<
</>
<!--/twistyPlugin-->
 
Changed:
<
<

Restart component in case of deadlock

>
>

WMAgent instructions

Restart component in case of deadlock

  %TWISTY{ showlink="Show..."
Line: 518 to 847
  </>
<!--/twistyPlugin-->
Changed:
<
<

Changing Tier0 certificates

>
>

How do I restart the Tier 0 WMAgent?

  %TWISTY{ showlink="Show..."
Line: 527 to 857
 mode="div" }%
Changed:
<
<
  • Check that using the new certificates guarantees privileges to all the needed resources:

Voboxes

>
>
  • Restart the whole agent:
cd /data/tier0/
./00_stop_agent.sh
./00_start_agent.sh
 
Changed:
<
<
  • Copy the servicecert*.pem, servicekey*.pem and serviceproxy*.pem files to
/data/certs 
  • Update the following files to point to the new certificates
admin/env.sh
admin/env_unit.sh

Kibana

  • Change the certificates in the monitoring scripts where they are used, to see where the certificates are being used and the current monitoring head node please check the Tier0 Montoring Twiki.

TransferSystem

  • In the TransferSystem (currently vocms001), update the following file to point to the new certificate and restart component.
/data/TransferSystem/t0_control.sh
>
>
  • Restart a single component (Replace ComponentName):
source /data/tier0/admin/env.sh
$manage execute-agent wmcoreD --restart --component ComponentName
  </>
<!--/twistyPlugin-->
Changed:
<
<

Getting Job Statistics

>
>

Updating workflow from completed to normal-archived in WMStats

  %TWISTY{ showlink="Show..."
Line: 556 to 880
 mode="div" }%
Changed:
<
<
This is the base script to compile the information of jobs that are already done:
/afs/cern.ch/user/e/ebohorqu/public/HIStats/stats.py
>
>
  • To move workflows in completed state to archived state in WMStats, the next code should be executed in one of the agents (prod or test):
     https://github.com/ticoann/WmAgentScripts/blob/wmstat_temp_test/test/updateT0RequestStatus.py 
 
Changed:
<
<
For the analysis we need to define certain things:
>
>
  • The script should be copied to bin folder of wmagent code. For instance, in replay instances:
     /data/tier0/srv/wmagent/2.0.4/sw/slc6_amd64_gcc493/cms/wmagent/1.0.17.pre4/bin/ 
 
Changed:
<
<
  • main_dir: Folder where input log archives are. e.g. '/data/tier0/srv/wmagent/current/install/tier0/JobArchiver/logDir/Pí, in main()
  • temp: Folder where output json files are going to be generated, in main().
  • runList: Runs to be analyzed, in main()
  • Job type in two places:
    • getStats()
    • stats[dataset] in main()
>
>
  • The script should be modified, assigning a run number in the next statement
     if info['Run'] < <RunNumber>
    As you should notice, the given run number would be the oldest run to be shown in WMStats.
 
Changed:
<
<
The script is run without any parameter. This generates a json file with information about cpu, memory, storage and start and stop times. Task is also included. An example of output file is:
/afs/cern.ch/user/e/ebohorqu/public/HIStats/RecoStatsProcessing.json
>
>
  • After it, the code can be executed with:
     $manage execute-agent updateT0RequestStatus.py 
 
Changed:
<
<
With a separate script in R, I was reading and summarizing the data:
/afs/cern.ch/user/e/ebohorqu/public/HIStats/parse_cpu_info.R
>
>
</>
<!--/twistyPlugin-->
 
Deleted:
<
<
There, task type should be defined and also output file. With this script I was just summarizing cpu data, but we could modify it a little to get memory data. Maybe it is quicker to do it directly with the first python script, if you like to do it :P
 
Changed:
<
<
That script calculates efficiency of each job:
TotalLoopCPU / TotalJobTime * numberOfCores 
>
>

How do I set job thresholds in the WMAgent?

 
Changed:
<
<
and an averaged efficiency per dataset:
sum(TotalLoopCPU) / sum(TotalJobTime * numberOfCores) 
>
>
<!--/twistyPlugin twikiMakeVisibleInline-->
 
Changed:
<
<
numberOfCores was obtained from job.pkl, TotalLoopCPU and TotalJobTime were obtained from report.pkl
>
>
  • AgentStatusWatcher and SSB:
Site thresholds are automatically updated by a WMAgent component: AgentStatusWatcher. This components takes information about site status and resources (CPU Bound and IO Bound) from the SiteStatusBoard Pledges view There are some configurations in the WMAgent config that can be tuned, please have a look to the documentation
 
Changed:
<
<
Job type could be Processing, Merge and Harvesting. For Processing type, task could be Reco or AlcaSkim and for Merge type, ALCASkimMergeALCARECO, RecoMergeSkim, RecoMergeWrite _AOD, RecoMergeWrite _DQMIO, RecoMergeWrite _MINIAOD and RecoMergeWrite _RECO.
>
>
  • Add sites to resource control/manual update of the thresholds
This doesn't worth unless AgentStatusWatcher is shutdown. Some useful commands are:

#Source environment 
source /data/tier0/admin/env.sh

# Add a site to Resource Control - Change site, thresholds and plugin if needed
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --cms-name=T2_CH_CERN_T0 --se-name=srm-eoscms.cern.ch --ce-name=T2_CH_CERN_T0 --pending-slots=1000 --running-slots=1000 --plugin=CondorPlugin

# Change/init thresholds by task:
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --task-type=Processing --pending-slots=500 --running-slots=500

# Change site status (normal, drain, down)
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --down
 
<!--/twistyPlugin-->
Changed:
<
<

Unpickling the PSet.pkl file (job configuration file)

>
>

Unregistering an agent from WMStats

  %TWISTY{ showlink="Show..."
Line: 597 to 933
 mode="div" }%
Changed:
<
<
To modify the configuration of a job, you can modify the content of the PSet.pkl file. In order to to this you have to dump the pkl file into a python file and there make the necessary changes. To do this normally you'll need ParameterSet.Config. If it is not present in your python path you can modify it:

//BASH

export PYTHONPATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc491/cms/cmssw-patch/CMSSW_7_5_8_patch1/python

In the previous example we assume the job is using CMSSW_7_5_8_patch1 for runningm and that's why we point to this particular path in cvmfs. You should modify it according to the CMSSW version your job is intended to use.

Now you can use the following snippet to dump the file:

>
>
First thing to know - an agent has to be stopped to unregister it. Otherwise, AgentStatusWatcher will just keep updating a new doc for wmstats.
  • Log into the agent
  • Source the environment:
     source /data/tier0/admin/env.sh  
  • Execute:
     $manage execute-agent wmagent-unregister-wmstats `hostname -f` 
  • You will be prompt for confirmation. Type 'yes'.
  • Check that the agent doesn't appear in WMStats.
 
Changed:
<
<
//PYTHON
import FWCore.ParameterSet.Config
import pickle
pickleHandle = open('PSet.pkl','rb')
process = pickle.load(pickleHandle)
>
>
</>
<!--/twistyPlugin-->
 
Deleted:
<
<
#This line only will print the python version of the pkl file on the screen process.dumpPython()
 
Changed:
<
<
#The actual writing of the file outputFile = open('PSetPklAsPythonFile.py', 'w') outputFile.write(process.dumpPython()) outputFile.close()
>
>

Modify the thresholds in the resource control of the Agent

 
Changed:
<
<
After dumping the file you can modify its contents. It is not necessary to pkl it again. you can use the cmsRun command normally
>
>
<!--/twistyPlugin twikiMakeVisibleInline-->
 
Changed:
<
<
cmsRun PSetPklAsPythonFile.py
>
>
  • Login into the desired agent and become cmst1
  • Source the environment
     source /data/tier0/admin/env.sh 
  • Execute the following command with the desired values:
     $manage execute-agent wmagent-resource-control --site-name=<Desired_Site> --task-type=Processing --pending-slots=<desired_value> --running-slots=<desired_value> 
    Example:
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=3000 --running-slots=9000 
 
Changed:
<
<
or
>
>
  • To change the general values
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=PyCondorPlugin 
 
Changed:
<
<
 
cmsRun -e PSet.py 2>err.txt 1>out.txt &
>
>
  • To see the current thresholds and use
     $manage execute-agent wmagent-resource-control -p 
 
<!--/twistyPlugin-->
Changed:
<
<

Checking transfer status at agent shutdown

>
>

Checking transfer status at agent shutdown

  %TWISTY{ showlink="Show..."
Line: 683 to 1020
 ) and site like ...
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>
</>
<!--/twistyPlugin-->


 
Changed:
<
<

Disabling flocking to Tier0 Pool

>
>

Condor instructions

Useful queries

  %TWISTY{ showlink="Show..."
Line: 694 to 1034
 mode="div" }%
Deleted:
<
<
If it is needed to prevent new Central Production jobs to be executed in the Tier0 pool it is necessary to disable flocking.

NOTE1: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/

Only in case of emergency out of working hours, consider executing the below procedure on your own. But posting elog entry in this case is even more important as SI team needs to be aware of such meaningful changes.

 
Added:
>
>
To get condor attributes:
condor_q 52982.15 -l | less -i
To get condor list by regexp:
condor_q -const 'regexp("30199",WMAgent_RequestName)' -af
 
Changed:
<
<
GENERAL INFO:
  • Difference between site whitelisting and enabling/disabling flocking. When flocking jobs of different core counts, defragmentation may have to be re-tuned.
  • Also when the core-count is smaller than the defragmentation policy objective. E.g., the current defragmentation policy is focused on defragmenting slots with less than 4 cores. Having flocking enabled and only single or 2-core jobs in the mix, will trigger unnecessary defragmentation. I know this is not a common case, but if the policy were focused on 8-cores and for some reason, they inject 4-core jobs, while flocking is enabled, the same would happen.
>
>
</>
<!--/twistyPlugin-->
 
Deleted:
<
<
As the changes directly affect the GlideInWMS Collector and Negotiator, you can cause a big mess if you don't proceed with caution. To do so you should follow these steps.
 
Changed:
<
<
NOTE2: The root access to the GlideInWMS Collector is guaranteed for the members of the cms-tier0-operations@cern.ch e-group.
>
>

Changing priority of jobs that are in the condor queue

 
Changed:
<
<
  • Login to vocms007 (GlideInWMS Collector-Negociator)
  • Login as root
     sudo su - 
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/
  • There you will find a list of files. Most of them are puppetized which means any change will be overridden when puppet be executed. There is one no-puppetized file called 99_local_tweaks.config that is the one that will be used to do the changes we desire.
     -rw-r--r--. 1 condor condor  1849 Mar 19  2015 00_gwms_general.config
     -rw-r--r--. 1 condor condor  1511 Mar 19  2015 01_gwms_collectors.config
     -rw-r--r--  1 condor condor   678 May 27  2015 03_gwms_local.config
     -rw-r--r--  1 condor condor  2613 Nov 30 11:16 10_cms_htcondor.config
     -rw-r--r--  1 condor condor  3279 Jun 30  2015 10_had.config
     -rw-r--r--  1 condor condor 36360 Jun 29  2015 20_cms_secondary_collectors_tier0.config
     -rw-r--r--  1 condor condor  2080 Feb 22 12:24 80_cms_collector_generic.config
     -rw-r--r--  1 condor condor  3186 Mar 31 14:05 81_cms_collector_tier0_generic.config
     -rw-r--r--  1 condor condor  1875 Feb 15 14:05 90_cms_negotiator_policy_tier0.config
     -rw-r--r--  1 condor condor  3198 Aug  5  2015 95_cms_daemon_monitoring.config
     -rw-r--r--  1 condor condor  6306 Apr 15 11:21 99_local_tweaks.config

Within this file there is a special section for the Tier0 ops. The other sections of the file should not be modified.

>
>
<!--/twistyPlugin twikiMakeVisibleInline-->
 
Changed:
<
<
  • To disable flocking you should locate the flocking config section:
    # Knob to enable or disable flocking
    # To enable, set this to True (defragmentation is auto enabled)
    # To disable, set this to False (defragmentation is auto disabled)
    ENABLE_PROD_FLOCKING = True
  • Change the value to False
    ENABLE_PROD_FLOCKING = False
  • Save the changes in the 99_local_tweaks.config file and execute the following command to apply the changes:
     condor_reconfig 
>
>
  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> JobPrio "<New Prio (numeric value)>" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
    for job in $(condor_q -w | awk '{print $1}')
         do
               condor_qedit $job JobPrio "508200001"
         done  
 
Changed:
<
<
  • The negociator has a 12h cache, so the schedds don't need to authenticate during this period of time. It is required to restart the negotiator.
>
>
<!--/twistyPlugin-->
 
Deleted:
<
<
  • Now, you can check the whitelisted Schedds to run in Tier0 pool, the Central Production Schedds should not appear there.
     condor_config_val -master gsi_daemon_name  
 
Changed:
<
<
  • Now you need to restart the condor negociator to make sure that the changes are applied right away.
     ps aux | grep "condor_negotiator"   
    kill -9 <replace_by_condor_negotiator_process_id> 
>
>

Updating the wall time the jobs are using in the condor ClassAd

 
Changed:
<
<
  • After killing the process it should reappear again after a couple of minutes.
>
>
<!--/twistyPlugin twikiMakeVisibleInline-->
 
Changed:
<
<
  • It is done!
>
>
This time can be modified using the following command. Remember that it should be executed as the owner of the jobs.
 
Changed:
<
<
Remember that this change won't remove/evict the jobs that are actually running, but will prevent new jobs to be sent.
>
>
condor_qedit -const 'MaxWallTimeMins>30000' MaxWallTimeMins 1440
 
<!--/twistyPlugin-->
Changed:
<
<

Enabling pre-emption in the Tier0 pool

>
>

Check the number of jobs and CPUs in condor

  %TWISTY{ showlink="Show..."
Line: 769 to 1093
 mode="div" }%
Changed:
<
<
BEWARE: Please DO NOT use this strategy unless you are sure it is necessary and you agree in doing it with the Workflow Team. This literally kills all the Central Production jobs which are in Tier0 Pool (including ones which are being executed at that moment).
>
>
The following commands can be executed from any VM where there's a Tier0 schedd present (recheck if the list of VMs corresponds with the current list of Tier0 schedds. Tier0 production Central Manager is hosted on vocms007).
 
Changed:
<
<
NOTE: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/ Only in case of emergency out of working hours, consider executing the below procedure on your own. But posting elog entry in this case is even more important as SI team needs to be aware of such meaningful changes.
>
>
  • Get the number of tier0 jobs sorted by a number of CPUs they are using:
 
Added:
>
>
condor_status -pool vocms007 -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")' -af Cpus | sort | uniq -c
 
Changed:
<
<
  • Login to vocms007 (GlideInWMS Collector-Negotiator)
  • Login as root
     sudo su -  
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/ 
  • Open 99_local_tweaks.config
  • Locate this section:
     # How to drain the slots
        # graceful: let the jobs finish, accept no more jobs
        # quick: allow job to checkpoint (if supported) and evict it
        # fast: hard kill the jobs
       DEFRAG_SCHEDULE = graceful 
  • Change it to:
     DEFRAG_SCHEDULE = fast 
  • Leave it enabled only for ~5 minutes. After this the Tier0 jobs will start being killed as well. After the 5 minutes, revert the change
     DEFRAG_SCHEDULE = graceful 
>
>

  • Get the number of total CPUs used for Tier0 jobs on Tier0 Pool:

condor_status -pool vocms007  -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || 
ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")'  -af Cpus | awk '{sum+= $1} END {print(sum)}'

  • The total number of CPUs used by NOT Tier0 jobs on Tier0 Pool:

condor_status -pool vocms007  -const 'State=="Claimed" && ( ClientMachine=!="vocms001.cern.ch" && ClientMachine=!="vocms014.cern.ch" && ClientMachine=!="vocms015.cern.ch" && 
ClientMachine=!="vocms0313.cern.ch" && ClientMachine=!="vocms0314.cern.ch" &&
 ClientMachine=!="vocms039.cern.ch" && ClientMachine=!="vocms047.cern.ch" && ClientMachine=!="vocms013.cern.ch")'  -af Cpus | awk '{sum+= $1} END {print(sum)}'

  </>
<!--/twistyPlugin-->
Changed:
<
<

Changing the status of _CH_CERN sites in SSB

>
>

Overriding the limit of Maximum Running jobs by the Condor Schedd

  %TWISTY{ showlink="Show..."
Line: 805 to 1132
 mode="div" }%
Changed:
<
<
To change T2_CH_CERN and T2_CH_CERN_HLT

*Please note that Tier0 Ops changing the status of T2_CH_CERN and T2_CH_CERN_HLT is an emergency procedure, not a standard one*

  • Open a GGUS Ticket to the site before proceeding, asking them to change the status themselves.
  • If there is no response after 1 hour, reply to the same ticket reporting you are changing it and proceed with the steps in the next section.

To change T0_CH_CERN

  • You should go to the Prodstatus Metric Manual Override site.
  • There, you will be able to change the status of T0_CH_CERN/T2_CH_CERN/T2_CH_CERN_HLT. You can set Enabled, Disabled, Drain or No override. The Reason field is mandatory (the history of thisreason can be checked here). Then click "Apply" and the procedure will be complete. The users in the cms-tier0-operations e-group are able to do this change.
  • The status in the SSB site is updated every 15 minutes. So you should be able to see the change there maximum after this amount of time.
  • More extensive documentation can be checked here.
>
>
  • Login as root in the Schedd machine
  • Go to:
     /etc/condor/config.d/99_local_tweaks.config  
  • There, override the limit adding/modifying this line:
     MAX_JOBS_RUNNING = <value>  
  • For example:
     MAX_JOBS_RUNNING = 12000  
  • Then, to apply the changes, run:
    condor_reconfig 
  </>
<!--/twistyPlugin-->
Changed:
<
<

Changing priority of jobs that are in the condor queue

>
>

Changing highIO flag of jobs that are in the condor queue

  %TWISTY{ showlink="Show..."
Line: 830 to 1155
 }%

  • The command for doing it is (it should be executed as cmst1):
Changed:
<
<
condor_qedit <job-id> JobPrio "<New Prio (numeric value)>" 
>
>
condor_qedit <job-id> Requestioslots "0" 
 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
Changed:
<
<
for job in $(condor_q -w | awk '{print $1}')
>
>
 for job in $(cat <text_file_with_the_list_of_job_condor_IDs>)
  do
Changed:
<
<
condor_qedit $job JobPrio "508200001"
>
>
condor_qedit $job Requestioslots "0"
  done
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>
</>
<!--/twistyPlugin-->


 
Changed:
<
<

Changing highIO flag of jobs that are in the condor queue

>
>

GRID certificates

Changing the certificate mapping to access eos

  %TWISTY{ showlink="Show..."
Line: 848 to 1176
 mode="div" }%
Changed:
<
<
  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> Requestioslots "0" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
     for job in $(cat <text_file_with_the_list_of_job_condor_IDs>)
             do
                 condor_qedit $job Requestioslots "0"
             done 
>
>
  • The VOC is responsible for this change. This mapping is specified on a file deployed at:
    • /afs/cern.ch/cms/caf/gridmap/gridmap.txt
  • Current VOC, Daiel Valbuena, has the script writting the gridmap file versioned here: (See Gitlab repo).
  • The following lines were added there to map the certificate used by our agents to the cmst0 service account.
    9.  namesToMapToTIER0 = [ "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms15.cern.ch",
    10.                 "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch"]
    38.        elif p[ 'dn' ] in namesToMapToTIER0:
    39.           dnmap[ p['dn'] ] = "cmst0" 
  </>
<!--/twistyPlugin-->
Changed:
<
<

Updating workflow from completed to normal-archived in WMStats

>
>

Changing Tier0 certificates

  %TWISTY{ showlink="Show..."
Line: 867 to 1197
 mode="div" }%
Changed:
<
<
  • To move workflows in completed state to archived state in WMStats, the next code should be executed in one of the agents (prod or test):
     https://github.com/ticoann/WmAgentScripts/blob/wmstat_temp_test/test/updateT0RequestStatus.py 
>
>
  • Check that using the new certificates guarantees privileges to all the needed resources:

Voboxes

 
Changed:
<
<
  • The script should be copied to bin folder of wmagent code. For instance, in replay instances:
     /data/tier0/srv/wmagent/2.0.4/sw/slc6_amd64_gcc493/cms/wmagent/1.0.17.pre4/bin/ 
>
>
  • Copy the servicecert*.pem, servicekey*.pem and serviceproxy*.pem files to
/data/certs 
  • Update the following files to point to the new certificates
admin/env.sh
admin/env_unit.sh

Kibana

  • Change the certificates in the monitoring scripts where they are used, to see where the certificates are being used and the current monitoring head node please check the Tier0 Montoring Twiki.

TransferSystem

 
Changed:
<
<
  • The script should be modified, assigning a run number in the next statement
     if info['Run'] < <RunNumber>
    As you should notice, the given run number would be the oldest run to be shown in WMStats.
>
>
TransferSystem is not used anymore
  • In the TransferSystem (currently vocms001), update the following file to point to the new certificate and restart component.
/data/TransferSystem/t0_control.sh
 
Changed:
<
<
  • After it, the code can be executed with:
     $manage execute-agent updateT0RequestStatus.py 
>
>
</>
<!--/twistyPlugin-->


 
Deleted:
<
<
</>
<!--/twistyPlugin-->
 
Changed:
<
<

Adding runs to the skip list in the t0Streamer cleanup script

>
>

OracleDB (T0AST) instructions

Change Cmsweb Tier0 Data Service Passwords (Oracle DB)

  %TWISTY{ showlink="Show..."
Line: 887 to 1231
 mode="div" }%
Changed:
<
<
The script is running as a cronjob under the cmsprod acrontab. It is located in the cmsprod area on lxplus.
>
>
All the T0 WMAgent instances has the capability of access the Cmsweb Tier0 Data Service instances. So, when changing the passwords it is necessary to be aware of which instances are running.
 
Changed:
<
<
# Tier0 - /eos/cms/store/t0streamer/ area cleanup script. Running here as cmsprod has writing permission on eos - cms-tier0-operations@cern.ch
0 5 * * * lxplus /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py >> /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/streamer_delete.log 2>&1
>
>
Instances currently in use currently (03/03/2015)
 
Changed:
<
<
To add a run to the skip list:
  • Login as cmsprod on lxplus.
  • Go to the script location and open it with an editor:
    /afs/cern.ch/user/c/cmsprod/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py 
  • The skip list is on the line 83:
      # run number in this list will be skipped in the iteration below
        runSkip = [251251, 251643, 254608, 254852, 263400, 263410, 263491, 263502, 263584, 263685, 273424, 273425, 273446, 273449, 274956, 274968, 276357]  
  • Add the desired run in the end of the list. Be careful in not removing the existing runs.
  • Save the changes.
  • It is done!. Don't forget to add it to the Good Runs Twiki
>
>
Instance Name TNS
CMS_T0DATASVC_REPLAY_1 INT2R
CMS_T0DATASVC_REPLAY_2 INT2R
CMS_T0DATASVC_PROD CMSR
 
Changed:
<
<
NOTE: It was decided to keep the skip list within the script instead of a external file to avoid the risk of deleting the runs in case of an error reading such external file.
>
>
  1. Review running instances.
  2. Stop each of them using:
     /data/tier0/00_stop_agent.sh 
  3. Verify that everything is stopped using:
     ps aux | egrep 'couch|wmcore' 
  4. Make sure of having the new password ready (generating it or getting it in a safe way from the one who is creating it).
  5. From lxplus or any of the T0 machines, log in to the instances you want to change the password to using:
     sqlplus <instanceName>/<password>@<tns> 
    Replacing the brackets with the proper values for each instance.
  6. In sqlplus run the command password, you will be prompt for entering the Old password, the*New Password* and confirming this last. Then you can exit from sqlplus
          SQL> password
          Changing password for <user>
          Old password: 
          New password: 
          Retype new password: 
          Password changed
          SQL> exit
          
  7. Then, you should retry logging in to the same instance, if you can not, you are in trouble!
  8. Communicate the password with the CMSWEB contact in a safe way. After his confirmation you can continue with the following steps.
  9. If everything went well now you can access all the instances with the new passwords. Now it is necessary to update the files secrets files within all the machines, These files are located in:
          /data/tier0/admin/
          
    And normally are named as following (not all the instances will have all the files):
          WMAgent.secrets
          WMAgent.secrets.replay
          WMAgent.secrets.prod
          WMAgent.secrets.localcouch
          WMAgent.secrets.remotecouch
          
  10. If there was an instance running you may also change the password in:
         /data/tier0/srv/wmagent/current/config/tier0/config.py
         
    There you must look for the entry:
          config.T0DAtaScvDatabase.connectUrl
         
    and do the update.
  11. You can now restart the instances that were running before the change. Be careful, some components may fail if you start the instance so you should have clear the trade off of starting it.

</>

<!--/twistyPlugin-->

Backup T0AST (Database)

<!--/twistyPlugin twikiMakeVisibleInline-->

If you want to do a backup of a database (for example, after retiring a production node, you want to keep the information of the old T0AST) you should.

  • Request a target Database: Normally this databases are owned by dirk.hufnagel@cernNOSPAMPLEASE.ch, so he should request a new database to be the target of the backup.
  • When the database is ready, you can open a ticket for requesting the backup. For this you should send an email to phydb.support@cernNOSPAMPLEASE.ch. An example of a message can be found in this Elog .
  • When the backup is done you will get a reply to your ticket confirming it.
 
<!--/twistyPlugin-->
Changed:
<
<

Restarting Tier-0 voboxes

>
>

Checking what is locking a database / Cern Session Manager

<!--/twistyPlugin twikiMakeVisibleInline-->

  • Go to this link
     https://session-manager.web.cern.ch/session-manager/ 
  • Login using the DB credentials.
  • Check the sessions and see if you see any errors or something unusual.

<!--/twistyPlugin-->


T0 nodes, headnodes

Restarting Tier-0 voboxes

  %TWISTY{ showlink="Show..."
Line: 944 to 1348
  </>
<!--/twistyPlugin-->
Changed:
<
<

Modifying jobs to resume them with other features (like memory, disk, etc.)

>
>

Commissioning of a new node

*INCOMPLETE INSTRUCTIONS: WORK IN PROGRESS 2017/03*

  %TWISTY{ showlink="Show..."
Line: 952 to 1359
 remember="on" mode="div" }%
Added:
>
>

Folder's structure and permissions

  • These folders should be placed at /data/:
# Permissions Owner Group Folder Name
1. (775) drwxrwxr-x. root zh admin
2. (775) drwxrwxr-x. root zh certs
3. (755) drwxr-xr-x. cmsprod zh cmsprod
4. (700) drwx------. root root lost+found
5. (775) drwxrwxr-x. root zh srv
6. (755) drwxr-xr-x. cmst1 zh tier0
TIPS:
  • To get the folder permissions as a number:
    stat -c %a /path/to/file
  • To change permissions of a file/folder:
    EXAMPLE 1: chmod 775 /data/certs/
  • To change the user and/or group ownership of a file/directory:
    EXAMPLE 1: chown :zh /data/certs/
    EXAMPLE 2: chown -R cmst1:zh /data/certs/* 
 
Changed:
<
<
Some scripts are already available to do this, provided with:
  • the cache directory of the job (or location of the job in JobCreator),
  • the feature to modify
  • and the value to be assigned to the feature.

Depending of the feature you want to modify, you would need to change:

  • the config of the single job (job.pkl),
  • the config of the whole workflow (WMWorkload.pkl),
  • or both.

We have learnt by trial and error which variables and files need to be modified to get the desired result, so you would need to do the same depending of the case. Down below we show some basic examples of how to do this:

Some cases have proven you need to modify the Workflow Sandbox when you want to modify next variables:

  • Memory thresholds (maxRSS, memoryRequirement)
  • Number of processing threads (numberOfCores)
  • CMSSW release (cmsswVersion)
  • SCRAM architecture (scramArch)

Modifying the job description has proven to be useful to change next variables:

  • Condor ClassAd of RequestCpus (numberOfCores)
  • CMSSW release (swVersion)
  • SCRAM architecture (scramArch)
>
>

2. certs

  • Certificates are placed on this folder. You should copy them from another node:
    • servicecert-vocms001.pem
    • servicekey-vocms001-enc.pem
    • servicekey-vocms001.pem
    • vocms001.p12
    • serviceproxy-vocms001.pem
 
Changed:
<
<
At /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs there are two directories named "job" and "workflow". You should enter the respective directory. Follow next instructions in the agent machine in charge of the jobs to modify.
>
>
NOTE: serviceproxy-vocms001.pem is renewed periodically via a cronjob. Please check the cronjobs section
 
Changed:
<
<

Modifying the Workflow Sandbox

>
>

5. srv

  • There you will find the
    glidecondor
    folder, used to....
  • Other condor-related folders could be found. Please check with the Submission Infrastructure operator/team what is needed and who is responsible for it.
 
Changed:
<
<
Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/workflow
>
>

6. tier0

  • Main folder for the WMAgent, containing the configuration, source code, deployment scripts, and deployed agent.
 
Changed:
<
<
In a file named "list", list the jobs you need to modify. Follow the procedure for each type of job/task, given the workflow configuration is different for different Streams.
>
>
File Description
00_deploy.prod.sh Script to deploy the WMAgent for production(*)
00_deploy.replay.sh Script to deploy the WMAgent for a replay(*)
00_fix_t0_status.sh
00_patches.sh
00_readme.txt Some documentation about the scripts
00_software.sh Gets the source code to use form Github for WMCore and the Tier0. Applies the described patches if any.
00_start_agent.sh Starts the agent after it is deployed.
00_start_services.sh Used during the deployment to start services such as CouchDB
00_stop_agent.sh Stops the components of the agent. It doesn't delete any information from the file system or the T0AST, just kill the processes of the services and the WMAgent components
00_wipe_t0ast.sh Invoked by the 00_deploy script. Wipes the content of the T0AST. Be careful!
(*) This script is not static. It might change depending on the version of the Tier0 used and the site where the jobs are running. Check its content before deploying. (**) This script is not static. It might change when new patches are required and when the release versions of the WMCore and the Tier0 change. Check it before deploying.
 
Changed:
<
<
Use script print_workflow_config.sh to generate a human readable copy of WMWorkload.pkl. Look for the name of the variable of the feature to change, for instance maxRSS. Now use the script generate_code.sh to create a script to modify that feature. You should provide the name of the feature and the value to be assigned, for instance:
feature=maxRSS
value=15360000
>
>
Folder Description
 
Changed:
<
<
Executing generate_code.sh would create a script named after the feature, like modify_wmworkload_maxRSS.py. The later will modify the selected feature in the Workflow Sandbox.
>
>

Cronjobs

 
Changed:
<
<
After generated, you need to add a call to that script in modify_one_workflow.sh. The later will call all the required scripts, create the tarball and locate it where required (Specs folder).
>
>
</>
<!--/twistyPlugin-->
 
Deleted:
<
<
Finally, execute modify_several_workflows.sh which will call modify_one_workflow.sh for all the desired workflows.
 
Changed:
<
<
The previous procedure has been followed for several jobs, so for some features the required personalization of the scripts has been already done, and you would just need to comment or uncomment the required lines. As a summary, you would need to proceed as detailed bellow:
vim list
./print_workflow_config.sh
vim generate_code.sh
./generate_code.sh
vim modify_one_workflow.sh
./modify_several_workflows.sh
>
>

Running a replay on a headnode

 
Changed:
<
<

Modifying the Job Description

>
>
<!--/twistyPlugin twikiMakeVisibleInline-->
  • To run a replay in a instance used for production (for example before deploying it in production) you should check the following:
    • If production ran in this instance before, be sure that the T0AST was backed up. Deploying a new instance will wipe it.
    • Modify the WMAgent.secrets file to point to the replay couch and t0datasvc.
      • [20171012] There is a replay WMAgent.secrets file example on vocms0313.
    • Download the latest ReplayOfflineConfig.py from the Github repository. Check the processing version to use based on the jira history.
    • Do not use production 00_deploy.sh. Use the replays 00_deploy.sh script instead. This is the list of changes:
      • Points to the replay secrets file instead of the production secrets file:
            WMAGENT_SECRETS_LOCATION=$HOME/WMAgent.replay.secrets; 
      • Points to the ReplayOfflineConfiguration instead of the ProdOfflineConfiguration:
         sed -i 's+TIER0_CONFIG_FILE+/data/tier0/admin/ReplayOfflineConfiguration.py+' ./config/tier0/config.py 
      • Uses the "tier0replay" team instead of the "tier0production" team (relevant for WMStats monitoring):
         sed -i "s+'team1,team2,cmsdataops'+'tier0replay'+g" ./config/tier0/config.py 
      • Changes the archive delay hours from 168 to 1:
        # Workflow archive delay
                    echo 'config.TaskArchiver.archiveDelayHours = 1' >> ./config/tier0/config.py
      • Uses lower thresholds in the resource-control:
        ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --cms-name=T0_CH_CERN --pnn=T0_CH_CERN_Disk --ce-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=SimpleCondorPlugin
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Merge --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Cleanup --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=LogCollect --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Skim --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Production --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Harvesting --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Express --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Repack --pending-slots=160 --running-slots=320
 
Changed:
<
<
Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/job
>
>
./config/tier0/manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN --cms-name=T2_CH_CERN --pnn=T2_CH_CERN --ce-name=T2_CH_CERN --pending-slots=0 --running-slots=0 --plugin=SimpleCondorPlugin
 
Changed:
<
<
Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use modify_pset.py as base to create another file which would modify the required feature, you can give it a name like modify_pset_.py. Add a call to the just created script in modify_one_job.sh. Finally, execute modify_several_jobs.sh, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.
vim list
cp modify_pset.py modify_pset_<feature>.py
vim modify_pset_<feature>.py
vim modify_one_job.sh
./modify_several_jobs.sh
>
>
Again, keep in mind that 00_deploy.sh script wipes t0ast db - production instance in this case - so, carefully.
 
<!--/twistyPlugin-->
Changed:
<
<

PromptReconstruction at T1s/T2s

>
>

Changing Tier0 Headnode

  %TWISTY{ showlink="Show..."
Line: 1027 to 1470
 mode="div" }%
Changed:
<
<
There are 3 basic requirements to perform PromptReconstruction at T1s (and T2s):
>
>
# Instruction Responsible Role
0. | If there are any exceptions when logging into a candidate headnode, then you should restart it at first. | Tier0 |
0. Run a replay in the new headnode. Some changes have to be done to safely run it in a Prod instance. Please check the Running a replay on a headnode section Tier0
1. Deploy the new prod instance in a new vocmsXXX node, check that we use. Obviously, you should use a production version of 00_deploy.sh script. Tier0
1.5. Check the ProdOfflineconfiguration that is being used Tier0
2. Start the Tier0 instance in vocmsXXX Tier0
3. THIS IS OUTDATED ALREADY I THINK Coordinate with Storage Manager so we have a stop in data transfers, respecting run boundaries (Before this, we need to check that all the runs currently in the Tier0 are ok with bookkeeping. This means no runs in Active status.) SMOps
4. THIS IS OUTDATED ALREADY I THINK Checking al transfer are stopped Tier0
4.1. THIS IS OUTDATED ALREADY I THINK Check http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
4.2. THIS IS OUTDATED ALREADY I THINK Check /data/Logs/General.log
5. THIS IS OUTDATED ALREADY I THINK Change the config file of the transfer system to point to T0AST1. It means, going to /data/TransferSystem/Config/TransferSystem_CERN.cfg and change that the following settings to match the new head node T0AST)
  "DatabaseInstance" => "dbi:Oracle:CMS_T0AST",
  "DatabaseUser"     => "CMS_T0AST_1",
  "DatabasePassword" => 'superSafePassword123',
Tier0
6. THIS IS OUTDATED ALREADY I THINK Make a backup of the General.log.* files (This backup is only needed if using t0_control restart in the next step, if using t0_control_stop + t0_control start logs won't be affected) Tier0
7. THIS IS OUTDATED ALREADY I THINK

Restart transfer system using:

A)

t0_control restart (will erase the logs)

B)

t0_control stop

t0_control start (will keep the logs)

Tier0
8. THIS IS OUTDATED ALREADY I THINK Kill the replay processes (if any) Tier0
9. THIS IS OUTDATED ALREADY I THINK Start notification logs to the SM in vocmsXXX Tier0
10. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
11. THIS IS OUTDATED ALREADY I THINK Restart transfers SMOps
12. RECHECK THE LIST OF CRONTAB JOBS Point acronjobs ran as cmst1 on lxplus to a new headnode. They are checkActiveRuns and checkPendingTransactions scripts. Tier0
</>
<!--/twistyPlugin-->
 
Deleted:
<
<
  • Each desired site should be configured in the T0 Agent Resource Control. For this, /data/tier0/00_deploy.sh file should be modified specifying pending and running slot thresholds for each type of processing task. For instance:
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --cms-name=T1_IT_CNAF --pnn=T1_IT_CNAF_Disk --ce-name=T1_IT_CNAF --pending-slots=100 --running-slots=1000 --plugin=PyCondorPlugin
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Processing --pending-slots=1500 --running-slots=4000
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Production --pending-slots=1500 --running-slots=4000
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Merge --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Cleanup --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=LogCollect --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Skim --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Harvesting --pending-slots=10 --running-slots=20
 
Changed:
<
<
A useful command to check the current state of the site (agent parameters for the site, running jobs etc.):
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN -p
>
>

Restarting head node machine

 
Changed:
<
<
  • A list of possible sites where the reconstruction is wanted should be provided under the parameter siteWhitelist. This is done per Primary Dataset in the configuration file /data/tier0/admin/ProdOfflineConfiguration.py. For instance:
datasets = [ "DisplacedJet" ]
>
>
<!--/twistyPlugin twikiMakeVisibleInline-->
 
Changed:
<
<
for dataset in datasets: addDataset(tier0Config, dataset, do_reco = True, raw_to_disk = True, tape_node = "T1_IT_CNAF_MSS", disk_node = "T1_IT_CNAF_Disk", siteWhitelist = [ "T1_IT_CNAF" ], dqm_sequences = [ "@common" ], physics_skims = [ "LogError", "LogErrorMonitor" ], scenario = ppScenario)
>
>
  1. Stop Tier0 agent
    00_stop_agent.sh
  2. Stop condor
    service condor stop 
    If you want your data to be still available, then cp your spool directory to disk
    cp -r /mnt/ramdisk/spool /data/
  3. Restart the machine (or request its restart)
  4. Mount the RAM Disk (Condor spool won't work otherwise).
  5. If necessary, copy back the data to the spool.
  6. When restarted, start the sminject component
    t0_control start 
  7. Start the agent
    00_start_agent
    Particularly, check the PhEDExInjector component, if there you see errors, try restarting it after sourcing init.sh
    source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
    $manage execute-agent wmcoreD --restart --component PhEDExInjector
 
Changed:
<
<
  • Jobs should be able to write in the T1 storage systems, for this, a proxy with the production VOMS role should be provided at /data/certs/. The variable X509_USER_PROXY defined at /data/tier0/admin/env.sh should point to the proxy location. A proxy with the required role can not be generated for a time span mayor than 8 days, then a cron job should be responsible of the renewal. For jobs to stage out at T1s, there is no need of mappings of the Distinguished Name (DN) shown in the certificate to specific users in the T1 sites, the mapping is made with the role of the certificate. This could be needed to stage out at T2 sites. Down below, the information of a valid proxy is shown:
subject   : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch/CN=110263821
issuer    : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch
identity  : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch
type      : RFC3820 compliant impersonation proxy
strength  : 1024
path      : /data/certs/serviceproxy-vocms001.pem
timeleft  : 157:02:59
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO        : cms
subject   : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch
issuer    : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
attribute : /cms/Role=production/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
timeleft  : 157:02:58
uri       : voms2.cern.ch:15002
>
>
<!--/twistyPlugin-->


 
Deleted:
<
<
</>
<!--/twistyPlugin-->
 
Changed:
<
<

Manually modify the First Conditions Safe Run (fcsr)

>
>

T0 Pool instructions

Disabling flocking to Tier0 Pool

  %TWISTY{ showlink="Show..."
Line: 1090 to 1526
 mode="div" }%
Changed:
<
<
The current fcsr can be checked in the Tier0 Data Service: https://cmsweb.cern.ch/t0wmadatasvc/prod/firstconditionsaferun
>
>
If it is needed to prevent new Central Production jobs to be executed in the Tier0 pool it is necessary to disable flocking.
 
Changed:
<
<
In the CMS_T0DATASVC_PROD database check the table the first run with locked = 0 is fcsr
>
>
NOTE1: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/
 
Changed:
<
<
 reco_locked table 
>
>
Only in case of emergency out of working hours, consider executing the below procedure on your own. But posting elog entry in this case is even more important as SI team needs to be aware of such meaningful changes.
 
Deleted:
<
<
If you want to manually set a run as the fcsr you have to make sure that it is the lowest run with locked = 0
 
Changed:
<
<
 update reco_locked set locked = 0 where run >= <desired_run> 
>
>
GENERAL INFO:
  • Difference between site whitelisting and enabling/disabling flocking. When flocking jobs of different core counts, defragmentation may have to be re-tuned.
  • Also when the core-count is smaller than the defragmentation policy objective. E.g., the current defragmentation policy is focused on defragmenting slots with less than 4 cores. Having flocking enabled and only single or 2-core jobs in the mix, will trigger unnecessary defragmentation. I know this is not a common case, but if the policy were focused on 8-cores and for some reason, they inject 4-core jobs, while flocking is enabled, the same would happen.
 
Deleted:
<
<
</>
<!--/twistyPlugin-->
 
Changed:
<
<

Modify the thresholds in the resource control of the Agent

>
>
As the changes directly affect the GlideInWMS Collector and Negotiator, you can cause a big mess if you don't proceed with caution. To do so you should follow these steps.
 
Changed:
<
<
<!--/twistyPlugin twikiMakeVisibleInline-->
>
>
NOTE2: The root access to the GlideInWMS Collector is guaranteed for the members of the cms-tier0-operations@cern.ch e-group.
 
Changed:
<
<
  • Login into the desired agent and become cmst1
  • Source the environment
     source /data/tier0/admin/env.sh 
  • Execute the following command with the desired values:
     $manage execute-agent wmagent-resource-control --site-name=<Desired_Site> --task-type=Processing --pending-slots=<desired_value> --running-slots=<desired_value> 
    Example:
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=3000 --running-slots=9000 
>
>
  • Login to vocms007 (GlideInWMS Collector-Negociator)
  • Login as root
     sudo su - 
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/
  • There you will find a list of files. Most of them are puppetized which means any change will be overridden when puppet be executed. There is one no-puppetized file called 99_local_tweaks.config that is the one that will be used to do the changes we desire.
     -rw-r--r--. 1 condor condor  1849 Mar 19  2015 00_gwms_general.config
     -rw-r--r--. 1 condor condor  1511 Mar 19  2015 01_gwms_collectors.config
     -rw-r--r--  1 condor condor   678 May 27  2015 03_gwms_local.config
     -rw-r--r--  1 condor condor  2613 Nov 30 11:16 10_cms_htcondor.config
     -rw-r--r--  1 condor condor  3279 Jun 30  2015 10_had.config
     -rw-r--r--  1 condor condor 36360 Jun 29  2015 20_cms_secondary_collectors_tier0.config
     -rw-r--r--  1 condor condor  2080 Feb 22 12:24 80_cms_collector_generic.config
     -rw-r--r--  1 condor condor  3186 Mar 31 14:05 81_cms_collector_tier0_generic.config
     -rw-r--r--  1 condor condor  1875 Feb 15 14:05 90_cms_negotiator_policy_tier0.config
     -rw-r--r--  1 condor condor  3198 Aug  5  2015 95_cms_daemon_monitoring.config
     -rw-r--r--  1 condor condor  6306 Apr 15 11:21 99_local_tweaks.config
 
Changed:
<
<
  • To change the general values
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=PyCondorPlugin 
>
>
Within this file there is a special section for the Tier0 ops. The other sections of the file should not be modified.
 
Changed:
<
<
  • To see the current thresholds and use
     $manage execute-agent wmagent-resource-control -p 
>
>
  • To disable flocking you should locate the flocking config section:
    # Knob to enable or disable flocking
    # To enable, set this to True (defragmentation is auto enabled)
    # To disable, set this to False (defragmentation is auto disabled)
    ENABLE_PROD_FLOCKING = True
  • Change the value to False
    ENABLE_PROD_FLOCKING = False
  • Save the changes in the 99_local_tweaks.config file and execute the following command to apply the changes:
     condor_reconfig 

  • The negociator has a 12h cache, so the schedds don't need to authenticate during this period of time. It is required to restart the negotiator.

  • Now, you can check the whitelisted Schedds to run in Tier0 pool, the Central Production Schedds should not appear there.
     condor_config_val -master gsi_daemon_name  

  • Now you need to restart the condor negociator to make sure that the changes are applied right away.
     ps aux | grep "condor_negotiator"   
    kill -9 <replace_by_condor_negotiator_process_id> 

  • After killing the process it should reappear again after a couple of minutes.

  • It is done!

Remember that this change won't remove/evict the jobs that are actually running, but will prevent new jobs to be sent.

 
<!--/twistyPlugin-->
Changed:
<
<

Overriding the limit of Maximum Running jobs by the Condor Schedd

>
>

Enabling pre-emption in the Tier0 pool

  %TWISTY{ showlink="Show..."
Line: 1136 to 1601
 mode="div" }%
Changed:
<
<
  • Login as root in the Schedd machine
  • Go to:
     /etc/condor/config.d/99_local_tweaks.config  
  • There, override the limit adding/modifying this line:
     MAX_JOBS_RUNNING = <value>  
  • For example:
     MAX_JOBS_RUNNING = 12000  
  • Then, to apply the changes, run:
    condor_reconfig 
>
>
BEWARE: Please DO NOT use this strategy unless you are sure it is necessary and you agree in doing it with the Workflow Team. This literally kills all the Central Production jobs which are in Tier0 Pool (including ones which are being executed at that moment).

NOTE: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/ Only in case of emergency out of working hours, consider executing the below procedure on your own. But posting elog entry in this case is even more important as SI team needs to be aware of such meaningful changes.

  • Login to vocms007 (GlideInWMS Collector-Negotiator)
  • Login as root
     sudo su -  
  • Go to /etc/condor/config.d/
     cd /etc/condor/config.d/ 
  • Open 99_local_tweaks.config
  • Locate this section:
     # How to drain the slots
        # graceful: let the jobs finish, accept no more jobs
        # quick: allow job to checkpoint (if supported) and evict it
        # fast: hard kill the jobs
       DEFRAG_SCHEDULE = graceful 
  • Change it to:
     DEFRAG_SCHEDULE = fast 
  • Leave it enabled only for ~5 minutes. After this the Tier0 jobs will start being killed as well. After the 5 minutes, revert the change
     DEFRAG_SCHEDULE = graceful 
  </>
<!--/twistyPlugin-->
Changed:
<
<

Unregistering an agent from WMStats

>
>

Changing the status of _CH_CERN sites in SSB

  %TWISTY{ showlink="Show..."
Line: 1157 to 1637
 mode="div" }%
Changed:
<
<
First thing to know - an agent has to be stopped to unregister it. Otherwise, AgentStatusWatcher will just keep updating a new doc for wmstats.
  • Log into the agent
  • Source the environment:
     source /data/tier0/admin/env.sh  
  • Execute:
     $manage execute-agent wmagent-unregister-wmstats `hostname -f` 
  • You will be prompt for confirmation. Type 'yes'.
  • Check that the agent doesn't appear in WMStats.
>
>
To change T2_CH_CERN and T2_CH_CERN_HLT

*Please note that Tier0 Ops changing the status of T2_CH_CERN and T2_CH_CERN_HLT is an emergency procedure, not a standard one*

  • Open a GGUS Ticket to the site before proceeding, asking them to change the status themselves.
  • If there is no response after 1 hour, reply to the same ticket reporting you are changing it and proceed with the steps in the next section.

To change T0_CH_CERN

  • You should go to the Prodstatus Metric Manual Override site.
  • There, you will be able to change the status of T0_CH_CERN/T2_CH_CERN/T2_CH_CERN_HLT. You can set Enabled, Disabled, Drain or No override. The Reason field is mandatory (the history of thisreason can be checked here). Then click "Apply" and the procedure will be complete. The users in the cms-tier0-operations e-group are able to do this change.
  • The status in the SSB site is updated every 15 minutes. So you should be able to see the change there maximum after this amount of time.
  • More extensive documentation can be checked here.

</>

<!--/twistyPlugin-->


 
Deleted:
<
<
</>
<!--/twistyPlugin-->
 
Changed:
<
<

Checking what is locking a database / Cern Session Manager

>
>

Other (did not fit into the categories above/outdated/in progress)

Updating TransferSystem for StorageManager change of alias (probably outdated)

  %TWISTY{ showlink="Show..."
Line: 1177 to 1664
 mode="div" }%
Changed:
<
<
  • Go to this link
     https://session-manager.web.cern.ch/session-manager/ 
  • Login using the DB credentials.
  • Check the sessions and see if you see any errors or something unusual.
>
>
Ideally this process should be transparent to us. However, it might be that the TransferSystem doesn't update the IP address of the SM alias when the alias is changed to point to the new machine. In this case you will need to restart the TransferSystem in both the /data/tier0/sminject area on the T0 headnode and the /data/TransferSystem area on vocms001. Steps for this process are below:

  1. Watch the relevant logs on the headnode to see if streamers are being received by the Tier0Injector and if repack notices are being sent by the LoggerReceiver. A useful command for this is:
     watch "tail /data/tier0/srv/wmagent/current/install/tier0/Tier0Feeder/ComponentLog; tail /data/tier0/sminject/Logs/General.log; tail /data/tier0/srv/wmagent/current/install/tier0/JobCreator/ComponentLog" 
  2. Also watch the TransferSystem on vocms001 to see if streamers / files are being received from the SM and if CopyCheck notices are being sent to the SM. A useful command for this is:
     watch "tail /data/TransferSystem/Logs/General.log; tail /data/TransferSystem/Logs/Logger/LoggerReceiver.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckManager.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckWorker.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorManager.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorWorker.log" 
  3. If any of these services stop sending and/or receiving, you will need to restart the TransferSystem.
  4. Restart the TransferSystem on vocms001. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/TransferSystem
    ./t0_control stop
    ./t0_control start
              
  5. Restart the TransferSystem on the T0 headnode. Do the following (this should save the logs. If it doesn't, use restart instead):
    cd /data/tier0/sminject
    ./t0_control stop
    ./t0_control start
              
  </>
<!--/twistyPlugin-->
Deleted:
<
<

Commissioning of a new node

 
Changed:
<
<
*INCOMPLETE INSTRUCTIONS: WORK IN PROGRESS 2017/03*
>
>

Getting Job Statistics (needs to be reviewed)

  %TWISTY{ showlink="Show..."
Line: 1194 to 1691
 remember="on" mode="div" }%
Deleted:
<
<

Folder's structure and permissions

  • These folders should be placed at /data/:
# Permissions Owner Group Folder Name
1. (775) drwxrwxr-x. root zh admin
2. (775) drwxrwxr-x. root zh certs
3. (755) drwxr-xr-x. cmsprod zh cmsprod
4. (700) drwx------. root root lost+found
5. (775) drwxrwxr-x. root zh srv
6. (755) drwxr-xr-x. cmst1 zh tier0
TIPS:
  • To get the folder permissions as a number:
    stat -c %a /path/to/file
  • To change permissions of a file/folder:
    EXAMPLE 1: chmod 775 /data/certs/
  • To change the user and/or group ownership of a file/directory:
    EXAMPLE 1: chown :zh /data/certs/
    EXAMPLE 2: chown -R cmst1:zh /data/certs/* 
 
Changed:
<
<

2. certs

  • Certificates are placed on this folder. You should copy them from another node:
    • servicecert-vocms001.pem
    • servicekey-vocms001-enc.pem
    • servicekey-vocms001.pem
    • vocms001.p12
    • serviceproxy-vocms001.pem

NOTE: serviceproxy-vocms001.pem is renewed periodically via a cronjob. Please check the cronjobs section

>
>
This is the base script to compile the information of jobs that are already done:
/afs/cern.ch/user/e/ebohorqu/public/HIStats/stats.py
 
Changed:
<
<

5. srv

  • There you will find the
    glidecondor
    folder, used to....
  • Other condor-related folders could be found. Please check with the Submission Infrastructure operator/team what is needed and who is responsible for it.
>
>
For the analysis we need to define certain things:
 
Changed:
<
<

6. tier0

  • Main folder for the WMAgent, containing the configuration, source code, deployment scripts, and deployed agent.
>
>
  • main_dir: Folder where input log archives are. e.g. '/data/tier0/srv/wmagent/current/install/tier0/JobArchiver/logDir/Pí, in main()
  • temp: Folder where output json files are going to be generated, in main().
  • runList: Runs to be analyzed, in main()
  • Job type in two places:
    • getStats()
    • stats[dataset] in main()
 
Changed:
<
<
File Description
00_deploy.prod.sh Script to deploy the WMAgent for production(*)
00_deploy.replay.sh Script to deploy the WMAgent for a replay(*)
00_fix_t0_status.sh
00_patches.sh
00_readme.txt Some documentation about the scripts
00_software.sh Gets the source code to use form Github for WMCore and the Tier0. Applies the described patches if any.
00_start_agent.sh Starts the agent after it is deployed.
00_start_services.sh Used during the deployment to start services such as CouchDB
00_stop_agent.sh Stops the components of the agent. It doesn't delete any information from the file system or the T0AST, just kill the processes of the services and the WMAgent components
00_wipe_t0ast.sh Invoked by the 00_deploy script. Wipes the content of the T0AST. Be careful!
(*) This script is not static. It might change depending on the version of the Tier0 used and the site where the jobs are running. Check its content before deploying. (**) This script is not static. It might change when new patches are required and when the release versions of the WMCore and the Tier0 change. Check it before deploying.
>
>
The script is run without any parameter. This generates a json file with information about cpu, memory, storage and start and stop times. Task is also included. An example of output file is:
/afs/cern.ch/user/e/ebohorqu/public/HIStats/RecoStatsProcessing.json
 
Changed:
<
<
Folder Description
>
>
With a separate script in R, I was reading and summarizing the data:
/afs/cern.ch/user/e/ebohorqu/public/HIStats/parse_cpu_info.R
 
Changed:
<
<

Cronjobs

>
>
There, task type should be defined and also output file. With this script I was just summarizing cpu data, but we could modify it a little to get memory data. Maybe it is quicker to do it directly with the first python script, if you like to do it :P
 
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>
That script calculates efficiency of each job:
TotalLoopCPU / TotalJobTime * numberOfCores 
 
Changed:
<
<

Changing the certificate mapping to access eos

>
>
and an averaged efficiency per dataset:
sum(TotalLoopCPU) / sum(TotalJobTime * numberOfCores) 
 
Changed:
<
<
<!--/twistyPlugin twikiMakeVisibleInline-->
>
>
numberOfCores was obtained from job.pkl, TotalLoopCPU and TotalJobTime were obtained from report.pkl
 
Changed:
<
<
  • The VOC is responsible for this change. This mapping is specified on a file deployed at:
    • /afs/cern.ch/cms/caf/gridmap/gridmap.txt
  • Current VOC, Daiel Valbuena, has the script writting the gridmap file versioned here: (See Gitlab repo).
  • The following lines were added there to map the certificate used by our agents to the cmst0 service account.
    9.  namesToMapToTIER0 = [ "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms15.cern.ch",
    10.                 "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch"]
    38.        elif p[ 'dn' ] in namesToMapToTIER0:
    39.           dnmap[ p['dn'] ] = "cmst0" 
>
>
Job type could be Processing, Merge and Harvesting. For Processing type, task could be Reco or AlcaSkim and for Merge type, ALCASkimMergeALCARECO, RecoMergeSkim, RecoMergeWrite _AOD, RecoMergeWrite _DQMIO, RecoMergeWrite _MINIAOD and RecoMergeWrite _RECO.
 
<!--/twistyPlugin-->
Changed:
<
<

Update code in the dmwm/T0 repository

>
>

Update code in the dmwm/T0 repository

  %TWISTY{ showlink="Show..."
Line: 1283 to 1734
 mode="div" }%
Added:
>
>
This guide contains all the necessary steps, read it at first.
 Execute this commands locally, where you already made a copy of the repository

  • Get the latest code from the repository
Line: 1352 to 1804
 </>
<!--/twistyPlugin-->
Changed:
<
<

Changing the certificate mapping to access eos

>
>

EOS Areas of interest

  %TWISTY{ showlink="Show..."
Line: 1361 to 1813
 mode="div" }%
Deleted:
<
<
  • The VOC is responsible for this change. This mapping is specified on a file deployed at:
    • /afs/cern.ch/cms/caf/gridmap/gridmap.txt
  • Current VOC, Daiel Valbuena, has the script writting the gridmap file versioned here: (See Gitlab repo).
  • The following lines were added there to map the certificate used by our agents to the cmst0 service account.
    9.  namesToMapToTIER0 = [ "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms15.cern.ch",
    10.                 "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch"]
    38.        elif p[ 'dn' ] in namesToMapToTIER0:
    39.           dnmap[ p['dn'] ] = "cmst0" 

</>

<!--/twistyPlugin-->
 
Changed:
<
<

Check the number of jobs and CPUs in condor

<!--/twistyPlugin twikiMakeVisibleInline-->

The following commands can be executed from any VM where there's a Tier0 schedd present (recheck if the list of VMs corresponds with the current list of Tier0 schedds. Tier0 production Central Manager is hosted on vocms007).

>
>
The Tier-0 WMAgent uses four areas on eos.
Path Use Who writes Who reads Who cleans
/eos/cms/store/t0streamer/ Input streamer files transferred from P5 Storage Manager Tier-0 worker nodes Tier-0 t0streamer area cleanup script
/eos/cms/store/unmerged/ Store output files smaller than 2GB until the merge jobs put them together Tier-0 worker nodes (Processing/Repack jobs) Tier-0 worker nodes(Merge Jobs) ?
/eos/cms/tier0/ Files ready to be transferred to Tape and Disk Tier-0 worker nodes (Processing/Repack/Merge jobs) PhEDEx Agent Tier-0 WMAgent creates and auto approves transfer/deletion requests. PhEDEx executes them
/eos/cms/store/express/ Output from Express processing Tier-0 worker nodes Users Tier-0 express area clenaup script
 
Deleted:
<
<
  • Get the number of tier0 jobs sorted by a number of CPUs they are using:
 
Changed:
<
<
condor_status -pool vocms007 -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")' -af Cpus | sort | uniq -c
>
>
/eos/cms/store/t0streamer/
 
Added:
>
>
SM writes raw files there. And we delete the files with the script. the script is on the acronjob under the cmst0 acc. It keeps data which are not repacked yet. Also, keeps the data not older than 7 days. The data is repacked (rewritten) dat files > PDs (raw .root files).
 
Deleted:
<
<
  • Get the number of total CPUs used for Tier0 jobs on Tier0 Pool:
 
Changed:
<
<
condor_status -pool vocms007 -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")' -af Cpus | awk '{sum+= $1} END {print(sum)}'
>
>
/eos/cms/store/unmerged/
 
Added:
>
>
There go the files which need to be merged into larger files. Not all the files go there. The job itself manages it (after merging, the job deletes the unmerged files).
 
Deleted:
<
<
  • The total number of CPUs used by NOT Tier0 jobs on Tier0 Pool:
 
Changed:
<
<
condor_status -pool vocms007 -const 'State=="Claimed" && ( ClientMachine=!="vocms001.cern.ch" && ClientMachine=!="vocms014.cern.ch" && ClientMachine=!="vocms015.cern.ch" && ClientMachine=!="vocms0313.cern.ch" && ClientMachine=!="vocms0314.cern.ch" && ClientMachine=!="vocms039.cern.ch" && ClientMachine=!="vocms047.cern.ch" && ClientMachine=!="vocms013.cern.ch")' -af Cpus | awk '{sum+= $1} END {print(sum)}'
>
>
/eos/cms/store/express/
 
Changed:
<
<
<!--/twistyPlugin-->

Change a file size limit on Tier0

<!--/twistyPlugin twikiMakeVisibleInline-->

As for October 2017, the file size limit was increased from 12GB to 16GB. However, if a change is needed, then the following values need to be modified:

  • maxSizeSingleLumi and maxEdmSize in ProdOfflineConfiguration.py
  • maxAllowedRepackOutputSize in srv/wmagent/current/config/tier0/config.py
>
>
Express output after being merged. Jobs from the tier0 are writing to it. Data deletions are managed by DDM.
 
<!--/twistyPlugin-->
\ No newline at end of file

Revision 642017-11-02 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1411 to 1411
 

</>

<!--/twistyPlugin-->
\ No newline at end of file
Added:
>
>

Change a file size limit on Tier0

<!--/twistyPlugin twikiMakeVisibleInline-->

As for October 2017, the file size limit was increased from 12GB to 16GB. However, if a change is needed, then the following values need to be modified:

  • maxSizeSingleLumi and maxEdmSize in ProdOfflineConfiguration.py
  • maxAllowedRepackOutputSize in srv/wmagent/current/config/tier0/config.py

<!--/twistyPlugin-->
 \ No newline at end of file

Revision 632017-10-27 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1018 to 1018
  </>
<!--/twistyPlugin-->
Changed:
<
<

PromptReconstruction at T1s

>
>

PromptReconstruction at T1s/T2s

  %TWISTY{ showlink="Show..."
Line: 1027 to 1027
 mode="div" }%
Changed:
<
<
There are 3 basic requirements to perform PromptReconstruction at T1s (and possibly T2s):
>
>
There are 3 basic requirements to perform PromptReconstruction at T1s (and T2s):
 
  • Each desired site should be configured in the T0 Agent Resource Control. For this, /data/tier0/00_deploy.sh file should be modified specifying pending and running slot thresholds for each type of processing task. For instance:
Line: 1040 to 1040
 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Skim --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Harvesting --pending-slots=10 --running-slots=20
Added:
>
>
A useful command to check the current state of the site (agent parameters for the site, running jobs etc.):
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN -p
 
  • A list of possible sites where the reconstruction is wanted should be provided under the parameter siteWhitelist. This is done per Primary Dataset in the configuration file /data/tier0/admin/ProdOfflineConfiguration.py. For instance:
datasets = [ "DisplacedJet" ]

Revision 622017-10-12 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 251 to 251
 remember="on" mode="div" }%
Changed:
<
<
>
>
 
  • To run a replay in a instance used for production (for example before deploying it in production) you should check the following:
    • If production ran in this instance before, be sure that the T0AST was backed up. Deploying a new instance will wipe it.
    • Modify the WMAgent.secrets file to point to the replay couch and t0datasvc.
Changed:
<
<
    • Download the latest ReplayOfflineConfig.py from the Github repository. Check the processing to use based on the elog history.
>
>
      • [20171012] There is a replay WMAgent.secrets file example on vocms0313.
    • Download the latest ReplayOfflineConfig.py from the Github repository. Check the processing version to use based on the jira history.
 
    • Do not use production 00_deploy.sh. Use the replays 00_deploy.sh script instead. This is the list of changes:
      • Points to the replay secrets file instead of the production secrets file:
            WMAGENT_SECRETS_LOCATION=$HOME/WMAgent.replay.secrets; 
Line: 278 to 279
  ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Express --pending-slots=800 --running-slots=1600 ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Repack --pending-slots=160 --running-slots=320
Changed:
<
<
./config/tier0/manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN --cms-name=T2_CH_CERN --pnn=T2_CH_CERN --ce-name=T2_CH_CERN --pending-slots=0 --running-slots=0 --plugin=PyCondorPlugin
>
>
./config/tier0/manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN --cms-name=T2_CH_CERN --pnn=T2_CH_CERN --ce-name=T2_CH_CERN --pending-slots=0 --running-slots=0 --plugin=SimpleCondorPlugin

Again, keep in mind that 00_deploy.sh script wipes t0ast db - production instance in this case - so, carefully.

  </>
<!--/twistyPlugin-->
Line: 292 to 295
 }%

# Instruction Responsible Role
Added:
>
>
0. | If there are any exceptions when logging into a candidate headnode, then you should restart it at first. | Tier0 |
 
0. Run a replay in the new headnode. Some changes have to be done to safely run it in a Prod instance. Please check the Running a replay on a headnode section Tier0
Changed:
<
<
1. Deploy the new prod instance in vocms0314, check that we use: Tier0
>
>
1. Deploy the new prod instance in a new vocmsXXX node, check that we use. Obviously, you should use a production version of 00_deploy.sh script. Tier0
 
1.5. Check the ProdOfflineconfiguration that is being used Tier0
Changed:
<
<
2. Start the Tier0 instance in vocms0314 Tier0
3. Coordinate with Storage Manager so we have a stop in data transfers, respecting run boundaries (Before this, we need to check that all the runs currently in the Tier0 are ok with bookkeeping. This means no runs in Active status.) SMOps
4. Checking al transfer are stopped Tier0
4.1. Check http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
4.2. Check /data/Logs/General.log
| 5. | Change the config file of the transfer system to point to T0AST1. It means, going to /data/TransferSystem/Config/TransferSystem_CERN.cfg and change that the following settings to match the new head node T0AST)
  "DatabaseInstance" => "dbi:Oracle:CMS_T0AST",
>
>
2. Start the Tier0 instance in vocmsXXX Tier0
3. THIS IS OUTDATED ALREADY I THINK Coordinate with Storage Manager so we have a stop in data transfers, respecting run boundaries (Before this, we need to check that all the runs currently in the Tier0 are ok with bookkeeping. This means no runs in Active status.) SMOps
4. THIS IS OUTDATED ALREADY I THINK Checking al transfer are stopped Tier0
4.1. THIS IS OUTDATED ALREADY I THINK Check http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
4.2. THIS IS OUTDATED ALREADY I THINK Check /data/Logs/General.log
| 5. | THIS IS OUTDATED ALREADY I THINK Change the config file of the transfer system to point to T0AST1. It means, going to /data/TransferSystem/Config/TransferSystem_CERN.cfg and change that the following settings to match the new head node T0AST)
  "DatabaseInstance" => "dbi:Oracle:CMS_T0AST",
  "DatabaseUser" => "CMS_T0AST_1", "DatabasePassword" => 'superSafePassword123', | Tier0 |
Changed:
<
<
6. Make a backup of the General.log.* files (This backup is only needed if using t0_control restart in the next step, if using t0_control_stop + t0_control start logs won't be affected) Tier0
7.

Restart transfer system using:

A)

t0_control restart (will erase the logs)

B)

t0_control stop

t0_control start (will keep the logs)

Tier0
8. Kill the replay processes (if any) Tier0
9. Start notification logs to the SM in vocms0314 Tier0
>
>
6. THIS IS OUTDATED ALREADY I THINK Make a backup of the General.log.* files (This backup is only needed if using t0_control restart in the next step, if using t0_control_stop + t0_control start logs won't be affected) Tier0
7. THIS IS OUTDATED ALREADY I THINK

Restart transfer system using:

A)

t0_control restart (will erase the logs)

B)

t0_control stop

t0_control start (will keep the logs)

Tier0
8. THIS IS OUTDATED ALREADY I THINK Kill the replay processes (if any) Tier0
9. THIS IS OUTDATED ALREADY I THINK Start notification logs to the SM in vocmsXXX Tier0
 
10. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
Changed:
<
<
11. Restart transfers SMOps
12. Point acronjobs ran as cmst1 on lxplus to a new headnode. They are checkActiveRuns and checkPendingTransactions scripts. Tier0
>
>
11. THIS IS OUTDATED ALREADY I THINK Restart transfers SMOps
12. RECHECK THE LIST OF CRONTAB JOBS Point acronjobs ran as cmst1 on lxplus to a new headnode. They are checkActiveRuns and checkPendingTransactions scripts. Tier0
  </>
<!--/twistyPlugin-->

Revision 612017-10-05 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1148 to 1148
 mode="div" }%
Added:
>
>
First thing to know - an agent has to be stopped to unregister it. Otherwise, AgentStatusWatcher will just keep updating a new doc for wmstats.
 
  • Log into the agent
  • Source the environment:
     source /data/tier0/admin/env.sh  
  • Execute:
Changed:
<
<
 $manage execute-agent wmagent-unregister-wmstats `hostname -f`:9999  
>
>
 $manage execute-agent wmagent-unregister-wmstats `hostname -f` 
 
  • You will be prompt for confirmation. Type 'yes'.
  • Check that the agent doesn't appear in WMStats.

Revision 602017-09-29 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1261 to 1261
 
38.        elif p[ 'dn' ] in namesToMapToTIER0:
39.           dnmap[ p['dn'] ] = "cmst0" 
Added:
>
>
</>
<!--/twistyPlugin-->
 

Update code in the dmwm/T0 repository

%TWISTY{

Line: 1337 to 1340
 

</>

<!--/twistyPlugin-->
Added:
>
>

Changing the certificate mapping to access eos

<!--/twistyPlugin twikiMakeVisibleInline-->

  • The VOC is responsible for this change. This mapping is specified on a file deployed at:
    • /afs/cern.ch/cms/caf/gridmap/gridmap.txt
  • Current VOC, Daiel Valbuena, has the script writting the gridmap file versioned here: (See Gitlab repo).
  • The following lines were added there to map the certificate used by our agents to the cmst0 service account.
    9.  namesToMapToTIER0 = [ "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms15.cern.ch",
    10.                 "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch"]
    38.        elif p[ 'dn' ] in namesToMapToTIER0:
    39.           dnmap[ p['dn'] ] = "cmst0" 

<!--/twistyPlugin-->

Check the number of jobs and CPUs in condor

<!--/twistyPlugin twikiMakeVisibleInline-->

The following commands can be executed from any VM where there's a Tier0 schedd present (recheck if the list of VMs corresponds with the current list of Tier0 schedds. Tier0 production Central Manager is hosted on vocms007).

  • Get the number of tier0 jobs sorted by a number of CPUs they are using:

condor_status -pool vocms007  -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || 
ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")'  -af Cpus | sort | uniq -c

  • Get the number of total CPUs used for Tier0 jobs on Tier0 Pool:

condor_status -pool vocms007  -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || 
ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")'  -af Cpus | awk '{sum+= $1} END {print(sum)}'

  • The total number of CPUs used by NOT Tier0 jobs on Tier0 Pool:

condor_status -pool vocms007  -const 'State=="Claimed" && ( ClientMachine=!="vocms001.cern.ch" && ClientMachine=!="vocms014.cern.ch" && ClientMachine=!="vocms015.cern.ch" && 
ClientMachine=!="vocms0313.cern.ch" && ClientMachine=!="vocms0314.cern.ch" &&
 ClientMachine=!="vocms039.cern.ch" && ClientMachine=!="vocms047.cern.ch" && ClientMachine=!="vocms013.cern.ch")'  -af Cpus | awk '{sum+= $1} END {print(sum)}'

<!--/twistyPlugin-->

Revision 592017-09-22 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 690 to 690
 mode="div" }%
Changed:
<
<
If it is needed to prevent new Central Production jobs to be executed in the Tier0 pool it is necessary to disable flocking. To do so you should follow these steps. Be Careful, you will make changes in the GlideInWMS Collector and Negociator, you can cause a big mess if you don't proceed with caution.
>
>
If it is needed to prevent new Central Production jobs to be executed in the Tier0 pool it is necessary to disable flocking.
 
Changed:
<
<
NOTE: The root access to the GlideInWMS Collector is guaranteed for the members of the cms-tier0-operations@cern.ch e-group.
>
>
NOTE1: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/

Only in case of emergency out of working hours, consider executing the below procedure on your own. But posting elog entry in this case is even more important as SI team needs to be aware of such meaningful changes.

GENERAL INFO:

  • Difference between site whitelisting and enabling/disabling flocking. When flocking jobs of different core counts, defragmentation may have to be re-tuned.
  • Also when the core-count is smaller than the defragmentation policy objective. E.g., the current defragmentation policy is focused on defragmenting slots with less than 4 cores. Having flocking enabled and only single or 2-core jobs in the mix, will trigger unnecessary defragmentation. I know this is not a common case, but if the policy were focused on 8-cores and for some reason, they inject 4-core jobs, while flocking is enabled, the same would happen.

As the changes directly affect the GlideInWMS Collector and Negotiator, you can cause a big mess if you don't proceed with caution. To do so you should follow these steps.

NOTE2: The root access to the GlideInWMS Collector is guaranteed for the members of the cms-tier0-operations@cern.ch e-group.

 
  • Login to vocms007 (GlideInWMS Collector-Negociator)
  • Login as root
Line: 750 to 765
 mode="div" }%
Changed:
<
<
BEWARE: Please DO NOT use this strategy unless you are sure it is necessary and you agree in doing it with the Workflow Team.
>
>
BEWARE: Please DO NOT use this strategy unless you are sure it is necessary and you agree in doing it with the Workflow Team. This literally kills all the Central Production jobs which are in Tier0 Pool (including ones which are being executed at that moment).
 
Changed:
<
<
  • Login to vocms007 (GlideInWMS Collector-Negociator)
>
>
NOTE: BEFORE CHANGING THE CONFIGURATION OF FLOCKING / DEFRAGMENTATION, YOU NEED TO CONTACT SI OPERATORS AT FIRST ASKING THEM TO MAKE CHANGES (as for September 2017, the operators are Diego at CERN <diego.davila@cernNOSPAMPLEASE.ch> and Krista at FNAL <klarson1@fnalNOSPAMPLEASE.gov> ). Formal way to request changes is GlideInWMS elog (just post a request here): https://cms-logbook.cern.ch/elog/GlideInWMS/ Only in case of emergency out of working hours, consider executing the below procedure on your own. But posting elog entry in this case is even more important as SI team needs to be aware of such meaningful changes.

  • Login to vocms007 (GlideInWMS Collector-Negotiator)
 
  • Login as root
     sudo su -  
  • Go to /etc/condor/config.d/

Revision 582017-09-21 - ElianaAlejandraBohorquezPuentes

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1239 to 1239
 10. "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch"]
38.        elif p[ 'dn' ] in namesToMapToTIER0:
39.           dnmap[ p['dn'] ] = "cmst0" 
\ No newline at end of file
Added:
>
>

Update code in the dmwm/T0 repository

<!--/twistyPlugin twikiMakeVisibleInline-->

Execute this commands locally, where you already made a copy of the repository

  • Get the latest code from the repository
git checkout master
git fetch dmwm
git pull dmwm master
git push origin master

  • Create a branch to add the code changes. Use a meaningful name.
git checkout -b <branch-name> dmwm/master

  • Make the changes in the code

  • Add the modified files to the changes to be commit
git add <file-name>

  • Make commit of the changes
git commit

  • Push the changes from your local repository to the remote repository
git push origin <branch-name>

  • Make a pull request from the GitHub web page

NOTE: If you want to modify your last commit, before it is merged into the code of dmwm (even if you already made the pull request), use these steps:

  • Make the required modifications in the branch
  • Fix the previous commit
git commit --amend
  • Force update
git push -f origin <branch-name>
  • If a pull request was done before, it will update automatically.

After the branch is merged, it can be safely deleted:

git branch -d <branch-name>

Other useful commands

  • Show the branch in which you are working and the status of the changes. Useful before doing commit or while working on a branch.
git branch
git status
  • Others
git reset
git diff
git log
git checkout . 

<!--/twistyPlugin-->

Revision 572017-09-18 - VytautasJankauskas

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 309 to 309
 
9. Start notification logs to the SM in vocms0314 Tier0
10. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
11. Restart transfers SMOps
Added:
>
>
12. Point acronjobs ran as cmst1 on lxplus to a new headnode. They are checkActiveRuns and checkPendingTransactions scripts. Tier0
  </>
<!--/twistyPlugin-->

Revision 562017-07-26 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Changed:
<
<
Contents :
>
>
Contents :
  Recipes for tier-0 troubleshooting, most of them are written such that you can copy-paste and just replace with your values and obtain the expected results.
Line: 17 to 17
 
/eos/cms/tier0/ Files ready to be transferred to Tape and Disk Tier-0 worker nodes (Processing/Repack/Merge jobs) PhEDEx Agent Tier-0 WMAgent creates and auto approves transfer/deletion requests. PhEDEx executes them
/eos/cms/store/express/ Output from Express processing Tier-0 worker nodes Users Tier-0 express area clenaup script
Added:
>
>

Modifying the Tier-0 Configuration

Adding a new scenario to the configuration

<!--/twistyPlugin twikiMakeVisibleInline-->

  • Go to the scenarios section.
  • Declare a new variable for the new scenario. Give it a meaningful name ending with the "Scenario" suffix.
     <meaningfulName>Scenario = "<actualNameOfTheNewScenario>" 
  • Make a new Pull Request, adding the scenario to the scenarios creation: https://github.com/dmwm/T0/blob/master/src/python/T0/WMBS/Oracle/Create.py#L864
  • NOTE: If the instance is already deployed, you can manually add the new scenario directly on the event_scenario table of the T0AST. The Tier0Feeder will pick the change up in the next polling cycle.

<!--/twistyPlugin-->
 

Corrupted merged file

%TWISTY{

Revision 552017-07-24 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 709 to 708
 
  • Now, you can check the whitelisted Schedds to run in Tier0 pool, the Central Production Schedds should not appear there.
     condor_config_val -master gsi_daemon_name  
Added:
>
>
  • Now you need to restart the condor negociator to make sure that the changes are applied right away.
     ps aux | grep "condor_negotiator"   
    kill -9 <replace_by_condor_negotiator_process_id> 

  • After killing the process it should reappear again after a couple of minutes.

  • It is done!
 Remember that this change won't remove/evict the jobs that are actually running, but will prevent new jobs to be sent.

</>

<!--/twistyPlugin-->

Revision 542017-07-24 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 8 to 8
  BEWARE: The writers are not responsible for side effects of these recipes, always understand the commands before executing them.
Added:
>
>

EOS Areas of interest

The Tier-0 WMAgent uses four areas on eos.
Path Use Who writes Who reads Who cleans
/eos/cms/store/t0streamer/ Input streamer files transferred from P5 Storage Manager Tier-0 worker nodes Tier-0 t0streamer area cleanup script
/eos/cms/store/unmerged/ Store output files smaller than 2GB until the merge jobs put them together Tier-0 worker nodes (Processing/Repack jobs) Tier-0 worker nodes(Merge Jobs) ?
/eos/cms/tier0/ Files ready to be transferred to Tape and Disk Tier-0 worker nodes (Processing/Repack/Merge jobs) PhEDEx Agent Tier-0 WMAgent creates and auto approves transfer/deletion requests. PhEDEx executes them
/eos/cms/store/express/ Output from Express processing Tier-0 worker nodes Users Tier-0 express area clenaup script

 

Corrupted merged file

%TWISTY{

Revision 532017-06-06 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Changed:
<
<
Contents :
>
>
Contents :
  Recipes for tier-0 troubleshooting, most of them are written such that you can copy-paste and just replace with your values and obtain the expected results.
Line: 180 to 184
 # Clean workarea rm -rf PSetTweaks/ WMCore.zip WMSandbox/
Changed:
<
<
Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong. </>
<!--/twistyPlugin-->
>
>
Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong. </>
<!--/twistyPlugin-->
 

Force Releasing PromptReco

Line: 276 to 280
 
9. Start notification logs to the SM in vocms0314 Tier0
10. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
11. Restart transfers SMOps
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>
</>
<!--/twistyPlugin-->
 

Changing CMSSW Version

Line: 395 to 400
 

Updating T0AST when a lumisection can not be transferred.

Changed:
<
<
>
>
 %TWISTY{ showlink="Show..." hidelink="Hide" remember="on" mode="div"
Changed:
<
<
}%
>
>
}%
 
update lumi_section_closed set filecount = 0, CLOSE_TIME = <timestamp>
where lumi_id in ( <lumisection ID> ) and run_id = <Run ID> and stream_id = <stream ID>;
Line: 948 to 945
  Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/job
Changed:
<
<
Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use modify_pset.py as base to create another file which would modify the required feature, you can give it a name like modify_pset_.py. Add a call to the just created script in modify_one_job.sh. Finally, execute modify_several_jobs.sh, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.
>
>
Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use modify_pset.py as base to create another file which would modify the required feature, you can give it a name like modify_pset_.py. Add a call to the just created script in modify_one_job.sh. Finally, execute modify_several_jobs.sh, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.
 
vim list
cp modify_pset.py modify_pset_<feature>.py
Line: 1197 to 1195
 
  • The VOC is responsible for this change. This mapping is specified on a file deployed at:
    • /afs/cern.ch/cms/caf/gridmap/gridmap.txt
Added:
>
>
  • Current VOC, Daiel Valbuena, has the script writting the gridmap file versioned here: (See Gitlab repo).
  • The following lines were added there to map the certificate used by our agents to the cmst0 service account.
    9.  namesToMapToTIER0 = [ "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms15.cern.ch",
    10.                 "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch"]
    38.        elif p[ 'dn' ] in namesToMapToTIER0:
    39.           dnmap[ p['dn'] ] = "cmst0" 

Revision 522017-05-15 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1183 to 1185
 
  • Proxy cronjobs

</>

<!--/twistyPlugin-->
Added:
>
>

Changing the certificate mapping to access eos

<!--/twistyPlugin twikiMakeVisibleInline-->

  • The VOC is responsible for this change. This mapping is specified on a file deployed at:
    • /afs/cern.ch/cms/caf/gridmap/gridmap.txt

Revision 512017-04-21 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Changed:
<
<
Contents :
>
>
Contents :
  Recipes for tier-0 troubleshooting, most of them are written such that you can copy-paste and just replace with your values and obtain the expected results.
Line: 184 to 184
 # Clean workarea rm -rf PSetTweaks/ WMCore.zip WMSandbox/
Changed:
<
<
Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong.
<!--/twistyPlugin-->
>
>
Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong. </>
<!--/twistyPlugin-->
 

Force Releasing PromptReco

Line: 269 to 269
 
2. Start the Tier0 instance in vocms0314 Tier0
3. Coordinate with Storage Manager so we have a stop in data transfers, respecting run boundaries (Before this, we need to check that all the runs currently in the Tier0 are ok with bookkeeping. This means no runs in Active status.) SMOps
4. Checking al transfer are stopped Tier0
Changed:
<
<
4.1. Check http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
>
>
4.1. Check http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
 
4.2. Check /data/Logs/General.log
| 5. | Change the config file of the transfer system to point to T0AST1. It means, going to /data/TransferSystem/Config/TransferSystem_CERN.cfg and change that the following settings to match the new head node T0AST)
  "DatabaseInstance" => "dbi:Oracle:CMS_T0AST",
  "DatabaseUser"     => "CMS_T0AST_1",
Line: 280 to 280
 
9. Start notification logs to the SM in vocms0314 Tier0
10. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
11. Restart transfers SMOps
Changed:
<
<
</>
<!--/twistyPlugin-->
>
>
</>
<!--/twistyPlugin-->
 

Changing CMSSW Version

Line: 399 to 400
 

Updating T0AST when a lumisection can not be transferred.

Added:
>
>
 %TWISTY{ showlink="Show..." hidelink="Hide" remember="on" mode="div"
Changed:
<
<
}%
>
>
}%
 
update lumi_section_closed set filecount = 0, CLOSE_TIME = <timestamp>
Line: 731 to 733
  </>
<!--/twistyPlugin-->
Changed:
<
<

Changing the status of T0_CH_CERN site in SSB

>
>

Changing the status of _CH_CERN sites in SSB

  %TWISTY{ showlink="Show..."
Line: 740 to 742
 mode="div" }%
Added:
>
>
To change T2_CH_CERN and T2_CH_CERN_HLT

*Please note that Tier0 Ops changing the status of T2_CH_CERN and T2_CH_CERN_HLT is an emergency procedure, not a standard one*

  • Open a GGUS Ticket to the site before proceeding, asking them to change the status themselves.
  • If there is no response after 1 hour, reply to the same ticket reporting you are changing it and proceed with the steps in the next section.

To change T0_CH_CERN

 
Changed:
<
<
  • There, you will be able to change the status of T0_CH_CERN. You can set Enabled, Disabled, Drain or No override. The Reason field is mandatory (the history of these reason can be checked here). Then click "Apply" and the procedure will be complete. Only the users in the cms-tier0-operations e-group are able to do this change.
>
>
  • There, you will be able to change the status of T0_CH_CERN/T2_CH_CERN/T2_CH_CERN_HLT. You can set Enabled, Disabled, Drain or No override. The Reason field is mandatory (the history of thisreason can be checked here). Then click "Apply" and the procedure will be complete. The users in the cms-tier0-operations e-group are able to do this change.
 
  • The status in the SSB site is updated every 15 minutes. So you should be able to see the change there maximum after this amount of time.
Changed:
<
<
  • The documentation can be check here.
>
>
  • More extensive documentation can be checked here.
  </>
<!--/twistyPlugin-->
Line: 935 to 945
  Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/job
Changed:
<
<
Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use modify_pset.py as base to create another file which would modify the required feature, you can give it a name like modify_pset_.py. Add a call to the just created script in modify_one_job.sh. Finally, execute modify_several_jobs.sh, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.
>
>
Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use modify_pset.py as base to create another file which would modify the required feature, you can give it a name like modify_pset_.py. Add a call to the just created script in modify_one_job.sh. Finally, execute modify_several_jobs.sh, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.
 
vim list
cp modify_pset.py modify_pset_<feature>.py
Line: 1012 to 1022
 mode="div" }%
Changed:
<
<
The current fcsr can be checked in the Tier0 Data Service: https://cmsweb.cern.ch/t0wmadatasvc/prod/firstconditionsaferun
>
>
The current fcsr can be checked in the Tier0 Data Service: https://cmsweb.cern.ch/t0wmadatasvc/prod/firstconditionsaferun
  In the CMS_T0DATASVC_PROD database check the table the first run with locked = 0 is fcsr

Revision 502017-03-21 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 685 to 682
  Within this file there is a special section for the Tier0 ops. The other sections of the file should not be modified.
Changed:
<
<
  • To actually disable flocking you should:
    • Uncomment this line:
       # <----- Uncomment here ------->
       # CERTIFICATE_MAPFILE= /data/srv/glidecondor/condor_mapfile 
    • Comment from the whitelist all the Central Production Schedds:
       # <---- Comment out all the schedds below this to disable flocking ---->
       # Adding global pool CERN production schedds for flocking
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0230.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0304.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0308.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0309.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0310.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0311.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms0303.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms026.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms053.cern.ch
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=ch/DC=cern/OU=computers/CN=vocms005.cern.ch
       # Adding global pool FNAL production schedds for flocking
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmsgwms-submit2.fnal.gov
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmsgwms-submit1.fnal.gov
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmssrv217.fnal.gov
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmssrv218.fnal.gov
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmssrv219.fnal.gov
       GSI_DAEMON_NAME=$(GSI_DAEMON_NAME),/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=cmssrv248.fnal.gov 
>
>
  • To disable flocking you should locate the flocking config section:
    # Knob to enable or disable flocking
    # To enable, set this to True (defragmentation is auto enabled)
    # To disable, set this to False (defragmentation is auto disabled)
    ENABLE_PROD_FLOCKING = True
  • Change the value to False
    ENABLE_PROD_FLOCKING = False
 
  • Save the changes in the 99_local_tweaks.config file and execute the following command to apply the changes:
     condor_reconfig 
Added:
>
>
  • The negociator has a 12h cache, so the schedds don't need to authenticate during this period of time. It is required to restart the negotiator.
 
  • Now, you can check the whitelisted Schedds to run in Tier0 pool, the Central Production Schedds should not appear there.
     condor_config_val -master gsi_daemon_name  

Revision 492017-03-16 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 223 to 223
 
    • If production ran in this instance before, be sure that the T0AST was backed up. Deploying a new instance will wipe it.
    • Modify the WMAgent.secrets file to point to the replay couch and t0datasvc.
    • Download the latest ReplayOfflineConfig.py from the Github repository. Check the processing to use based on the elog history.
Changed:
<
<
    • Do not use production 00_deploy.sh. Use the replays 00_deploy.sh script instead.
>
>
    • Do not use production 00_deploy.sh. Use the replays 00_deploy.sh script instead. This is the list of changes:
      • Points to the replay secrets file instead of the production secrets file:
            WMAGENT_SECRETS_LOCATION=$HOME/WMAgent.replay.secrets; 
      • Points to the ReplayOfflineConfiguration instead of the ProdOfflineConfiguration:
         sed -i 's+TIER0_CONFIG_FILE+/data/tier0/admin/ReplayOfflineConfiguration.py+' ./config/tier0/config.py 
      • Uses the "tier0replay" team instead of the "tier0production" team (relevant for WMStats monitoring):
         sed -i "s+'team1,team2,cmsdataops'+'tier0replay'+g" ./config/tier0/config.py 
      • Changes the archive delay hours from 168 to 1:
        # Workflow archive delay
                    echo 'config.TaskArchiver.archiveDelayHours = 1' >> ./config/tier0/config.py
      • Uses lower thresholds in the resource-control:
        ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --cms-name=T0_CH_CERN --pnn=T0_CH_CERN_Disk --ce-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=SimpleCondorPlugin
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Merge --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Cleanup --pending-slots=80 --running-slots=160
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=LogCollect --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Skim --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Production --pending-slots=0 --running-slots=0
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Harvesting --pending-slots=40 --running-slots=80
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Express --pending-slots=800 --running-slots=1600
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Repack --pending-slots=160 --running-slots=320
        
                   ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN --cms-name=T2_CH_CERN --pnn=T2_CH_CERN --ce-name=T2_CH_CERN --pending-slots=0 --running-slots=0 --plugin=PyCondorPlugin

  </>
<!--/twistyPlugin-->

Revision 482017-03-15 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1128 to 1130
 
EXAMPLE 2: chown -R cmst1:zh /data/certs/* 

2. certs

Added:
>
>
  • Certificates are placed on this folder. You should copy them from another node:
    • servicecert-vocms001.pem
    • servicekey-vocms001-enc.pem
    • servicekey-vocms001.pem
    • vocms001.p12
    • serviceproxy-vocms001.pem
 
Added:
>
>
NOTE: serviceproxy-vocms001.pem is renewed periodically via a cronjob. Please check the cronjobs section
 

5. srv

Added:
>
>
  • There you will find the
    glidecondor
    folder, used to....
  • Other condor-related folders could be found. Please check with the Submission Infrastructure operator/team what is needed and who is responsible for it.
 

6. tier0

Added:
>
>
  • Main folder for the WMAgent, containing the configuration, source code, deployment scripts, and deployed agent.
 
Changed:
<
<
$ :
>
>
File Description
00_deploy.prod.sh Script to deploy the WMAgent for production(*)
00_deploy.replay.sh Script to deploy the WMAgent for a replay(*)
00_fix_t0_status.sh
00_patches.sh
00_readme.txt Some documentation about the scripts
00_software.sh Gets the source code to use form Github for WMCore and the Tier0. Applies the described patches if any.
00_start_agent.sh Starts the agent after it is deployed.
00_start_services.sh Used during the deployment to start services such as CouchDB
00_stop_agent.sh Stops the components of the agent. It doesn't delete any information from the file system or the T0AST, just kill the processes of the services and the WMAgent components
00_wipe_t0ast.sh Invoked by the 00_deploy script. Wipes the content of the T0AST. Be careful!

(*) This script is not static. It might change depending on the version of the Tier0 used and the site where the jobs are running. Check its content before deploying. (**) This script is not static. It might change when new patches are required and when the release versions of the WMCore and the Tier0 change. Check it before deploying.

Folder Description

Cronjobs

 
Deleted:
<
<
---+++ Certificates
  • Certificates are placed on this folder.
    /data/certs/
  • You should copy them from another node:
    • servicecert-vocms001.pem
    • servicekey-vocms001-enc.pem
    • servicekey-vocms001.pem
    • vocms001.p12
    • serviceproxy-vocms001.pem
 
Deleted:
<
<
NOTE
serviceproxy-vocms001.pem is renewed periodically via a cronjob. Please check the next section
 
Deleted:
<
<
  • Go to this link
     https://session-manager.web.cern.ch/session-manager/ 
  • Login using the DB credentials.
  • Check the sessions and see if you see any errors or something unusual.
  </>
<!--/twistyPlugin-->

Revision 472017-03-15 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Changed:
<
<
Contents :
>
>
Contents :
  Recipes for tier-0 troubleshooting, most of them are written such that you can copy-paste and just replace with your values and obtain the expected results.
Line: 1097 to 1097
 
  • Check the sessions and see if you see any errors or something unusual.

</>

<!--/twistyPlugin-->
Added:
>
>

Commissioning of a new node

*INCOMPLETE INSTRUCTIONS: WORK IN PROGRESS 2017/03*

<!--/twistyPlugin twikiMakeVisibleInline-->

Folder's structure and permissions

  • These folders should be placed at /data/:
    # Permissions Owner Group Folder Name
    1. (775) drwxrwxr-x. root zh admin
    2. (775) drwxrwxr-x. root zh certs
    3. (755) drwxr-xr-x. cmsprod zh cmsprod
    4. (700) drwx------. root root lost+found
    5. (775) drwxrwxr-x. root zh srv
    6. (755) drwxr-xr-x. cmst1 zh tier0

TIPS:

  • To get the folder permissions as a number:
    stat -c %a /path/to/file
  • To change permissions of a file/folder:
    EXAMPLE 1: chmod 775 /data/certs/
  • To change the user and/or group ownership of a file/directory:
    EXAMPLE 1: chown :zh /data/certs/
    EXAMPLE 2: chown -R cmst1:zh /data/certs/* 

2. certs

5. srv

6. tier0

$ :

---+++ Certificates

  • Certificates are placed on this folder.
    /data/certs/
  • You should copy them from another node:
    • servicecert-vocms001.pem
    • servicekey-vocms001-enc.pem
    • servicekey-vocms001.pem
    • vocms001.p12
    • serviceproxy-vocms001.pem

NOTE
serviceproxy-vocms001.pem is renewed periodically via a cronjob. Please check the next section

  • Go to this link
     https://session-manager.web.cern.ch/session-manager/ 
  • Login using the DB credentials.
  • Check the sessions and see if you see any errors or something unusual.

<!--/twistyPlugin-->

Revision 462017-02-13 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1081 to 1081
 
  • Check that the agent doesn't appear in WMStats.

</>

<!--/twistyPlugin-->
Added:
>
>

Checking what is locking a database / Cern Session Manager

<!--/twistyPlugin twikiMakeVisibleInline-->

  • Go to this link
     https://session-manager.web.cern.ch/session-manager/ 
  • Login using the DB credentials.
  • Check the sessions and see if you see any errors or something unusual.

<!--/twistyPlugin-->

Revision 452016-11-14 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1062 to 1062
 
condor_reconfig 

</>

<!--/twistyPlugin-->
Added:
>
>

Unregistering an agent from WMStats

<!--/twistyPlugin twikiMakeVisibleInline-->

  • Log into the agent
  • Source the environment:
     source /data/tier0/admin/env.sh  
  • Execute:
     $manage execute-agent wmagent-unregister-wmstats `hostname -f`:9999  
  • You will be prompt for confirmation. Type 'yes'.
  • Check that the agent doesn't appear in WMStats.

<!--/twistyPlugin-->

Revision 442016-11-10 - ElianaAlejandraBohorquezPuentes

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 975 to 975
  physics_skims = [ "LogError", "LogErrorMonitor" ], scenario = ppScenario)
Changed:
<
<
  • A proxy with mapping to the cms vo should be provided at /data/certs/. The variable X509_USER_PROXY defined at /data/tier0/admin/env.sh should point to the proxy location. The proxy currently used by the T0 does not fulfill this requirement, so a proxy of one of the workflow team agents can be used temporarily. Down below, part of the information of a valid proxy is shown:
>
>
  • Jobs should be able to write in the T1 storage systems, for this, a proxy with the production VOMS role should be provided at /data/certs/. The variable X509_USER_PROXY defined at /data/tier0/admin/env.sh should point to the proxy location. A proxy with the required role can not be generated for a time span mayor than 8 days, then a cron job should be responsible of the renewal. For jobs to stage out at T1s, there is no need of mappings of the Distinguished Name (DN) shown in the certificate to specific users in the T1 sites, the mapping is made with the role of the certificate. This could be needed to stage out at T2 sites. Down below, the information of a valid proxy is shown:
 
Added:
>
>
subject : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch/CN=110263821 issuer : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch identity : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch type : RFC3820 compliant impersonation proxy strength : 1024 path : /data/certs/serviceproxy-vocms001.pem timeleft : 157:02:59 key usage : Digital Signature, Key Encipherment
 = VO cms extension information = VO : cms
Changed:
<
<
subject : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues issuer : /DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch
>
>
subject : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch issuer : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
 attribute : /cms/Role=production/Capability=NULL attribute : /cms/Role=NULL/Capability=NULL
Changed:
<
<
attribute : /cms/uscms/Role=NULL/Capability=NULL timeleft : 162:57:43 uri : lcg-voms2.cern.ch:15002
>
>
timeleft : 157:02:58 uri : voms2.cern.ch:15002
  </>
<!--/twistyPlugin-->

Revision 432016-11-08 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1036 to 1034
 
 $manage execute-agent wmagent-resource-control -p 

</>

<!--/twistyPlugin-->
\ No newline at end of file
Added:
>
>

Overriding the limit of Maximum Running jobs by the Condor Schedd

<!--/twistyPlugin twikiMakeVisibleInline-->

  • Login as root in the Schedd machine
  • Go to:
     /etc/condor/config.d/99_local_tweaks.config  
  • There, override the limit adding/modifying this line:
     MAX_JOBS_RUNNING = <value>  
  • For example:
     MAX_JOBS_RUNNING = 12000  
  • Then, to apply the changes, run:
    condor_reconfig 

<!--/twistyPlugin-->

Revision 422016-11-01 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 1027 to 1029
  Example:
 $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=3000 --running-slots=9000 
Added:
>
>
  • To change the general values
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=PyCondorPlugin 

  • To see the current thresholds and use
     $manage execute-agent wmagent-resource-control -p 
 </>
<!--/twistyPlugin-->

Revision 412016-10-31 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 944 to 940
 

PromptReconstruction at T1s

Added:
>
>
<!--/twistyPlugin twikiMakeVisibleInline-->
 There are 3 basic requirements to perform PromptReconstruction at T1s (and possibly T2s):

  • Each desired site should be configured in the T0 Agent Resource Control. For this, /data/tier0/00_deploy.sh file should be modified specifying pending and running slot thresholds for each type of processing task. For instance:
Line: 984 to 987
 timeleft : 162:57:43 uri : lcg-voms2.cern.ch:15002
Added:
>
>
<!--/twistyPlugin-->

Manually modify the First Conditions Safe Run (fcsr)

 %TWISTY{ showlink="Show..." hidelink="Hide"
Line: 991 to 998
 mode="div" }%
Added:
>
>
The current fcsr can be checked in the Tier0 Data Service: https://cmsweb.cern.ch/t0wmadatasvc/prod/firstconditionsaferun

In the CMS_T0DATASVC_PROD database check the table the first run with locked = 0 is fcsr

 reco_locked table 

If you want to manually set a run as the fcsr you have to make sure that it is the lowest run with locked = 0

 update reco_locked set locked = 0 where run >= <desired_run> 

</>

<!--/twistyPlugin-->

Modify the thresholds in the resource control of the Agent

<!--/twistyPlugin twikiMakeVisibleInline-->

  • Login into the desired agent and become cmst1
  • Source the environment
     source /data/tier0/admin/env.sh 
  • Execute the following command with the desired values:
     $manage execute-agent wmagent-resource-control --site-name=<Desired_Site> --task-type=Processing --pending-slots=<desired_value> --running-slots=<desired_value> 
    Example:
     $manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=3000 --running-slots=9000 
 
<!--/twistyPlugin-->
\ No newline at end of file

Revision 402016-10-28 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 760 to 761
  </>
<!--/twistyPlugin-->
Added:
>
>

Changing highIO flag of jobs that are in the condor queue

<!--/twistyPlugin twikiMakeVisibleInline-->

  • The command for doing it is (it should be executed as cmst1):
    condor_qedit <job-id> Requestioslots "0" 
  • A base snippet to do the change. Feel free to play with it to meet your particular needs, (changing the New Prio, filtering the jobs to be modified, etc.)
     for job in $(cat <text_file_with_the_list_of_job_condor_IDs>)
             do
                 condor_qedit $job Requestioslots "0"
             done 

<!--/twistyPlugin-->
 

Updating workflow from completed to normal-archived in WMStats

%TWISTY{

Line: 935 to 955
 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Cleanup --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=LogCollect --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Skim --pending-slots=50 --running-slots=50
Changed:
<
<
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Harvesting --pending-slots=10 --running-slots=20
>
>
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Harvesting --pending-slots=10 --running-slots=20
 
  • A list of possible sites where the reconstruction is wanted should be provided under the parameter siteWhitelist. This is done per Primary Dataset in the configuration file /data/tier0/admin/ProdOfflineConfiguration.py. For instance:
Deleted:
<
<
 
datasets = [ "DisplacedJet" ]
Line: 952 to 970
  siteWhitelist = [ "T1_IT_CNAF" ], dqm_sequences = [ "@common" ], physics_skims = [ "LogError", "LogErrorMonitor" ],
Changed:
<
<
scenario = ppScenario)
>
>
scenario = ppScenario)
 
  • A proxy with mapping to the cms vo should be provided at /data/certs/. The variable X509_USER_PROXY defined at /data/tier0/admin/env.sh should point to the proxy location. The proxy currently used by the T0 does not fulfill this requirement, so a proxy of one of the workflow team agents can be used temporarily. Down below, part of the information of a valid proxy is shown:
Deleted:
<
<
 
=== VO cms extension information ===
VO        : cms
Line: 966 to 982
 attribute : /cms/Role=NULL/Capability=NULL attribute : /cms/uscms/Role=NULL/Capability=NULL timeleft : 162:57:43
Changed:
<
<
uri : lcg-voms2.cern.ch:15002
>
>
uri : lcg-voms2.cern.ch:15002
  %TWISTY{ showlink="Show..."

Revision 392016-10-07 - ElianaAlejandraBohorquezPuentes

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 919 to 919
 ./modify_several_jobs.sh

</>

<!--/twistyPlugin-->
\ No newline at end of file
Added:
>
>

PromptReconstruction at T1s

There are 3 basic requirements to perform PromptReconstruction at T1s (and possibly T2s):

  • Each desired site should be configured in the T0 Agent Resource Control. For this, /data/tier0/00_deploy.sh file should be modified specifying pending and running slot thresholds for each type of processing task. For instance:

$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --cms-name=T1_IT_CNAF --pnn=T1_IT_CNAF_Disk --ce-name=T1_IT_CNAF --pending-slots=100 --running-slots=1000 --plugin=PyCondorPlugin
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Processing --pending-slots=1500 --running-slots=4000
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Production --pending-slots=1500 --running-slots=4000
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Merge --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Cleanup --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=LogCollect --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Skim --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Harvesting --pending-slots=10 --running-slots=20

  • A list of possible sites where the reconstruction is wanted should be provided under the parameter siteWhitelist. This is done per Primary Dataset in the configuration file /data/tier0/admin/ProdOfflineConfiguration.py. For instance:

datasets = [ "DisplacedJet" ]

for dataset in datasets:
    addDataset(tier0Config, dataset,
               do_reco = True,
               raw_to_disk = True,
               tape_node = "T1_IT_CNAF_MSS",
               disk_node = "T1_IT_CNAF_Disk",
               siteWhitelist = [ "T1_IT_CNAF" ],
               dqm_sequences = [ "@common" ],
               physics_skims = [ "LogError", "LogErrorMonitor" ],
               scenario = ppScenario)

  • A proxy with mapping to the cms vo should be provided at /data/certs/. The variable X509_USER_PROXY defined at /data/tier0/admin/env.sh should point to the proxy location. The proxy currently used by the T0 does not fulfill this requirement, so a proxy of one of the workflow team agents can be used temporarily. Down below, part of the information of a valid proxy is shown:

=== VO cms extension information ===
VO        : cms
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues
issuer    : /DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch
attribute : /cms/Role=production/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
attribute : /cms/uscms/Role=NULL/Capability=NULL
timeleft  : 162:57:43
uri       : lcg-voms2.cern.ch:15002

<!--/twistyPlugin twikiMakeVisibleInline-->

<!--/twistyPlugin-->

Revision 382016-10-03 - JohnHarveyCasallasLeon

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 239 to 243
 
# Instruction Responsible Role
0. Run a replay in the new headnode. Some changes have to be done to safely run it in a Prod instance. Please check the Running a replay on a headnode section Tier0
1. Deploy the new prod instance in vocms0314, check that we use: Tier0
Added:
>
>
1.5. Check the ProdOfflineconfiguration that is being used Tier0
 
2. Start the Tier0 instance in vocms0314 Tier0
3. Coordinate with Storage Manager so we have a stop in data transfers, respecting run boundaries (Before this, we need to check that all the runs currently in the Tier0 are ok with bookkeeping. This means no runs in Active status.) SMOps
4. Checking al transfer are stopped Tier0
Line: 251 to 256
 
7.

Restart transfer system using:

A)

t0_control restart (will erase the logs)

B)

t0_control stop

t0_control start (will keep the logs)

Tier0
8. Kill the replay processes (if any) Tier0
9. Start notification logs to the SM in vocms0314 Tier0
Changed:
<
<
10. Restart transfers SMOps
11. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
>
>
10. Change the configuration for Kibana monitoring pointing to the proper T0AST instance. Tier0
11. Restart transfers SMOps
 </>
<!--/twistyPlugin-->

Changing CMSSW Version

Line: 886 to 887
 In a file named "list", list the jobs you need to modify. Follow the procedure for each type of job/task, given the workflow configuration is different for different Streams.

Use script print_workflow_config.sh to generate a human readable copy of WMWorkload.pkl. Look for the name of the variable of the feature to change, for instance maxRSS. Now use the script generate_code.sh to create a script to modify that feature. You should provide the name of the feature and the value to be assigned, for instance:

Deleted:
<
<
 
feature=maxRSS
Changed:
<
<
value=15360000
>
>
value=15360000
  Executing generate_code.sh would create a script named after the feature, like modify_wmworkload_maxRSS.py. The later will modify the selected feature in the Workflow Sandbox.
Line: 906 to 904
 vim generate_code.sh ./generate_code.sh vim modify_one_workflow.sh
Changed:
<
<
./modify_several_workflows.sh
>
>
./modify_several_workflows.sh
 

Modifying the Job Description

Line: 920 to 916
 cp modify_pset.py modify_pset_.py vim modify_pset_.py vim modify_one_job.sh
Changed:
<
<
./modify_several_jobs.sh
>
>
./modify_several_jobs.sh
  </>
<!--/twistyPlugin-->

Revision 372016-09-22 - ElianaAlejandraBohorquezPuentes

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 854 to 854
 mode="div" }%
Changed:
<
<
Depending of the feature you want to modify, you would need to change the config of the job or the config of the workflow. Some scripts are already available to do this, provided the cache directory of the job, the feature to modify and its value.
>
>
Some scripts are already available to do this, provided with:
  • the cache directory of the job (or location of the job in JobCreator),
  • the feature to modify
  • and the value to be assigned to the feature.

Depending of the feature you want to modify, you would need to change:

  • the config of the single job (job.pkl),
  • the config of the whole workflow (WMWorkload.pkl),
  • or both.

We have learnt by trial and error which variables and files need to be modified to get the desired result, so you would need to do the same depending of the case. Down below we show some basic examples of how to do this:

Some cases have proven you need to modify the Workflow Sandbox when you want to modify next variables:

  • Memory thresholds (maxRSS, memoryRequirement)
  • Number of processing threads (numberOfCores)
  • CMSSW release (cmsswVersion)
  • SCRAM architecture (scramArch)

Modifying the job description has proven to be useful to change next variables:

  • Condor ClassAd of RequestCpus (numberOfCores)
  • CMSSW release (swVersion)
  • SCRAM architecture (scramArch)
 
Changed:
<
<
At /afs/cern.ch/user/e/ebohorqu/public/scripts/modifyConfigs/, enter to the directory of the feature to modify, e.g. maxRSS if you desire to modify the memory threshold.
>
>
At /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs there are two directories named "job" and "workflow". You should enter the respective directory. Follow next instructions in the agent machine in charge of the jobs to modify.
 
Changed:
<
<
To modify the Workflow Sandbox use next scripts:
>
>

Modifying the Workflow Sandbox

 
Changed:
<
<
modify_wmworkload.py
modify_one_workflow.sh
modify_several_workflows.sh
>
>
Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/workflow
  In a file named "list", list the jobs you need to modify. Follow the procedure for each type of job/task, given the workflow configuration is different for different Streams.
Changed:
<
<
Use script print_workflow_config.sh to generate a human readable copy of WMWorkload.pkl. This should be done in the agent machine. Look for the name of the variable of the feature to change, for instance maxRSS. Now use the script generate_code.sh to create a script to modify that feature. You should provide the name of the feature and the value to be assigned, for instance:
>
>
Use script print_workflow_config.sh to generate a human readable copy of WMWorkload.pkl. Look for the name of the variable of the feature to change, for instance maxRSS. Now use the script generate_code.sh to create a script to modify that feature. You should provide the name of the feature and the value to be assigned, for instance:
 
feature=maxRSS
value=15360000
Changed:
<
<
Executing generate_code.sh would create a script named as the feature, like modify_wmworkload_maxRSS.py. The later will modify the selected feature in the Workflow Sandbox.
>
>
Executing generate_code.sh would create a script named after the feature, like modify_wmworkload_maxRSS.py. The later will modify the selected feature in the Workflow Sandbox.
  After generated, you need to add a call to that script in modify_one_workflow.sh. The later will call all the required scripts, create the tarball and locate it where required (Specs folder).

Finally, execute modify_several_workflows.sh which will call modify_one_workflow.sh for all the desired workflows.

Added:
>
>
The previous procedure has been followed for several jobs, so for some features the required personalization of the scripts has been already done, and you would just need to comment or uncomment the required lines. As a summary, you would need to proceed as detailed bellow:

vim list
./print_workflow_config.sh
vim generate_code.sh
./generate_code.sh
vim modify_one_workflow.sh
./modify_several_workflows.sh

Modifying the Job Description

Go to next folder: /afs/cern.ch/work/e/ebohorqu/public/scripts/modifyConfigs/job

Very similar to the procedure to modify the Workflow Sandbox, add the jobs to modify to "list". You could and probably should read the job.pkl for one particular job and find the feature name to modify. After that, use modify_pset.py as base to create another file which would modify the required feature, you can give it a name like modify_pset_.py. Add a call to the just created script in modify_one_job.sh. Finally, execute modify_several_jobs.sh, which calls the other two scripts. Notice that there are already files for the mentioned features at the beginning of the section.

vim list
cp modify_pset.py modify_pset_<feature>.py
vim modify_pset_<feature>.py
vim modify_one_job.sh
./modify_several_jobs.sh
  </>
<!--/twistyPlugin-->
\ No newline at end of file

Revision 362016-08-04 - ElianaAlejandraBohorquezPuentes

Line: 1 to 1
 
META TOPICPARENT name="CompOpsTier0Team"

Cookbook

Line: 844 to 844
 
  • puppet agent -tv

</>

<!--/twistyPlugin-->
\ No newline at end of file
Added:
>
>

Modifying jobs to resume them with other features (like memory, disk, etc.)

<!--/twistyPlugin twikiMakeVisibleInline-->

Depending of the feature you want to modify, you would need to change the config of the job or the config of the workflow. Some scripts are