Tier-0 Replays

Description

Tier-0 replays are the way for the Tier-0 team to validate changes that need to be integrated into the production infrastructure, for this a similar infrastructure to the production WMAgent is deployed and injection messages are simulated using the transfer system software. Workloads for Express, Repack and PromptReco workflows are run to completion. This is the opportunity to observe the effect of the changes and ensure that no errors are introduced in the system.

Requirements

  • Tier-0 WMAgent deployed.
  • Transfer system not connected to Storage Manager.
  • Streamer files (On disk) and corresponding injection messages.

Infrastructure

  • All processing is EOS based, the test uses the infrastructure in the same way as the production Tier-0.
  • Data store location is configurable and could be /store/backfill/1 or /store/backfill/2
  • Test instances are currently connected to the production pool.

Use cases

Tier-0 replays are used to test changes in the infrastructure and software, the following are the most common use cases:

Configuration checklist

To ensure that the changes are properly tested, make sure the following items are clear before starting the replay:

  • CMSSW release.
  • Global Tag(s) to use.
  • WMAgent Tier-0 release to use and patches.
  • Special requirements on the type of input data to use (e.g. High PU, VdM scan data).

Procedure

Before you start, check with other team members if the vm you are going to use is not running another test (i.e. check the Elogs) The Elogs will tell you what was the last Processing version used for that instance so you can configure the ReplayOfflineConfiguration.py file accordingly.

Elog records

Create an Elog when you start the replay and when you finish it. You do not need to update about every aspect a replay is going through, but you can post an update in the Elog about important issues or if you need help.

How to start a replay

  • Login to a test instance: vocms001, vocms047, vocms015 and
     sudo -u cmst1 /bin/bash 
  • Keep in mind what CMS_T0AST you are going to use. Check WMAgentOracleAccounts and then configure the secrets file accordingly.
  • Go to folder /data/tier0/, there you will find several scripts that will allow you to manage the replay.
  • Check if a former couch/wmagent process is is still running
 ps aux | egrep 'wmcore|couch' 
  • If yes you can kill them using the following script, but be careful, you should confirm via Elog or directly with the team members if you can do it:
./00_stop_agent.sh 
  • Run the script to clone the WMCore and T0 repositories with the respective patches:
./00_software.sh 
  • Run the script to deploy the Tier-0 WMAgent:
./00_deploy.sh 
  • Start the services (CouchDB) and the WMAgent:
./00_start_agent.sh
  • Check the replay injection configurations:
vim /data/tier0/admin/ReplayOfflineConfiguration.py
cd /data/tier0/admin
curl -O https://raw.githubusercontent.com/dmwm/T0/master/etc/ReplayOfflineConfiguration.py

  • Start the 'transfer system'. Go to the replayinject area:
cd /data/tier0/replayinject/

When the Tier0Injector workers are restarted, you will resend the file that resend.txt is pointing (do ls -l to find out what file is currently pointing). You can change it by doing:

ln -sfn Run<run_number>.txt resend.txt

If you want to inject a run not in the /data/tier0/replayinject directory, you need to copy it from the TransferSystem on vocms001. Log into vocms001 as cmst1 and cd /data/TransferSystem and execute the following command

cat Logs/General.log* | grep "'Tier0Inject' => '1'" | grep "'RUNNUMBER' => '<run_number>'" > Run<run_number>.txt

and then copy this file to /data/tier0/replayinject on the replay machine via your afs home directory using scp. Then you can create the softlink to this new run file.

Then, you can restart the transfer system. This will kill the former processes and create new ones. If the replay is already running and you want to inject a new Run, you should omit this step. If it is the first run you inject, it is mandatory.

./t0_control.sh restart

Reinject the logs (this will reinject resend.txt)

./t0_resend.sh start

You can check the logs at /data/tier0/replayinject/Logs/Tier0Injector to see how the injection goes.

You can also check the components Logs, the components are located in:

 /data/tier0/srv/wmagent/current/install/tier0 

The following components are used for the replay *AlertProcessor

How do I know if a replay is done?

There are 2 ways:

  • Check the component logs: nothing pending to be created in JobCreator, nothing pending for submission in JobSubmitter, all workflows are deleted by TaskArchiver. This is not 100% accurate, use the following procedure instead.
  • Login to the oracle database (via sqlplus from lxplus, you can also use SQL Developer) and do:
select name from wmbs_fileset;

If it returns empty, the replay is done. Also check that condor_q is empty and that there are no paused jobs.

How to stop a replay

If the replay is done/you need to kill it, you can stop the tier0 instance with this script:

./00_stop_agent.sh

If there are jobs in condor, they need to be removed. You can use this command for killing ALL the jobs in the scheduler (please be careful)Warning, important :

condor_rm -all

FAQ

  • How do I make the streamers available for the replay?
    • Currently the streamer files for 209,210,211 and 212 are being transferred to EOS so you can use them for large scale replays. More recent commissioning runs are stored on tape under /store/t0streamer/ but you can use them for testing (not large scale) as they are in the disk pools. For Run2 the files will eventually be stored in EOS, so you don't need to worry about staging in disk. For old archived runs, the list of archived runs is in CompOpsTier0TeamArchives: in order to use these streamer files for a replay they must be staged first to disk into the T0Export pool. The easiest way to do this is to get the list of files using nsls in the desired streamer directory and then do stager_get on these filelists.
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2015-07-14 - BrandonLeighAllen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback