Difference: WernerJank (1 vs. 16)

Revision 162015-05-23 - IvanGlushkov

Line: 1 to 1
 
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"

Tier0 Operations, Monitoring and Status displays Guide

Introduction

Changed:
<
<
For installation, configuring, starting and stopping the system, consult the FAQ!
>
>
testFor installation, configuring, starting and stopping the system, consult the FAQ!
  For the CSA06 exercise, the Tier0 operation is based on a number of components:
  • Logger: central recording and steering of all activities

Revision 152006-11-23 - unknown

Line: 1 to 1
 
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"
Changed:
<
<

Tier0 Monitoring and Status displays Guide

>
>

Tier0 Operations, Monitoring and Status displays Guide

 

Introduction

For installation, configuring, starting and stopping the system, consult the FAQ!

Added:
>
>
For the CSA06 exercise, the Tier0 operation is based on a number of components:
  • Logger: central recording and steering of all activities
  • FileFeeder: Getting available files in Castor and feeding them into the Prompt Reconstruction
  • PR Manager: Prompt Reconstruction Manager, manageing the PR Workers
  • PR Worker: Prompt Reconstruction Worker, running the actual reconstruction code and configuration according to role
  • Export Manager: handles the injections into PhEDEx
  • PhEDEx itself, which loads the export buffer with appropriate traffic according to what the Export Emulator Manager tells it to do

All components are independent from each other and can be started and stopped as need arises (as of today, restarting the Logger likely will lose information, to be fixed by Tony?!).

All CSA06 Tier0 operations are run with the loginID cmsprod (for the password, contact Tony, Nick or Werner). All required software is installed in ~cmsprod/public/T0, with the exception of the required Perl and ApMon modules installed in /afs/cern.ch/user/w/wildish/public/perl ($T0ROOT/env.sh will set up your PERL5LIB accordingly).

All steering and manageing components run on lxgate39.

All worker components run on dedicated worker nodes (dual CPU 2.8GHz with SLC4 32-bit mode) as batch jobs, using the LSF-queue cmscsa06.

Installation of the system

All CSA06 computing tasks should use CMSSW_1_0_x. Currently, CMSSW_1_0_3 is installed, and this is the version the instructions following refer to! Other versions may be installed in parallel as needed.

The code has to be installed on shared disk space, visible from all worker nodes, with ~cmsprod/public/T0 as the BASE_DIR. The following tasks have to be performed to get the code to a runnable state:

  • Checkout the code from the CVS repository
  • Install the requisite Perl modules
  • Configure the system

Castor usage

Castor2 is used with several disk pools, configured both in size and functionality as required.
  • t0input 65TB on 13 servers, no tape, garbage collection disabled
  • t0export 80TB on 16 servers, with tape
  • cmsprod 22TB on 4 servers
The "RAW-data" input files are read from /castor/cern.ch/cms/T0Prototype/Input located in the t0input disk pool. The RECO output files are written to /castor/cern.ch/cms/store/CSA06/????? All options in the configuration file have to be set correctly in order to select the correct disk pool and path!

Checking out the code

Running in a /bin/bash shell, you can check out the T0 code as follows:
#CMSSW Version and installation dir
export CMSSW_VERS=CMSSW_1_0_3
export CMSSW_BASE_DIR=~cmsprod/public/CSA06
export CMSSW_DIR=$CMSSW_BASE_DIR/$CMSSW_VERS
cd $CMSSW_BASE_DIR

#Check out from cmscvs
project CMSSW
scramv1 project CMSSW $CMSSW_VERS
cd $CMSSW_VERS/src
mkdir T0
cd T0
cvs co COMP/T0

# Set environment
cd $CMSSW_DIR/src/T0/COMP/T0
. env.sh
scramv1 runtime -csh | tee $T0ROOT/runtime.csh
scramv1 runtime -sh | tee $T0ROOT/runtime.sh
scramv1 runtime -sh | tee $T0ROOT/runtime_pr.sh

#Prompt Reconstruction application
cd $CMSSW_BASE_DIR
scramv1 project CMSSW $CMSSW_VERS
cd $CMSSW_DIR/src
eval `scramv1 runtime -sh`
cmscvsroot CMSSW
cvs co Configuration/Examples/data/RECO081.cfg
# ..following two patches needed for CMSSW_1_0_0 (fix should be in 1_0_1..)!
cvs co -r HEAD Configuration/CompatibilityFragments/data/RecoLocalEcal.cff
cvs co -r 1.15 Configuration/Examples/data/RECO.cff
cp Configuration/Examples/data/RECO081.cfg $T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl

You need to edit $T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl and set the input fileNames, the maxEvents, and the output fileName as follows, wherever it is they appear in the configuration file:

  • untracked vstring fileNames = {'T0_INPUT_FILE'}
  • untracked int32 maxEvents = T0_MAX_EVENTS
  • untracked string fileName = "file:T0_OUTPUT_FILE"

Configuring the components

A single configuration file $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf is passed to each component, with the exception of the FileFeeder $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf, which we prefer to keep seperate for more frequent changes.

After making a change, the syntax of the configuration file should be checked with perl -c $modified_config_file.

Changeing the configuration file while the components are running is possible, as they will re-execute it and update themselves, but for obvious reasons should be done with care! Note, however, that the Worker components will only read their configuration at startup, and then they only really need it to find out where their Loggers and Managers are. Once they have connected to the Manager, that Manager will send them the configuration that they will use, and will send them any updates to it too. So it is only the configuration file(s) used by the Managers that really matters, if you sop/start workers with a different configuration file, it won't make any difference to what they do (providing they connect to the same Managers!)

A detailed description of how to change the configuration file is in July prototype configuration, here only an overview of the most common changes is given.

Change of configuration parameters

It is assumed that the configuration files are configured such that they can be used direclty. Here only a few possible changes are shown, needed for special cases such as changes of input/output path names, feeding rates, export rates to Tier1's, etc.

The Logger

The Logger writes a logfile if you have one set in the Logfile parameter in the Logger::Receiver section. This should not be on AFS, or the logger will fail as soon as the token expires. Any local filesystem will do. The file will be appended if it already exists. If your Logger is still running at midnight, it will rotate the logfile, adding a ".YYYYMMDD" suffix to the old one.

The FileFeeder

The configuration for the FileFeeder is maintained in a seperate file.

The Prompt Reconstruction components

Prompt Reconstruction has three components, the PromptReco::Manager, the PromptReco::Worker, and the PromptReco::Receiver. The Receiver is internal to the manager, to handle subscriptions to the Logger.

For PromptReco::Worker, the only supported Mode at the moment is LocalPull. Actually, that's not true, all modes are supported, but that's the only one tested. Classic should also work, but LocalPush won't until/unless someone writes the bits that push the files in the first place!

Leave TargetDirs as it is (a single entry consisting of '.') to write the RECO output in the jobs local working directory. As elsewhere, the only TargetMode supported at the moment is RoundRobin.

Set MaxEvents to the maximum number of events to process, if you want to process all events, set '-1' instead.

The DataDirs and LogDirs path should be set to an RFIO-accessible directory which exists (maybe Tony fixes this one day?), best to Castor2 to have them in a persistent store and SvcClass has to be set as appropriate.

Operating the Tier0

Starting the components

Some of the components are fussy about being started in the correct order, though that will change as the code improves. In all cases, the only argument you need to specify is the configuration file, with '--config $file'. Also, in all cases, if the Host is specified for that component, you must run it there or it will abort with an appropriate error message. For everything except the workers, just starting the task in a terminal window is good enough. I use screen to create persistant sessions that I can connect to from home or from the office, see http://wildish.home.cern.ch/wildish/UseScreen.html for a 1-minute tutorial on screen if you're interested.

To start the full system, the components should be started in this order:

#logon to machine where servers run..
ssh cmsprod@lxgate39
...give the password
# Set environment
export CMSSW_VERS=CMSSW_1_0_0
export CMSSW_BASE_DIR=~cmsprod/public/T0
export CMSSW_DIR=$CMSSW_BASE_DIR/$CMSSW_VERS
cd $CMSSW_DIR/src/T0/COMP/T0
. env.sh
#Working dir
export T0_WORK_DIR=/data/csa06
cd $T0_WORK_DIR
#Create dirs for log-files
mkdir -p ${T0_WORK_DIR}/Logs/Logger
mkdir -p ${T0_WORK_DIR}/Logs/FileFeeder
mkdir -p ${T0_WORK_DIR}/Logs/PromptRecoManager
mkdir -p ${T0_WORK_DIR}/Logs/ExportManager
# sTART THE COMPONENTS
#The Logger
$T0ROOT/src/Logger/LoggerReceiver.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/Logger/LoggerReceiver.log 2>&1 &
#The FileFeeder
$T0ROOT/src/Utilities/FileFeeder.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf --name CSA06Mixed::Feeder > ${T0_WORK_DIR}/Logs/FileFeeder/FileFeeder.log 2>&1 &
#The PR Manager
$T0ROOT/src/PromptReconstruction/PromptReconstructionManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/PromptRecoManager/PromptRecoManager.log 2>&1 &
#The PR Workers (e.g. start 20 jobs...)
for i in `seq 1 20`; do 
bsub -q cmscsa06  -R 'type=SLC4' $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh
sleep 5
done
#The Export Manager
$T0ROOT/src/ExportManager/ExportManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/ExportManager/ExportManager.log 2>&1 &

The script $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh will need editing to pick up the environment from the correct place (to be improved one day).

 

Watching the running system

Global plots

Revision 142006-10-23 - TonyWildish

Line: 1 to 1
 
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"

Tier0 Monitoring and Status displays Guide

Line: 7 to 7
 For installation, configuring, starting and stopping the system, consult the FAQ!

Watching the running system

Global plots

Changed:
<
<
>
>
 

Revision 102006-10-06 - unknown

Line: 1 to 1
 
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"
Changed:
<
<

CSA06 Tier0 Operations Guide

>
>

Tier0 Monitoring and Status displays Guide

 

Introduction

For installation, configuring, starting and stopping the system, consult the FAQ!

Line: 16 to 16
 
Added:
>
>
 

Export plots (PhEDEx)

Line: 56 to 57
 

Reporting bugs/problems

Simple operational problems can be reported to the t0 operations list or the t0 hypernews forum, depending if they are CERN-local for our specific dedicated hardware or conceptual, in the architecture of the system. Bugs or feature-requests should be reported using savannah, at https://savannah.cern.ch/projects/cmstier0/. Make sure you assign your problem report to someone appropriate, or at least include them in the CC:, or nobody will follow it up! By default, assign problems to Tony.
Added:
>
>

Contacts

Name office mobile
Tony Wildish 77103 ?
Nick Sinanis 79881 160516
Zhechka Toteva 71604
Jens Rehn 71606
Dirk Hufnagel 71704
Lassi Tuura 71542
Werner Jank 71580 160512
 -- Main.jank - 06 Oct 2006

Revision 92006-10-06 - unknown

Line: 1 to 1
 
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"

CSA06 Tier0 Operations Guide

Introduction

Changed:
<
<
For the CSA06 exercise, the Tier0 operation is based on a number of components:
  • Logger: central recording and steering of all activities
  • FileFeeder: Getting available files in Castor and feeding them into the Prompt Reconstruction
  • PR Manager: Prompt Reconstruction Manager, manageing the PR Workers
  • PR Worker: Prompt Reconstruction Worker, running the actual reconstruction code and configuration according to role
  • Export Manager: handles the injections into PhEDEx
  • PhEDEx itself, which loads the export buffer with appropriate traffic according to what the Export Emulator Manager tells it to do

All components are independent from each other and can be started and stopped as need arises (as of today, restarting the Logger likely will lose information, to be fixed by Tony?!).

All CSA06 Tier0 operations are run with the loginID cmsprod (for the password, contact Tony, Nick or Werner). All required software is installed in ~cmsprod/public/T0, with the exception of the required Perl and ApMon modules installed in /afs/cern.ch/user/w/wildish/public/perl ($T0ROOT/env.sh will set up your PERL5LIB accordingly).

All steering and manageing components run on lxgate39.

All worker components run on dedicated worker nodes (dual CPU 2.8GHz with SLC4 32-bit mode) as batch jobs, using the LSF-queue cmscsa06.

Installation of the system

All CSA06 computing tasks should use CMSSW_1_0_x. Currently, CMSSW_1_0_0 is installed, and this is the version the instructions following refer to! Other versions may be installed in parallel as needed.

The code has to be installed on shared disk space, visible from all worker nodes, with ~cmsprod/public/T0 as the BASE_DIR. The following tasks have to be performed to get the code to a runnable state:

  • Checkout the code from the CVS repository
  • Install the requisite Perl modules
  • Configure the system

Castor usage

Castor2 is used with several disk pools, configured both in size and functionality as required.
  • t0input 65TB on 13 servers, no tape, garbage collection disabled
  • t0export 80TB on 16 servers, with tape
  • cmsprod 22TB on 4 servers
The "RAW-data" input files are read from /castor/cern.ch/cms/T0Prototype/Input located in the t0input disk pool. The RECO output files are written to /castor/cern.ch/cms/store/CSA06/????? All options in the configuration file have to be set correctly in order to select the correct disk pool and path!

Checking out the code

Running in a /bin/bash shell, you can check out the T0 code as follows:
#CMSSW Version and installation dir
export CMSSW_VERS=CMSSW_1_0_0
export CMSSW_BASE_DIR=~cmsprod/public/T0
export CMSSW_DIR=$CMSSW_BASE_DIR/$CMSSW_VERS
cd $CMSSW_BASE_DIR

#Check out from cmscvs
project CMSSW
scramv1 project CMSSW $CMSSW_VERS
cd $CMSSW_VERS/src
mkdir T0
cd T0
cvs co COMP/T0

# Set environment
cd $CMSSW_DIR/src/T0/COMP/T0
. env.sh
scramv1 runtime -csh | tee $T0ROOT/runtime.csh
scramv1 runtime -sh | tee $T0ROOT/runtime.sh
scramv1 runtime -sh | tee $T0ROOT/runtime_pr.sh

#Prompt Reconstruction application
cd $CMSSW_BASE_DIR
scramv1 project CMSSW $CMSSW_VERS
cd $CMSSW_DIR/src
eval `scramv1 runtime -sh`
cmscvsroot CMSSW
cvs co Configuration/Examples/data/RECO081.cfg
# ..following two patches needed for CMSSW_1_0_0 (fix should be in 1_0_1..)!
cvs co -r HEAD Configuration/CompatibilityFragments/data/RecoLocalEcal.cff
cvs co -r 1.15 Configuration/Examples/data/RECO.cff
cp Configuration/Examples/data/RECO081.cfg $T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl

You need to edit $T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl and set the input fileNames, the maxEvents, and the output fileName as follows, wherever it is they appear in the configuration file:

  • untracked vstring fileNames = {'T0_INPUT_FILE'}
  • untracked int32 maxEvents = T0_MAX_EVENTS
  • untracked string fileName = "file:T0_OUTPUT_FILE"

Configuring the components

A single configuration file $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf is passed to each component, with the exception of the FileFeeder $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf, which we prefer to keep seperate for more frequent changes.

After making a change, the syntax of the configuration file should be checked with perl -c $modified_config_file.

Changeing the configuration file while the components are running is possible, as they will re-execute it and update themselves, but for obvious reasons should be done with care! Note, however, that the Worker components will only read their configuration at startup, and then they only really need it to find out where their Loggers and Managers are. Once they have connected to the Manager, that Manager will send them the configuration that they will use, and will send them any updates to it too. So it is only the configuration file(s) used by the Managers that really matters, if you sop/start workers with a different configuration file, it won't make any difference to what they do (providing they connect to the same Managers!)

A detailed description of how to change the configuration file is in July prototype configuration, here only an overview of the most common changes is given.

Change of configuration parameters

It is assumed that the configuration files are configured such that they can be used direclty. Here only a few possible changes are shown, needed for special cases such as changes of input/output path names, feeding rates, export rates to Tier1's, etc.

The Logger

The Logger writes a logfile if you have one set in the Logfile parameter in the Logger::Receiver section. This should not be on AFS, or the logger will fail as soon as the token expires. Any local filesystem will do. The file will be appended if it already exists. If your Logger is still running at midnight, it will rotate the logfile, adding a ".YYYYMMDD" suffix to the old one.

The FileFeeder

The configuration for the FileFeeder is maintained in a seperate file.

The Prompt Reconstruction components

Prompt Reconstruction has three components, the PromptReco::Manager, the PromptReco::Worker, and the PromptReco::Receiver. The Receiver is internal to the manager, to handle subscriptions to the Logger.

For PromptReco::Worker, the only supported Mode at the moment is LocalPull. Actually, that's not true, all modes are supported, but that's the only one tested. Classic should also work, but LocalPush won't until/unless someone writes the bits that push the files in the first place!

Leave TargetDirs as it is (a single entry consisting of '.') to write the RECO output in the jobs local working directory. As elsewhere, the only TargetMode supported at the moment is RoundRobin.

Set MaxEvents to the maximum number of events to process, if you want to process all events, set '-1' instead.

The DataDirs and LogDirs path should be set to an RFIO-accessible directory which exists (maybe Tony fixes this one day?), best to Castor2 to have them in a persistent store and SvcClass has to be set as appropriate.

Operating the Tier0

Starting the components

Some of the components are fussy about being started in the correct order, though that will change as the code improves. In all cases, the only argument you need to specify is the configuration file, with '--config $file'. Also, in all cases, if the Host is specified for that component, you must run it there or it will abort with an appropriate error message. For everything except the workers, just starting the task in a terminal window is good enough. I use screen to create persistant sessions that I can connect to from home or from the office, see http://wildish.home.cern.ch/wildish/UseScreen.html for a 1-minute tutorial on screen if you're interested.

To start the full system, the components should be started in this order:

#logon to machine where servers run..
ssh cmsprod@lxgate39
...give the password
# Set environment
export CMSSW_VERS=CMSSW_1_0_0
export CMSSW_BASE_DIR=~cmsprod/public/T0
export CMSSW_DIR=$CMSSW_BASE_DIR/$CMSSW_VERS
cd $CMSSW_DIR/src/T0/COMP/T0
. env.sh
#Working dir
export T0_WORK_DIR=/data/csa06
cd $T0_WORK_DIR
#Create dirs for log-files
mkdir -p ${T0_WORK_DIR}/Logs/Logger
mkdir -p ${T0_WORK_DIR}/Logs/FileFeeder
mkdir -p ${T0_WORK_DIR}/Logs/PromptRecoManager
mkdir -p ${T0_WORK_DIR}/Logs/ExportManager
# sTART THE COMPONENTS
#The Logger
$T0ROOT/src/Logger/LoggerReceiver.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/Logger/LoggerReceiver.log 2>&1 &
#The FileFeeder
$T0ROOT/src/Utilities/FileFeeder.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf --name CSA06Mixed::Feeder > ${T0_WORK_DIR}/Logs/FileFeeder/FileFeeder.log 2>&1 &
#The PR Manager
$T0ROOT/src/PromptReconstruction/PromptReconstructionManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/PromptRecoManager/PromptRecoManager.log 2>&1 &
#The PR Workers (e.g. start 20 jobs...)
for i in `seq 1 20`; do 
bsub -q cmscsa06  -R 'type=SLC4' $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh
sleep 5
done
#The Export Manager
$T0ROOT/src/ExportManager/ExportManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/ExportManager/ExportManager.log 2>&1 &

The script $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh will need editing to pick up the environment from the correct place (to be improved one day).

Watching the running system

Global plots
>
>
For installation, configuring, starting and stopping the system, consult the FAQ!

Watching the running system

Global plots

 
Line: 164 to 16
 
Changed:
<
<
Export plots (PhEDEx)
>
>

Export plots (PhEDEx)

 
Line: 175 to 27
 
Changed:
<
<
Logfile(s)
>
>

Logfile(s)

 All components produce output with variable levels of verbosity, see the configuration file syntax for details.
The most important information comes from the Logger, which collects and prints the log-info from all components.
Changed:
<
<

The logfiles of all components are written on lxcmsa.cern.ch:/data/CSA06/logs/component/version/channel where component is
>
>

The logfiles of all components are written on lxgate39.cern.ch:/data/CSA06/logs/component/version/channel where component is
 e.g. PR or Alca, version is e.g. 102 or 103, and channel is e.g. EWKSoup ExoticSoup HLTSoup Jets minbias SoftMuon TTbar Wenu ZMuMu.
Changed:
<
<

The log-files can either be retrieved with rfio (a lxplus account is required!) or listed directly by logging into lxcmsa.cern.ch (a special
>
>

The log-files can either be retrieved with rfio (a lxplus account is required!) or listed directly by logging into lxgate39.cern.ch (a special
 registration is required!):
tail -f /data/CSA06/logs/Logger.log
Changed:
<
<
Useful Twiki pages
>
>

Useful Twiki pages

 
Line: 193 to 45
 
Deleted:
<
<

Stopping the system

You can just kill everything if you like. You also need to clean the castor fileservers and flush the PhEDEx subscriptions to make a clean start.

If you want to stop the system cleanly, you can kill the Storage Manager and Repack Manager servers. The workers will talk to them again when they've finished their current tasks, and will then exit because they're no longer there. In principle, the components can be stopped and restarted without having to stop/restart the whole system, but that seems not to be working properly yet.

To clean castor, you can use $T0ROOT/src/Utilities/CleanupCastor.pl. Give it a '--directory $castor_path' argument, and it will wipe the files in that directory, recursively. There is no protection, so don't give it a directory containing real data! This script does a stager_rm on a group of files at a time, which is fast, but means that the garbage collector has to kick in to finally remove them. That can take a minute or two. If you only have a few files in the directory, you can add the '--hammer' argument, which really clobbers them by doing stager_rm followed by nsrm. That will take too long if you have more than a few dozen files, so run it the normal way first.

 

Troubleshooting the system

For checking progress or debugging of the applications, there are several more or less intrusive possibilities.
  • bpeek <lsf_jobID> allows inspecting the STDOUT of the running application.
  • lsrun -P -R cmscsa06 /bin/bash will give you a terminal shell on a worker node. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
  • A direct login to a worker node is possible with the cmsprod account. Machine and running status of the applications can be checked with the usual Linux commands (e.g. ps -efl, top,...), directory listings and filesizes can be monitored, file contents can be examined, attaching to the running processes is possible, etc.
Changed:
<
<
  • To check which batch nodes are in production and which are in maintenance, CDBHosts -cl lxbatch -q "clustersubname='cmscsa06'" -data "hostname,get_value(hostname,'/system/network/interfaces/eth0/switchmedium'),state" can be used.
>
>
  • To check which batch nodes are in production and which are in maintenance, on lxplus the command
    CDBHosts -cl lxbatch -q "clustersubname='cmscsa06'" -data "hostname,get_value(hostname,'/system/network/interfaces/eth0/switchmedium'),state"
    can be used.
 
Changed:
<
<

Reporting bugs/problems

>
>

Reporting bugs/problems

 Simple operational problems can be reported to the t0 operations list or the t0 hypernews forum, depending if they are CERN-local for our specific dedicated hardware or conceptual, in the architecture of the system. Bugs or feature-requests should be reported using savannah, at https://savannah.cern.ch/projects/cmstier0/. Make sure you assign your problem report to someone appropriate, or at least include them in the CC:, or nobody will follow it up! By default, assign problems to Tony.
Changed:
<
<
-- Main.jank - 22 Sep 2006
>
>
-- Main.jank - 06 Oct 2006

Revision 72006-10-05 - unknown

Line: 1 to 1
 
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"

CSA06 Tier0 Operations Guide

Line: 154 to 154
 The script $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh will need editing to pick up the environment from the correct place (to be improved one day).

Watching the running system

Changed:
<
<
The components all produce output with variable levels of verbosity, see the configuration file syntax for details.
You can also just check the logger, which gets most of the important information.
Alternatively, you can look at mona lisa, where there is a CMS/Tier0 group that plots most of the high-level metrics.
There's Lemon monitoring of
>
>
Global plots

Export plots (PhEDEx)

Logfile(s)
All components produce output with variable levels of verbosity, see the configuration file syntax for details.
The most important information comes from the Logger, which collects and prints the log-info from all components.
The logfiles of all components are written on lxcmsa.cern.ch:/data/CSA06/logs/component/version/channel where component is e.g. PR or Alca, version is e.g. 102 or 103, and channel is e.g. EWKSoup ExoticSoup HLTSoup Jets minbias SoftMuon TTbar Wenu ZMuMu.
The log-files can either be retrieved with rfio (a lxplus account is required!) or listed directly by logging into lxcmsa.cern.ch (a special registration is required!):
tail -f /data/CSA06/logs/Logger.log
Useful Twiki pages

 

Stopping the system

You can just kill everything if you like. You also need to clean the castor fileservers and flush the PhEDEx subscriptions to make a clean start.

Revision 62006-09-22 - unknown

Line: 1 to 1
 
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"

CSA06 Tier0 Operations Guide

Line: 91 to 91
  A detailed description of how to change the configuration file is in July prototype configuration, here only an overview of the most common changes is given.
Changed:
<
<

Mandatory changes to configuration parameters

You must set the T0::System{Name} to something globally unique, such as JulyProto$your_name. If you don't do this, and you run the servers on the same machine as someone else, you run the risk of sending indistinguishable data to mona lisa. This would not be good!

You must set the Port values for all components that have one. Pick a random number as a base, and work up sequentially from there. There are no preferred or required values, you simply need to avoid collisions with other users on the same machines. You will only need one or two dozen ports at most, use your unix user-id if you can't think of a random number!

You must set the Host values for all components that have one, to match the host that you are running them on. The components can all run on the same or different hosts. The servers are not heavyweight things, and can all coexist on lxgate39, for example, or you can use lxcmsa. Several instances from several users can run on a machine without unduly loading the system.

For a first run of the system, set the Quiet flag to zero and the Verbose flag to one for all components. Don't set the Debug flag to one unless you like hieroglyphics.

>
>

Change of configuration parameters

It is assumed that the configuration files are configured such that they can be used direclty. Here only a few possible changes are shown, needed for special cases such as changes of input/output path names, feeding rates, export rates to Tier1's, etc.
 

The Logger

Changed:
<
<
The Logger writes a logfile if you have one set in the Logfile parameter. This should not be on AFS, or the logger will fail as soon as the token expires. Any local filesystem will do. The file will be appended if it already exists. If your Logger is still running at midnight, it will rotate the logfile, adding a ".YYYYMMDD" suffix to the old one.
>
>
The Logger writes a logfile if you have one set in the Logfile parameter in the Logger::Receiver section. This should not be on AFS, or the logger will fail as soon as the token expires. Any local filesystem will do. The file will be appended if it already exists. If your Logger is still running at midnight, it will rotate the logfile, adding a ".YYYYMMDD" suffix to the old one.
 

The FileFeeder

Changed:
<
<
>
>
The configuration for the FileFeeder is maintained in a seperate file.
 

The Prompt Reconstruction components

Changed:
<
<
Prompt Reconstruction has three components, the Manager, the Workers, and the Receiver. The Receiver is internal to the manager, to handle subscriptions to the Logger. All three require a Port, the Manager and Receiver need a Host too.

Nota bene: Don't set Verbose to be true for the Worker unless you really know what you're doing. It will duplicate the STDOUT of the cmsRun application, which is not nice. This will be fixed at some point.

>
>
Prompt Reconstruction has three components, the PromptReco::Manager, the PromptReco::Worker, and the PromptReco::Receiver. The Receiver is internal to the manager, to handle subscriptions to the Logger.
  For PromptReco::Worker, the only supported Mode at the moment is LocalPull. Actually, that's not true, all modes are supported, but that's the only one tested. Classic should also work, but LocalPush won't until/unless someone writes the bits that push the files in the first place!

Leave TargetDirs as it is (a single entry consisting of '.') to write the RECO output in the jobs local working directory. As elsewhere, the only TargetMode supported at the moment is RoundRobin.

Changed:
<
<
Set MaxEvents to something small while testing. Note that setting '0' here means exactly that, zero events, if you want to process all events, set '-1' instead.

If you set LogDir to an RFIO-accessible directory, the logfiles of each individual reconstruction step will be copied to that directory by the prompt reconstruction worker as it progresses.

>
>
Set MaxEvents to the maximum number of events to process, if you want to process all events, set '-1' instead.
 
Changed:
<
<
Likewise, you can set RecoDir to something as a first measure to save the reco data. Something better will be put in place, this was hacked to allow CMSSW_0_9_0 data to be saved during the first tests. If it's a castor directory you're writing to, set SvcClass as appropriate.
>
>
The DataDirs and LogDirs path should be set to an RFIO-accessible directory which exists (maybe Tony fixes this one day?), best to Castor2 to have them in a persistent store and SvcClass has to be set as appropriate.
 

Operating the Tier0

Starting the components

Line: 173 to 163
 
Deleted:
<
<
For debugging the applications or for checking progress, lsrun -P -R cmscsa06 /bin/bash will give you a terminal shell on a worker node. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
 

Stopping the system

You can just kill everything if you like. You also need to clean the castor fileservers and flush the PhEDEx subscriptions to make a clean start.
Line: 187 to 175
 
  • bpeek <lsf_jobID> allows inspecting the STDOUT of the running application.
  • lsrun -P -R cmscsa06 /bin/bash will give you a terminal shell on a worker node. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
  • A direct login to a worker node is possible with the cmsprod account. Machine and running status of the applications can be checked with the usual Linux commands (e.g. ps -efl, top,...), directory listings and filesizes can be monitored, file contents can be examined, attaching to the running processes is possible, etc.
Added:
>
>
  • To check which batch nodes are in production and which are in maintenance, CDBHosts -cl lxbatch -q "clustersubname='cmscsa06'" -data "hostname,get_value(hostname,'/system/network/interfaces/eth0/switchmedium'),state" can be used.
 

Reporting bugs/problems

Simple operational problems can be reported to the t0 operations list or the t0 hypernews forum, depending if they are CERN-local for our specific dedicated hardware or conceptual, in the architecture of the system. Bugs or feature-requests should be reported using savannah, at https://savannah.cern.ch/projects/cmstier0/. Make sure you assign your problem report to someone appropriate, or at least include them in the CC:, or nobody will follow it up! By default, assign problems to Tony.

Revision 52006-09-22 - unknown

Line: 1 to 1
 
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"
Changed:
<
<

Introduction

>
>

CSA06 Tier0 Operations Guide

Introduction

  For the CSA06 exercise, the Tier0 operation is based on a number of components:
  • Logger: central recording and steering of all activities
Line: 19 to 20
  All worker components run on dedicated worker nodes (dual CPU 2.8GHz with SLC4 32-bit mode) as batch jobs, using the LSF-queue cmscsa06.
Changed:
<
<

Installation of the system

>
>

Installation of the system

 All CSA06 computing tasks should use CMSSW_1_0_x. Currently, CMSSW_1_0_0 is installed, and this is the version the instructions following refer to! Other versions may be installed in parallel as needed.
Line: 29 to 30
 
  • Install the requisite Perl modules
  • Configure the system
Changed:
<
<

Castor usage

>
>

Castor usage

 Castor2 is used with several disk pools, configured both in size and functionality as required.
  • t0input 65TB on 13 servers, no tape, garbage collection disabled
  • t0export 80TB on 16 servers, with tape
Line: 38 to 39
 The RECO output files are written to /castor/cern.ch/cms/store/CSA06/????? All options in the configuration file have to be set correctly in order to select the correct disk pool and path!
Changed:
<
<

Checking out the code

>
>

Checking out the code

 Running in a /bin/bash shell, you can check out the T0 code as follows:
#CMSSW Version and installation dir

Line: 81 to 82
 
  • untracked int32 maxEvents = T0_MAX_EVENTS
  • untracked string fileName = "file:T0_OUTPUT_FILE"
Changed:
<
<

Configuring the components

>
>

Configuring the components

 A single configuration file $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf is passed to each component, with the exception of the FileFeeder $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf, which we prefer to keep seperate for more frequent changes.

After making a change, the syntax of the configuration file should be checked with perl -c $modified_config_file.

Line: 90 to 91
  A detailed description of how to change the configuration file is in July prototype configuration, here only an overview of the most common changes is given.
Changed:
<
<

Mandatory changes to configuration parameters

>
>

Mandatory changes to configuration parameters

 You must set the T0::System{Name} to something globally unique, such as JulyProto$your_name. If you don't do this, and you run the servers on the same machine as someone else, you run the risk of sending indistinguishable data to mona lisa. This would not be good!

You must set the Port values for all components that have one. Pick a random number as a base, and work up sequentially from there. There are no preferred or required values, you simply need to avoid collisions with other users on the same machines. You will only need one or two dozen ports at most, use your unix user-id if you can't think of a random number!

Line: 99 to 100
  For a first run of the system, set the Quiet flag to zero and the Verbose flag to one for all components. Don't set the Debug flag to one unless you like hieroglyphics.
Changed:
<
<

The Logger

>
>

The Logger

 The Logger writes a logfile if you have one set in the Logfile parameter. This should not be on AFS, or the logger will fail as soon as the token expires. Any local filesystem will do. The file will be appended if it already exists. If your Logger is still running at midnight, it will rotate the logfile, adding a ".YYYYMMDD" suffix to the old one.
Changed:
<
<

The FileFeeder

>
>

The FileFeeder

 
Changed:
<
<

The Prompt Reconstruction components

>
>

The Prompt Reconstruction components

 Prompt Reconstruction has three components, the Manager, the Workers, and the Receiver. The Receiver is internal to the manager, to handle subscriptions to the Logger. All three require a Port, the Manager and Receiver need a Host too.

Nota bene: Don't set Verbose to be true for the Worker unless you really know what you're doing. It will duplicate the STDOUT of the cmsRun application, which is not nice. This will be fixed at some point.

Line: 120 to 121
  Likewise, you can set RecoDir to something as a first measure to save the reco data. Something better will be put in place, this was hacked to allow CMSSW_0_9_0 data to be saved during the first tests. If it's a castor directory you're writing to, set SvcClass as appropriate.
Changed:
<
<

CSA06 Tier0 Operations Guide

Starting the components

>
>

Operating the Tier0

Starting the components

 Some of the components are fussy about being started in the correct order, though that will change as the code improves. In all cases, the only argument you need to specify is the configuration file, with '--config $file'. Also, in all cases, if the Host is specified for that component, you must run it there or it will abort with an appropriate error message. For everything except the workers, just starting the task in a terminal window is good enough. I use screen to create persistant sessions that I can connect to from home or from the office, see http://wildish.home.cern.ch/wildish/UseScreen.html for a 1-minute tutorial on screen if you're interested.

To start the full system, the components should be started in this order:

Line: 134 to 135
 export CMSSW_VERS=CMSSW_1_0_0 export CMSSW_BASE_DIR=~cmsprod/public/T0 export CMSSW_DIR=$CMSSW_BASE_DIR/$CMSSW_VERS
Deleted:
<
<
# Set environment
 cd $CMSSW_DIR/src/T0/COMP/T0 . env.sh #Working dir
Changed:
<
<
cd /data/csa06
>
>
export T0_WORK_DIR=/data/csa06 cd $T0_WORK_DIR #Create dirs for log-files mkdir -p ${T0_WORK_DIR}/Logs/Logger mkdir -p ${T0_WORK_DIR}/Logs/FileFeeder mkdir -p ${T0_WORK_DIR}/Logs/PromptRecoManager mkdir -p ${T0_WORK_DIR}/Logs/ExportManager # sTART THE COMPONENTS
 #The Logger
Changed:
<
<
$T0ROOT/src/Logger/LoggerReceiver.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf
>
>
$T0ROOT/src/Logger/LoggerReceiver.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/Logger/LoggerReceiver.log 2>&1 &
 #The FileFeeder
Changed:
<
<
$T0ROOT/src/Utilities/FileFeeder.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf --name CSA06Mixed::Feeder
>
>
$T0ROOT/src/Utilities/FileFeeder.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf --name CSA06Mixed::Feeder > ${T0_WORK_DIR}/Logs/FileFeeder/FileFeeder.log 2>&1 &
 #The PR Manager
Changed:
<
<
$T0ROOT/src/PromptReconstruction/PromptReconstructionManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf
>
>
$T0ROOT/src/PromptReconstruction/PromptReconstructionManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/PromptRecoManager/PromptRecoManager.log 2>&1 &
 #The PR Workers (e.g. start 20 jobs...) for i in `seq 1 20`; do bsub -q cmscsa06 -R 'type=SLC4' $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh sleep 5 done #The Export Manager
Changed:
<
<
$T0ROOT/src/ExportManager/ExportManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf
>
>
$T0ROOT/src/ExportManager/ExportManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/ExportManager/ExportManager.log 2>&1 &
 

The script $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh will need editing to pick up the environment from the correct place (to be improved one day).

Changed:
<
<

Watching the running system

>
>

Watching the running system

 The components all produce output with variable levels of verbosity, see the configuration file syntax for details.
You can also just check the logger, which gets most of the important information.
Alternatively, you can look at mona lisa, where there is a CMS/Tier0 group that plots most of the high-level metrics.
There's Lemon monitoring of
Line: 168 to 175
  For debugging the applications or for checking progress, lsrun -P -R cmscsa06 /bin/bash will give you a terminal shell on a worker node. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
Changed:
<
<

Stopping the system

>
>

Stopping the system

 You can just kill everything if you like. You also need to clean the castor fileservers and flush the PhEDEx subscriptions to make a clean start.

If you want to stop the system cleanly, you can kill the Storage Manager and Repack Manager servers. The workers will talk to them again when they've finished their current tasks, and will then exit because they're no longer there. In principle, the components can be stopped and restarted without having to stop/restart the whole system, but that seems not to be working properly yet.

To clean castor, you can use $T0ROOT/src/Utilities/CleanupCastor.pl. Give it a '--directory $castor_path' argument, and it will wipe the files in that directory, recursively. There is no protection, so don't give it a directory containing real data! This script does a stager_rm on a group of files at a time, which is fast, but means that the garbage collector has to kick in to finally remove them. That can take a minute or two. If you only have a few files in the directory, you can add the '--hammer' argument, which really clobbers them by doing stager_rm followed by nsrm. That will take too long if you have more than a few dozen files, so run it the normal way first.

Changed:
<
<

Troubleshooting the system

>
>

Troubleshooting the system

 For checking progress or debugging of the applications, there are several more or less intrusive possibilities.
  • bpeek <lsf_jobID> allows inspecting the STDOUT of the running application.
  • lsrun -P -R cmscsa06 /bin/bash will give you a terminal shell on a worker node. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
  • A direct login to a worker node is possible with the cmsprod account. Machine and running status of the applications can be checked with the usual Linux commands (e.g. ps -efl, top,...), directory listings and filesizes can be monitored, file contents can be examined, attaching to the running processes is possible, etc.
Changed:
<
<

Reporting bugs/problems

>
>

Reporting bugs/problems

 Simple operational problems can be reported to the t0 operations list or the t0 hypernews forum, depending if they are CERN-local for our specific dedicated hardware or conceptual, in the architecture of the system. Bugs or feature-requests should be reported using savannah, at https://savannah.cern.ch/projects/cmstier0/. Make sure you assign your problem report to someone appropriate, or at least include them in the CC:, or nobody will follow it up! By default, assign problems to Tony.
Changed:
<
<
-- Main.jank - 18 Sep 2006
>
>
-- Main.jank - 22 Sep 2006

Revision 42006-09-21 - unknown

Line: 1 to 1
 
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"
Changed:
<
<

CSA06 Tier0 Operations Guide

>
>

Introduction

  For the CSA06 exercise, the Tier0 operation is based on a number of components:
  • Logger: central recording and steering of all activities
Line: 34 to 34
 
  • t0input 65TB on 13 servers, no tape, garbage collection disabled
  • t0export 80TB on 16 servers, with tape
  • cmsprod 22TB on 4 servers
Changed:
<
<
The "RAW-data" input files are read from /castor/cern.ch/cms/store/CSA06/.., e.g. /castor/cern.ch/cms/store/CSA06/2006/8/17/CSA06-082-os-TTbar. The RECO output files are written to /castor/cern.ch/cms/store/CSA06/?????
>
>
The "RAW-data" input files are read from /castor/cern.ch/cms/T0Prototype/Input located in the t0input disk pool. The RECO output files are written to /castor/cern.ch/cms/store/CSA06/?????
 All options in the configuration file have to be set correctly in order to select the correct disk pool and path!

Checking out the code

Line: 55 to 56
 cvs co COMP/T0

# Set environment

Changed:
<
<
cd ~cmsprod/public/T0/CMSSW_1_0_0/src/T0/COMP/T0
>
>
cd $CMSSW_DIR/src/T0/COMP/T0
 . env.sh scramv1 runtime -csh | tee $T0ROOT/runtime.csh scramv1 runtime -sh | tee $T0ROOT/runtime.sh
Line: 81 to 82
 
  • untracked string fileName = "file:T0_OUTPUT_FILE"

Configuring the components

Changed:
<
<
A single configuration file $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf is passed to each component.
>
>
A single configuration file $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf is passed to each component, with the exception of the FileFeeder $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf, which we prefer to keep seperate for more frequent changes.

After making a change, the syntax of the configuration file should be checked with perl -c $modified_config_file.

 
Changed:
<
<
The configuration file consists of a series of Perl hashes containing configuration information for each component. The components are named appropriately, and when they start they read the config file ("do $file or die") and set their parameters from the hashes they are named after. If you change the configuration file while the components are running, they will re-execute it and update themselves, so don't mess it up while you're running! You can check it's OK with perl -c $your_config_file. Note, however, that the Worker components will only read their configuration at startup, and then they only really need it to find out where their Loggers and Managers are. Once they have connected to the Manager, that Manager will send them the configuration that they will use, and will send them any updates to it too. So it is only the configuration file(s) used by the Managers that really matters, if you sop/start workers with a different configuration file, it won't make any difference to what they do (providing they connect to the same Managers!)
>
>
Changeing the configuration file while the components are running is possible, as they will re-execute it and update themselves, but for obvious reasons should be done with care! Note, however, that the Worker components will only read their configuration at startup, and then they only really need it to find out where their Loggers and Managers are. Once they have connected to the Manager, that Manager will send them the configuration that they will use, and will send them any updates to it too. So it is only the configuration file(s) used by the Managers that really matters, if you sop/start workers with a different configuration file, it won't make any difference to what they do (providing they connect to the same Managers!)
 
Changed:
<
<
The detailed syntax of the configuration file is specified in the July prototype configuration syntax. Check there for details of anything you find in the configuration files in the repository. Many of the configuration parameters are not so important, others are critical. Here we describe only what you must set before running.
>
>
A detailed description of how to change the configuration file is in July prototype configuration, here only an overview of the most common changes is given.
 

Mandatory changes to configuration parameters

You must set the T0::System{Name} to something globally unique, such as JulyProto$your_name. If you don't do this, and you run the servers on the same machine as someone else, you run the risk of sending indistinguishable data to mona lisa. This would not be good!
Line: 117 to 120
  Likewise, you can set RecoDir to something as a first measure to save the reco data. Something better will be put in place, this was hacked to allow CMSSW_0_9_0 data to be saved during the first tests. If it's a castor directory you're writing to, set SvcClass as appropriate.
Changed:
<
<

Starting the components

>
>

CSA06 Tier0 Operations Guide

Starting the components

 Some of the components are fussy about being started in the correct order, though that will change as the code improves. In all cases, the only argument you need to specify is the configuration file, with '--config $file'. Also, in all cases, if the Host is specified for that component, you must run it there or it will abort with an appropriate error message. For everything except the workers, just starting the task in a terminal window is good enough. I use screen to create persistant sessions that I can connect to from home or from the office, see http://wildish.home.cern.ch/wildish/UseScreen.html for a 1-minute tutorial on screen if you're interested.

To start the full system, the components should be started in this order:

Line: 125 to 129
 
#logon to machine where servers run..
ssh cmsprod@lxgate39

Added:
>
>
...give the password # Set environment export CMSSW_VERS=CMSSW_1_0_0 export CMSSW_BASE_DIR=~cmsprod/public/T0 export CMSSW_DIR=$CMSSW_BASE_DIR/$CMSSW_VERS # Set environment cd $CMSSW_DIR/src/T0/COMP/T0 . env.sh #Working dir
 cd /data/csa06 #The Logger $T0ROOT/src/Logger/LoggerReceiver.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf #The FileFeeder
Changed:
<
<
$T0ROOT/src/Utilities/FileFeeder.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf --name CSA06Mixed::Feeder
>
>
$T0ROOT/src/Utilities/FileFeeder.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf --name CSA06Mixed::Feeder
 #The PR Manager $T0ROOT/src/PromptReconstruction/PromptReconstructionManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf #The PR Workers (e.g. start 20 jobs...)
Line: 141 to 154
 $T0ROOT/src/ExportManager/ExportManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf
Changed:
<
<
The script $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh will need editing to pick up the environment from the correct place (to be improved one day). They can then be started as simply as bsub -q $queue $script. Note that we have some dedicated machines available to us, and we submit to them as follows: bsub -q dedicated -R itdccms $script or, for SLC4, you can submit with bsub -q cmscsa06 -R "type=SLC4" $script. You can also start the workers on those machines interactively, using the lsrun command. For example, lsrun -P -R itdccms /bin/bash will give you a terminal shell on a worker node, from which you can start your processes. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
>
>
The script $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh will need editing to pick up the environment from the correct place (to be improved one day).
 
Changed:
<
<

Watching the running system

>
>

Watching the running system

 The components all produce output with variable levels of verbosity, see the configuration file syntax for details.
You can also just check the logger, which gets most of the important information.
Alternatively, you can look at mona lisa, where there is a CMS/Tier0 group that plots most of the high-level metrics.
There's Lemon monitoring of
Changed:
<
<
>
>
 
Line: 153 to 166
 
Changed:
<
<
For debugging the applications or for checking progress, lsrun -P -R itdccms /bin/bash will give you a terminal shell on a worker node. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
>
>
For debugging the applications or for checking progress, lsrun -P -R cmscsa06 /bin/bash will give you a terminal shell on a worker node. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
 
Changed:
<
<

Stopping the system

>
>

Stopping the system

 You can just kill everything if you like. You also need to clean the castor fileservers and flush the PhEDEx subscriptions to make a clean start.

If you want to stop the system cleanly, you can kill the Storage Manager and Repack Manager servers. The workers will talk to them again when they've finished their current tasks, and will then exit because they're no longer there. In principle, the components can be stopped and restarted without having to stop/restart the whole system, but that seems not to be working properly yet.

To clean castor, you can use $T0ROOT/src/Utilities/CleanupCastor.pl. Give it a '--directory $castor_path' argument, and it will wipe the files in that directory, recursively. There is no protection, so don't give it a directory containing real data! This script does a stager_rm on a group of files at a time, which is fast, but means that the garbage collector has to kick in to finally remove them. That can take a minute or two. If you only have a few files in the directory, you can add the '--hammer' argument, which really clobbers them by doing stager_rm followed by nsrm. That will take too long if you have more than a few dozen files, so run it the normal way first.

Changed:
<
<

Custom configurations

For testing purposes we will often want customised configurations, such as feeding one component of the system with files prepared in advance, at a specific rate. This is described on a separate page.
>
>

Troubleshooting the system

For checking progress or debugging of the applications, there are several more or less intrusive possibilities.
  • bpeek <lsf_jobID> allows inspecting the STDOUT of the running application.
  • lsrun -P -R cmscsa06 /bin/bash will give you a terminal shell on a worker node. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
  • A direct login to a worker node is possible with the cmsprod account. Machine and running status of the applications can be checked with the usual Linux commands (e.g. ps -efl, top,...), directory listings and filesizes can be monitored, file contents can be examined, attaching to the running processes is possible, etc.
 

Reporting bugs/problems

Simple operational problems can be reported to the t0 operations list or the t0 hypernews forum, depending if they are CERN-local for our specific dedicated hardware or conceptual, in the architecture of the system. Bugs or feature-requests should be reported using savannah, at https://savannah.cern.ch/projects/cmstier0/. Make sure you assign your problem report to someone appropriate, or at least include them in the CC:, or nobody will follow it up! By default, assign problems to Tony.

Revision 32006-09-20 - unknown

Line: 1 to 1
 
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"

CSA06 Tier0 Operations Guide

Line: 9 to 9
 
  • PR Manager: Prompt Reconstruction Manager, manageing the PR Workers
  • PR Worker: Prompt Reconstruction Worker, running the actual reconstruction code and configuration according to role
  • Export Manager: handles the injections into PhEDEx
Changed:
<
<
  • PhEDEx itself, which loads the export buffer with appropriate traffic according to what the Export Emulator Manager tells it to do
>
>
  • PhEDEx itself, which loads the export buffer with appropriate traffic according to what the Export Emulator Manager tells it to do
  All components are independent from each other and can be started and stopped as need arises (as of today, restarting the Logger likely will lose information, to be fixed by Tony?!).
Changed:
<
<
All CSA06 Tier0 operations are run with the loginID cmsprod (for the password, contact Tony, Nick or Werner). All required software is installed in ~cmsprod/public/T0, with the exception of the required Perl and ApMon modules.
>
>
All CSA06 Tier0 operations are run with the loginID cmsprod (for the password, contact Tony, Nick or Werner). All required software is installed in ~cmsprod/public/T0, with the exception of the required Perl and ApMon modules installed in /afs/cern.ch/user/w/wildish/public/perl ($T0ROOT/env.sh will set up your PERL5LIB accordingly).
  All steering and manageing components run on lxgate39.

All worker components run on dedicated worker nodes (dual CPU 2.8GHz with SLC4 32-bit mode) as batch jobs, using the LSF-queue cmscsa06.

Installation of the system

Changed:
<
<
The code has to be installed on shared disk space, visible to all worker nodes. There are a few tasks to perform to get the code to a runnable state:
>
>
All CSA06 computing tasks should use CMSSW_1_0_x. Currently, CMSSW_1_0_0 is installed, and this is the version the instructions following refer to! Other versions may be installed in parallel as needed.

The code has to be installed on shared disk space, visible from all worker nodes, with ~cmsprod/public/T0 as the BASE_DIR. The following tasks have to be performed to get the code to a runnable state:

 
  • Checkout the code from the CVS repository
  • Install the requisite Perl modules
Changed:
<
<
  • Configure the rest of the system.

Using Castor

If you want to run tests in castor, you have two things to consider: wether to use tape or not, and which disk pool to use. You must specify both, or your tests will be invalid, and probably will interfere with the rest of CMS or CERN.

Use, or not, of tape is specified by the tape class you associate with the directory you write to. See nsls --class and nslistclass if you really want to know more. We have a group-writeable castor directory that is not associated to a tapeclass, i.e. files written there won't be migrated to tape and won't be garbage-collected automatically. The directory is /castor/cern.ch/cms/T0Prototype. You can create a subdirectory there for your tests, to avoid interfering with other users. You should probably also create a separate directory for the storage manager output and the repacker output, so you can delete them separately if you want to.

>
>
  • Configure the system
 
Changed:
<
<
The castor pool you use is specified, in castor 2, by the service class. (Castor 1 used the STAGE_POOL environment variable, that is no longer supported(. We have two dedicated buffers for the T0 tests at the moment, the t0input and t0export pools. You can set the SvcClass parameter in the config file to whichever of these you want on a per-component basis.
>
>

Castor usage

Castor2 is used with several disk pools, configured both in size and functionality as required.
  • t0input 65TB on 13 servers, no tape, garbage collection disabled
  • t0export 80TB on 16 servers, with tape
  • cmsprod 22TB on 4 servers
The "RAW-data" input files are read from /castor/cern.ch/cms/store/CSA06/.., e.g. /castor/cern.ch/cms/store/CSA06/2006/8/17/CSA06-082-os-TTbar. The RECO output files are written to /castor/cern.ch/cms/store/CSA06/????? All options in the configuration file have to be set correctly in order to select the correct disk pool and path!
 

Checking out the code

Changed:
<
<
You can check out the T0 code as follows:
>
>
Running in a /bin/bash shell, you can check out the T0 code as follows:
 

Changed:
<
<
cd $SOMEWHERE
>
>
#CMSSW Version and installation dir export CMSSW_VERS=CMSSW_1_0_0 export CMSSW_BASE_DIR=~cmsprod/public/T0 export CMSSW_DIR=$CMSSW_BASE_DIR/$CMSSW_VERS cd $CMSSW_BASE_DIR

#Check out from cmscvs

 project CMSSW
Changed:
<
<
scramv1 project CMSSW CMSSW_0_9_0 cd CMSSW_0_9_0/src
>
>
scramv1 project CMSSW $CMSSW_VERS cd $CMSSW_VERS/src
 mkdir T0 cd T0 cvs co COMP/T0
Deleted:
<
<

I will refer to the $SOMEWHERE/CMSSW_0_9_0/src/T0/COMP/T0 directory as $T0ROOT for the rest of these instructions. There is a shell script, $T0ROOT/env.sh, which must be sourced from its directory (i.e. not from just anywhere in the filesystem) which will set T0ROOT and other variables for your environment. Note that there is no corresponding env.csh, you should be using a bash-like shell for this setup. Complain if you really need csh-like shells.

N.B. The named version here, CMSSW_0_9_0, is probably not going to be valid for long! Make sure you pick an appropriate version. I'm not sure I can write a recipe for that, you'll have to figure it out for yourself when you need to.

 
Changed:
<
<
Once you have built the repacker, you must save the runtime environment in a file, for the jobs to use when they run:

>
>
# Set environment cd ~cmsprod/public/T0/CMSSW_1_0_0/src/T0/COMP/T0 . env.sh scramv1 runtime -csh | tee $T0ROOT/runtime.csh
 scramv1 runtime -sh | tee $T0ROOT/runtime.sh
Changed:
<
<
>
>
scramv1 runtime -sh | tee $T0ROOT/runtime_pr.sh
 
Changed:
<
<

Preparing the Prompt Reconstruction application

This is up to date for 0_9_2
export CMSSW_BASE_DIR=/afs/cern.ch/path/to/where/you/want/to/install           
export CMSSW_DIR=$CMSSW_BASE_DIR/CMSSW_0_9_2

>
>
#Prompt Reconstruction application
 cd $CMSSW_BASE_DIR
Changed:
<
<
scramv1 project CMSSW CMSSW_0_9_2
>
>
scramv1 project CMSSW $CMSSW_VERS
 cd $CMSSW_DIR/src eval `scramv1 runtime -sh` cmscvsroot CMSSW cvs co Configuration/Examples/data/RECO081.cfg
Added:
>
>
# ..following two patches needed for CMSSW_1_0_0 (fix should be in 1_0_1..)! cvs co -r HEAD Configuration/CompatibilityFragments/data/RecoLocalEcal.cff cvs co -r 1.15 Configuration/Examples/data/RECO.cff cp Configuration/Examples/data/RECO081.cfg $T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl
 
Changed:
<
<
Copy the Configuration/Examples/data/RECO081.cfg file to $T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl. Edit Reco.cfg.tmpl and set the input fileNames, the maxEvents, and the output fileName as follows, wherever it is they appear in the configuration file.:
>
>
You need to edit $T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl and set the input fileNames, the maxEvents, and the output fileName as follows, wherever it is they appear in the configuration file:
 
  • untracked vstring fileNames = {'T0_INPUT_FILE'}
  • untracked int32 maxEvents = T0_MAX_EVENTS
  • untracked string fileName = "file:T0_OUTPUT_FILE"
Changed:
<
<
The Prompt Reconstruction Worker depends on $T0ROOT/runtime_pr.sh for its runtime environment, saved from the output of scramv1 runtime -sh. You can save the repacker and the Prompt Reconstruction environments from different installations if you wish, or if they're the same then you can just copy or soft-link one envinment to the other. That's all there is to configure.

Prerequisite Perl modules

The prototype makes heavy use of the Perl Object Environment ( POE) modules. POE has a home page (http://poe.perl.org/) and detailed documentation (http://search.cpan.org/~rcaputo/POE/) on the web. The POE modules should be installed somewhere in your PERL5LIB, and are all available from CPAN. The $T0ROOT/env.sh script will set up your PERL5LIB to point to the installations that already exist at CERN, so you do not need to install them for yourself. If you do wish to install separately, for yourself, you will need these modules:
  • POE
  • POE::Component::TCP::Server
  • POE::Component::TCP::Client
  • POE::Queue::Array
  • POE::Filter::Reference
  • POE::Wheel::Run

You also need the monalisa ApMon module, available from http://monalisa.cern.ch/monalisa.html. At CERN, the POE and ApMon modules are installed in /afs/cern.ch/user/w/wildish/public/perl. $T0ROOT/env.sh will set up your PERL5LIB accordingly.

For completeness, I note that you need the Carp, Cwd, Data::Dumper, File::Basename and Getopt::Long modules, though these are normally installed on any sensible system. Complain to your sysadmin if they aren't! Time::HiRes is another good one to have, though not essential.

These Perl modules must be available on all hosts that you use for your T0, all the components need them.

Configuring the components

The prototype is configured with one or more configuration files. Each component accepts a '--config' argument to tell it where its configuration file is, and although they can be separate files for each component, there is no problem if they are all configured from the same file, and that's what I recommend you do. Some components use configuration information about other components, for example many of the components use the Logger::Receiver parameters to find out where to send their own log messages. If you split your configuration over many files, such 'shared' information must be consistant across all of them. The default configuration file is in $T0ROOT/src/Config/JulyPrototype.conf, but a better starting point is $T0ROOT/src/Config/DevPrototype.conf, which will run a smaller, leaner first version of the prototype.
>
>

Configuring the components

A single configuration file $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf is passed to each component.
  The configuration file consists of a series of Perl hashes containing configuration information for each component. The components are named appropriately, and when they start they read the config file ("do $file or die") and set their parameters from the hashes they are named after. If you change the configuration file while the components are running, they will re-execute it and update themselves, so don't mess it up while you're running! You can check it's OK with perl -c $your_config_file. Note, however, that the Worker components will only read their configuration at startup, and then they only really need it to find out where their Loggers and Managers are. Once they have connected to the Manager, that Manager will send them the configuration that they will use, and will send them any updates to it too. So it is only the configuration file(s) used by the Managers that really matters, if you sop/start workers with a different configuration file, it won't make any difference to what they do (providing they connect to the same Managers!)

The detailed syntax of the configuration file is specified in the July prototype configuration syntax. Check there for details of anything you find in the configuration files in the repository. Many of the configuration parameters are not so important, others are critical. Here we describe only what you must set before running.

Changed:
<
<

Mandatory changes to configuration parameters

>
>

Mandatory changes to configuration parameters

 You must set the T0::System{Name} to something globally unique, such as JulyProto$your_name. If you don't do this, and you run the servers on the same machine as someone else, you run the risk of sending indistinguishable data to mona lisa. This would not be good!

You must set the Port values for all components that have one. Pick a random number as a base, and work up sequentially from there. There are no preferred or required values, you simply need to avoid collisions with other users on the same machines. You will only need one or two dozen ports at most, use your unix user-id if you can't think of a random number!

Line: 106 to 96
  For a first run of the system, set the Quiet flag to zero and the Verbose flag to one for all components. Don't set the Debug flag to one unless you like hieroglyphics.
Deleted:
<
<

The Storage Manager components.

For the StorageManager::Manager, you need to look at the TargetDirs and TargetRate parameters, as a minimum. Set the TargetDirs to some location that you can write to, don't just use the value there because someone else is probably using it already. The target directories should exist before you attempt to use them.

The TargetRate should also be set to something sensible, such as 100 MB/sec to start with. Don't set it too low or your system will take ages to do anything useful.

You should set the SvcClass of the StorageManager::Worker if you are writing to castor.

 

The Logger

The Logger writes a logfile if you have one set in the Logfile parameter. This should not be on AFS, or the logger will fail as soon as the token expires. Any local filesystem will do. The file will be appended if it already exists. If your Logger is still running at midnight, it will rotate the logfile, adding a ".YYYYMMDD" suffix to the old one.
Line: 139 to 122
  To start the full system, the components should be started in this order:
Changed:
<
<
  • The Logger, $T0ROOT/src/Logger/LoggerReceiver.pl.
  • The FileFeeder.
  • The PR Manager.
  • The PR Workers.
  • The Export Manager, $T0ROOT/src/ExportManager/ExportManager.pl.
>
>
#logon to machine where servers run..
ssh cmsprod@lxgate39
cd /data/csa06
#The Logger
$T0ROOT/src/Logger/LoggerReceiver.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf
#The FileFeeder
$T0ROOT/src/Utilities/FileFeeder.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf --name CSA06Mixed::Feeder
#The PR Manager
$T0ROOT/src/PromptReconstruction/PromptReconstructionManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf
#The PR Workers (e.g. start 20 jobs...)
for i in `seq 1 20`; do 
bsub -q cmscsa06  -R 'type=SLC4' $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh
sleep 5
done
#The Export Manager
$T0ROOT/src/ExportManager/ExportManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf
 
Changed:
<
<
There are two scripts, $T0ROOT/src/RepackManager/run_RepackWorker.sh and $T0ROOT/src/StorageManager/run_StorageManagerWorker.sh, which can be used to start the Repack or Storage Manager workers either interactively or as batch tasks. Both will need editing to pick up the environment from the correct place in your system, unless I improve them one day. They can then be started as simply as bsub -q $queue $script. Note that we have some dedicated machines available to us, and we submit to them as follows: bsub -q dedicated -R itdccms $script or, for SLC4, you can submit with bsub -q cmscsa06 -R "type=SLC4" $script. You can also start the workers on those machines interactively, using the lsrun command. For example, lsrun -P -R itdccms /bin/bash will give you a terminal shell on a worker node, from which you can start your processes. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
>
>
The script $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh will need editing to pick up the environment from the correct place (to be improved one day). They can then be started as simply as bsub -q $queue $script. Note that we have some dedicated machines available to us, and we submit to them as follows: bsub -q dedicated -R itdccms $script or, for SLC4, you can submit with bsub -q cmscsa06 -R "type=SLC4" $script. You can also start the workers on those machines interactively, using the lsrun command. For example, lsrun -P -R itdccms /bin/bash will give you a terminal shell on a worker node, from which you can start your processes. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
 

Watching the running system

The components all produce output with variable levels of verbosity, see the configuration file syntax for details.
You can also just check the logger, which gets most of the important information.
Alternatively, you can look at mona lisa, where there is a CMS/Tier0 group that plots most of the high-level metrics.
There's Lemon monitoring of
Line: 157 to 153
 
Added:
>
>
For debugging the applications or for checking progress, lsrun -P -R itdccms /bin/bash will give you a terminal shell on a worker node. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
 

Stopping the system

You can just kill everything if you like. You also need to clean the castor fileservers and flush the PhEDEx subscriptions to make a clean start.

Revision 22006-09-20 - unknown

Line: 1 to 1
 
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"
Added:
>
>

CSA06 Tier0 Operations Guide

For the CSA06 exercise, the Tier0 operation is based on a number of components:

  • Logger: central recording and steering of all activities
  • FileFeeder: Getting available files in Castor and feeding them into the Prompt Reconstruction
  • PR Manager: Prompt Reconstruction Manager, manageing the PR Workers
  • PR Worker: Prompt Reconstruction Worker, running the actual reconstruction code and configuration according to role
  • Export Manager: handles the injections into PhEDEx
  • PhEDEx itself, which loads the export buffer with appropriate traffic according to what the Export Emulator Manager tells it to do

All components are independent from each other and can be started and stopped as need arises (as of today, restarting the Logger likely will lose information, to be fixed by Tony?!).

All CSA06 Tier0 operations are run with the loginID cmsprod (for the password, contact Tony, Nick or Werner). All required software is installed in ~cmsprod/public/T0, with the exception of the required Perl and ApMon modules.

All steering and manageing components run on lxgate39.

All worker components run on dedicated worker nodes (dual CPU 2.8GHz with SLC4 32-bit mode) as batch jobs, using the LSF-queue cmscsa06.

Installation of the system

The code has to be installed on shared disk space, visible to all worker nodes. There are a few tasks to perform to get the code to a runnable state:
  • Checkout the code from the CVS repository
  • Install the requisite Perl modules
  • Configure the rest of the system.

Using Castor

If you want to run tests in castor, you have two things to consider: wether to use tape or not, and which disk pool to use. You must specify both, or your tests will be invalid, and probably will interfere with the rest of CMS or CERN.

Use, or not, of tape is specified by the tape class you associate with the directory you write to. See nsls --class and nslistclass if you really want to know more. We have a group-writeable castor directory that is not associated to a tapeclass, i.e. files written there won't be migrated to tape and won't be garbage-collected automatically. The directory is /castor/cern.ch/cms/T0Prototype. You can create a subdirectory there for your tests, to avoid interfering with other users. You should probably also create a separate directory for the storage manager output and the repacker output, so you can delete them separately if you want to.

The castor pool you use is specified, in castor 2, by the service class. (Castor 1 used the STAGE_POOL environment variable, that is no longer supported(. We have two dedicated buffers for the T0 tests at the moment, the t0input and t0export pools. You can set the SvcClass parameter in the config file to whichever of these you want on a per-component basis.

Checking out the code

You can check out the T0 code as follows:
cd $SOMEWHERE
project CMSSW
scramv1 project CMSSW CMSSW_0_9_0
cd CMSSW_0_9_0/src
mkdir T0
cd T0
cvs co COMP/T0

I will refer to the $SOMEWHERE/CMSSW_0_9_0/src/T0/COMP/T0 directory as $T0ROOT for the rest of these instructions. There is a shell script, $T0ROOT/env.sh, which must be sourced from its directory (i.e. not from just anywhere in the filesystem) which will set T0ROOT and other variables for your environment. Note that there is no corresponding env.csh, you should be using a bash-like shell for this setup. Complain if you really need csh-like shells.

N.B. The named version here, CMSSW_0_9_0, is probably not going to be valid for long! Make sure you pick an appropriate version. I'm not sure I can write a recipe for that, you'll have to figure it out for yourself when you need to.

Once you have built the repacker, you must save the runtime environment in a file, for the jobs to use when they run:

scramv1 runtime -sh | tee $T0ROOT/runtime.sh

Preparing the Prompt Reconstruction application

This is up to date for 0_9_2
export CMSSW_BASE_DIR=/afs/cern.ch/path/to/where/you/want/to/install           
export CMSSW_DIR=$CMSSW_BASE_DIR/CMSSW_0_9_2
cd $CMSSW_BASE_DIR
scramv1 project CMSSW CMSSW_0_9_2
cd $CMSSW_DIR/src
eval `scramv1 runtime -sh`
cmscvsroot CMSSW
cvs co Configuration/Examples/data/RECO081.cfg
Copy the Configuration/Examples/data/RECO081.cfg file to $T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl. Edit Reco.cfg.tmpl and set the input fileNames, the maxEvents, and the output fileName as follows, wherever it is they appear in the configuration file.:

  • untracked vstring fileNames = {'T0_INPUT_FILE'}
  • untracked int32 maxEvents = T0_MAX_EVENTS
  • untracked string fileName = "file:T0_OUTPUT_FILE"

The Prompt Reconstruction Worker depends on $T0ROOT/runtime_pr.sh for its runtime environment, saved from the output of scramv1 runtime -sh. You can save the repacker and the Prompt Reconstruction environments from different installations if you wish, or if they're the same then you can just copy or soft-link one envinment to the other. That's all there is to configure.

Prerequisite Perl modules

The prototype makes heavy use of the Perl Object Environment ( POE) modules. POE has a home page (http://poe.perl.org/) and detailed documentation (http://search.cpan.org/~rcaputo/POE/) on the web. The POE modules should be installed somewhere in your PERL5LIB, and are all available from CPAN. The $T0ROOT/env.sh script will set up your PERL5LIB to point to the installations that already exist at CERN, so you do not need to install them for yourself. If you do wish to install separately, for yourself, you will need these modules:
  • POE
  • POE::Component::TCP::Server
  • POE::Component::TCP::Client
  • POE::Queue::Array
  • POE::Filter::Reference
  • POE::Wheel::Run

You also need the monalisa ApMon module, available from http://monalisa.cern.ch/monalisa.html. At CERN, the POE and ApMon modules are installed in /afs/cern.ch/user/w/wildish/public/perl. $T0ROOT/env.sh will set up your PERL5LIB accordingly.

For completeness, I note that you need the Carp, Cwd, Data::Dumper, File::Basename and Getopt::Long modules, though these are normally installed on any sensible system. Complain to your sysadmin if they aren't! Time::HiRes is another good one to have, though not essential.

These Perl modules must be available on all hosts that you use for your T0, all the components need them.

Configuring the components

The prototype is configured with one or more configuration files. Each component accepts a '--config' argument to tell it where its configuration file is, and although they can be separate files for each component, there is no problem if they are all configured from the same file, and that's what I recommend you do. Some components use configuration information about other components, for example many of the components use the Logger::Receiver parameters to find out where to send their own log messages. If you split your configuration over many files, such 'shared' information must be consistant across all of them. The default configuration file is in $T0ROOT/src/Config/JulyPrototype.conf, but a better starting point is $T0ROOT/src/Config/DevPrototype.conf, which will run a smaller, leaner first version of the prototype.

The configuration file consists of a series of Perl hashes containing configuration information for each component. The components are named appropriately, and when they start they read the config file ("do $file or die") and set their parameters from the hashes they are named after. If you change the configuration file while the components are running, they will re-execute it and update themselves, so don't mess it up while you're running! You can check it's OK with perl -c $your_config_file. Note, however, that the Worker components will only read their configuration at startup, and then they only really need it to find out where their Loggers and Managers are. Once they have connected to the Manager, that Manager will send them the configuration that they will use, and will send them any updates to it too. So it is only the configuration file(s) used by the Managers that really matters, if you sop/start workers with a different configuration file, it won't make any difference to what they do (providing they connect to the same Managers!)

The detailed syntax of the configuration file is specified in the July prototype configuration syntax. Check there for details of anything you find in the configuration files in the repository. Many of the configuration parameters are not so important, others are critical. Here we describe only what you must set before running.

Mandatory changes to configuration parameters

You must set the T0::System{Name} to something globally unique, such as JulyProto$your_name. If you don't do this, and you run the servers on the same machine as someone else, you run the risk of sending indistinguishable data to mona lisa. This would not be good!

You must set the Port values for all components that have one. Pick a random number as a base, and work up sequentially from there. There are no preferred or required values, you simply need to avoid collisions with other users on the same machines. You will only need one or two dozen ports at most, use your unix user-id if you can't think of a random number!

You must set the Host values for all components that have one, to match the host that you are running them on. The components can all run on the same or different hosts. The servers are not heavyweight things, and can all coexist on lxgate39, for example, or you can use lxcmsa. Several instances from several users can run on a machine without unduly loading the system.

For a first run of the system, set the Quiet flag to zero and the Verbose flag to one for all components. Don't set the Debug flag to one unless you like hieroglyphics.

The Storage Manager components.

For the StorageManager::Manager, you need to look at the TargetDirs and TargetRate parameters, as a minimum. Set the TargetDirs to some location that you can write to, don't just use the value there because someone else is probably using it already. The target directories should exist before you attempt to use them.

The TargetRate should also be set to something sensible, such as 100 MB/sec to start with. Don't set it too low or your system will take ages to do anything useful.

You should set the SvcClass of the StorageManager::Worker if you are writing to castor.

The Logger

The Logger writes a logfile if you have one set in the Logfile parameter. This should not be on AFS, or the logger will fail as soon as the token expires. Any local filesystem will do. The file will be appended if it already exists. If your Logger is still running at midnight, it will rotate the logfile, adding a ".YYYYMMDD" suffix to the old one.

The FileFeeder

The Prompt Reconstruction components

Prompt Reconstruction has three components, the Manager, the Workers, and the Receiver. The Receiver is internal to the manager, to handle subscriptions to the Logger. All three require a Port, the Manager and Receiver need a Host too.

Nota bene: Don't set Verbose to be true for the Worker unless you really know what you're doing. It will duplicate the STDOUT of the cmsRun application, which is not nice. This will be fixed at some point.

For PromptReco::Worker, the only supported Mode at the moment is LocalPull. Actually, that's not true, all modes are supported, but that's the only one tested. Classic should also work, but LocalPush won't until/unless someone writes the bits that push the files in the first place!

Leave TargetDirs as it is (a single entry consisting of '.') to write the RECO output in the jobs local working directory. As elsewhere, the only TargetMode supported at the moment is RoundRobin.

Set MaxEvents to something small while testing. Note that setting '0' here means exactly that, zero events, if you want to process all events, set '-1' instead.

If you set LogDir to an RFIO-accessible directory, the logfiles of each individual reconstruction step will be copied to that directory by the prompt reconstruction worker as it progresses.

Likewise, you can set RecoDir to something as a first measure to save the reco data. Something better will be put in place, this was hacked to allow CMSSW_0_9_0 data to be saved during the first tests. If it's a castor directory you're writing to, set SvcClass as appropriate.

Starting the components

Some of the components are fussy about being started in the correct order, though that will change as the code improves. In all cases, the only argument you need to specify is the configuration file, with '--config $file'. Also, in all cases, if the Host is specified for that component, you must run it there or it will abort with an appropriate error message. For everything except the workers, just starting the task in a terminal window is good enough. I use screen to create persistant sessions that I can connect to from home or from the office, see http://wildish.home.cern.ch/wildish/UseScreen.html for a 1-minute tutorial on screen if you're interested.

To start the full system, the components should be started in this order:

  • The Logger, $T0ROOT/src/Logger/LoggerReceiver.pl.
  • The FileFeeder.
  • The PR Manager.
  • The PR Workers.
  • The Export Manager, $T0ROOT/src/ExportManager/ExportManager.pl.

There are two scripts, $T0ROOT/src/RepackManager/run_RepackWorker.sh and $T0ROOT/src/StorageManager/run_StorageManagerWorker.sh, which can be used to start the Repack or Storage Manager workers either interactively or as batch tasks. Both will need editing to pick up the environment from the correct place in your system, unless I improve them one day. They can then be started as simply as bsub -q $queue $script. Note that we have some dedicated machines available to us, and we submit to them as follows: bsub -q dedicated -R itdccms $script or, for SLC4, you can submit with bsub -q cmscsa06 -R "type=SLC4" $script. You can also start the workers on those machines interactively, using the lsrun command. For example, lsrun -P -R itdccms /bin/bash will give you a terminal shell on a worker node, from which you can start your processes. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!

Watching the running system

The components all produce output with variable levels of verbosity, see the configuration file syntax for details.
You can also just check the logger, which gets most of the important information.
Alternatively, you can look at mona lisa, where there is a CMS/Tier0 group that plots most of the high-level metrics.
There's Lemon monitoring of

Stopping the system

You can just kill everything if you like. You also need to clean the castor fileservers and flush the PhEDEx subscriptions to make a clean start.

If you want to stop the system cleanly, you can kill the Storage Manager and Repack Manager servers. The workers will talk to them again when they've finished their current tasks, and will then exit because they're no longer there. In principle, the components can be stopped and restarted without having to stop/restart the whole system, but that seems not to be working properly yet.

To clean castor, you can use $T0ROOT/src/Utilities/CleanupCastor.pl. Give it a '--directory $castor_path' argument, and it will wipe the files in that directory, recursively. There is no protection, so don't give it a directory containing real data! This script does a stager_rm on a group of files at a time, which is fast, but means that the garbage collector has to kick in to finally remove them. That can take a minute or two. If you only have a few files in the directory, you can add the '--hammer' argument, which really clobbers them by doing stager_rm followed by nsrm. That will take too long if you have more than a few dozen files, so run it the normal way first.

Custom configurations

For testing purposes we will often want customised configurations, such as feeding one component of the system with files prepared in advance, at a specific rate. This is described on a separate page.

Reporting bugs/problems

Simple operational problems can be reported to the t0 operations list or the t0 hypernews forum, depending if they are CERN-local for our specific dedicated hardware or conceptual, in the architecture of the system. Bugs or feature-requests should be reported using savannah, at https://savannah.cern.ch/projects/cmstier0/. Make sure you assign your problem report to someone appropriate, or at least include them in the CC:, or nobody will follow it up! By default, assign problems to Tony.
 -- Main.jank - 18 Sep 2006 \ No newline at end of file

Revision 12006-09-18 - unknown

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="Sandbox.CMST0CSA06OperationsGuide"
-- Main.jank - 18 Sep 2006
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback