How to Test Configuration Files with DQMF for Online Monitoring of the L1Calo in Run 2
Introduction
DQMF/DQMD is a main tool used to add and display histograms of L1Calo:
- DQMF*: The Data Quality Monitoring Framework(DQMF) is the online framework for data quality assessment
- The DQMF involves analysis of various monitoring data through user-defined algorithms and relaying the summary of the analysis results during a run
- This twiki provides instructions on how to build a test partition for L1Calo, and test the OKS configuration files with the test partition
Specificly, the DQMF has the following feature:
- Show histograms with various statistical checks
- Automatically flag the data quality from DQMF results
- Based on xml configuration files which are part of OKS database to select histograms and their checks
- Results are visualized with the Data Quality Monitoring Display(DQMD)
Instructions to Build a Test Partition for L1Calo
The instruction below will generate a test partition that runs on TestBed(tbed)
and will have just a single L1Calo child process.
1) apply for a TestBed account
- To begin with, you should have a valid TestBed account, e.g. on pc-tbed-pub-01.cern.ch. The TestBed machine has the same name and home directory as your lxplus account. However, you do not have such an account automatically on the cluster. Please refer to the following Twiki page TestBed Twiki on how to apply for one.
- Then, make sure you can do login (ssh -Y) TestBed from the lxplus without a password. Otherwise, your test partitions will not work.
2) login to the TestBed
- Login to the host where you want to login to the TestBed subsequently, e.g.:
$ ssh -Y <your-user-name>@lxplus.cern.ch
- Then, login to the TestBed where you want to run the test partition, e.g.:
$ ssh -Y pc-tbed-pub-01.cern.ch
- After that, create a working directory in your public area (e.g. l1calo_test_partition), i.e. in your home: /afs/cern.ch/user/y/yhao/public/l1calo_test_partition but do not create it in any non-shared and temporary directory, because parts of the tdaq system run on other machines even for single host partitions:
$ cd ~yhao/public
$ mkdir l1calo_test_partition
$ cd l1calo_test_partition
$ fs sa . system:anyuser all
# make sure it’s public
3) get all the software and prepare the test area
- Note some of the copied files will overwrite existing ones. This is expected:
$ cp -r ~yhao/public/l1calo_test_partition .
$ mkdir -p combined/segments
$ cp ~yhao/public/l1calo_test_partition/combined/segments/DQM_segment.data.xml combined/segments
$ mkdir daq/hw
$ ln -s /tbed/oks/tdaq-05-05-00/daq/hw/hosts.data.xml daq/hw/hosts-mon.data.xml
4) set up the tdaq environment
- Simply type the following command:
$ source ~yhao/public/l1calo_test_partition/setup.sh
- as well as do an additional export:
$ export TDAQ_DB_PATH=`pwd`:$TDAQ_DB_PATH
5) get the trunk of PartitionMaker
Note: You can skip this step if you have already the whole directory from my public area.
- If you do not have it yet do:
$ cd l1calo_test_partition
$ source setup.sh
$ svn co $SVNROOT/DAQ/DataFlow/!PartitionMaker/trunk PartitionMaker
$ pushd PartitionMaker/cmt
$ cat >version.cmt
HEAD
^D
$ cmt config
$ make
$ make inst
$ popd
$ which pm_part_hlt.py
- Or in case you have it already, you should just go to your PartitionMaker directory and svn update, then rebuild it:
$ pushd PartitionMaker
$ svn update
$ cd cmt
$ cmt config
$ make && make inst
$ popd
6) check out the DQMF code in Point 1 machine
- Set up
- login to Point 1 machine:
-
$ ssh -Y <your-user-name>@atlasgw.cern.ch
- when asked for a login-rights, type:
-
Do you want to ask for remote access ([L1C:remote] role)? ([y]n): y
-
Please state the reason for your request and any possible disruptions (5 chars min): check database setup, no disruption for ATLAS
-
Please enter the duration (in hours 1..5) [3]: 5
-
Do you want to wait for request approval? ([y]n): y
- when asked for a host-name, type:
-
$ pc-atlas-pub
- setup the tdaq current release:
-
$ source ~yhao/setup.sh
- Also, set up ROOT environment:
-
$ export ROOTSYS=/sw/atlas/sw/lcg/external/root/5.18.00a/slc4_ia32_gcc34/root/
-
$ export PATH=$PATH:$ROOTSYS/bin
- make a directory for OKS configuration file:
-
$ mkdir /atlas-home/1/yhao/oksConfig/tdaq-05-05-00
- set up the oks_repository
-
$ export TDAQ_DB_USER_REPOSITORY=/atlas-home/1/yhao/oksConfig/tdaq-05-05-00
- you can rename your directory if you wish, but the tdaq-05-05-00 directory must be the final directory
- checkout OKS code:
-
$ oks-checkout.sh [-t] [-h] file
- arguments or parameters that could be used:
-
-t | --trace
: trace this script execution
-
-h | --help
: print this message
- check out the L1Calo code as well as the skeleton code:
-
$ oks-checkout.sh $TDAQ_DB_REPOSITORY/l1calo
-
$ oks-checkout.sh $TDAQ_DB_REPOSITORY/daq/segments/DQM/
- the $TDAQ_DB_REPOSITORY is at /atlas/oks/tdaq-05-05-00 on the P1 machines, therefore, do:
-
$ export TDAQ_DB_REPOSITORY=/atlas/oks/tdaq-05-05-00
- this is also the source for the OKS database
7) copy the DQMF code from Point 1 machine to TestBed machine
- Just copy the above configuration from P1 into your local daq/segments. Make sure that the entire DQM tree and all the included xml files are copied. So just do:
$ scp -r <your-user-name>@atlasgw.cern.ch:/atlas-home/1/yhao/oksConfig/tdaq-05-05-00/daq/segments/DQM daq/segments
- Make the following changes to the main DQM configuration file: substitute daq/segments/DQM/DQM.HLT.xml for l1calo/segments/DQM/l1calo_dqmf.data.xml and rename it to be daq/segments/DQM/DQM.L1Calo.xml Change the machine specified under the RunsOn relation of the main configuration file to whatever tbed machine you are using. In this case the corresponding lines would look like:
<rel name="RunsOn">"Computer" "pc-tbed-pub-01.cern.ch"</rel>
- Correct the reference directory in daq/segments/DQM/setup_dqm_env.xml:
- from
-
<attr name="Value" type="string">"/atlas/moncfg/tdaq-05-05-00/trigger/dqm/Ref_Histo"</attr>
- to
-
<attr name="Value" type="string">"/afs/cern.ch/user/y/yhao/public/refernce"</attr>
- Open the main configuration file with oks and make sure there is no error with it:
$ oks_data_editor daq/segments/DQM/DQM.L1Calo.xml
Since PartitionMaker will throw an error and cannot generate a partition if something is wrong, please always verify the oks-correctness of your DQM segment.
8) make changes to the python script used to generate the partition
Customize (edit) the file
test_20.2.3.7.py
which is used to generate the partition. Note some of following instructions should already be in place, but please check it again.
- Make sure that the following line is not commented out:
option['data'] = ['/afs/cern.ch/user/h/haimo/wpub/data/data15_13TeV.00284484.physics_Main.daq.RAW._lb0500._SFO-1._0001.data']
- Locate the partition name and change it to something you like (unique enough please) or keep it as
part_dqmf_yourusername
: option['partition-name'] = 'part_dqmf_' + os.environ['USER'] + '_20.2.3.7'
- The directory for the tdaq log files is set in
option['log-root']
: option['log-root'] = '/logs/' + os.environment['USER']
and you should probably leave it that way. /logs
is the file system for log files on the tbed and having your username as a subdirectory there makes it easier to find the logfiles.
- Tell the partition generation to include the DQM segment:
- locate the
hlt-dqm
option and change it to something like: option['hlt-dqm'] = 'daq/segments/DQM/DQM.L1Calo.xml'
- add
add_DQM_seg.py
to the post processing option: option['post-processor'] = ['localhost_specific','add_DQM_seg','myCoralServer']
This new post processing modules attaches the DQM segment to the partition. You should already have it in your main folder.
9) set up the server
Some preliminaries that are needed for now. If the partition later fails early in configuration phase, check that you have done this step properly. In the
myCoralServer.out
logfile it may indicate that it it didn't find the
authentication.xml
file and so didn't get access to Coral.
- create a directory
/tmp/.coral/CoralServer
(on the host where your partition shall run) and copy the authentication file there. Right now only Werner, Martin and Haimo can copy that file: $ mkdir -p /tmp/.coral/CoralServer
- Haimo is now keeping the file under public/CoralServer:
$ cp -v ~haimo/public/CoralServer/authentication.xml /tmp/.coral/CoralServer/
10) generate the partition for L1Calo
- Just generate the partition:
$ m_part_hlt.py -F test_20.2.3.7.py
- This command will printout something like:
workarea
patcharea
/afs/cern.ch/atlas/software/releases/20.2.3/AtlasP1HLT/20.2.3.7/InstallArea/share/data:/afs/cern.ch/atlas/software/releases/20.2.3/AtlasP1HLT/20.2.3.7/InstallArea/x86_64-slc6-gcc48-opt/data
release
/afs/cern.ch/atlas/software/releases/20.2.3/AtlasHLT/20.2.3/InstallArea/share/data:/afs/cern.ch/atlas/software/releases/20.2.3/AtlasHLT/20.2.3/InstallArea/x86_64-slc6-gcc48-opt/data
generated partition/segment will use new TDAQ_DB_PATH:
/afs/cern.ch/atlas/software/releases/20.2.3/AtlasP1HLT/20.2.3.7/InstallArea/share/data:/afs/cern.ch/atlas/software/releases/20.2.3/AtlasP1HLT/20.2.3.7/InstallArea/x86_64-slc6-gcc48-opt/data:/afs/cern.ch/atlas/software/releases/20.2.3/AtlasHLT/20.2.3/InstallArea/share/data:/afs/cern.ch/atlas/software/releases/20.2.3/AtlasHLT/20.2.3/InstallArea/x86_64-slc6-gcc48-opt/data:/afs/cern.ch/user/y/yhao/public/l1calo_test_partition:/tbed/oks/tdaq-05-05-00:/afs/cern.ch/atlas/project/tdaq/inst/tdaq/tdaq-05-05-00/installed/share/data:/afs/cern.ch/atlas/project/tdaq/inst/dqm-common/dqm-common-00-37-00/installed/share/data:/afs/cern.ch/atlas/project/tdaq/inst/tdaq-common/tdaq-common-01-31-00/installed/share/data:/afs/cern.ch/atlas/offline/external/LCGCMT/LCGCMT_71/installed/share/data:/afs/cern.ch/atlas/project/tdaq/inst/tdaq/tdaq-05-05-00/installed/databases
WARNING:root:The tag with name x86_64-slc6-gcc47-opt is not available in the TDAQ sw repository. Not using it.
WARNING:root:The tag with name x86_64-slc6-gcc47-dbg is not available in the TDAQ sw repository. Not using it.
previous version of part_test_yhao_20.2.3.7.data.xml removed
- copy the long line printed after generated partition/segment from the printout above and set the TDAQ_DB_PATH enviroment variable to it:
$ export TDAQ_DB_PATH="afs/cern.ch/atlas/software/releases/20.2.3/AtlasP1HLT/20.2.3.7/InstallArea/share/data:/afs/cern.ch/atlas/software/releases/20.2.3/AtlasP1HLT/20.2.3.7/InstallArea/x86_64-slc6-gcc48-opt/data:/afs/cern.ch/atlas/software/releases/20.2.3/AtlasHLT/20.2.3/InstallArea/share/data:/afs/cern.ch/atlas/software/releases/20.2.3/AtlasHLT/20.2.3/InstallArea/x86_64-slc6-gcc48-opt/data:/afs/cern.ch/user/y/yhao/public/l1calo_test_partition:/tbed/oks/tdaq-05-05-00:/afs/cern.ch/atlas/project/tdaq/inst/tdaq/tdaq-05-05-00/installed/share/data:/afs/cern.ch/atlas/project/tdaq/inst/dqm-common/dqm-common-00-37-00/installed/share/data:/afs/cern.ch/atlas/project/tdaq/inst/tdaq-common/tdaq-common-01-31-00/installed/share/data:/afs/cern.ch/atlas/offline/external/LCGCMT/LCGCMT_71/installed/share/data:/afs/cern.ch/atlas/project/tdaq/inst/tdaq/tdaq-05-05-00/installed/databases"
The reason for doing this is that up to here, only the tdaq environment is needed. Any session with a tdaq-05-05-00 environment and the above value for TDAQ_DB_PATH can then run the partition.
- At this point it is convenient to have two sessions, one where only the tdaq environment is set up (./setup.sh) and where you (re-)generate the partition, and another where you have the tdaq envirnment plus the extended TDAQ_DB_PATH that is where you run the partition. This way you avoid accumulating multiple copies of the HLT path components into TDAQ_DB_PATH.
11) run the partition and the DQM display
- Run the partition:
$ setup_daq part_test_yhao_20.2.3.7.data.xml part_test_yhao_20.2.3.7
This will check the consistency of the partition. If something goes wrong here check the previous step, in particular the TDAQ_DB_PATH
- After a few seconds you should see a printout similar to this:
TDAQ SW Release: tdaq-05-05-00 patch level: 0
Database file: /afs/cern.ch/user/y/yhao/public/l1calo_test_partition/part_test_yhao_20.2.3.7.data.xml
Checking the consistency of the database ... OK
15:09:42
TDAQ_PARTITION: part_test_yhao_20.2.3.7
15:09:42
Checking initial partition... OK!
15:09:44
TDAQ_DB_DATA: /afs/cern.ch/user/y/yhao/public/l1calo_test_partition/part_test_yhao_20.2.3.7.data.xml
Getting the part_test_yhao_20.2.3.7 Partition environment from the Database ... OK
TDAQ_LOGS_PATH: /logs/yhao/part_test_yhao_20.2.3.7
15:09:44
---> Starting all the PMGs for partition part_test_yhao_20.2.3.7 (60 seconds timeout...)
15:09:44
15:09:51
Done!
Now starting RC setup via PMG on pc-tbed-pub-01.cern.ch
15:10:46
Now starting RootController for part_test_yhao_20.2.3.7 via PMG on pc-tbed-pub-01.cern.ch
15:10:48
OK!
Starting IGUI for partition part_test_yhao_20.2.3.7 (log file /logs/yhao/part_test_yhao_20.2.3.7/igui_1450015848.out).
Please wait for a window to appear on your screen...
setup_daq script exiting
- The igui will come up and there are 3 commands going forward: INITILIZE, CONFIG and START
- click INITIALIZE after a few seconds you will see a few CHIP errors regarding SFO-1 and DummyRack, but they can be ignored
- click CONFIG this will take about 4 minutes, after which the Run Control tree in the igui should show CONNECTED
- click START the HLT still does some extensive configuring before the run actually starts, this will about 3 minutes, then the run will start with 500 mHz of event rate.
- Start the DQM Display:
-
$ dqm_display -p part_test_yhao_20.2.3.7
- Note that most likely the histograms you added in your xml configuration files will not be visible in the DQMD ("Histogram not found"), due to the way this test setup works. However, if you manage to run the test partition and see your configuration in the DQMD, the histograms should be visible at P1.
- Finishing up:
- click STOP to stop the run
- when the run control state is CONNECTED,click on UNCONFIG
- once run control state is INITIAL, click on SHUTDOWN
- once the state is NONE, click on the File tab in the upper left corner and select exit
- answer yes to the question "Do you also want to shut down the Infrastructure?"
12) some useful commands
If there are problems, it may help to clean up what's left from the previous run.
- First look if your partition is still known to the TDAQ infrastructure
$ ipc_ls -P -l
# will list ALL partitions running from AFS
- if your partition is among those listed, you can kill it:
$ pmg_kill_partition -p <partition-name>
- Sometimes it is necessary to free some resources that the partition left bus (even if the partition is no longer among the above list). Look at the resource manager:
$ rm_get_partitions_list
- if your partition is among the ones listed (the format of this list is different from ipc_ls) you can free them by
$ rm_free_all_resources -p <partition-name>
- Finally, there may still be some odd processes of your partition hanging around (normally pmg_kill_partition dors get rid of them). You can see them by
$ ps auxww | grep <your-userid>
- if they are things like is_server or ipcserver, you can kill them with a judicious
$ kill -9
# pid == process id from ps output
13) log files
- Another resource when you get into trouble is to look at would be the log files in /logs/$USER/<partition-name>. Any applications that the DQM segment itself starts will also have its file.err or file.out files therein.
14) updating the DQM configuration files
- Simple skeleton configuration xml files are provided for each sub-groups. You will find them in the appropriate folders when you check out the DQMF code as explained above.
- When updating your configuration files be sure to use the new DQTemplate classes described here: Data Quality Monitoring Framework twiki
More Information
- General information about the OKS database
- General information about the DQMF
Major updates:
--
YongliangHao - 2015-12-13
%RESPONSIBLE% Main.unknown
%REVIEW%
Never reviewed
Topic revision: r4 - 2015-12-15
- unknown