--
ParasNaik - 2017-05-16 --
ClaireProuve - 2015-07-18
Running the Online RICH mirror alignment
How to run the Alignment from the PVSS panel
Before running
XML-file to start with
If you want to start from a specific xml-file (i.e a specific mirror alignment) you need to copy it to the right location. Just follow the instructions
here.
[Paras will remove later, as this is no longer needed, but first a backup may need to be made] The alignments that were put into the database should also be kept at this location:
/group/rich/AlignmentFiles/databaseAlignments
, where you can pick them up whenever you want to.
Check Configuration file
There is only one configuration file,
Configuration.py
. If you want to use the same configuration as used in a previous alignment, since 2017 you can just go into that alignment's directory and retrieve the version of
Configuration.py
used. Of course, if you wish to repeat the alignment, you need to make sure that new Configurables have not been added since!
Since the Configuration file in the
master
branch of Panoptes is
the Configuration file, where
Configuration.py
exists we have created a subdirectory called
SavedConfigurations
, that contains
.txt
versions of various configurations that you can write over
Configuration.py
before you start your alignments.
Please read README.txt
. Always create new configurations in the
SavedConfigurations
directory, update
SavedConfigurations/README.txt
, then overwrite
Configuration.py
with your new configuration.
It is necessary to check if all parameters (especially starting-iteration, though we almost always start from Iteration 0) are set correctly. Make sure all your variables are set consistently.
You can read more about the
Configuration file.
NOTE after changing the Configuration file locally, you have to
recompile the code (including a git fetch Panoptes
and a git lb-checkout
your branch of Rich/RichMirrorAlignmentOnline
first) for the changes to be implemented in the next mirror alignment.
Magnification factors
You can chose between calculating the magnification factors on-the-fly for each iteration or using predetermined ones. In order to chose you have to modify the value of
magnifCoeffMode in the
Configuration file.
Find out more about using
predetermined Magnification factors.
Magnification factors on-the-fly
Set the value of
magnifCoeffMode in the
Configuration file to 2.
Predetermined magnification factors
Set the value of
magnifCoeffMode in the
Configuration file to 0.
Then you have to provide the files containing the predetermined magnification factors in a directory and
point towards it in magnifDir in the
Configuration file.
Creating a new directory for predetermined magnification factors
Choices for fixed magnification factors should always be stored in
/group/rich/AlignmentFiles/MagnifFactors/Rich1/
and
/group/rich/AlignmentFiles/MagnifFactors/Rich2/
and its subdirectories.
If you want to use a new set of magnification factors, make sure that the files have the same names as those that can currently be found within the substructure of those directories.
Say you determine new magnification factors for RICH1, by allowing them to float in a test alignment. If you want to install them, follow the following procedure on a
plus
machine (substituting your alignment name for
20170612_004007
, and the appropriate nameString instead of "_Mp6Wi8.0Fm5Mm2Sm0_online_Collision17"):
-
cd /group/rich/AlignmentFiles/MagnifFactors/Rich1/
-
ls /group/online/AligWork/MirrorAlignments/Rich1/20170612_004007
- Look at the last iteration number and remember it.
-
mkdir 20170612_004007
-
cd 20170612_004007
-
cp /group/online/AligWork/MirrorAlignments/Rich1/20170612_004007/summary.txt .
- (substituting your alignment number for
_i4
)
-
cp /group/online/AligWork/MirrorAlignments/Rich1/20170612_004007/Rich1MirrMagnFactors*_i4.txt .
-
for file in Rich1*; do mv "$file" ${file//_i4/_predefined}; done
-
for file in Rich1*; do mv "$file" ${file//_Mp6Wi8.0Fm5Mm2Sm0_online_Collision17/}; done
-
rm Rich1MirrMagnFactors_predefined.txt
- change the magnifDir in the Configuration file
For RICH2 of course you want to substitute "Rich2" for "Rich1" in the above procedure.
Errorloggers
You will need to look at two different errorloggers:
1. For the analyzers: errorLog LHCbA
OR in a separate terminal window
less +F --follow-name /clusterlogs/partitions/LHCbA/daq/LHCbA.log
If using the errorLog, you can change the output level in the little settings window: navigate (with the arrow keys) up to "Severity for messages" and cycle though the options with the ">" key-combination. When you have found the one you want simply hit "enter". This can be done at any time during the running (the message window might need a while to catch up with the command though).
2. For the iterator: ssh -Y hlt02
source /group/online/dataflow/scripts/shell_macros.sh
errlog -m hlt02
NOTE: here you want the errlog
command and NOT the errorLog
command!!!
This panel is a bit moody and will sometimes not output the messages at the right time. It might take a while, or wait for the next iteration or wait til the program finished.
Alternatively, in a separate terminal window
less +F --follow-name /group/rich/AlignmentFiles/Logging/Rich1_hlt02.log
should track the RICH1 mirror alignment when it is running, and pause otherwise. Change 'Rich1' to 'Rich2' for RICH2.
Run the mirror alignment
0. Create an FSM launch script on ui
. You only need to do this once (unless there is a new version of WinCC)
ssh -Y ui
then create a new file called
fsm.sh
and put the following into it:
if [[ "$WCCOA_DIR" =~ "WinCC_OA/3.11" ]]; then
echo "Starting the FSM on WinCC-OA 3.11"
/group/online/ecs/Shortcuts311/LHCb/ECS/ECS_UI_FSM.sh
elif [[ "$WCCOA_DIR" =~ "WinCC_OA/3.15" ]]; then
echo "Starting the FSM on WinCC-OA 3.15"
/group/online/ecs/Shortcuts315/LHCb/ECS/ECS_UI_FSM.sh
elif [[ "$WCCOA_DIR" =~ "WinCC_OA/3.16" ]]; then
echo "Starting the FSM on WinCC-OA 3.16"
/group/online/ecs/Shortcuts316/LHCb/ECS/ECS_UI_FSM.sh
else
echo "You need to edit fsm.sh and add functionality for this WinCC version:"
echo $WCCOA_DIR
fi
1. Open the panel from an ui or plus machine:
ssh -Y ui
(if you haven't already)
source fsm.sh
2. Right-click on LHCb_Align when the FSM panel shows up
3. Take the partition: Only take the partition if no one else is using the alignment!
click "take" and then
wait for a smaller new panel to show up (this may take a while, just be patient). In the new panel just click "*dismiss*".
When you have taken the partition, a complete loss of internet connection will log you off the FSM panel.
During data taking, you will not need to take the partition. The partition will be set up at the pit. The bad news is now you effectively share the partition with everyone else. The good news is if you lose internet connection, your alignments will still run (unless one of the automated alignments overrides your alignment, automated alignments have priority).
4. Choose your alignment: select the activity from the menu on the right of the Run Info panel. This will always be
Alignment|Rich1
or Alignment|Rich2
.
5. If the state is NOT ALLOCATED, then Allocate:
6. Reserve alignment farm: Put your name into the panel and click the button to reserve the alignment. This is so other people know you are there and that they shouldn't touch anything until you are done. Also, if they need the alignment they can contact you. During commissioning periods we use a
TimeTable to let everyone know when we have booked the farm (the link may need to be updated each year).
7. Select runs: Select the run range to use in the alignment by clicking on "Choose runs for alignment". Note that the fill numbers are also displayed for your convenience. The runs to use are the first ones of each fill (if they have enough events), unless the event count is being cut off (currently 3M) in which case use all of them in a fill.
Information about all runs and fills can be found in the
run database.
Make sure to pick "Rich" from the list at the top! Now select the runs you want to run over and click "Ok". The number of events for each run listed in this window is the combination of events provided from the RICH1 and RICH2 mirror HLT lines. It is not quite the perfect mix, right now it is 57% RICH1 and 43% RICH2, but with substantial error bars [we aimed for 50-50, so not so bad]).
To get the number of events actually processed, use the
numTriggers.C script or just look in
AlignmentView
.
8. Verify / select farms to run on: You don't have to do this usually. However to see check the status of the HLT subfarms, click on "HLT" which will open up an overview of the HLT subfarms and nodes.
The selection-panel on the left show the included/excluded nodes and subfarms. In order to include/exclude subfarms/nodes select them in the panels and
then click the button with the error pointing toward the left/right. Then
also click the "Include"/"Remove" button. This might take a while, just be patient.
The subfarms marked in red are not working right now (for whatever reasons) and can and shall not be included.
When all available nodes are included their number should be between 1200 and 1800. If it is less than 1200, you need to
DEALLOCATE and re-*ALLOCATE* the partition. This almost always solves the problem, and can also solve other problems with the mirror alignment (e.g. if
LHCbA gets into a bad state and nothing seems to work exactly right). If the problem persists go and complain to an Online piquet (or expert [Beat Jost, Clara Gaspar, Markus Frank]) and/or the Alignment Piquet. See the
ShiftDB
for who's on shift.
9. Look at the status of the subfarms and nodes: Also something you don't have to do usually. If need be though, click the "PARTAlign" button in the HLT panel.
Another panel will open that will show you the state of the subfarms. By clicking on a certain subfarm you will get the state of its individual nodes.
Choose steps 10a AND 11a if an alignment has not been performed in a while on the farm.
Otherwise try step 10-11b.
10a. Start alignment by configuring: The configuring process might take a couple of minutes and some nodes might take significantly longer than others. That's just normal. If one node seems crazy slow or fails you can remove it as shown in step 7.
There is a time-limit set of how long the configuring is allowed to take (6-8 minutes). It is possible this will take longer, and if nodes take longer their status will appear as ERROR, but other nodes in the HLT panel will already be READY. Give it time and verify in the
LHCbA errorlogger that something is still happening. If some node(s) take(s) too long or seems stuck for some reason, just remove it.
Do not change the status of things anywhere else than in this specific panel (unless you are an expert). Don't give commands to subfarms or nodes directly! (But also don't worry, it's very unlikely a DEALLOCATE and ALLOCATE won't fix it)
11a. Start the run: From the same dropdown panel select "start run".
The "RunInfo" button on the left will change to "running". Wait a few seconds and see what happens to the "HLT" button. Sometimes it goes straight to running and sometimes it will be back at "Ready". In that case just select "start run" again.
10-11b. Autopilot: In lieu of the two steps (or three, if the farm is not allocated yet), if the runs and alignment are selected you can run the alignment in one step by switching the AUTOPILOT from OFF to ON. However if there is a bug in the code or some other problem this may cause weird loops to occur; if you see this happening switch AUTOPILOT from ON to OFF. And then try to diagnose and repair the problem following the guidelines below, and next time don't use the AUTOPILOT.
12. Now lean back and relax
If you want to you can go into the working directory and see the files being written, or watch the log files scroll by =)
13. If the alignment is successful, everything goes into READY. However once this happens, it is very important that
you execute a RESET command as soon as possible after the alignment goes into READY. We have no idea why this is not done automatically, but since it is not, you have to do it.
14a. If and only if you want to perform another alignment, you MUST RECONFIGURE by following these steps:
- You should have already done this, but if not: The alignment will have gone into READY, click on READY at the top and pick RESET
- Wait until the alignment goes into NOT READY
- Now do whatever you need to do to prepare for the next alignment: it may be changing run numbers, changing configuration and/or or changing the starting XML file (REMEMBER YOU MUST COMPILE if you change any of the code, including the Configuration)
- Steps 10-11a or 10-11b
14b. If you want to stop doing mirrror alignments, send a RESET, and free the alignment farm again
- You should have already done this, but if not: The alignment will have gone into READY, click on READY at the top and pick RESET
- DEALLOCATE the farm
- If you took the farm, then you should click on the lock and release the farm.
Monitoring the procedure
The results of each iteration is saved, this includes XML files and histograms produced at the end of each iteration (more details below).
For the moment the monitoring should be performed manually looking at the output of the fits, especially looking at the plots!!! See
AlignmentView
and you should be familiar with
all of the information on
LHCbRichMirrorAlignShiftInfo to help you evaluate an alignment.
The output of each perfomed alignment will be saved under
/group/online/AligWork/MirrorAlignments/Rich' + str(whichRich)/{YYMMDD_hhmmss}
.
Xml files location
- The xml files produced in the job you are running will be saved to
/group/online/AligWork/MirrorAlignments/Rich' + str(whichRich)/{YYMMDD_hhmmss}
- The xml file the alignment starts with is picked up from
/group/online/alignment/Rich' + str(whichRich)+ '/MirrorAlign/
Histograms location
The histograms are produced automatically when running the alignment. We have one root file for each iteration, they are at: the histograms are at
/hist/Savesets/2015/LHCbA/
Nomenclature convention: e.g.
/hist/Savesets/2015/LHCbA/AligWork_Rich1/07/10/AligWrk_Rich1-1569160001-20150710T100630-EOR.root
- Rich1 is the name of the activity (Rich1, Rich2, Muon, Tracker or Velo)
- 156916 is the first run number in the run list
- 0001 is the number of iteration
- 20150710T100630 is the time when the file has been generated.
The histograms will we copied into the work directory for fitting etc. All histograms will also be saved at the end of the alignment procedure to
/group/online/AligWork/MirrorAlignments/Rich' + str(whichRich)/{YYMMDD_hhmmss}
To run root on plus:
lb-run root root.exe FILENAME.root
Known issues and workarounds
Everything takes ages. That's normal
Actual errors
When taking the LHCbA portion, get a warning that says that PARTAlign is owned by ECS:LHCbARunControl (or someone else).
If you have to take
LHCbA, and you get a warning like this chances are it
LHCbA taken by someone else.
Typically this partition should not be taken by someone else unless they are using it or the alignment is in automatic mode
(otherwise you should have been warned that
LHCbA is occupied and you shouldn't use it).
If none of the subfarms are able to be included and PARTAlign remains OFFLINE,
you should try to force exclude PARTAlign, deallocate everything and then you can try to allocate yourself
LHCbA from top level.
To force exclude, you can right click on the lock to go into expert mode and then select the option.
Alignment goes into state “READY,” but the alignment hasn’t converged (ALWAYS CHECK! You should get an email if it converged, and in the ELOG there should be a summary report)
- In this case you probably follow Step 10a but not Step 10b. Just click on “start run" again (no reconfiguring, etc.!!!)
PARTAlign_Master READY, nodes READY
It is possible that the job is stuck with the PARTAlign_Master in
RUNNING and the various nodes (e.g. HLTA05_A) in
READY.
In this case select again
START_RUN (as mentioned above). This can happen after configuring but also between iterations.
Only few nodes included
If only few nodes are in the "Included Nodes and Removed Nodes", try to DEALLOCATE and then ALLOCATE again.
HLT goes into state error: Check whether it is all nodes or just one
- If its all nodes then there is an actual error in the code, try checking the error loggers.
- If only individual nodes are in error it could be
- Time-out during configuring
- give it time ~20 min
- There is a time-limit set of how long the configuring is allowed to take.
- It is possible to take longer and if nodes take longer their status will appear as error.
- Give it time and verify in the errorlogger that something is still happening.
- a crushed node during running
- In LHCbA_HLT: TOP
- Select the node (e.g. HLTE1028_A)
- Click “arrow” to exclude area
- Click “remove” to exclude (might take a while, just wait)
Nodes going into error while running
Sometimes a node goes into error state without apparent reason. This will prevent the others to get to the next iteration. Just exclude the node and click "start run" again.
Red herrings
something with Options "../../jobs " something ==> a problem in
your python-configurations
- use
tail -n 20 /clusterlogs/partitions/LHCbA/daq/LHCbA.log
to see more details of the error (can increase the number 20
if need be)
something complaining about some
Configuration.py
file in the
OnlineDev
something something directory ==> something in
our own Configuration file is wrong
Specific errors
In
LHCbA log, or when running on a single node, if you see
Aug29-182253[ERROR]hltc0507: CTRL(HLTC0507_A_Controller): start: FAILED Summary:FINISHED:1 TIMEOUT:1 NodeAdder_0 .....
you may want to mention it to Beat and Clara, but it is usually not a problem
In
LHCbA log, or when running on a single node, if you see something like
Aug29-165429[ERROR]hltc0222: GaudiOnlineExe.exe(LHCbA_HLTC0222_AligWrk_0): RawDecoder: Rich::DAQ::RawDataFormatTool:: 'Unknown Level0 hardware ID 181' | L1HardID=10 Ingress=1 Input=1 [DecodeRawRichHLT] StatusCode=FAILURE
This means that
you are using the wrong DB tags.
In
LHCbA log, or in a single node if you see
Aug30-144215[WARN] hlt02: cp: cannot stat ‘/group/online/dataflow/options/LHCbA/HLT/LHCbA_HLT01_HLT.py’: No such file or directory
Aug30-144215[WARN] hlt02: cp: cannot stat ‘/group/online/dataflow/options/LHCbA/LHCbA_HLT01_HLT.opts’: No such file or directory
Just ignore it
Errors and warnings that are not a/our problem
There will be serveral harmless warnings from the analyzers during configuring and running such as:
Warning: using CKThetaQuarzRedractCorrections = [0,-0.0001,0]
UpdateManagerSvc: Override condition for path 'Conditions/Environment/Rich1/RefractivityScaleFactor' is defined more than once
FinalTrackClones:TrackBuildCloneTable:: The WARNING message is suppressed: 'Probleme extrapolating state'
TrackBestTrackCreator.Fitter:TrackMasterFitter:: The WARNING message is suppressed: 'unable to fit the track' ...
and others discussed on
LHCbRichMirrorAlignShiftInfo
After the alignment has run
After the alignment has run check by hand in the working directory if it has converged and/or look at the output of the iterator in the hlt02 error logger / logging files (discussed in
LHCbRichMirrorAlignCodeOnline).
In order to make sure that no nonsense happend in the alignment you should:
1. Have a look at the output plots for each iteration and see if the fits make sense. Maybe also skim for changes between iterations that seem to have gotten worse rather than better.
2. Look at the amount the mirrors were tilted and check for abnormally high values.
Log files
There are the hlt02 error logger / logging files (discussed in
LHCbRichMirrorAlignCodeOnline). They should show up in the YYYYMMDD_HHMMSS directory after a successful alignment, otherwise they are in the Logging directory.
There will be a log file produced each day (being written to all the time) that contains the output of all analyzers that ran from the
LHCbA partition ( = the alignment farm).
This log file will be at
/clusterlogs/partitions/LHCbA/daq/LHCbA.log
(under plus). Older log files are in
clusterlogs/partitions/LHCbA/daq/old
.
Note that these log files contain the entire output from all analyzers of this day (including other alignments etc). They are extremely large and probably not very practial. However you can filter the code using
grep
to isolate one node (e.g. hltf0119) to see how that node is configured and if everything ran (Brunel produces a table of processed events).
Explanation of files in the alignment folder (from Claire's email, 28 September 2015)
In
/group/online/AligWork/MirrorAlignments/
there are folders for different dates which in turn contain alignments for Rich1 and/or Rich2.
The alignment procedure starts from the xml file like
Rich1CondDBUpdate_Mp6Wi4.0Fv3Cm2Sm1_online_Collision15_i0.xml
and from this making the histograms in
RichRecQCHistos_rich1_Mp6Wi4.0Fv3Cm2Sm1_online_Collision15_i0.root
If not using predetermined magnification factors The the starting xml-file (
Rich1CondDBUpdate _Mp6Wi4.0Fv3Cm2Sm1_online_Collision15_i0.xml
) is taken and tilts of +0.3 and -0.3 (+/-0.7 for RICH1) are applied around the y- and the z-axis for the primary and secondary mirrors respectively which gives 8 new xml files with endings like (these are for the calculation of the magnification factors):
Rich1CondDBUpdate_Mp6Wi4.0Fv3Cm2Sm1_online_Collision15_pri_negYzerZ_i0.xml
The 8
.xml
files are used to make the corresponding histograms stored in files like:
RichRecQCHistos_rich1_Mp6Wi4.0Fv3Cm2Sm1_online_Collision15_pri_negYzerZ_i0.root
The 2D fits are performed on each of the 8 (+1 for untitled) root-files and the output stored in the folders which also contain the plots for each mirrorpair with the line of the 2D fit:
Rich1MirrCombinFit_Mp6Wi4.0Fv3Cm2Sm1_online_Collision15_i0
and with the ending corresponding to the tilted mirrors:
Rich1MirrCombinFit_Mp6Wi4.0Fv3Cm2Sm1_online_Collision15_pri_negYzerZ_i0
The the magnification factors are calculated for each tilt and each mirror and stored in files like:
Rich1MirrMagnFactors_Mp6Wi4.0Fv3Cm2Sm1_online_Collision15_pri_negYzerZ_i0.txt
If using predetermined magnification factors Then the files like the one just above are already pre-made.
After that the actual mirror-corrections are determined and summarised in the file (here only one per iteration):
Rich1MirrAlignOut_i0.txt
(If all numbers under "only this iteration's additional corrections to compensations: truncated, not rounded" are .0 the alignment has converged.)
If the alignment has not converged the determined mirror-tilts are applied to the "old" xml-file (
Rich1CondDBUpdate_Mp6Wi4.0Fv3Cm2Sm1_online_Collision15_i0.xml
in this case) to make the next one:
Rich1CondDBUpdate_Mp6Wi4.0Fv3Cm2Sm1_online_Collision15_i1.xml
and the whole procedure continues with the next iteration...
To figure out how many events were actually run over in the alignment
This info should be in
AlignmentView
now. You can also use the following:
It can be very useful to figure out how many events were actually run over in the alignment for each RICH detector. It is hard to get this number directly, because our HLT Line provides both Rich1 and Rich2 samples to the online RICH mirror alignment together.
First, create a file called
numTriggers.C
In it put the following:
void numTriggers(TString dirname, TString TH1name)
{
TString ext=".root";
TSystemDirectory dir(dirname, dirname);
TList *files = dir.GetListOfFiles();
if (files) {
TSystemFile *file;
TString fname;
TIter next(files);
int onlyOne = 1;
while ( onlyOne == 1 && (file=(TSystemFile*)next()) ) {
fname = file->GetName();
if (!file->IsDirectory() && fname.EndsWith(ext)) {
TFile *f = TFile::Open(dirname+"/"+fname.Data());
TH1 * h1 = (TH1*)f->Get(TH1name);
cout << fname.Data() << " has " << h1->GetEntries() << " entries in TH1 "<< TH1name << "." << endl;
onlyOne++;
}
}
}
}
Save it, and then to find the number of entries in the TH1 histogram of your choice, in the latest .root file in a directory, just run (for example):
lb-run root root.exe -b -l -q 'numTriggers.C("/group/online/AligWork/MirrorAlignments/Rich1/20160103_232542/","RICH/RichODIN/TriggerType")’
So in this case, it spits out the actual number of events the alignment ran over for the alignment
Rich1/20160103_232542
:
RichRecQCHistos_rich1_Mp6Wi8.0Fv3Cm0Sm1_online_Collision15_i2.root has 2.06806e+06 entries in TH1 RICH/RichODIN/TriggerType.
So you know this particular alignment ran over about 2 million RICH1 triggered events.
As long as the alignment runs over only one
TriggerType (it should due to the HLT filter), this will be accurate and of course much faster than looking in a TBrowser.
Back to main Mirror alignment TWiki