%CERTIFY%

SCTCalibrationMonitoringCAF

UPDATE

The information of this page is rather of-out-date. It will be updated, and part of the information will come from here

Introduction

SCT calibrations are performed online. The purpose of the SCT offline 'calibration' on CAF is to monitor the goodness of the online calibration, and to update the information on noisy and dead channels used in reconstruction when necessary. The intention is that this should be done within the '36 hour calibration loop' before bulk reconstruction begins.

CAF Calibration Monitoring

Currently the SCT Calibration Monitoring on CAF contain the following jobs/tasks

calibration job input stream Min. Events main developer
Noisy Strips NTUP_TRKVALID physics_RNDM 5k Junji, Peter V
HV Trips NTUP_TRKVALID physics_RNDM 5k Tim
Dead Chip / Strip NTUP_TRKVALID phyiscs_MinBias >100k/200k Minoru
Noise Occupancy, BSErrorDB HIST phyiscs_RNDM 5k Minoru
Efficiency, Raw Occupancy HIST phyiscs_MinBias 5k Minoru
Lorentz Angle - - - Elias

task main developer
Upload Misha (Jose)
Job Tranforms Misha (Peter R)
Automation/TMS Misha (Peter R)

Processing at Tier-0 or CAF

The calibration tasks will be handled by the CAF/Tier0 Task Management System(TMS) with the help of the Job Transform called sct_calib.py. The TMS system basically consists of two daemons: a task creaton daemon(TOM) run by the SCTCalib team and a Job Supervisor daemon(Eowyn) run by the TMS-CAF team.

The sctTOM daemon:

  • searches for new unprocessed datasets in the TMS db matching the above specified criteria.
  • if any is found the sctTOM daemon creates one or several new tasks (NoisyStrip,HV,..) in the TMS db.
  • Once all the input files are ready on Castor the job is created in the task.

The Eowyn daemon:

  • submits the individual jobs to the Tier0/CAF
  • handles staging in/out of input/output.
  • Monitor the status of the runing jobs and update the status of the tasks

The TMS tasks can be monitored using the TaskLister

The output of the task is copied both to castor(permanent) and to an afs location(temporary) from where it can be display in our webpage and uploaded upon request of the shifter.

CAF processing

Noisy Channels

Noisy channels search will be performed using the tool SCT_Calib

  • Run over all events in an Athena way to have access to CondDB
  • Chain ntuples and check that number of entries is enough (number?)
  • Order events in ntuples and save new ntuples (?)
  • Use CondDB
    • CalibrationSvc
    • ConfigurationSvc
    • MonitoringSvc (previous run whenever possible to check for differences)
  • Look for noisy channels
    • Threshold to be defined
  • If noisy channel found check if it is already known (Calibration, Configuration)
  • New noisy channels compare with existing in MonitoringSvc for previous IOVs, if more than 1000 channels difference. Output code = ERROR
  • Save two files one with the known noisy strips and one with the new ones. Write new noisy channels in mylocal.db and xml file, with format as in the DB: DBFormat.png
  • Set IOVs from first and last entry analysed in the ntuple chain.
  • Generate a text files with report and Output code from the job. This information will be mailed by T0 process:
    • IOV range and stream details, runs,..
    • Number of new noisy channels
    • Found known channels

Dead Channels/Chips

Similar to Noisy Channels to be better defined

HV Trips / Anomalous Behavior

HV trip will use TrkValidation ntuple produced at Tier-0. If possible will use SCT_Calib or a similar tool.

  • Tool will use CondDB (DCSSvc) to check for trips
  • Run job every 6(?) hours. Interval of time to be determined. If possible could be convenient to make it coincide with other searches.
  • If present and needed will use the ordered TrkValidation ntuples.
  • Save a log file of process
  • Look for trips, need to define/check:
    • Hits corresponding to a trip
    • How long should last to flag it
    • Both links show the trip
    • Confirm with DCS. If no confirmation with DCS, flag it as a trip in any case if above limit : high occupancy (0.1?)
  • If trips found and confirmed:
    • Save trip information in txt file : run, module, IOVs, duration (s)
    • Save information in local DB and xml
    • Folder : /Monitoring/DCS/ (to be created)
    • Payload : hashid (int), status (int)
  • Return Output code
  • Mail to shifter Output code and text files with:
    • IOV range and stream details, runs,.. checked
    • Trips found with details (run, IOV, module, duration)
    • If trips confirmed by DCS

Noise occupancy

To Be completed

Upload DB and data display

Upload can be done in two ways:

  • Upload of data will be done using CherryPy with a web interface:
    • User will use nice password to calibration page. Only shifter of the week + experts will be allowed to enter with write permissions
    • Upload form page (using python or PHP) will upload to DB. Data to upload have to be in xml in the format defined by CherryPy.
  • Upload using the local db file:
    • Information on the command to upload to the DB can be found in the page AtCoolCopy.
    • A working example of how to upload a mycool.db file is:
AtlCoolMerge.py mycool.db COMP200 ATONR_COOL ATLAS_COOLOFL_SCT_W DB_PASSWORD

Example of a minimum web display:

WebDisplay_Noise.png WebDisplay_HV.png
Password protected access Access granted gives permission to upload or see results depending on the user

Database Derived directory structure

  • /SCT/Derived/Monitoring [0,0,10000,0,3,-3,11,'NOISY',150.0,' 727 ]
  • /SCT/Derived/NoiseOccupancy [0,0,10000,2,8,0,50,54.4253921508789]
  • /SCT/Derived/DCS (?) [XXXXXXXXXXXXXXXXXXXXXXXXXXX]

Instructions for running the calibration loop (noise processing) on CAF.

The SCT 36h Calibration&Monitoring is implemented in the ATHENA's package: InnerDetector/InDetCalibAlgs/SCT_CalibAlgs. As described above the package should be run automatically by the TMS. However, in case the automation fails, the running of the package on CAF has to be initiated manually. The currently exist three ways of doing this.

Running the code by jobOptions on CAF

To run the code on CAF, do the following:

1. Login on sctcalib at lxplus.

2. Set the athena environment, source setEnv.sh

3. go to: cd /afs/cern.ch/user/s/sctcalib/testarea/AtlasOffline-15.5.1/InnerDetector/InDetCalibAlgs/SCT_CalibAlgs/run

4. Locate the data files for the runs of interest (SCT in, more than 50000 ev. in IDCosmic stream). Normally they are stored on castor in the directory: /castor/cern.ch/grid/atlas/tzero/prod1/perm/data09_cos/physics_IDCosmic/, but it depends also on the tag, which can be data09_calophys or something else, so you should also identify the right tag for the run in advance.

5. In run/LISTS create a text file containing the absolute data-file paths for the run of interest. For example, for run 135294 the file list_00135294.txt must be created with the following content:

/castor/cern.ch/grid/atlas/tzero/prod1/perm/data09_cos/physics_IDCosmic/0135294/data09_cos.00135294.physics_IDCosmic.merge.NTUP_TRKVALID.f157_m197/data09_cos.00135294.physics_IDCosmic.merge.NTUP_TRKVALID.f157_m197._0001.1

/castor/cern.ch/grid/atlas/tzero/prod1/perm/data09_cos/physics_IDCosmic/0135294/data09_cos.00135294.physics_IDCosmic.merge.NTUP_TRKVALID.f157_m197/data09_cos.00135294.physics_IDCosmic.merge.NTUP_TRKVALID.f157_m197._0002.1

/castor/cern.ch/grid/atlas/tzero/prod1/perm/data09_cos/physics_IDCosmic/0135294/data09_cos.00135294.physics_IDCosmic.merge.NTUP_TRKVALID.f157_m197/data09_cos.00135294.physics_IDCosmic.merge.NTUP_TRKVALID.f157_m197._0003.1

6. Edit accordingly run/list_runs.py with the name (number) of your run(s) and the location of the list file (list_00135294.txt in the example above).

7. Finally, set the number of runs you will run over in xrunCaf.py (variable s, for one run: s=1), and then execute from the run directory: python xrunCaf.py

Note, you can run over as many runs as you like at the same time.

8. You'll see in your terminal the total number of events per run you are running over. You could cross check this number with the one you find on http://atlas-runquery.cern.ch/. They should be the same, otherwise you've forgotten to include some data file. You can also use as a reference for the quality of the runs, the site: http://atlasdqm.cern.ch:8080/atlaswebdq?subsys=Overview

9. Check the jobs are running with bjobs. Sometimes they are pending quite some time before go in run status. The results are output in: /afs/cern.ch/user/s/sctcalib/scratch0/results/Cosmics_2009. The following files should be produced (for run 135294 for example):

BadStripsAllFile_00135294.xml, BadStripsNewFile_00135294.xml, BadStripsSummaryFile_00135294.xml, SCTHitMaps_00135294.root, job_00135294_00135294.sh, job_00135294_00135294.sh.e, job_00135294_00135294.sh.o, joboptions.py, log_00135294, mycool_00135294.db.

Check the log files in case of problems.

10. Study BadStripsAllFile_00135294.xml, BadStripsNewFile_00135294.xml, BadStripsSummaryFile_00135294.xml for new noisy modules/chips/strips.

11. If needed, report on the SCT Weekly Meeting (Wednesdays at 10:30).

12. In 4 hours the results should authomatically appear on: https://pc-sct-www01.cern.ch/24hLoop/index.php

Processing using JobTransform on CAF

This is an instruction to run JobTransform on CAF for noisy strip processing, as of 2.10.2012 (in parenthese info as of 25.10.2009 is given (for cosmic run)). The following applies to the NoisyStrip loop, whereas all histo-based parts use express_express stream.

0) Input datasets: daq.RAW (TrkVal ntuples)

  • In Run List, find run number and project tag (e.g. data12_8TeV (data09_900GeV)). The criteria are :
    • SCT is ON
    • Min. number of events : 5000 (as of 21st Nov SCT daily meeting : please use 5k instead of 10k. This is to check data with less events during commissioning period with mixture of no beam, 1 beam, 2 beams and collisions.)
    • Rec ENABLED

  • In Dataset status or Dataset lister (use Dataset Type = RAW, Stream = SCTNoise and Stream Type = calibration), find the dataset for the run.
    • The default dataset is from calibration_SCTNoise (physics_RNDM) stream and the full name is defined by ProjectTag.RunNumber.calibration_SCTNoise.daq.RAW (ProjectTag.RunNumber.physics_IDCosmic.merge.NTUP_TRKVALID.RecoTag_MergeTag).
    • An example from run 211772 is data12_8TeV.00211772.calibration_SCTNoise.daq.RAW.
    • Skip the run If no daq.RAW dataset is available from calibration_SCTNoise.
    • Check the status of the dataset, wihch is processed If the status is in On CAF in "Replication to CAF" column. Note that there might be time delay of staging after indication of On CAF.

1) Login to lxplus

  • Use sctcalib account and go to the directory below, where we submit our jobs to CAF.
  • /afs/cern.ch/user/s/sctcalib/testarea/latest/InnerDetector/InDetCalibAlgs/SCT_CalibAlgs/run

2) Job submission to CAF

  • Execute the following command at run directory, where Dataset is the dataset name found in step 0) and Tag is either c0 or c00 for the standard and trigger-aware analysis, respectively:
    • >./tfOnCAF.sh Tag Dataset doNoisyStrips (>./trfOnCAF.sh Dataset doNoisyStrips)
    • tfOnCAF.sh, has substituted trfOnCAF.sh, which ran using the old transformation framework (PyJobTransformCore)

  • Status of a job
    • The job status can be checked by bjobs command.
    • When a job has finished, the following log files will appear in run directory.
      • ProjectTag.RunNumber.calibration_SCTNoise.sct_LoopName.SCTCALIB.Tag.o, ProjectTag.RunNumber.calibration_SCTNoise.sct_LoopName.SCTCALIB.Tag.e

3) Results

  • Check log files in 2) if the job successfully finished or not.
    • If not successful, errors are most likely due to low statistics in the dataset (less than 10000 events). In this case, we skip the processing for that run.
    • If not successful due to other errors, we need investigation. Report errors to Peter R. and Junji by e-mail.

  • If the job was successful, check outputs stored under the following directory.
    • /afs/cern.ch/user/s/sctcalib/scratch0/tmp/results/ProjectTag/calibration_SCTNoise/RunNumber (/afs/cern.ch/user/s/sctcalib/scratch0/results/Cosmics_2009/RunNumber/doNoisyStrips)
    • A list of outputs : BadStripsAllFile.xml, BadStripsNewFile.xml, BadStripsSummaryFile.xml, mycool.db, sct_calib.log, SCTHitMaps.root
    • All outputs are automatically copied every 4 hours to a directory /var/www/html/24hLoop/Results at pc-sct-www01.

  • Summary of results from BadStripsSummaryFile.xml will be found in 36h calibration loop web display
    • The web display is under pc-sct-www01. The results can be seen after the automatic copy runs.

4) Upload to COOL

  • Tag used now is SctDerivedMonitoring-UPD4-002(SctDerivedMonitoring-002-00).
  • To upload the data to COOL (for use in reprocessing):
    • Go to the 36h calibration loop web display development page.
    • Decide which runs you wish to upload (see criteria below).
    • Click on the upload box (right hand side of page) for the runs to be uploaded. If there are no boxes you do not have upload rights - contact Jose.
    • Click on 'send' button at the bottom of the page. This will write a list of runs to be uploaded into a directory on pc-sct-www01. The actual upload will be done by a cron job which runs at 17:00 (?) each day.
    • When a run has been uploaded a tick will appear instead of the upload box, together with a link to the upload log file. Check the upload log files to make sure there were no errors. If any errors are seen mail Pat.
    • In case of doubts that upload was done in time for the bulk reconstruction use the automated check available here (check PCalEnd for the corresponding run). For data12_8TeV works from run # 212619 onwards!
  • Criteria for uploading:
    • Do NOT upload if number of events processed < 10k; (this should already be a requirement in the job).
    • Do NOT upload if the number of MODULES with at least 1 noisy strip is >200 (obsolete? ML: >5000).
    • If the number of modules with at least 1 noisy strip has changed by more than 20% from the average of the last 5 runs find out why (obsolete? ML: >50%). Do NOT upload unless a good reason for the change is identified (e.g. a change of threshold).
    • If the number of noisy strips has changed by >128 from the last run find out why (obsolete? ML: >2560). Do NOT upload unless a good reason for the change is identified (e.g. the change is due to a new noisy chip).
    • Do not bother to upload runs which are marked totally bad (red) by the SCT offline DQ shifter.

5) If needed, report in the SCT Weekly Meeting and e-log

  • Report the results in the SCT Weekly Meeting (Wednesdays at 10:30), to give feedback to SCT Operations, especially for noisy links, noisy chips, big change of number of noisy strips etc.
  • Write a summary of processing and findings into e-log.

Instructions for monitoring the calibration loop on TMS.

0) Find the list of recent run as in 0) of the Instructions for running the calibration loop (noise processing) on CAF.

1) Check that all the task has been created using the TaskLister (only show task with username sctcalib )

2) Check that no jobs have failed. If so send a email to Peter.Lundgaard.Rosendahl@cernNOSPAMPLEASE.ch and Mykhailo.Lisovyi@cernNOSPAMPLEASE.ch with the taskid of the failing tasks. If tasks is only does not contain any jobs check the status of the input dataset is OnCAF

3) Do the steps 4)-5) from Instructions for running the calibration loop (noise processing) on CAF.

Expert Instructions for the calibration loop on TMS.

TOM daemon management

The TOM daemon which handles task creation and supervision is currently run at pc-sct-www01. All files related to this daemon is currently located in

/afs/cern.ch/atlas/project/tzero/sct/run

A detailed description on how to start and stop the daemon is given in the help_sctTOM.txt. *Note that the daemon should always be started from screen as described in the help file*.

The configuration of the TOM is done in the sctTOM.cfg file. Some general documentation is found at AtlasTierZeroTOM, but some important settings are:

  • automaticmode [on/off] - controls whether automation is on for this type of job.
  • inputdsspecautomatic - this specifies the search string for new data. For all new datasets matching this search a new task will be created.
  • tasktransinfo trfsetupcmd - this specifies the athena setup command.
  • tasktransinfo trfpath - this specifies the transformation to be run.
  • inputs - specifies the name of the input argument for the trf, ie. input, followed by the metatype of the input. Use "!{'metatype' : 'inputLFNlistDA'}" for turn off staging of input files.
  • outputs - specifies the name of the output argument of the trf, ie. prefix, along with the storage information.
  • phconfig - all other arguments that should be passed on the trf can be specified by the phconfig option as phconfig SCTCalibConfig "!{'value': ['SCTCalibConfig.py'],}"

Manually submission to TMS

In order to use the manually create tasks in TMS you first need to check out two packages

svn co svn+ssh://svn.cern.ch/reps/atlasoff/Production/tzero/trunk tzero

and

svn co svn+ssh://svn.cern.ch/reps/atlasoff/Production/tzcontrib/sct/trunk sct

Then in the sct dir, make sure that pythonpath in the setup.sh is pointing to the tzero dir that you just checked out, and then source setup.sh

Now you can create new tasks using the input dataset name

./SCTCalib.py "input" "part" where part is either ns, hv, ...

If everything is ok the command should end with New task created

You can see the DB information about a task by using the Inspect.py

./Inspect.py "taskid"

Guide to the TaskLister

For general description of the TaskLister please see https://twiki.cern.ch/twiki/bin/viewauth/Atlas/AtlasTierZeroNewMonitoringPagesHelp#taskLister

  • Only show sctcalib tasks. This is done by only selecting "sctcalib" under "UserName" in the left sidebar.

  • Locate task. A task is most easily found by its run number(in the first column) and the Type(in the 5th column).

  • Taskname. The taskname of a task is generated from the input dataset name, appending the AMI tag, the step
(sctns, sctdc, ...) and the final ".task"

  • Job status. When all the jobs in a task is done and the output dataset staged the "Status" column is "FINISHED".
The other task states are
Status Meaning
RUNNING
The input dataset has been created, but no jobs defined yet
YELLOW
Jobs are running
GREEN
Job has finish, but the staging of output and logs is not completed
RED
Job has failed and automatic resubmission has reached max attempts. Consult log file.
FINISHED
Task completed.

  • Job Statistic. In the "Job Statistic" columns the number of jobs in different states are shown.
In case a task is "FINISHED" check the "#Done" column.
Column Comment
#Done "1" means the calibration job is successful. "1(1)" means the job has exited with "Need checks" and you should consult the log file.
#TJF Number of failed jobs. Failed jobs are automatically run twice. If task is "RED" consult log file
Event not filled

  • Consulting log files. To see the log file for a given task click either in the "#DONE" or "#TJF" column and then on the "log file".
In the new window scroll down to see the "ErrorInfo".

Troubleshooting:

Problem Comment
If a task is in "RUNNING", but the number of jobs in the task is zero The input dataset is created, but not closed. Therefore is task is created but not yet filled with any job. Check the input dataset is "onCAF"
When password is changed: follow /home/sctcalib/bin/README


Major updates:
-- PeterLundgaardRosendahl - 13-Apr-2010

-- PeterVankov - 23-Oct-2009

-- JoseEnriqueGarcia - 04 Aug 2009

%RESPONSIBLE% Main.unknown
%REVIEW% Never reviewed

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2015-06-19 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback