This page describes and documents the Tallinn analysis code for SingleTopPolarization.

Main workflow

The data processing is done in several steps:

  • step1 : CMSSW: PFBRECO and object ID, last ran ~Summer 2013
  • step2 : CMSSW: event reco, top reco, calculation of single top specific variables (cos θ*), systematically variated weights, last ran ~November 2013
    • ran using
    • input datasets in, split by tag(date) of the step1 processing
    • takes ~2 days
    • The finished datasets are in datasets/step2/mc/Sep8, Sep8_qcd, datasets/step2/mc_syst/Sep8. These are completely finished.
    • The real data are in datasets/step2/data/Sep8, but these are not completely done (~90%), publication is about 10% done
    • The crab configuration files can be created with python datasets/ -s step2 -s step2_syst -t TAG --pset=PATH/src/step2/step2_*, adding additional command line arguments as needed.

Running step3

Step3 was written in Julia ( and consists of 2 main parts:

  • src/skim/skim.jl which processes step2 files into flat ntuples suitable for BDT training
  • src/analysis/evloop2.jl which loops over the events and creates the systematically variated histograms

Julia compiles the code dynamically, but runs only on SLC6 boxes: thebe/ied/phys/wn*. Access to ROOT and EDM files in Julia is provided through libraries maintained by Joosep.

To set up julia in, run

source /home/software/julia/

Step3 EDM->flat ntuple

To test if you are able to process step3 locally, run from $STPOL_DIR/src/skim

/home/software/.julia/v0.3/CMSSW/julia skim.jl testout /hdfs/cms/store/user/andres/s2_Oct22/iso/nominal/T_t_ToLeptons/output_1_1_oR9.root

The output should be testout.root (ntuple) and testout_processed.root (metadata). If this test is successful, submission of step3 to the cluster must be tested.

Debugging step3:

  • ERROR: undefined BASE: modify analysis/basedir.jl

Step3 metadata

After running step3, the csv files should be aggregated to calculate the gen weight. This can be done with

cd src/skim
julia metahadd.jl ../../datasets/step4/csvt.dat metadata.json

The file metadata.json contains the total number of generated events for all samples.

Creating step3.root.added files with BDT

Next, one must call src/skim/ to add the gen weight and recalculate the BDT variables. This must be done for every step3 output file. In, replace @METADATAFILE@ with the absolute path to the output of metahadd.jl and define $FILE_NAMES.

Running step4: histograms

Simple test command

julia analysis/evloop2.jl testout.root analysis/infile.json 0 10000000 /home/andres/single_top/stpol_pdf/src/step3/output/Oct28_reproc/iso/nominal/T_s/output0.root

meaning (firstevent lastevent inclusive)

julia analysis/evloop2.jl outputfile.root jsonfile firstevent lastevent inputfile1.root inputfile2.root

To run a larger set, submit with sbatch.

infile.json configures various options when running

    "qcd_cut":"mva_nominal", #also possible metmtw_nominal
    "b_weight_nominal": "b_weight", #for CSVT, also possible b_weight_old for TCHPT 
    "do_delta_r": true, #currently not used
    "bdt_var": "bdt_sig_bg", #bdt variable to cut on
    "soltype": "none", #none - use all solutions, also possible real or complex
    "vars_to_use": "analysis", #analysis - small set, all_crosscheck - long list of variables
    "do_ljet_rms":false #rms cut on histograms


Tips for running CRAB2

  • run crab status with crab -c DIR -USER.xml_report RReport.xml -status
  • Aggregate the results with python CRABDIRS*/share/RReport.xml (you will need numpy and scipy), script is as
  • the script will produce CRABDIRS*.files.txt with the list of good files from these crab jobs, which are needed for later steps.

PDF weights

Should be run separately, using the configuration files

Two configurations are necessary since the default LHAPDF in CMSSW supports only 3 concurrent PDF sets. These PDFs should be calculated for these datasets:

Steps for going from histograms to unfolded result

  1. BDT fit using stpol/src/fraction_fit/
  2. unfolding using stpol/src/stpol-unfold/

BDT fit

The fit is run using theta-auto, based on the standalone script

This config file takes a single command line argument: the particular fit configuration which will determine the systematic scenario and specific priors in a simple .cfg file.

The fit configuration files are found in

The template of a fit config is

In short, to run all the nominal fit results on new histograms, the fitconfigs/bdt/*/*.cfg files should be changed by editing the file fitconfigs/bdt/mu/nominal.cfg to reflect the correct inputs and propagating the changes using

 ./ fitconfigs/bdt/mu/nominal.cfg
 rm -Rf fitconfigs/bdt/ele   
 cp -R fitconfigs/bdt/mu fitconfigs/bdt/ele
 sed -i '' 's/mu/ele/g' fitconfigs/bdt/ele/*.cfg

and all the fits should be run using

find fitconfigs/bdt -name "*.cfg" -exec ./ {} \;

where is just a wrapper that calls theta-auto:

The outputs of the fits are stored in the folder results/bdt, which is also added into svn in AN-14-001/trunk/data/fits


The unfolding is based on the KIT code with some modifications, and is accessible in

Before compiling the code, edit the Makefile with the path to TUnfold, then do

make unfold

If that works, edit the Makefile to change DATADIR to point to the histograms and copy the contents of the AN-14-001/trunk/data/fits to a folder fitresults, which should contain

$ find `pwd`/fitresults -name "*.txt" | head

To a single unfolding, the program unfold should be called

#./unfold /path/to/histograms systematic /path/to/fitresult outfile.root
./unfold $(DATADIR)/mu/ nominal fitresults/nominal/mu histos/mu__nominal.root

For all the unfoldings, one can do

make do_unfold_mu
make do_unfold_ele

The unfolded histograms are stored in histos/*.root, and are uploaded with the AN to AN-14-001/trunk/data/unfolded/


The output files are stored in Hadoop on T2_EE_ESTONIA.

Primary pat::tuples (step1) with PFBRECO

Are published in DBS and accessible via crab

File lists for local processing can be extracted using DAS, e.g. : --query="file dataset=/T_t-channel_TuneZ2star_8TeV-powheg-tauola/jpata-stpol_step1_v2_1_noSkim-6d0886f8efd932bc8d37cab903c44a2c/USER instance=cms_dbs_ph_analysis_02" --limit=0

Intermediate (step2) ntuples in EDM-format with plain event content

The most recent conglomeration of files is

Final ntuples (step3) in ROOT format

Located in T2_EE_Estonia, files in

lcg-ls -b -D srmv2 -v --vo cms srm://

The datasets are split according to /btagger/processing-tag/lepton-isolation/systematic-variation/sample-name/job/

One can find the source for step3 here under the step2 section.

The job folders contain

  • output.root - main kinematic variables in one TTree ("dataframe")
  • output.root.added - BDT, some separately-calcyulated event weights in a similar TTree

The most important event weights are

  • xsweight - the sample-dependent cross-section weight corresponding to L=1/pb
  • pu_weight - the pile-up correction weight
  • ...

The most important cut variables are

  • hlt, n_signal_mu/ele, n_veto_mu/ele, njets, ntags, met/mtw/bdt_qcd, bdt_sig_bg

Fitting with theta

Apply the following patch to theta to be able to configure signal priors with mle(..., signal_prior="gauss:mean,sigma", ...)

Index: utils2/theta_auto/
--- utils2/theta_auto/	(revision 403)
+++ utils2/theta_auto/	(working copy)
@@ -603,6 +603,11 @@
         elif spec.startswith('fix:'):
             v = float(spec[4:])
             signal_prior_dict = {'type': 'delta_distribution', 'beta_signal': v}
+        elif spec.startswith('gauss:'):
+            v = spec[6:].split(",")
+            mean = float(v[0])
+            sigma = float(v[1])
+            signal_prior_dict = {'type': 'gauss1d', 'parameter':'beta_signal', 'range':["-inf","inf"], 'mean':mean, 'width':sigma}
         else: raise RuntimeError, "signal_prior specification '%s' unknown" % spec
         if type(spec) != dict: raise RuntimeError, "signal_prior specification has to be a string ('flat' / 'fix:X') or a dictionary!"

Scanned histograms (step4) in ROOT format

  • the histograms for the BDT fit are in hists/preselection/Nj_Mt/{mu,ele}/merged/bdt_sig_bg.root
  • The histograms are provided pre-fit, that is, the BDT fit coefficients are not applied
  • The MC histograms (also the transfer matrix) are normalized to measured luminosity
  • for the transfer matrix, unreconstructed events are put into the underflow bin of the reco-axis, and the corresponding projections are provided. proj_x should correspond to the number of generated events.

Aug5, 2014

BDT scan.

old PAS

Out-of-the-box histograms from the PAS (TCHPT, QCD by MET/MTW, old BDT). Uses the *old* analysis code.

Fit results

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2014-11-27 - JoosepPata
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback