Unbinned PID resampling using kernel density estimation


We have launched migration to the improved PIDGen2 tool which can be found here: https://gitlab.cern.ch/lhcb-rta/pidgen2/. PIDCorr2 is still under development (see the conversation on the mailing list), so if you need access to that functionality, you must still use the old tool, described below.

Please use the lhcb-phys-pid-calibration mailing list in case of questions.


PIDGen is a set of scripts within PIDCalib package to correct the PID response for MC signal samples based on PID calibration data (the same data as used by "traditional" PIDCalib approach). It implements two alternative approaches:

  • PID resampling, where the PID response is completely replaced by the one randomly generated from calibration PDFs.

  • Transformation of PID variables, where the PID variables from the simulation are transformed in such a way that they are distributed as in data.

Like in PIDCalib, the PID correction is a function of track kinematics (momentum and pseudorapidity) and event multiplicity (number of tracks). Unlike PIDCalib, the correction is done with an unbinned approach, where the calibration PDFs in four dimensions (PID, Pt, eta, Ntrack) are described by a kernel density estimation procedure using the Meerkat library.

The corrected PID response from both the PID resampling and PID variable transformation can be used as an input to a multivariate classifier.

The PID resampling approach has an important limitation that the PID variables for the same track are generated independently, and thus no correlations between them are reproduced. Therefore, only one PID variable per track can be used in the selection (correlations between variables for different tracks are, naturally, preserved via correlations with kinematics of tracks).

The approach with PID variable transformation aims to remove this limitation. In that case, the corrected PID variables preserve correlations with the output of simulation, and through the correlations in simulation the correlations between PID variables for the same track are reproduced "in the first order". This is not perfect by probably acceptable in most cases. The drawback of this approach is that it also relies on the parametrisation of PID PDFs in simulation (which are extracted from samples that are typically much smaller than calibration data).

More details about the approach can be found in this presentation and in the LHCb-INT-2017-007 internal note draft.

Contents of the package

All the user-level scripts are located in $PIDPERFSCRIPTSROOT/scripts/python/PIDGenUser/ directory under PIDCalib package in Urania project. Here is the GitLab repository of the code.

  • PIDGen.py - Python script for PID resampling.

  • PIDCorr.py - Python script for PID variable transformation.

  • Examples/Lb2Lcpi/Lb2Lcpi_pidgen.py - Example PID resampling script for Lb->Lcpi MC sample

  • Examples/Lb2Lcpi/Lb2Lcpi_pidcorr.py - Example script for PID variable transformation for Lb->Lcpi MC sample

Note that the scripts use data stored in CERN EOS, and assume that it's accessible at root://eoslhcb.cern.ch/ server.

Using PIDGen with MC samples

Generation of PID response is performed by one of the two scripts, PIDGen.py or PIDCorr.py. The scripts require many parameters (MC data location, variable names, location of the calibrated PID response etc.). The sample python scripts are provided, Examples/Lb2Lcpi/Lb2Lcpi_{pidgen, pidcorr}, which the user can modify to actually run the PID generation for their own MC sample.

Running PID resampling (PIDGen)

To run PIDGen resampling, you need to setup Urania v10r1, copy the example script to a directory with write access (the output ntuple will be created there) and run it. To benefit from the latest updates in the package, it is recommended to run from Urania HEAD nightlies:

$ lb-run -c best --nightly lhcb-run2-patches Urania/HEAD bash
$ cp $PIDPERFSCRIPTSROOT/scripts/python/PIDGenUser/Examples/Lb2Lcpi/Lb2Lcpi_pidgen.py {your_place}
$ cd {your_place}
$ python Lb2Lcpi_pidgen.py

If the first line crashes, try running the following instead:

$ lb-run -c x86_64_v2-centos7-gcc11-opt --nightly lhcb-run2-patches Urania/HEAD bash

if this crashes, too, try setting up Urania/v10r1 which is up-to-date (as of end 2021):

$ lb-run -c best Urania/v10r1 bash

The complete list of PID configurations available in Run 1 and Run 2 can be obtained by running PIDGen.py without arguments:

$ python $PIDPERFSCRIPTSROOT/scripts/python/PIDGenUser/PIDGen.py
Note that Run 1 and Run 2 PID configurations have different names!

Running PID variable transformation (PIDCorr)

To run PIDCorr variable transformation, you need to setup Urania, copy the example script to a directory with write access (the output ntuple will be created there) and run it:

$ lb-run -c best --nightly lhcb-run2-patches Urania/HEAD bash
$ cp $PIDPERFSCRIPTSROOT/scripts/python/PIDGenUser/Examples/Lb2Lcpi/Lb2Lcpi_pidcorr.py {your_place}
$ cd {your_place}
$ python Lb2Lcpi_pidcorr.py

The result of PIDCorr is dependent on the version of MC used to generate your signal ("sim08" or "sim09" for Run1 or "run2" for Run2)! The complete list of PID configurations available can be obtained by running PIDCorr.py without arguments:

$ python $PIDPERFSCRIPTSROOT/scripts/python/PIDGenUser/PIDCorr.py

PID templates available in Urania v8r0 and nightlies

Check the Run 1 control plots for all available combinations of track and PID variable in Run1.

Check the Run 2 control plots for all available combinations of track and PID variable in Run2.

Electron templates ending with _Stripping are produced from the Stripping28 samples (updated sWeights calculation using HasBremAdded category). It is recommended to use those instead of the _Brunel versions (direct output from Turcal).

K_MC15TuneV1_ProbNNK_Brunel_Mod2 and pi_MC15TuneV1_ProbNNpi_Brunel_Mod2 templates feature improved description of the distribution near ProbNN=0 and are recommended to be used instead of x_MC15TuneV1_ProbNNx_Brunel versions (where a step-like structure can be observed around zero).

PIDGen templates MC15TuneV1_ProbNNpiNotK_Brunel and MC15TuneV1_ProbNNKNotpi_Brunel provide resampling for ProbNNpi*(1-ProbNNK)* and ProbNNK*(1-ProbNNpi) combinations of variables, respectively (for instance, for tracker-only MC where PIDCorr cannot be used).

Templates with the names containing e.g. "PIDKgt4" or similar are for resampling of variables with a certain cut on PID (e.g. PIDKgt4_MC15TuneV1_ProbNNK_Brunel is for the ProbNNK with PIDK>4 cut).


In some cases, separate templates are created for different categories of datasets. There are currently two classes of such categories:

  • Templates from datasets split by track charge. These are available for most pion, kaon and proton responses and can be accessed by appending "_Plus" or "_Minus" to the dataset name, e.g. "MagDown_2012_Plus".

  • Templates from datasets split by "BremAdded" category for electrons. Can be accessed by appending "_Brem" or "_NoBrem" to the dataset name, e.g. "MagDown_2012_Brem".

Evaluation of systematic uncertainties

In general, the uncertainties in PID efficiencies can come from at least four sources:

  1. Statistics of MC calibration samples.
  2. Parametrisation of weighted PID control samples. In the case of PIDGen/PIDCorr it is based on kernel density, PIDCalib uses histograms
  3. Subtraction of background from PID control samples using sWeights. This procedure assumes no correlation between weighting variable (e.g. D* or D mass) and the other variables (Pt, eta of the track, PID variables), and in general any correlation will introduce bias.
  4. Correlations of PID response with the other parameters of the event not taken into account by PIDCalib (anything apart from P, Eta, Ntracks) and between the tracks (e.g. if the Cherenkov rings overlap in RICH).

The "standard" procedure in PIDCalib or PIDGen/PIDCorr only accounts for 1. and 2.

Source 1. Should be small for pions and kaons, and can be estimated by using alternative PID templates obtained from bootstrapped samples. The PIDGen.py and PIDCorr.py scripts have a command line parameter '-v' or '--var', to specify the "variation" of the template. By default it's "default", but one can also specify "stat_0", ... "stat_4" to choose one of the bootstrapped templates (usually there are at least 5 bootstrapped templates for each PID combination) and fill the corresponding corrected variables.

Source 2. can be substantial, and can be estimated similarly, by using an alternative template called "syst_1". It's the template with the modified kernel width (50% wider) and should give the idea about the influence of the KDE procedure.

Concerning 3 and 4, there is no universal way to estimate the uncertainty. The source 3. can probably be ignored at least for pions and kaons (the D*->Dpi sample is very clean). For 4., one would need to use something different from PIDCalib (which always relies on the assumption that PID is only correlated with P, Eta and track multiplicity). Some analyses are using control channels in data to compare MC-corrected and data distributions, which should cover 4. Alternatively, one could e.g. compare PIDCorr and PIDGen (in PIDGen, the correlations with anything else are completely ignored, while in PIDCorr they are in principle preserved from simulation, which would give at least some idea about the effect).

Frequently asked questions

I don't have the Eta (pseudorapidity) variable in my simulation ntuple.
You can use PIDGen (PIDCorr) with either Pt and Eta variables (-p <ptvar> -e <etavar>), or Pt and P variables (-p <ptvar> -q <pvar>) taken from the ntuple. In the latter case Eta will be calculated from P and Pt in the script.

The PID response I need is missing in PIDGen
Probably nobody asked for it before, please let us know. But check the Urania nightlies first, it might happen that the response you need was just added recently.

The PID response I need is present in PIDGen but not in PIDCorr
Probably nobody asked for it before, please let us know. But PIDCorr relies on MC samples as well, so you might need to wait until the sample is generated before the PID response you need becomes available.

The result of PIDGen does not match data
  1. Check that there are no "hidden" PID cuts in the simulated or data samples (e.g. in the stripping).
  2. Check that the kinematic distribution of the MC tracks (momentum, eta) is covered by the calibration data (e.g., there is typically no PID calibration data for tracks softer than 3 GeV (below RICH threshold) or above 150 GeV).
  3. Check that kinematic distributions of the simulated sample match those in data. It might be necessary to e.g. reweigh MC events with the Dalitz plot distributions observed in data.
  4. Usually, one needs to either reweigh MC such that the number of tracks distribution matches that in data (or at least rescale the Ntracks variable by multiplying it by 1.10-1.15 such that it roughly resembles data distribution). If the nTracks is the only variable you need to correct, you can use the command line option in PIDGen/PIDCorr.py scripts such as "--ntrscale 1.15" to do this internally.
  5. If data distribution is obtained with sWeights, make sure there are no biases in the distribution (e.g. correlations between reweighting and PID variables, or backgrounds that peak in the reweighting variable).

The result of PIDCorr does not match data
In addition to the issues mentioned above, there can be a few issues specific to PIDCorr:
  1. The version of simulation (sim08, sim09) used to generate your signal sample should match the version used to create PID response templates. PIDCorr allows one to choose either sim08 or sim09 (but note that not all PID responses are available for both versions).
  2. Typically, MC samples used to create simulation response templates are much smaller than the data calibration samples. Therefore, uncertainty due to the size of calibration samples can be larger with PIDCorr than with PIDGen.
  3. The kinematic coverage of the MC sample can be different from the coverage of the calibration sample, in which case PIDCorr can perform worse than PIDGen. This is known to be the case for Run2 low-Pt pions (below 500 MeV), to be fixed.

Can I set up PIDGen locally in my university?
It can be done, but will require plenty of space:
  1. Make sure Urania is accessible in your local machine (e.g. via cvmfs)
  2. Copy the contents of /eos/lhcb/wg/PID/PIDGen to your local machine and change the value of eosdir and eosrootdir variables in PIDGenExpert/Run*/Config*.py files to point to this location.
  3. Change the environment variable $PIDPERFSCRIPTSROOT to point to the location of modified PIDCalib scripts before running PIDGen or PIDCorr

What stripping version is used for Run1 electron samples?
e_CombDLLe and e_VxProbNNe variables correspond to Stripping 21, e_S20CombDLLe and e_S20VxProbNNe to Stripping 20.

I use weights to match simulation kinematics to data. Should I pass the weights to PIDGen/PIDCorr?
You don't need to pass your weights to PIDGen/PIDCorr. PIDGen/PIDCorr ensures that for each particular Pt, Eta, Ntracks the PID distribution will match data. Thus if your Pt, Eta, Ntracks distribution matches data, PID response will be correct. How exactly you make Pt, Eta, Ntracks match data is up to you. If you use event-by-event weighs, you will just correct each single event with PIDGen/PIDCorr and then use that event weight later to e.g. compare with your data calibration sample of BDT tuning.

Distribution of ProbNNp for protons in Run2 doesn't look good
Run2 uses Lambda0 calibration sample which is biased for tracks coming from the PV. See this presentation. The solution is to either use the variable p_LbLcPi_MC15TuneV1_ProbNNp_Brunel (coming from the low-stats Lb->LcPi sample) or wait for inclusive Lc sample to be produced.

Nightlies are broken
Sometimes the nightlies in "today's" slot are not compiled properly. In this case, one can try to specify an older slot: $ lb-run -c best --nightly lhcb-run2-patches Mon Urania/HEAD bash (note the "Mon" part, which means take the Monday slot).

-- AntonPoluektov - 2017-05-09

Edit | Attach | Watch | Print version | History: r59 < r58 < r57 < r56 < r55 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r59 - 2022-04-07 - AntonPoluektov
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback