Prospects for computer-assisted data quality monitoring at the CMS pixel detector.

Abstract

Data quality monitoring (DQM) and data certification (DC) are of vital importance to advanced detectors such as CMS, and are key ingredients in assuring solid results of high-level physics analyses. The current approach for DQM and DC at CMS is mainly based on manual monitoring of reference histograms summarizing the status and performance of the detector. This requires a large amount of person power while having a rather coarse time granularity to keep the number of histograms to check manageable. We investigate methods for computer-assisted DQM and DC at the CMS detector, focusing on a case study in the pixel tracker. In particular, using data taken in 2017, we show that autoencoder techniques are able to accurately spot anomalous detector behaviour, with a time granularity previously inaccessible to the human certification procedure.

Definitions and terminology

Luminosity section (LS): An elementary time unit of continuous data taking in CMS, during which the instantaneous luminosity is assumed unchanged. A LS lasts 218 LHC orbits, or approximately 23.3 seconds.

Run: A time unit of data taking in CMS, typically consisting of a few tens to a few hundreds of luminosity sections.

Fill: A period during which the same proton beams are circulating in the LHC, typically spanning multiple runs.

Data quality monitoring (DQM): The process of checking the quality of recorded data, with the aim of spotting potential detector issues.

Data certification (DC): The process of checking the quality of recorded data, aiming to certify the data as good for usage in physics analyses.

Non-negative matrix factorization (NMF): A type of factorization method for non-negative inputs, computing a set of basis components that optimally span the space of input instances [3].

Barrel pixel (BPIX) layer and forward pixel (FPIX) disk: Components of the pixel tracker at CMS [5]. There are four barrel layers, numbered BPIX L1 to BPIX L4 and three forward disks on each side, numbered FPIX- D3 to FPIX+ D3.

Introduction

The histograms used in this case study represent the distributions of the collected electric charge (in elementary charge units) per cluster, for BPIX layers and FPIX disks [5]. Each histogram contains the data collected during a single LS. In this case study, we use BPIX L2, L3 and L4, as well as FPIX D1, D2 and D3, where the distributions of the collected cluster charge have a relatively stable behaviour over time. An example distribution for BPIX L2 is shown on the left. In the following, all distributions have been normalized to unity, and the last bin of each histogram contains the overflow. The data set used in this case study is the 2017 dataset reconstructed with Legacy reprocessing [7].

The strategy and goal of the computer-assisted DQM and DC procedures presented here is not to replace human decision-making. Instead, these methods are intended to assist the people responsible for monitoring and certification, efficiently and effectively pointing them towards potentially anomalous behaviour with a finer time granularity than directly accessible to those people. Furthermore, the methods presented here are prospects and have not yet been deployed.

We study and compare several of these methods:

Moments method: The first and second order moments of the histograms in the training set are calculated. For a given histogram, a score is assigned by comparing its moments to the average values and standard deviations of those moments in the training set.

Landau fit method: Each histogram is fitted with a Landau distribution and the mean-squared-error (MSE) between the original histogram and the best fit is calculated.

Templates method: For each histogram to be classified, the MSE is calculated between this histogram and each of a set of reference histograms, and the minimum value is chosen as MSE score for this histogram.

NMF method: A set of basis components is extracted from the training set using NMF. A given histogram is reconstructed as an optimized linear combination of the basis components, and the MSE between the original histogram and its reconstruction is computed.

Autoencoder method: Similar to the NMF method, but where the histogram is reconstructed using an autoencoder. More details on the autoencoder method can be found in [1].

example.pdf-1.png

https://twiki.cern.ch/twiki/pub/Sandbox/ML4DQMPixelMay2022/example.pdf

Overview of global approach

Training and testing on a full year of data taking

Schematic overview of the method: from the input histograms to a quantitative anomaly flagging performance metric with a comparison between different models.

sketch overview global.pdf-1.png

https://twiki.cern.ch/twiki/pub/Sandbox/ML4DQMPixelMay2022/sketch_overview_global.pdf

Good and anomalous histograms

Distributions of the collected electric charge (in elementary charge units) per cluster, for the different BPIX layers and FPIX disks.

The blue histograms are obtained as averages from the data set and represent the range of expected shapes for each distribution (that may vary slightly over time due to changing detector conditions). The black histograms correspond to an anomalous LS caused by beam dump effects, when the proton beams in the accelerator are disposed of, and the red histograms are the autoencoder reconstructions.

The averages shown in this figure (in blue) are calculated by partitioning the LS in the dataset (in chronological order) into 50 approximately equally large sets, and averaging all histograms of a given type within each set to a single histogram. This method is used to obtain a set of reference histograms while keeping the spectrum of expected shapes.

It can be observed that the autoencoder reconstructs the good histograms (overlapping with the blue spectrum) accurately, and the anomalous histograms (not overlapping with the blue spectrum) less accurately. This results in a larger mean squared error (MSE) between these histograms and their reconstruction.

figure run306139 ls1112.pdf-1.png

https://twiki.cern.ch/twiki/pub/Sandbox/ML4DQMPixelMay2022/figure_run306139_ls1112.pdf

Output score distributions and correlations

The panels on the diagonal line show the output score distributions for the good test set (in blue) and the anomalous runs (in shades of red) for the different models. The y-axis scale is logarithmic, and the distributions have been normalized to unity.

The panels away from the diagonal display the correlations between models: each dot represents one LS with its assigned score according to one model on the x-axis, and according to another model on the y-axis. The scores for each of the models have been rescaled to the range 0 to 1.

The horizontal and vertical arrays of points correspond to lower bounds on the fitted probability density where it cannot be numerically distinguished from zero.

correlations.pdf-1.png

https://twiki.cern.ch/twiki/pub/Sandbox/ML4DQMPixelMay2022/correlations.pdf

The training set consists of the full dataset with filters applied to select LS where the CMS detector was fully switched on and collected reasonable statistics. Another filter is applied in addition to remove histograms with a relatively large mean-squared-error with respect to a set of reference histograms. These are obtained as averaged partitions from the set of luminosity sections passing the earlier filters. This additional filter removes anomalies from the training set while maximizing the total training data.

The anomalous test set consists of a number of anomalous runs. Resampling techniques have been applied on the histograms belonging to the anomalous runs in order to increase their statistics. The resampling adds representative variation to the histograms while keeping their essential shape characteristics.

The good test set is obtained in a similar way as the training set but with slightly stricter thresholds to ensure that no anomalous luminosity sections are selected while still covering the full spectrum of good histogram shapes. Alternative approaches have been used as a cross-check, where the good test set consists of a number of predefined good runs (with or without resampling), or of averaged partitions from the training set (with or without resampling).

Overview of operational approach

Dedicated training for single application run

Schematic overview of the method, modified with respect to Fig. 1 to highlight the differences for the operational application of the method. Typical applications during data taking include certifying single runs using a dedicated training set tailored to the specific application run.

sketch overview local.pdf-1.png

https://twiki.cern.ch/twiki/pub/Sandbox/ML4DQMPixelMay2022/sketch_overview_local.pdf

Application of operational approach

Illustration of an operational implementation of the autoencoder model.

In Figs. 1 to 3, the model was trained and tested globally on the full 2017 dataset (with some filters applied as discussed before). Here, an alternative training and testing approach is chosen that represents more closely the operational situation in practical applications of the model. In this so-called local training, for each application run, the model is updated with a dedicated training on the runs preceding the application run.

The fraction of flagged LS is low for good runs and higher for known anomalous or otherwise special runs (indicated with vertical coloured lines in the bottom pad), showing that the model flags anomalous LS accurately in local training as well as in global training.

The runs with a high fraction of flagged LS can be classified in a number of categories. Indicated in red are the runs with timing scans or other anomalies, and in orange the runs that have only a fraction of their LS showing anomalous distributions. In purple are runs with low pileup or trigger rates, causing statistical fluctuations in the distributions. In between fills, discrete changes in accelerator or detector conditions might take place, causing a mismatch between the training runs and the application run and hence the method is not applied yet to the first run of a fill.

The threshold score value above which a LS is considered anomalous is calculated as the 97% percentile of the scores on the local training set plus a fixed margin, which is (preliminarily) optimized to achieve a low false alarm rate while accurately flagging known anomalous LS.

local playback.pdf-1.png

https://twiki.cern.ch/twiki/pub/Sandbox/ML4DQMPixelMay2022/local_playback.pdf

References

[1] CMS Collaboration, Tracker DQM Machine Learning studies for data certification, CERN-CMS-DP-2021-034, https://cds.cern.ch/record/2799472.

[2] V. Azzolini et al., The Data Quality Monitoring Software for the CMS experiment at the LHC: past, present and future, EPJ Web of Conferences 214, 02003 (2019) https://doi.org/10.1051/epjconf/201921402003.

[3] Lee, D. D. and Seung, H. S., Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems 13 - Proceedings of the 2000 Conference, NIPS 2000 (Advances in Neural Information Processing Systems), Neural information processing systems foundation, link.

[4] K. He, X. Zhang, S. Ren, and J. Sun, arXiv:1512.03385.

[5] CMS Collaboration, The CMS Phase-1 Pixel Detector Upgrade, CERN-CMS-NOTE-2020-005, http://cds.cern.ch/record/2745805.

[6] CMS Collaboration, The Phase-1 Pixel Detector Performance in 2018, CERN-CMS-DP-2021-007, https://cds.cern.ch/record/2765491.

[7] CMS Collaboration, Strategies and performance of the CMS silicon tracker alignment during LHC Run 2, arXiv:2111.08757v2.

-- LukaLambrecht - 2022-04-14

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf correlations.pdf r1 manage 27713.8 K 2022-05-13 - 18:39 LukaLambrecht  
PNGpng correlations.pdf-1.png r1 manage 254.5 K 2022-05-13 - 18:39 LukaLambrecht  
PDFpdf example.pdf r1 manage 127.7 K 2022-05-13 - 18:39 LukaLambrecht  
PNGpng example.pdf-1.png r1 manage 41.3 K 2022-05-13 - 18:39 LukaLambrecht  
PDFpdf figure_examples_run299325.pdf r1 manage 1774.9 K 2022-05-12 - 17:34 LukaLambrecht  
PNGpng figure_examples_run299325.pdf-1.png r1 manage 534.9 K 2022-05-12 - 17:34 LukaLambrecht  
PDFpdf figure_run306139_ls1112.pdf r1 manage 527.5 K 2022-05-12 - 17:34 LukaLambrecht  
PNGpng figure_run306139_ls1112.pdf-1.png r1 manage 607.4 K 2022-05-12 - 17:34 LukaLambrecht  
PDFpdf local_playback.pdf r1 manage 132.9 K 2022-05-12 - 17:34 LukaLambrecht  
PNGpng local_playback.pdf-1.png r1 manage 92.8 K 2022-05-12 - 17:34 LukaLambrecht  
PDFpdf sketch_overview_global.pdf r1 manage 85.2 K 2022-05-12 - 17:33 LukaLambrecht  
PNGpng sketch_overview_global.pdf-1.png r1 manage 252.4 K 2022-05-12 - 17:33 LukaLambrecht  
PDFpdf sketch_overview_local.pdf r1 manage 72.6 K 2022-05-12 - 17:33 LukaLambrecht  
PNGpng sketch_overview_local.pdf-1.png r1 manage 235.2 K 2022-05-12 - 17:33 LukaLambrecht  
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2022-05-13 - LukaLambrecht
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback