Quantum Annealing for stop paper
General comment: This manuscript applies the quantum annealing for machine learning with zooming (QAML-Z) algorithm to the search for top squarks at the LHC. The authors’ main contribution appears to be the addition of a preprocessing step of the input variables with principal component analysis (PCA), which improves the performance of the algorithm. In particular, there is a possible improvement over the classical algorithm used in a
CMS search: a boosted decision tree (BDT). While the analysis presented here is sound, I have some major comments. My main major comments are (1) to expand the description of and comparison to the classical ML algorithm (in terms of performance and time to solution) and (2) to clarify the novel contributions of this work relative to the previously published literature on QAML and QAML-Z methods. Below, please find my line-by-line comments and questions.
General reply
Dear journal, we would like to thank the referee for his/her comments & questions as these have helped us to improve the quality of the paper.
Concerning comment (1)
- It is indeed important to compare the classification performance of the quantum annealer with the one of a classical ML algorithm. This is precisely a major point of the paper where we compare the QAML-Z algorithm and the BDT, putting the former as much as possible on a comparable ground to the latter. In the reply to the first question below, we provide more details on the description of the BDT, that we have now incorporated in an Appendix of the paper. Beyond the BDT, we studied the performance of a DNN for the same classification problem; more details are provided in the response to the first question below. The DNN's performance was observed to be worse than the one of BDT.
- The important parameter in this type of problem, namely classification which is done offline, is not the time taken by the algorithm to find the best solution. The most important is, by far, the performance, namely the capacity of the classification algorithm to separate as well as possible the signal from the background, this for two reasons: the cross-section of the SM background which is dominating the one of new signals by several orders of magnitude, and the overlap of signal and background in the kinematic and topological variables used to discriminate them; a good illustration of the latter effect is provided in Fig. 2.
Concerning comment (2), there are four contributions which are novel at various degrees, and it is true that some of them are not clear enough in the first version of the paper:
- The classification problem is new: stop versus SM background. The QAML-Z algorithm, after its invention, has been tested only on the Higgs problem, and it is important to test it in another classification problem where the challenges are different: different signal cross-section versus SM background cross-section, different overlap of signal and background in the distribution of the discriminating variables. With the application of QAML-Z to a second classification problem, we now have a more complete picture of the capability and potential of this approach for classification. Even though already mentioned in the abstract, this aspect is now additionally clarified in section 1 and the conclusion.
- The use of a statistically complete figure of merit to gauge the performance of a selection. The Area Under Roc curve, used in many optimization studies as a metric, is not a complete one, as it is equivalent to an S/B maximization, not accounting for the statistical uncertainty of neither the signal nor the background, and not accounting for the systematic uncertainty of the background prediction. These uncertainties, part of the experimental reality of a search, should be taken into account. With the maximization of the FOM as defined in the paper, they are accounted for in this study. This aspect is now further clarified in section 2, and has now been clarified as novelty in the conclusion.
- More systematic and complete coverage of quantum annealing settings for each set of tested variables. The figures 6 and 7 of Appendix C are illustrating the extensive scan that we did to find the best setting for the sets of variables A and B (defined in Tab. 2); namely, we tested various augmentation schemes, different cutoff values and the use (or not) of variable fixing. Even though only illustrated for the sets A and B, this work has been systematically done for all sets of variables defined in Tab. 2 and is, as far as we know, a novelty of our study. This is now clarified in section 5 and in the conclusion.
- PCA. This aspect has already been introduced both as a necessity to put the QAML-Z on an equal footing to the BDT, and as a novelty versus ref. [3], so we have not added any further clarification.
We thank the referee for bringing the point of clarifying the novel contributions of our work.
Physics comments:
Q1: 79 How is the BDT trained? Is it optimal? Could a DNN or other classical ML algorithm achieve better performance? Given that one of the main claims of the paper is that there may be an improvement for the quantum algorithm compared to the classical ML algorithm, it would be good to better understand the specifics of the classical ML algorithm used.
The BDT has been trained with the TMVA package, which is the standard multivariate analysis package of root. The number of trees NT is 400. The maximal depth of the trees MD is 3. The maximal node size MN, which is the percentage of number of Signal or Background events at which the splitting of data stops, is 2.5%; MN is a stopping condition of the training. Finally, and as mentioned in the paper, the data is diagonalized.
The internal parameters of the BDT (NT, MD, MN, diagonalization or not) have been varied, a new BDT trained each time, and its performance assessed via the FOM maximization, all these while ensuring that there is no over-training. The chosen parameters correspond to the best performance, while having a trustable BDT training (no over-training). For the choice of input variables for the BDT: as briefly explained in lines 80-84 of the version 1, a new variable v is added to an already existing set of variables S, a new BDT training, and the FOM maximized versus the output of the BDT. If the maximal FOM reached for the set S+v is higher than for S, v is incorporated as input variable; if it is compatible with the one of S, it is not. This procedure is repeated until there is no new variable at disposal. In view of our approach, we are confident that the BDT is optimal.
After the study of the BDT, we purposefully took the same input variables to train a DNN, varying its internal parameters, re-training a new DNN, and assessing its performance via the FOM maximization. The question to address here was whether for a given classification problem (here stop versus SM), and with the same set of input variables as for the BDT, a DNN architecture can achieve a better result than a BDT. Among all options explored (approximately 50) covering different number of nodes and hidden layers, batch sizes, number of epochs, learning rates, weight initializer and activation functions, we did not observe an improvement of the performance with a DNN versus the BDT.
Q2: 91 How big of an impact does the choice of f=20% play? Do the optimal solutions vary depending on this choice? Similarly, does assuming no systematic uncertainty on the signal play a role (I assume it’s an even smaller effect than the background systematic uncertainty)?
It should first be stressed that extreme values of f do not correspond to any realistic analysis in HEP: no measurement of the SM background is without any systematic uncertainty (f=0), nor do we have f=100% which corresponds to the case where the prediction of the background is totally out of control and the corresponding search isn't worth pursuing. values of f between 15% and 40%, which correspond to realistic precisions of background prediction, have been tested. They mainly result in the FOM (1) being maximized at a different value of the BDT output, and (2) having a different maximal value. However, the very choice of the input variables, which is what we want to determine with the FOM maximization, doesn't change with values of f in this range. Please note that a systematic uncertainty on the signal is not considered in similar metrics, as most discoveries are foremost limited by statistical uncertainty. Now indeed, assuming some systematic uncertainty on the signal has no significant effect on the outcome of the FOM maximization.
Q3: 95-98 Maybe point to a reference that shows why maximizing FOM is a good idea
This is done in the second version, thank you.
Q4: 122 Could a quick summary of the weak classifier construction be given? In particular the equation shown in the Methods section of [3] doesn’t seem to give binary values as this paper claims.
For a variable i, a weak classifier chi_i is a function built as a function of different percentiles of signal and background; the corresponding details are given the supplementary material of ref. [3]. Now for a given event of index tau, it is indeed the binary value sgn(chi_i(tau)) = +-1 which is taken. The very final value is then divided by the number N of total weak classifiers. The text has been clarified in this regard, many thanks.
Q5: 210 Could you report how many events are used for training/testing/validation for both signal and background?
As mentioned in line 266, N(Train)=50.E3 events. Lines 200-203 mention that the size of the training and testing samples are equal, so N(Test)=50.E3. Each of these samples includes 19.E3 events for signal and 31.E3 events for background. These points have been clarified in the text.
Finally, N(Assess) is 196.E3 events for background (i.e. all the rest of the background sample), and 7.E3 for signal as we asses the performance of the signal on a single signal point as mentioned in lines 209-210; this has also been clarified in the text.
Q6: 216 Is it possible to run the algorithm on a D-Wave Advantage machine with a Pegasus graph? Or discuss the gains possible by doing that?
We don't have access to the Pegasus version of D-Wave graphs yet, the access to the latest hardware being more difficult. However, we indeed hope to get access to this machine, our plan being to run the different options of table III as to obtain a systematic comparison of the same settings, input variables, etc across 2 different machines. We kindly refer the referee to the lines 366-375 for the discussion about the possible gains.
Q7: 237 How are the cutoff C and variable-fixing scheme related? If you use both, does it effectively remove more variables?
The cutoff is the percentile pruning of the J_{ij} matrix to allow the implementation of the classification problem on a hardware with a limited number of connected qubits; as such, it is part of the embedding.
The variable fixing scheme is a classical polynomial-time procedure to fix the value of a portion of the input variables to values that have a high probability of being optimal. If used, the variable fixing allows to find a fraction of the solution spins via a classical approach; the quantum annealing is therefore used only on the non-fixed variables, i.e. it is less used (hence the remark in lines 341-343).
As such, these two schemes are not related, even though they both contribute to the embedding by decreasing the size (number of terms) of the Ising Hamiltonian to embed.
When both are used, the variable fixing algorithm is run after the pruning of J_{ij} coefficients (cutoff). Therefore the more coefficients we remove, the easier it is for the algorithm to find classical solutions for some variables (variable fixing).
Q8: 237 Could a citation or brief description of the variable-fixing theme be added?
The description provided above has been added to the text.
Q9: 240 Could a full comparison be made to classical ML and/or classical simulated annealing?
The paper is built to provide an as complete as possible comparison between the quantum annealing and classical ML (here BDT) approaches for classification: same problem (stop versus SM background), same input variables, same pre-selection, diagonalization of data applied to both. We are confident that the comparison of the performances of quantum annealing (different settings and input variables) with the BDT provided in table III is the most complete we can provide. As mentioned above, we have tested a DNN and observed that its performance isn't better than the one of BDT.
The purpose of this paper is to test under realistic conditions whether the QAML-Z out-performs, in this classification problem, a classical ML tool which is typically used in a HEP search. As such, a classical simulated annealing is not in the scope of this paper,
where we make the second attempt on classification in high energy physics with quantum annealing, and where we want to focus on the quantum annealing (i.e. different settings) and show how it compares (given the present graph) with a classical ML approach.
Q10: 253 Is there a justification that 10 times is enough?
For the same quantum annealing setting, for the same set of input variables and with the same events, we determined the standard deviation sigma with 5, 10 and 20 runs. We observed that the value of sigma determined with 5 was already very close to the one obtained with 10 or 20. Being conservative, we kept 10 as the final number of annealing runs to determine sigma.
Q11: 342 It is my understanding that the variable fixing scheme is necessary to put it on a physical quantum computer with limited qubits. Is that correct? So is the point of this statement that once more qubits are available, the performance will improve?
It is the embedding scheme used by D-Wave which is necessary to implement the classification problem on the physical quantum computer, namely to embed the Ising Hamiltonian on the Chimera graph. We point the referee to what is explained in lines 217-219. It is indeed true that once more qubits are available, among others, a larger number of qubits will be connected. This will allow us to use are larger number/fraction of the J_{i,j}, i.e. to prune less the J_{i,j} matrix by using lower values of C as mentioned in lines 366-372. Another consequence of the larger number of
connected qubits is that the chains will be more stable, thus less information will be lost by broken chains (please refer to lines 373-375).
About line 342: The variables which are fixed by variable fixing don't enter the QA process; in those cases, the classification therefore uses a smaller fraction of qubits, and the QA is therefore less put at use.
Q12: 399 I was under the impression that the BDT used was the same/similar to the one from the CMS publication [5]. However, clearly, the data used here is based on Delphes simulation. So, presumably the BDT was retrained on this more simplified dataset?
Indeed: in order to make the comparison with the results of quantum annealing as valid as possible, a BDT was re-trained with the the Delphes simulation. It has to be noted that the performance of the BDT (for the same signal) is compatible between this new simulation and the full simulation of the CMS detector.
Text comments

200 Usually in ML, the convention of “training” (used in training), “validation” (used to validate / select the “best” model), and “testing” (held out for final performance checks) datasets are used.
We acknowledge the usual convention. However, since the "QA", "Train" and "Test" samples are unequivocally defined and better correspond to the needs of this work (i.e. 3 orthogonal samples with one specifically sent to the quantum annealing algorithm), we prefer to keep this notation.
Stop1 direct search at 8 TeV
8 TeV stop studies can be found
here
Replies to JHEP's referee can be found
here
Stop1 grid
This can be found
here
SW guide
This can be found
here
Old Jet and MET studies
Some Jet and MET studies can be found
here
Old triggers studies studies
trigger studies
--
PedrameBargassa - 28 Jan 2009