# Q&A for VBS WW OS leptonic

- Comment is acknowledged and answered
- Authors are working on answering the comment
- Comment requires further work to be addressed or need attention from the internal reviewer regarding a specific issue
- We do not agree with the comment and arguments are given

## FR

Here below are just listed Type B comments for the Final Reading, and corresponding answers are provided in line.

• On the figures (2-4), I would find it helpful if the signal template was identified.

Done.

• Also, Fig 2 & 3 captions, I would prefer "signal segion" to SR to make the figures more stand-alone comprehensible.

We'd rather keep them as they are, for consistency with the rest of the body.

• And, Fig 4, similarly "control regions (CR)" spelled out, although a rephrasing to emphasise that it covers the whole figure would remove the necessity to repeat it in the caption.

We'd rather keep them as they are, for consistency with the rest of the body.

• Finally, and more basic, I do not understand what the DNN output is. This may reflect my ignorance of the use of neural networks, but I find it deeply unsatisfactory.

The "DNN output" is the distribution obtained by training a machine learning algorithm to distingusih our signal from the main backgrounds. It's an output in the sense that it's the outcome of the last DNN layer, which is a measure of how likely an event is classified as signal (high DNN score/output) or background (low DNN score/output).

general:

• Figure 1 middle: you should add the higgs, i.e. "Z/gamma/H". Are the higgs exchange diagrams included in the MC predictions?

Done.

• Try to avoid "data" * this is jargon. Everything is data! Better: experimental data, CMS data, measured events...

• Pileup: appears several times in the paper, so it would be nice to say how big the effect is: give average number of pileup interactions or something equivalent. And a reference would not be bad.

We have now included the following reference, where these numbers are quoted:

@article{Sirunyan:2020foa,
author         = "Sirunyan, Albert M and others",
title          = "{Pileup mitigation at CMS in 13 TeV data}",
collaboration  = "CMS",
journal = "JINST",
volume = "15",
pages = "P09018",
year           = "2020",
eprint         = "2003.00503",
archivePrefix  = "arXiv",
primaryClass   = "hep-ex",
reportNumber   = "CMS-JME-18-001, CERN-EP-2020-017",
doi = "10.1088/1748-0221/15/09/p09018",
}


The size of the effect due to pileup uncertainties in our measurement is given in Table 2.

• in line 130/131 you state "interference ... of the order of a few percent..." Which is not soooo small! Apparantly you have not included this in the analysis, so you should assign an uncertainty to the interference effect.

The interference term plays a negligible role in our selection. The "few percent" statement actually refers to the overall normalization by which the signal would change, but since we measure a cross section this contribution doesn't have to be taken into account; the residual shape uncertainty is much smaller, and therefore we neglected it.

• you say very little about the theoretical uncertainty ("scale"). First, it would be good to underline that this is an estimate of the missing higher order contributions, since you use LO cross sections. Second, why only factorization scale is varied, and not the renormalization scale?

Line 261 now reads: "Among theoretical uncertainties, effects due to the choice of the renormalization and factorization scales are evaluated, in order to cover missing higher order terms in the perturbation series of cross section calculations." VBS diagrams have no alpha strong couplings, hence varying the renormalization scale would have no effect in our signal.

abstract:

• "(expected in the standard model)" [then the term standard model could be left out in the last line]

We'd rather keep it as it is.

• Maybe split the last sentences: "..hypothesis. The measured..."

Done.

• I would drop "(scale)" here. It is not necessary in the abstract, and it would then be natural that you also specify what the measurement error is due to.

Done.

body:

• 11 much clearer: "(W+W+, W-W-)"

We'd rather keep it as it is, as this is how same-sign W bosons are labeled in all references we quote.

• Figure 1 caption: "and QCD-induced" (right)"

Done.

• 70 difficult to digest why MET should be relevant for finding the z vertex position. Can we drop that part of the sentence? MET is in any case defined (again) in line 87.

Done.

• 75 "true" is a dangerous term. Better: "of the particle level jet momentum"

Done.

• 100-112: I am wondering if a concise table would not be better than all these words.

We provide a summary table as supplementary material.

• table 1: would be nice to give the list a meaningful order, eg discrimination power!

The correlation matrix reported in Figure 19 of ANv9 was used to select input variables of interest, since the network exploit differences in these correlations between signal and background samples. Therefore, variables are not chosen by their discrimination power, because this was not the criteria we employed to chose them.

• general: The use of "while" should be limited to the meaning "during the time that".

Done.

• general: When making comparisons between items that are expected to be similar, the use of "with" is preferred. When making poetic comparisons "compare to a summer evening" the use of "to" is preferred. For our paper it should be "with".

Done.

• line 242: What is a unity Gaussian distribution? Is the area under the Gaussian = 1?

Yes it is.

• line 265: for the scale variations - do you take a sum or envelope (the envelope, i hope) - please add this information to the text.

Yes it's the envelope, we have added this information.

• line 265 and further: about the PDF uncertainties - what kind of uncertainties do you use? please specify it in the text.

PDF uncertainties are computed as recommended by PDF4LHC, we have added the corresponding reference:

@article{Rojo:2016ymp,
archiveprefix = {arXiv},
author = {Rojo, Juan},
doi = {10.22323/1.265.0018},
eprint = {1606.08243},
journal = {PoS},
pages = {018},
primaryclass = {hep-ph},
title = {{PDF4LHC recommendations for Run II}},
volume = {DIS2016},
year = {2016}
}


• figure 2: please make the legend larger (also for the ratio panels); please place "data" first; please make the x-axis labels larger; please add info to the caption what is represented by the vertical bars on the data points and on the ratio points; there are two red curves in the figure - i do not understand what they are and to which the legend (VBS) refers to? i strongly suggest to make data symbols larger.

Done, the superimposed red line has been removed.

• figure 3: please move "CMS" outside of the figure box, top left; please place "data" first; please make the legend in the ratio panels larger; please make the x-axis labels larger; please add info to the caption what is represented by the vertical bars on the data points and on the ratio points; there are two red curves in the figure - i do not understand what they are and to which the legend (VBS) refers to? please make data symbols larger; please left-align the cuts on m_jj and eta_jj.

Done, the superimposed red line has been removed.

• figure 4: please move "CMS" outside of the figure box, top left; please place "data" first; please make the legend larger, also in the ratio panels; please make the x-axis labels larger; please add info to the caption what is represented by the vertical bars on the data points and on the ratio points; there are two red curves in the figure - i do not understand what they are and to which the legend (VBS) refers to? please make data symbols larger.

Done, the superimposed red line has been removed.

Abstract: some suggestions

• 5th line: by requiring exactly two opposite sign leptons
• remove charged (7th line)
• A signal is observed with a significance of 5.6 standard deviations (5.2 expected) â€¦

Done.

Intro:

• You say that the analysis is more difficult than for the same sign W's. What is therefore the added value? A higher cross section ? Something else? The reader would like to know why you did this analysis !

The motivation for this analysis is explained in the first paragraph of the introduction section, we just wanted to highlight that the W+W- is more challenging because of the larger backgrounds with respect to the same-sign final state.

• L166 DeepJet (as in the reference)

Done.

• L166 give the definition before the cut: The transverse mass defined as mT=.., where phi is the azimuthal angle in radians, is required to be above 60 GeV.

Done.

• L177 why phase space ? -> sample

Done.

• Figure 3 is totally confusing for me. First, I do not understand the binning of the mass distribution. But furthermore, is this mass distribution used since out of the Delta_etajj > 3.5 you seem to use only the region 300<mjj<500 GeV (3rd bin, see L234-236) ?

The bin edges of the mass distribution are displayed on top of dashed black lines [500, 750, 1000, 1500, 2000, inf]. Besides, the third bin has been included in the mjj distribution since it's just the number of events in mjj [300, 500] within the same detajj region. The caption now says: "[..] The third bin contains the number of events in the $300 < \mjj[\GeV] < 500$ and $\detajj > 3.5$ regions and, for display purposes, is included in the \mjj distribution, shown in the last five bins. [..]"

• Figure 4. Is the DY tautau CR described anywhere in the text?

Yes it is, see lines 178-179.

• A deeper explanation on how the 5.6 standard deviations are calculated would be appreciated.

Lines 282-285 are dedicated to explaining how the statistical significance has been computed, focusing on the meaning of the p-value, which in turn is converted in number of sigmas.

Abstract:

• L10: What is "(scale)" ? An abstract should be understood by outsiders. What about adding from

It has been removed, the theoretical uncertainty is better explained in the body.

• L291 a somewhat shortened version of the definition? Or just "(scale) " => "(factorization scale) "?

Done.

Main text:

• L17: What is "α_EW"? Please specify, is it the SU(2) "g" or the NC combination (that is unlikely)?

It is the electroweak coupling, as stated just before being invoked.

• Fig. 1: The intraction points should be blobbed, especially for the VBS crossings.

We'd rather keep a real example to give the reader a feeling of a SM VBS process.

• L31: No b tagging? It is used later, in L117.

Indeed the vetoing is based on inverting the requirement of b-tagging algorithm loose working point, as stated in line 165.

• L54: " (p_T) " => " p_T " (the parentheses are used for electrons vs jets). Or simply: "momentum p_T > 50 GeV (>100 GeV)". In the case of inequality you do not need ≈.

Done.

• L69: "The physics objects ... are the jets and the associated missing transverse momentum, taken as the negative vector sum of the pT of those jets.": The lepton momenta are forgotten? Later, in LL86-87 the missing momentum is correctly defined. It should be defined just once, correctly.

That line has been removed and the primary vertex definition has been changed to: "The primary vertex (PV) is taken to be the vertex corresponding to the hardest scattering in the event, evaluated using tracking information alone, as described in Section 9.4.1 of Ref.~\cite{CMS-TDR-15-02}."

• L79: Please quote Pileup mitigation at CMS in 13 TeV data,'' JINST {15} (2020) no.09, P09018, [arXiv:2003.00503 [hep-ex]] even if you do not use that particular method.

Done.

• L84: "Additional selection criteria": what are those? Reference or examples.

That line has been removed.

• L94: 36.3 + 41.5 + 59.7 fb−1 = 137.5 fb−1 . In the Abstract, the figures, and the Summary it is rounded up to 138. Is this not a problem? Maybe we should mention rounding in L97, after telling its uncertainty.

As far as we know, CMS plots are now published with "L=138 fb^-1", the rounding should be implicit.

• LL100-112: This is quite a set of cuts, depending on various conditions, circumstances. Maybe it would be worth to mention here how they were optimized.

More details are given in the analysis note, anyway our selection includes quite standard VBS requirements for the signal region. The choice of CRs is better explained in section 5.

• L122: It is strange to see that the earlier, 2016 data were simulated by a later Pythia version, than the 2017-18 ones. Also in L146. At the same time the NNPDF sets follow the times.

In order to include the dipole recoil scheme in the 2016 simulation, we had to use a special CMSSW release (7_1_47_PTYTHIA240). This is because pythia versions employed in older CMSSW releases missed this parton shower setting. Eventually, pythia 8.240 is the version integrated in CMSSW 7_1_47_PTYTHIA240, which is more recent than those used in the 2017 and 2018 simulations.

• L123: "The W boson decay to a τ lepton is part of the signal definition." This contradicts the Abstract: "Events are selected by requiring exactly two leptons (electrons or muons) ...". This can be smoothened by mentioning that the tau decays to lighter leptons are considered in the simulation.

It's not a contraditction, since electrons and muons can also be selected from tau leptonic decays. We have rephrased line 123 as "contributions from \PGt decays to lighter leptons are also considered and included in the simulation."

• In L170 the DY tau-lepton decays (ττ → eμ) are mentioned as background. Moreover, L296 says: "Electrons and muons coming from a τ decay are vetoed.", and the Summary mentions " leptons (electrons or muons)", so the sentence in L123 should probably be deleted.

In the fiducial volume we have vetoed leptons from tau decay, which by the way constitutes a tiny fraction of our events.

• L141: "weighted" => "reweighted" or "rescaled" (one weights by cross sections)

Done.

• L202: "Nonprompt leptons ... mainly come from W + jets events." What about WW -> tau X ?

All sources of nonprompt leptons are taken into account, but W+jets events give the major contribution to this (minor) background.

• Figs. 2 and 3: Please put in the plots "Z_ll < 1" (left) and "Z_ll >1" (right) for conference use as that is the only difference between the plot conditions.

Done.

• L266: "The b tagging [26] introduces different uncertainty sources." => "b tagging [26] also introduces uncertainty sources." (different from what?)

Done.

• L273: "These uncertainties are added in quadrature" Maybe the conservativeness of this estimation could be underlined here, as there could be correlations among the various sources of systematics.

Indeed that's not actually true, perhaps some CWR comments got mixed. The overall systematic uncertainty is computed by freezing all systematic uncertainties to their best fit value, then a second likelihood scan is performed, which gives the statistical component. In turn, this is subtracted in quadrature from the total error. We have removed that sentence.

• Fig. 4, caption: The sentence "In the left-hand plot ..." is redundant as it repeats what is written in the legend. In the right plot, the legends are confusing, it should be better: "DY 0 PU jet CR" => "Δηjj < 5 DY CR" and "DY 1 PU jet CR" => "Δηjj > 5 DY CR" as in the caption.

We'd rather keep the SF CR plot as it is, since labels are the same as those in the legend, and the caption too for the sake of clarity.

• L292: "The latter volume" => "The other fiducial volume" (too far away from L286)

Done.

• Table 3 is very useful, but it should be clearer with the columns left-aligned.

Done.

References:

• L338: "051801. 27 p," => "051801," (or write it everywhere)

Done.

## CWR

### Comments from Andreas Meyer (cds)

• title: the term "leptonically decaying W+W pair" seems unfortunate, for two reasons: strictly, the WW pair does not decay, but both W do separately and independently; also, it sounds like "observation" refers to the observation of a WW pair decaying leptonically not to the electroweak production of W+W-. How about: "First observation of electroweak production of W+W-; with two leptons and two jets in the final state, in pp collisions at sqrt(s) = 13 TeV". I think the notion "associated" can be dropped.

We dropped the "first" too and modified the title as: "Observation of electroweak W+W- pair production in association with two jets in proton-proton collisions at sqrt{s} = 13 TeV"

• abstract:
• suggest: "The fiducial volume is defined as containing"
• replace first occurrence of "having" by "and"

We rephrased the abstract as follows: "An observation of the electroweak production of a W$^{+}$W$^{-}$ pair with two jets, with both W bosons decaying leptonically, is reported. The data sample corresponds to an integrated luminosity of 138 fb$^{-1}$ of proton-proton collisions at $\sqrt{s}=13$ TeV, collected by the CMS detector at the CERN LHC. Events are selected by requiring exactly two leptons (electrons or muons) and two jets with large pseudorapidity separation and high dijet invariant mass. Events are categorized based on the flavor of the final-state leptons. A signal is observed (expected) with a significance of 5.6 (5.2) standard deviations with respect to the background-only hypothesis and the measured fiducial cross section is $10.2 \pm 2.0$ fb, consistent with the Standard Model prediction of $9.1 \pm 0.6$ fb."

• around line 11, it would be nice to say already here why same-sign WW is so much simpler, namely due to the absence of tt background, which for this analysis is dominant, and crucial to understand.

We changed the following sentence: "The EW production of two W bosons with the same electric charge ($\PW^{\pm}\PW^{\pm}$) in the fully leptonic final state has extensively been studied by the ATLAS and CMS Collaborations~\cite{ATLAS8TeV, SMP-13-015, ATLAS2016, SMP-17-004, SMP-19-012, SMP-20-006}. In this paper, the full 2016--2018 data set, recorded by the CMS experiment, is exploited to search for the purely EW production of a pair of opposite-sign (OS) \PW bosons with two jets, a process that has never been observed so far. This analysis faces more challenges with respect to the $\PW^{\pm}\PW^{\pm}$ channel, namely because of the \ttbar production that enters in the OS signal selection."

• I would propose to add a paragraph about the analysis strategy in the introduction. The choice of categories and cuts is rather complex. A short summary upfront can help greatly. It also gives an opportunity what "data-driven" means specifically. As such, I consider "data-driven" (line 27) jargon.

Adding another paragraph would produce some redundancy in the text, we would prefer to leave the structure of the paper as it is now. However, we may add a table summarizing the different categories and their selections as additional material. This table would help the reader understand the different phase spaces of the analysis.

We removed the jargon "data-driven" and substituted it with a more appropriate expression: "its contamination in the signal region is measured in data through dedicated control regions enriched in \ttbar events".

• In this context, I suggest to change the title of section 4 "Analysis strategy" into "Event selection".

Done.

• line 44: "time evolution" sounds complicated. There are simply 3 separate luminosity calibrations for 3 different years, with partially uncorrelated uncertainties. Suggest to remove the sentence "the improvement in precision .... effects.", and place the references [5-7] at the end of the preceeding sentence, after 1.6\%.

This was the standard sentence for the luminosity and it was recently updated to: "The integrated luminosities for the 2016, 2017, and 2018 data-taking years have 1.2--2.5\% individual uncertainties~\cite{CMS-LUM-17-003,CMS-PAS-LUM-17-004,CMS-PAS-LUM-18-002}, while the overall uncertainty for the 2016--2018 period is 1.6\%."

• Too little information is given about the trigger. Suggest to add the standard sentence in section 2. Then in line 47, suggest to give the pt thresholds of the trigger. My comment is triggered by the fact that the offline lepton thresholds, esp for the 2nd lepton, are very low, 13 GeV. I trust that these events are selected by single lepton triggers, but then 25 GeV for the single electron trigger is also low.

In the offline analysis, events are selected by both single and double lepton triggers. Single electron trigger pt thresholds are 27 GeV, 35 GeV and 32 GeV and single muon trigger pt thresholds are 24 GeV, 27 GeV and 24 GeV for 2016, 2017 and 2018 data sets, respectively. The per-leg trigger efficiency is measured in data with the Tag-and-Probe technique and applied to MC. Single and double lepton trigger efficiencies are then combined together as disjoint probabilities, and the error in this estimate is evaluated by varying the pt of the probe lepton. Eventually, the impact in the final measurement is around 1%. This uncertainty comprises the contribution from double lepton triggers as well, whose pt thresholds are 23 GeV / 12 GeV (e/mu) 17 GeV / 8 GeV (mu/mu) and 23 GeV / 12 GeV (e/e) and for which the analysis selection is trigger safe. Thus, we do not expect any significant improvement in increasing the thresholds of the lepton transverse momenta. As a further cross-check, we have verified this by setting the offline lepton pt thresholds to 30 GeV (leading lepton) and 20 GeV (subleading lepton) and the expected significance was found to be 5 sigmas instead of 5.2 with respect to the background-only hypothesis.

Added standard sentence in Sec.2: "Events of interest are selected using a two-tiered trigger system. The first level (L1), composed of custom hardware processors, uses information from the calorimeters and muon detectors to select events at a rate of around 100\unit{kHz} within a fixed latency of about 4\mus~\cite{CMS:2020cmk}. The second level, known as the high-level trigger (HLT), consists of a farm of processors running a version of the full event reconstruction software optimized for fast processing, and reduces the event rate to around 1\unit{kHz} before data storage~\cite{CMS:2016ngn}."

Added trigger pt-thresholds in Line 47: "This analysis requires events filtered by trigger algorithms that select either a single lepton passing a high-\pt threshold, or two leptons with a lower \pt threshold, satisfying both isolation and identification criteria. In the 2016 data set, the \pt threshold of the single electron trigger is 25\GeV for $\lvert\eta\rvert < 2.1$ and 27\GeV for $2.1 < \lvert\eta\rvert < 2.5$, while the \pt threshold of the single muon trigger is 24\GeV. Double lepton triggers have lower \pt thresholds, namely 23\GeV (12\GeV) for the leading (trailing) lepton in the double electron trigger, 17\GeV (8\GeV) for the leading (trailing) lepton in the double muon trigger and 23\Gev (8\GeV in the first part of the data set, corresponding to 17.7\fbinv, and 12\GeV in the second one) for the leading (trailing) lepton in the electron-muon trigger. In the 2017 data set, single electron and single muon \pt thresholds are raised to 35\GeV and 27\GeV, respectively. Likewise, in the 2018 data set the corresponding single lepton \pt thrseholds are 32\GeV and 24\GeV. Double lepton \pt thresholds in 2017 and 2018 data sets are the same as those described for the 2016 data set, except for the \pt threshold of the trailing lepton in the electron-muon trigger, which is 12\GeV."

• line 85: "high DIJET invariant mass" for clarity. Maybe reorder "with a large separation in pseudorapidity, and a high invariant mass" since the mass is a consequence of the separation in eta.

Reordered as suggested. Substituted "invariant mass" with \mjj and "separation in pseudorapidity" with \detajj as these variables were defined in the Introduction.

• section 4: consider making a table to give an overview of the various categories. In the text it is hard to understand.

• line 110: add reference for Zeppenfeld variable.

Reference to the Zeppenfeld variable:

@article{Rainwater:1996ud,
archiveprefix = {arXiv},
author = {Rainwater, David L. and Szalapski, R. and Zeppenfeld, D.},
doi = {10.1103/PhysRevD.54.6680},
eprint = {hep-ph/9605444},
journal = {Phys. Rev. D},
pages = {6680},
title = {{Probing color singlet exchange in $\PZ$ + two jet events at the CERN LHC}},
volume = {54},
year = {1996}
}


• line 115: "this region" refers to the signal region ? For clarity, suggest to replace "this" by "the signal".

Actually it refers to the DY CR, but we propose to cut the last part of the sentence: "In DY CRs, the \PQb veto requirement is the same as that in the SR."

• line 119: replace "purity" by "fraction of DY events", for clarity.

Done.

• section 4: I suggest to add a couple of figures showing the most important distributions pre-fit in the different categories.

• line 122: "Data-driven": I suspect, data-driven means "fit in categories" ?.

We can rephrase: "Normalizations of major backgrounds are measured by the fit to data with dedicated control regions."

• lines 126 and 138: free to float: I understand "free to float" as the nuisance parameter not constrained by any prior. Why not put a (loose) prior ?

Specific nuisance parameters are included in the likelihood function to constrain the normalization of all major backgrounds. Such parameters are described by a flat pdf defined in the [-10,10] range so that the fit can properly adjust them. However, these are mainly constrained by the number of events in the corresponding CR, therefore changing the prior function is not expected to give any sizeable effect to this procedure.

• line 139: nonprompt not explained. Suggest to rephrase: "Regardless of the lepton flavour, nonprompt leptons, i.e.\,leptons produced in decays of hadrons, are mainly produced IN W + jets events".

We rephrased it as: "Nonprompt leptons, i.e.\,either leptons produced in decays of hadrons or jets misidentified as leptons, are mainly due to \wj events."

• line 160: I dont think this information is necessary. Suggest to remove this paragraph. Optionally add information about number of layers and total number of nodes in line 154 above.

We removed this paragraph.

• line 159: propose to add a couple of figures of the most important input variables.

We can add the figures of the most important input variables as additional material.

• line 191: How can the uncertainty due to luminosity be 2.1%, while the luminosity uncertainty is 1.6% ? For sure, the luminosity should not be constrained or pulled in the fit. Cross sections are generally orthogonal to luminosity. However, technically, the fit of MC distributions to data can lead to pulls that also affect the luminosity uncertainty. Has the pull of the luminosity been checked ?

Indeed we apply a 1.6% uncertainty in the luminosity over the full Run 2 data set, keeping correlations into account as recommended by the LUMI group. The 2.0% value (for some reason the text mistakenly reports 2.1%, but the actual number is 2.0%) is the contribution of such uncertainty to the cross section measurement. This number differs from the 1.6% a priori value, but it should be noticed that this nuisance parameter is only defined for the signal sample and for those backgrounds whose normalization is not measured in data. The luminosity uncertainty in the cross section measurement can ultimately depend on two effects: the correlation among different processes, that can slightly pull the nuisance parameter during the fit procedure, and the error propagation to the final result. Eventually, the combined action of these effects makes the 1.6% a priori uncertainty in the luminosity a greater contribution when computing the cross section measurement.

• Figures 2,3,4: legends and axis titles are too small. Fonts of the Y-axis values and titles of the upper panels are ok, but could also be bigger.

• line 210-212: scales for which samples are varied ? For ttbar, parton shower scales (ISR, FSR) and hdamp should also be varied.

Scale uncertainties are varied for the signal and all background samples. We have checked the effect of varying the hdamp uncertainty in the ttbar sample for the 2018 data set and it has resulted in a negligible contribution to the final result, therefore this uncertainty source has not been included in the fit.

### Comments from Albert De Roeck (cds)

General questions:

• line 20 we mention the reduced central event hadron activity, but our selection does not seem to make explicit use of that. Is that due to the pile-up, or for some other reason?
• Both the ttb scale and normalisation systematic errors are among the most dominant ones. Are we fully sure there is no double counting here? I know you say the scale uncertainties only look at the change in shape, not the total even numbers..

Requiring a reduced hadron activity between the two tagging jets would mean applying some kinematic selection to variables related to the third jet. This would in turn mean explicitly relying on the parton shower modeling for the definition of our phase space, which is not very convenient. Instead, we decided to select our signal region according to centrality of the di-lepton system with respect to the two jets, since it has a good discrimination power and is scarcely correlated to the third jet kinematic. Concerning the systematic uncertainties, there is no double counting between the ttb normalization and QCD scale variations.

• "First" in the title: is this conform with our publication rules? This was under discussion some time ago, whether we should use it (or not) in titles.

Dropped "First". Title changed in: "Observation of electroweak W+W- pair production in association with two jets in proton-proton collisions at sqrt{s} = 13 TeV".

• ref [1] is a recent paper of course, but not the one that originally pointed out what is reported in this statement, Shouldn't we refer (also) to the original paper?

We changed the Ref[1] to:

@article{PhysRevD.16.1519,
title = {Weak interactions at very high energies: The role of the Higgs-boson mass},
author = {Lee, Benjamin W. and Quigg, C. and Thacker, H. B.},
journal = {Phys. Rev. D},
volume = {16},
issue = {5},
pages = {1519--1531},
numpages = {0},
year = {1977},
month = {Sep},
publisher = {American Physical Society},
doi = {10.1103/PhysRevD.16.1519},
}


• line 39: We give no comment on the CMS trigger here at the end of the section as we usually do in our papers? space limitations of PRL?.

We will add the standard CMS description:

"The central feature of the CMS apparatus is a superconducting solenoid of 6\unit{m} internal diameter, providing a magnetic field of 3.8\unit{T}. A silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections, are installed within the solenoid. Forward calorimeters extend the pseudorapidity coverage provided by the barrel and endcap detectors. Muons are detected in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid. A more detailed description of the CMS detector, together with a definition of the coordinate system and the relevant kinematic variables, can be found in Ref.~\cite{CMS_detector}.

Events of interest are selected using a two-tiered trigger system. The first level (L1), composed of custom hardware processors, uses information from the calorimeters and muon detectors to select events at a rate of around 100\unit{kHz} within a fixed latency of about 4\mus~\cite{CMS:2020cmk}. The second level, known as the high-level trigger (HLT), consists of a farm of processors running a version of the full event reconstruction software optimized for fast processing, and reduces the event rate to around 1\unit{kHz} before data storage~\cite{CMS:2016ngn}.

During the 2016 and 2017 data-taking, a gradual shift in the timing of the inputs of the ECAL L1 trigger in the region at $\abs{\eta} > 2.0$ caused a specific trigger inefficiency. For events containing an electron (a jet) with \pt larger than $\approx$50\GeV ($\approx$100\GeV), in the region $2.5 < \abs{\eta} < 3.0$ the efficiency loss is $\approx$10--20\%, depending on \pt, $\eta$, and time. Correction factors were computed from data and applied to the acceptance evaluated by simulation.

The particle-flow (PF) algorithm~\cite{CMS:2017yfk} aims to reconstruct and identify each individual particle in an event, with an optimized combination of information from the various elements of the CMS detector. The energy of photons is obtained from the ECAL measurement. The energy of electrons is determined from a combination of the electron momentum at the primary interaction vertex as determined by the tracker, the energy of the corresponding ECAL cluster, and the energy sum of all bremsstrahlung photons spatially compatible with originating from the electron track. The energy of muons is obtained from the curvature of the corresponding track. The energy of charged hadrons is determined from a combination of their momentum measured in the tracker and the matching ECAL and HCAL energy deposits, corrected for the response function of the calorimeters to hadronic showers. Finally, the energy of neutral hadrons is obtained from the corresponding corrected ECAL and HCAL energies. The candidate vertex with the largest value of summed physics-object $\pt^2$ is taken to be the primary $\Pp\Pp$ interaction vertex. The physics objects used for this determination are the jets and the associated missing transverse momentum, taken as the negative vector sum of the \pt of those jets.

Hadronic jets are clustered from all the PF candidates in an event using the infrared and collinear safe anti-\kt algorithm~\cite{Cacciari:2008gp, Cacciari:2011ma} with a distance parameter of 0.4. Jet momentum is determined as the vectorial sum of all particle momenta in the jet, and is found from simulation to be, on average, within 5 to 10% of the true momentum over the whole \pt spectrum and detector acceptance. Additional proton-proton interactions within the same or nearby bunch crossings can contribute additional tracks and calorimetric energy depositions, increasing the apparent jet momentum. To mitigate this effect, tracks identified to be originating from pileup vertices are discarded and an offset correction is applied to correct for remaining contributions. Jet energy corrections are derived from simulation studies so that the average measured energy of jets becomes identical to that of particle level jets. In situ measurements of the momentum balance in dijet, $\text{photon} + \text{jet}$, $\PZ + \text{jet}$, and multijet events are used to determine any residual differences between the jet energy scale in data and in simulation, and appropriate corrections are made~\cite{CMS:2016lmd}. Additional selection criteria are applied to each jet to remove jets potentially dominated by instrumental effects or reconstruction failures.

The missing transverse momentum vector \ptvecmiss is computed as the negative vector sum of the transverse momenta of all the PF candidates in an event, and its magnitude is denoted as \ptmiss~\cite{CMS:2019ctu}. The \ptvecmiss is modified to account for corrections to the energy scale of the reconstructed jets in the event. The pileup per particle identification (PUPPI) algorithm~\cite{Bertolini:2014bba} is applied to reduce the pileup dependence of the \ptmissvec observable. The \ptmissvec is computed from the PF candidates weighted by their probability to originate from the primary interaction vertex~\cite{CMS:2019ctu}."

• line 56: complicated way of saying that you re-weight the events according to the pile-up distribution as observed in data

"For each data set, simulated events are reweighted according to the pileup profile distribution as observed in data."

• what is the "dipole recoil setting"? Is that a MC parameter? is there a reference for it? please specify. It is not clear here if this is just detail or relevant for this study. This should be better explained.

The sentence was rephrased as "The dipole approach~\cite{sjostrand_2018} is used to model the inital-state radiation, rather than the standard \pt-ordered one used in the \PYTHIA parton shower.". We added this reference:

@article{sjostrand_2018,
title={Some dipole shower studies},
volume={78},
ISSN={1434-6052},
url={http://dx.doi.org/10.1140/epjc/s10052-018-5645-z},
DOI={10.1140/epjc/s10052-018-5645-z},
number={3},
journal={Eur. Phys. J. C},
publisher={Springer Science and Business Media LLC},
author={Cabouat, Baptiste and Sjöstrand, Torbjörn},
year={2018},
month={Mar}
}


• What tool was used for this study to estimate the effect of the interference? Or from which reference was this extracted reference? Please specify..

The interference contribution was estimated by subtraction: from the inclusive sample, which is simulated at LO by generating both EWK and QCD diagrams, as well as their interference, the individual EWK and QCD samples are subtracted. The remaining term, that can go negative, is the interference itself.

• line 76 and line 77: these correction procedures need a reference where these have been presented in detail.

The reweighting has been applied in order to get a better data/MC agreement. We added the following references to these methods.

@article{Khachatryan:2016mnb,
author         = "Khachatryan, Vardan and others",
title          = "Measurement of differential cross sections for top quark
pair production using the lepton+jets final state in
proton-proton collisions at {13\TeV}",
collaboration  = "CMS",
journal        = "Phys. Rev. D",
volume         = "95",
year           = "2017",
pages          = "092001",
doi            = "10.1103/PhysRevD.95.092001",
eprint         = "1610.04191",
archivePrefix  = "arXiv",
primaryClass   = "hep-ex",
reportNumber   = "CMS-TOP-16-008, CERN-EP-2016-227",
SLACcitation   = "%%CITATION = ARXIV:1610.04191;%%"
}

@article{Sirunyan:2019bzr,
author         = "Sirunyan, Albert M and others",
title          = "Measurements of differential {\PZ} boson production cross
sections in proton-proton collisions at {$\sqrt{s}=13\TeV$}",
collaboration  = "CMS",
journal        = "JHEP",
volume         = "12",
year           = "2019",
pages          = "061",
doi            = "10.1007/JHEP12(2019)061",
eprint         = "1909.04133",
archivePrefix  = "arXiv",
primaryClass   = "hep-ex",
reportNumber   = "CMS-SMP-17-010, CERN-EP-2019-175",
SLACcitation   = "%%CITATION = ARXIV:1909.04133;%%"
}


• Some missing information (I guess due to PRL space pressure??) * We do not define how we select a primary vertex, which I assume we do in the analysis. (we do talk about pile-up vertices later) * pTmiss is not defined; we usually define that in papers. * I assume we use particle flow in the analysis? Not mentioned here.

See the comment about the CMS trigger.

• line 131: I assume this conclusion is drawn from MC studies, or does it come from the real data experience? Perhaps good to spell that out.

"A large fraction of the DY background" -> "A large fraction of the DY MC background"

• line 138: "Regardless of the final state, nonprompt leptons are mainly produced via W + jets events" somewhat unlucky phrasing, as the W boson delivered a prompt lepton, but I imagine you talk about the additional lepton here...

Indeed the nonprompt lepton is not the one coming from the W boson, but rather it can be either a jet mis-reconstructed as a lepton or a real lepton coming from a B meson decay produced in the jet itself. We will rephrase as: "Nonprompt leptons, i.e.\,either leptons produced in decays of hadrons or jets misidentified as leptons, are mainly due to \wj events."

• line 182-188: so the signal is shown twice on the plots, once as a contribution to the distribution, and once as a stacked contribution to compare the sum with data..? One can guess that, but it is not clear from the text.

The signal is shown both as stacked and superimposed histogram, those lines have been removed and the caption of Figure 2 now reads: "The contributions from background and signal processes are shown as stacked histograms; the signal template is also displayed as a superimposed line to highlight the difference in shape with respect to the background distribution. Systematic uncertainties are plotted as dashed gray bands. This description holds for Figures 3 and 4 as well."

• line 191: how does this 2.1% lumi normalization uncertainty relates the 1.6% lumi uncertainty reported in line 43? Perhaps I am missing something...

Indeed we apply a 1.6% uncertainty in the luminosity over the full Run 2 data set, keeping correlations into account as recommended by the LUMI group. The 2.0% value (for some reason the text mistakenly reports 2.1%, but the actual number is 2.0%) is the contribution of such uncertainty to the cross section measurement. This number differs from the 1.6% a priori value, but it should be noticed that this nuisance parameter is only defined for the signal sample and for those backgrounds whose normalization is not measured in data. The luminosity uncertainty in the cross section measurement can ultimately depend on two effects: the correlation among different processes, that can slightly pull the nuisance parameter during the fit procedure, and the error propagation to the final result. Eventually, the combined action of these effects makes the 1.6% a priori uncertainty in the luminosity a greater contribution when computing the cross section measurement.

• line 227: we derive a fiducial and a more inclusive cross section. It is clear to me how we defined the fiducial cross section, measured within a certain phase space region. The more inclusive cross section is defined with parton level cuts, which is fine on a MC. But how do we actually derive this from the measurement? We are not unfolding the data, are we? And is this a cross section for W^+W^- or for lepton^+lepton^- channel for the fiducial cross section? The discussion here needs to be expanded on how we derive this experimental number to make this information more significant and usable.

The measured cross section in the more inclusive fiducial volume comes from the MC cross section multiplied by the measured signal strength. The cross section refers to the W+W-jj -> l+l-vvjj electroweak production as this is how the sample was generated.

• line 252: I suggest to repeat the fiducial cuts that have been imposed here, as this cross section is defined only within that regions, as is done in the abstract or just with reference to table 3

We added the reference to Table 3.

### Comments from Guillelmo Gomez Ceballos Retuerto (cds)

• Title. I believe we should change the title in several ways. I suggest something like: "Observation of electroweak W+W- pair production in association with two jets at sqrt{s} = 13 TeV pp collisions". There is no need to say "first", when writing observation, it implies it's the first (in the experiment or everywhere, it depends). There is also no need to mention the leptonic final states either.

Modified the title as: "Observation of electroweak W+W- pair production in association with two jets in proton-proton collisions at sqrt{s} = 13 TeV"

• Abstracts. final state leptons --> final state charged leptons (we don't split the final states depending on the neutrino flavors)

"An observation of the electroweak production of a W$^{+}$W$^{-}$ pair with two jets, with both W bosons decaying leptonically, is reported. The data sample corresponds to an integrated luminosity of 138 fb$^{-1}$ of proton-proton collisions at $\sqrt{s}=13$ TeV, collected by the CMS detector at the CERN LHC. Events are selected by requiring exactly two leptons (electrons or muons) and two jets with large pseudorapidity separation and high dijet invariant mass. Events are categorized based on the flavor of the final-state charged leptons. A signal is observed (expected) with a significance of 5.6 (5.2) standard deviations with respect to the background-only hypothesis and the measured fiducial cross section is $10.2 \pm 2.0$ fb, consistent with the Standard Model prediction of $9.1 \pm 0.6$ (scale) fb."

• l9. Remove "(fully leptonic or semi-leptonic)" there is no need to say that. In fact, we also have fully-hadronic final states.

Done.

• l9-11. I feel this sentence should be completely re-written. You mention the 'first' observation with two papers, which doesn't make sense. Notice that you are not mentioning ATLAS, which is probably not good either. Let me suggest the following sentence: "The EW production for two W bosons with the same electric charge in the fully leptonic final state has extensively been studied by the ATLAS and CMS Collaborations~\cite{ATLAS8TeV, SMP-13-015, ATLAS2016, SMP-17-004, SMP-19-012, SMP-20-006}.

Done.

• l110. Add a reference to the Zeppenfeld variable

Reference to the Zeppenfeld variable:

@article{Rainwater:1996ud,
archiveprefix = {arXiv},
author = {Rainwater, David L. and Szalapski, R. and Zeppenfeld, D.},
doi = {10.1103/PhysRevD.54.6680},
eprint = {hep-ph/9605444},
journal = {Phys. Rev. D},
pages = {6680},
title = {{Probing color singlet exchange in $\PZ$ + two jet events at the CERN LHC}},
volume = {54},
year = {1996}
}


• l139. I think we should remove "Regardless of the final state,"

We rephrase it as: "Nonprompt leptons, i.e.\,either leptons produced in decays of hadrons or jets misidentified as leptons, are mainly due to \wj events."

• l162. What are the L2 weights? A reference is needed, is it reference 25 valid?

We removed lines 160-163 since they are too technical.

• l76. Add a reference to the top-pt reweighting

We added the following references (top-pt and Z-pt reweighting).

@article{Khachatryan:2016mnb,
author         = "Khachatryan, Vardan and others",
title          = "Measurement of differential cross sections for top quark
pair production using the lepton+jets final state in
proton-proton collisions at {13\TeV}",
collaboration  = "CMS",
journal        = "Phys. Rev. D",
volume         = "95",
year           = "2017",
pages          = "092001",
doi            = "10.1103/PhysRevD.95.092001",
eprint         = "1610.04191",
archivePrefix  = "arXiv",
primaryClass   = "hep-ex",
reportNumber   = "CMS-TOP-16-008, CERN-EP-2016-227",
SLACcitation   = "%%CITATION = ARXIV:1610.04191;%%"
}

@article{Sirunyan:2019bzr,
author         = "Sirunyan, Albert M and others",
title          = "Measurements of differential {\PZ} boson production cross
sections in proton-proton collisions at {$\sqrt{s}=13\TeV$}",
collaboration  = "CMS",
journal        = "JHEP",
volume         = "12",
year           = "2019",
pages          = "061",
doi            = "10.1007/JHEP12(2019)061",
eprint         = "1909.04133",
archivePrefix  = "arXiv",
primaryClass   = "hep-ex",
reportNumber   = "CMS-SMP-17-010, CERN-EP-2019-175",
SLACcitation   = "%%CITATION = ARXIV:1909.04133;%%"
}


• l185-187. I think those lines are not needed. Nevertheless, they should be part of caption in Figure 2, and mention in Figure 3 captions that those comments are applicable from Figure 2.

The caption of figure 2 now reads: "The contributions from background and signal processes are shown as stacked histograms; the signal template is also displayed as a superimposed line to highlight the difference in shape with respect to the background distribution. Systematic uncertainties are plotted as dashed gray bands. This description holds for Figures 3 and 4 as well."

• l194. Either we explain the prefiring corrections or not, but we can't keep it as it. Maybe just leave it as trigger efficiency if you don't want to enter in the explanation, and then you should change the systematic table

We added this paragraph to the CMS detector description: "During the 2016 and 2017 data-taking, a gradual shift in the timing of the inputs of the ECAL L1 trigger in the region at $\abs{\eta} > 2.0$ caused a specific trigger inefficiency. For events containing an electron (a jet) with \pt larger than $\approx$50\GeV ($\approx$100\GeV), in the region $2.5 < \abs{\eta} < 3.0$ the efficiency loss is $\approx$10--20\%, depending on \pt, $\eta$, and time. Correction factors were computed from data and applied to the acceptance evaluated by simulation."

Then we changed line 194 as: "Uncertainties in the trigger timing shift".

• Table 2. While I understand the ordering you have chosen, I think it would be better put together the similar sources. There are two uncertainties I don't completely understand: JES and ptmiss energy scale should be fully correlated, what did you mean by splitting them? In addition, how can it be that the luminosity uncertainty is 2.1%? First of all, are you really quoting the uncertainty in the signal yield? If so, then there shouldn't even be uncertainty in the signal... or is it the uncertainty in the cross section? Second, even if it's in the signal cross section, the uncertainty should be about 1.6%. All the important backgrounds are estimated from data, hence the luminosity plays not role. Can you clarify?

Indeed we apply a 1.6% uncertainty in the luminosity over the full Run 2 data set, keeping correlations into account as recommended by the LUMI group. The 2.0% value (for some reason the text mistakenly reports 2.1%, but the actual number is 2.0%) is the contribution of such uncertainty to the cross section measurement. This number differs from the 1.6% a priori value, but it should be noticed that this nuisance parameter is only defined for the signal sample and for those backgrounds whose normalization is not measured in data. The luminosity uncertainty in the cross section measurement can ultimately depend on two effects: the correlation among different processes, that can slightly pull the nuisance parameter during the fit procedure, and the error propagation to the final result. Eventually, the combined action of these effects makes the 1.6% a priori uncertainty in the luminosity a greater contribution when computing the cross section measurement. The ptmiss energy scale is actually the residual contribution due to the uncertainty coming from PF candidates that are not associated to any leptons or jets. Maybe it's better to call it as "unclustered missing energy".

• Section 7. The significance should be given before the cross section measurements, it's strange to talk about measurements before knowing about the significance. Just move the last paragraph after the first three lines in the section. I believe they could perfectly be in the same paragraph, just saying a significant excess of events is observed, etc...

Done.

• You should missing a table with the yields. I am not suggesting to have it in the paper, but at least as supplementary material.

We will add the table as supplementary material.

• l209. The most impactful are very weird words. Let's say: The theoretical uncertainties with the largest impact in the analysis are those corresponding to the choice of the QCD renormalization and factorization scales.

Will correct.

### Comments from Chiara Mariotti (cds)

We will add the standard CMS description:

"The central feature of the CMS apparatus is a superconducting solenoid of 6\unit{m} internal diameter, providing a magnetic field of 3.8\unit{T}. A silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections, are installed within the solenoid. Forward calorimeters extend the pseudorapidity coverage provided by the barrel and endcap detectors. Muons are detected in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid. A more detailed description of the CMS detector, together with a definition of the coordinate system and the relevant kinematic variables, can be found in Ref.~\cite{CMS_detector}.

Events of interest are selected using a two-tiered trigger system. The first level (L1), composed of custom hardware processors, uses information from the calorimeters and muon detectors to select events at a rate of around 100\unit{kHz} within a fixed latency of about 4\mus~\cite{CMS:2020cmk}. The second level, known as the high-level trigger (HLT), consists of a farm of processors running a version of the full event reconstruction software optimized for fast processing, and reduces the event rate to around 1\unit{kHz} before data storage~\cite{CMS:2016ngn}.

During the 2016 and 2017 data-taking, a gradual shift in the timing of the inputs of the ECAL L1 trigger in the region at $\abs{\eta} > 2.0$ caused a specific trigger inefficiency. For events containing an electron (a jet) with \pt larger than $\approx$50\GeV ($\approx$100\GeV), in the region $2.5 < \abs{\eta} < 3.0$ the efficiency loss is $\approx$10--20\%, depending on \pt, $\eta$, and time. Correction factors were computed from data and applied to the acceptance evaluated by simulation.

The particle-flow (PF) algorithm~\cite{CMS:2017yfk} aims to reconstruct and identify each individual particle in an event, with an optimized combination of information from the various elements of the CMS detector. The energy of photons is obtained from the ECAL measurement. The energy of electrons is determined from a combination of the electron momentum at the primary interaction vertex as determined by the tracker, the energy of the corresponding ECAL cluster, and the energy sum of all bremsstrahlung photons spatially compatible with originating from the electron track. The energy of muons is obtained from the curvature of the corresponding track. The energy of charged hadrons is determined from a combination of their momentum measured in the tracker and the matching ECAL and HCAL energy deposits, corrected for the response function of the calorimeters to hadronic showers. Finally, the energy of neutral hadrons is obtained from the corresponding corrected ECAL and HCAL energies. The candidate vertex with the largest value of summed physics-object $\pt^2$ is taken to be the primary $\Pp\Pp$ interaction vertex. The physics objects used for this determination are the jets and the associated missing transverse momentum, taken as the negative vector sum of the \pt of those jets.

Hadronic jets are clustered from all the PF candidates in an event using the infrared and collinear safe anti-\kt algorithm~\cite{Cacciari:2008gp, Cacciari:2011ma} with a distance parameter of 0.4. Jet momentum is determined as the vectorial sum of all particle momenta in the jet, and is found from simulation to be, on average, within 5 to 10% of the true momentum over the whole \pt spectrum and detector acceptance. Additional proton-proton interactions within the same or nearby bunch crossings can contribute additional tracks and calorimetric energy depositions, increasing the apparent jet momentum. To mitigate this effect, tracks identified to be originating from pileup vertices are discarded and an offset correction is applied to correct for remaining contributions. Jet energy corrections are derived from simulation studies so that the average measured energy of jets becomes identical to that of particle level jets. In situ measurements of the momentum balance in dijet, $\text{photon} + \text{jet}$, $\PZ + \text{jet}$, and multijet events are used to determine any residual differences between the jet energy scale in data and in simulation, and appropriate corrections are made~\cite{CMS:2016lmd}. Additional selection criteria are applied to each jet to remove jets potentially dominated by instrumental effects or reconstruction failures.

The missing transverse momentum vector \ptvecmiss is computed as the negative vector sum of the transverse momenta of all the PF candidates in an event, and its magnitude is denoted as \ptmiss~\cite{CMS:2019ctu}. The \ptvecmiss is modified to account for corrections to the energy scale of the reconstructed jets in the event. The pileup per particle identification (PUPPI) algorithm~\cite{Bertolini:2014bba} is applied to reduce the pileup dependence of the \ptmissvec observable. The \ptmissvec is computed from the PF candidates weighted by their probability to originate from the primary interaction vertex~\cite{CMS:2019ctu}."

• line 76: it is not clear how the events are “weighted”. Are they weighted to match (N)NLO distributions ? or simply to get data/MC agreement in the control regions?

The reweighting has been applied in order to get a better data/MC agreement. We added the following references to these methods.

@article{Khachatryan:2016mnb,
author         = "Khachatryan, Vardan and others",
title          = "Measurement of differential cross sections for top quark
pair production using the lepton+jets final state in
proton-proton collisions at {13\TeV}",
collaboration  = "CMS",
journal        = "Phys. Rev. D",
volume         = "95",
year           = "2017",
pages          = "092001",
doi            = "10.1103/PhysRevD.95.092001",
eprint         = "1610.04191",
archivePrefix  = "arXiv",
primaryClass   = "hep-ex",
reportNumber   = "CMS-TOP-16-008, CERN-EP-2016-227",
SLACcitation   = "%%CITATION = ARXIV:1610.04191;%%"
}

@article{Sirunyan:2019bzr,
author         = "Sirunyan, Albert M and others",
title          = "Measurements of differential {\PZ} boson production cross
sections in proton-proton collisions at {$\sqrt{s}=13\TeV$}",
collaboration  = "CMS",
journal        = "JHEP",
volume         = "12",
year           = "2019",
pages          = "061",
doi            = "10.1007/JHEP12(2019)061",
eprint         = "1909.04133",
archivePrefix  = "arXiv",
primaryClass   = "hep-ex",
reportNumber   = "CMS-SMP-17-010, CERN-EP-2019-175",
SLACcitation   = "%%CITATION = ARXIV:1909.04133;%%"
}


• line 97: what does it mean Optimal ? whould be better/easy to understand if you write down the criteria.

In this context "optimal" means that the selection is perferomed by maximizing the signal-to-background ratio. We removed the sentence since it may be ambiguous and does not add relevant information.

### Comments from Sergei Gninenko (cds)

• The abstract: The natural question here is whether the measured (or observed significance) cross-section agrees with the Standard Model expectations. Please clarify.

We rephrased the abstract as follows: "An observation of the electroweak production of a W$^{+}$W$^{-}$ pair with two jets, with both W bosons decaying leptonically, is reported. The data sample corresponds to an integrated luminosity of 138 fb$^{-1}$ of proton-proton collisions at $\sqrt{s}=13$ TeV, collected by the CMS detector at the CERN LHC. Events are selected by requiring exactly two leptons (electrons or muons) and two jets with large pseudorapidity separation and high dijet invariant mass. Events are categorized based on the flavor of the final-state charged leptons. A signal is observed (expected) with a significance of 5.6 (5.2) standard deviations with respect to the background-only hypothesis and the measured fiducial cross section is $10.2 \pm 2.0$ fb, consistent with the Standard Model prediction of $9.1 \pm 0.6$ (scale) fb.}"

• Conclusion: The same question. If LO calculations do not exist, that should be mentioned. Please clarify.

We could add this sentence in the last line: "Results are compatible with SM predictions within one standard deviation."

• Introduction: Here you discuss the vector boson scattering (VBS), defining vector bosons to be W, Z, and photon. Then, for example, in Line 58. The VBS opposite-sign electroweak process.... do you mean here W opposite-sign electroweak process, or something else? If just W, the re-definition VBS produces unnecessary confusion, I would suggest keeping VBS in the introduction, but then using W throughout the paper which is strictly dedicated to the W scattering.

We corrected as "The signal process..."

• Fig. 1 Diagrams shown are confusing.For example, the processes q(q') - > q(q') W shown on the left one do not exsist. Should be, e.g. q -> q' W± . Please, indicate also the charges of virtual W's

Done.

### Comments from Hyunyong Kim (cds)

• Abstract: to 138 fb^-1 -> to an integrated luminosity of 138 fb^-1 at 13 TeV -> at sqrt(s) = 13 TeV

Done.

• Section 2: add a trigger part

Events of interest are selected using a two-tiered trigger system. The first level (L1), composed of custom hardware processors, uses information from the calorimeters and muon detectors to select events at a rate of around 100\unit{kHz} within a fixed latency of about 4\mus~\cite{CMS:2020cmk}. The second level, known as the high-level trigger (HLT), consists of a farm of processors running a version of the full event reconstruction software optimized for fast processing, and reduces the event rate to around 1\unit{kHz} before data storage~\cite{CMS:2016ngn}.

• L112-113: We understand you require pT > 20 GeV. Mention WP and Algorithm in paper with a proper reference for the b-tagging.

WP and algorithm are mentioned in lines 101-102.

### Comments from Greg Landsberg (cds)

• Title: PRL specifically prohibits "First" in the title; even for the PLB you should drop "First"; also "... with two jets in proton-proton collisions at s√=13 TeV".

We will drop the "first" too and modify the title as: "Observation of electroweak W+W- pair production in association with two jets in proton-proton collisions at sqrt{s} = 13 TeV"

• Abstract, L10: drop "a cone size ΔR=0.4 and"' first of all, the anti-kT jets are not conical in shape; second, it makes little sense to introduce the distance parameter in the abstract when you don't even mention the jet algorithm used!

We dropped the description of the fiducial volume, which might be too technical for the abstract: "An observation of the electroweak production of a W$^{+}$W$^{-}$ pair with two jets, with both W bosons decaying leptonically, is reported. The data sample corresponds to an integrated luminosity of 138 fb$^{-1}$ of proton-proton collisions at $\sqrt{s}=13$ TeV, collected by the CMS detector at the CERN LHC. Events are selected by requiring exactly two leptons (electrons or muons) and two jets with large pseudorapidity separation and high dijet invariant mass. Events are categorized based on the flavor of the final-state charged leptons. A signal is observed (expected) with a significance of 5.6 (5.2) standard deviations with respect to the background-only hypothesis and the measured fiducial cross section is $10.2 \pm 2.0$ fb, consistent with the Standard Model prediction of $9.1 \pm 0.6$ (scale) fb.}"

• Introduction: the introduction completely misses the previous work on this subject; you should review the literature and previous measurements of WW production; you may also want to mention an observation of WZ EW production by ATLAS.

The EW production for two W bosons with the same electric charge in the fully leptonic final state has extensively been studied by the ATLAS and CMS Collaborations~\cite{ATLAS8TeV, SMP-13-015, ATLAS2016, SMP-17-004, SMP-19-012, SMP-20-006}

• L9: first, the CMS Style strongly recommends not to use "semileptonic" in this context; you should use leptons+jets. Second, why do you only mention these two channels, and not also all-hadronic, where the background is clearly overwhelming, thus illustrating the point even better?

We removed what was inside those brackets to make the discussion more general.

• L21: drop "containing W boson decays" - the final states are selected based on leptons; there is no guarantee that they come from W boson decays.

Done.

• Section 2: mention the trigger and cite the L1 and HLT trigger papers; alternatively do so on LL47-49.

• LL47-49: please, specify trigger thresholds used in the analysis.

"This analysis requires events filtered by trigger algorithms that select either a single lepton passing a high-\pt threshold, or two leptons with a lower \pt threshold, satisfying both isolation and identification criteria. In the 2016 data set, the \pt threshold of the single electron trigger is 25\GeV for $\lvert\eta\rvert < 2.1$ and 27\GeV for $2.1 < \lvert\eta\rvert < 2.5$, while the \pt threshold of the single muon trigger is 24\GeV. Double lepton triggers have lower \pt thresholds, namely 23\GeV (12\GeV) for the leading (trailing) lepton in the double electron trigger, 17\GeV (8\GeV) for the leading (trailing) lepton in the double muon trigger and 23\Gev (8\GeV in the first part of the data set, corresponding to 17.7\fbinv, and 12\GeV in the second one) for the leading (trailing) lepton in the electron-muon trigger. In the 2017 data set, single electron and single muon \pt thresholds are raised to 35\GeV and 27\GeV, respectively. Likewise, in the 2018 data set the corresponding single lepton \pt thrseholds are 32\GeV and 24\GeV. Double lepton \pt thresholds in 2017 and 2018 data sets are the same as those described for the 2016 data set, except for the \pt threshold of the trailing lepton in the electron-muon trigger, which is 12\GeV."

• L54: Ref. [8] is not a proper reference for the tag-and-probe method; you should use our first W/Z cross section measurement paper instead.

We added the reference to the W/Z cross section measurement:

@article{2011,
title={Measurements of inclusive W and Z cross sections in pp collisions at $\sqrt {s} = 7$ TeV},
volume={2011},
ISSN={1029-8479},
url={http://dx.doi.org/10.1007/JHEP01(2011)080},
DOI={10.1007/jhep01(2011)080},
number={1},
journal={Journal of High Energy Physics},
publisher={Springer Science and Business Media LLC},
collaboration ="CMS",
year={2011},
month={Jan}
}


• LL54-55: The b jet tagging efficiency [we tag jets, not quarks!] is measured in simulation and corrections are derived using ...

Done.

• L59: give full PYTHIA version here, 8.2xy.

PYTHIA 8.240 and 8.230 in 2016 and 2017-2018, respectively.

• Section 4: no details of electron, muon, jet, or pmissT reconstruction are given. You should add the standard text about PF reconstruction and explain what algorithm is used to reconstruct jets; introduce the distance size; explain the definition of pmissT, etc. Since you are talking about pileup vertices later, you should also define the selection of the primary vertex of the event.

We will add these details in section 2, together with the CMS trigger description. We will add the standard CMS description:

"The central feature of the CMS apparatus is a superconducting solenoid of 6\unit{m} internal diameter, providing a magnetic field of 3.8\unit{T}. A silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections, are installed within the solenoid. Forward calorimeters extend the pseudorapidity coverage provided by the barrel and endcap detectors. Muons are detected in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid. A more detailed description of the CMS detector, together with a definition of the coordinate system and the relevant kinematic variables, can be found in Ref.~\cite{CMS_detector}.

Events of interest are selected using a two-tiered trigger system. The first level (L1), composed of custom hardware processors, uses information from the calorimeters and muon detectors to select events at a rate of around 100\unit{kHz} within a fixed latency of about 4\mus~\cite{CMS:2020cmk}. The second level, known as the high-level trigger (HLT), consists of a farm of processors running a version of the full event reconstruction software optimized for fast processing, and reduces the event rate to around 1\unit{kHz} before data storage~\cite{CMS:2016ngn}.

During the 2016 and 2017 data-taking, a gradual shift in the timing of the inputs of the ECAL L1 trigger in the region at $\abs{\eta} > 2.0$ caused a specific trigger inefficiency. For events containing an electron (a jet) with \pt larger than $\approx$50\GeV ($\approx$100\GeV), in the region $2.5 < \abs{\eta} < 3.0$ the efficiency loss is $\approx$10--20\%, depending on \pt, $\eta$, and time. Correction factors were computed from data and applied to the acceptance evaluated by simulation.

The particle-flow (PF) algorithm~\cite{CMS:2017yfk} aims to reconstruct and identify each individual particle in an event, with an optimized combination of information from the various elements of the CMS detector. The energy of photons is obtained from the ECAL measurement. The energy of electrons is determined from a combination of the electron momentum at the primary interaction vertex as determined by the tracker, the energy of the corresponding ECAL cluster, and the energy sum of all bremsstrahlung photons spatially compatible with originating from the electron track. The energy of muons is obtained from the curvature of the corresponding track. The energy of charged hadrons is determined from a combination of their momentum measured in the tracker and the matching ECAL and HCAL energy deposits, corrected for the response function of the calorimeters to hadronic showers. Finally, the energy of neutral hadrons is obtained from the corresponding corrected ECAL and HCAL energies. The candidate vertex with the largest value of summed physics-object $\pt^2$ is taken to be the primary $\Pp\Pp$ interaction vertex. The physics objects used for this determination are the jets and the associated missing transverse momentum, taken as the negative vector sum of the \pt of those jets.

Hadronic jets are clustered from all the PF candidates in an event using the infrared and collinear safe anti-\kt algorithm~\cite{Cacciari:2008gp, Cacciari:2011ma} with a distance parameter of 0.4. Jet momentum is determined as the vectorial sum of all particle momenta in the jet, and is found from simulation to be, on average, within 5 to 10% of the true momentum over the whole \pt spectrum and detector acceptance. Additional proton-proton interactions within the same or nearby bunch crossings can contribute additional tracks and calorimetric energy depositions, increasing the apparent jet momentum. To mitigate this effect, tracks identified to be originating from pileup vertices are discarded and an offset correction is applied to correct for remaining contributions. Jet energy corrections are derived from simulation studies so that the average measured energy of jets becomes identical to that of particle level jets. In situ measurements of the momentum balance in dijet, $\text{photon} + \text{jet}$, $\PZ + \text{jet}$, and multijet events are used to determine any residual differences between the jet energy scale in data and in simulation, and appropriate corrections are made~\cite{CMS:2016lmd}. Additional selection criteria are applied to each jet to remove jets potentially dominated by instrumental effects or reconstruction failures.

The missing transverse momentum vector \ptvecmiss is computed as the negative vector sum of the transverse momenta of all the PF candidates in an event, and its magnitude is denoted as \ptmiss~\cite{CMS:2019ctu}. The \ptvecmiss is modified to account for corrections to the energy scale of the reconstructed jets in the event. The pileup per particle identification (PUPPI) algorithm~\cite{Bertolini:2014bba} is applied to reduce the pileup dependence of the \ptmissvec observable. The \ptmissvec is computed from the PF candidates weighted by their probability to originate from the primary interaction vertex~\cite{CMS:2019ctu}."

• LL91-92: give a reference to "loosely identified", as the statement is meaningless otherwise.

Removed "loosely identified".

• L101 and further in text: generally, you could save considerable amount of space by introducing the SR and CR acronyms and using them throughout the paper.

Done. Defined in Introduction signal region as SR and control regions as CRs.

• L115: there is no mention of any mT selection for the signal region in the paper; moreover mT is not even defined. This needs to be properly described in the paper.

The signal region selection on mT is reported in lines 102-104, which state: "In the $\Pe\PGm$ [signal] category, the transverse mass \mt formed by the combination of \ptll and \ptmiss is required to be above 60\GeV, while no cut on this variable is applied to $\Pe\Pe$ and $\PGm\PGm$ categories." We rephrased those lines 102-104 in this way: "The transverse mass \mt is required to be above 60\GeV in the $\Pe\PGm$ SR, and is defined as $\mT=$ $\sqrt{2 \ptll \ptmiss \left[1 - \cos\Delta\phi(\ptvecll, \ptvecmiss)\right]}$." to define mT in a clearer way.

• L118: is this really the requirement on meμ [in which case use eμ, not ℓℓ as the subscripts] or you meant mT instead?

The mℓ&#8467 variable refers to the invariant mass of the two final state leptons with the highest pT, regardless of their flavor. The strict mℓ&#8467 window around the Z boson mass is defined in same flavor categories (ee or μ&#956 final states) to select the DY CR, whereas mℓ&#8467 is required to be higher than 120 GeV in the same flavor SR, in order to discard Drell-Yan events. In different flavor categories we require mℓ&#8467 > 50 GeV, which is limited to 80 GeV in the DYτ&#964 CR.

• LL156-157: according to the CMS style, for all continuous variables one should use strict inequalities, as the probability of the variable to be exactly equal to a given number is basically zero, even given the finite machine precision. Thus, you should use "(Zℓℓ>1)".

Used strict inequalities.

• LL162-163: give references to the L2 weights and early stopping methods; note that the latter should not be capitalized.

We removed lines 160-163 since they were too technical.

• Table 1 caption, L1 and Figs. 2-3 captions, L1: use strict inequality.

Done.

• LL179-180: since you have shape uncertainties [e.g., scale uncertainties], you should mention that those are given by Gaussian distributions.

Nuisance parameters associated to shape uncertainties are given by a unit Gaussian distribution."

• L191: given that most of your backgrounds are estimated from data and not simulation, why would the 1.6\% integrated luminosity uncertainty translate in the 2.1\% uncertainty in the signal yield rather than to approximately 1.6\%? Please, explain.

Indeed we apply a 1.6% uncertainty in the luminosity over the full Run 2 data set, keeping correlations into account as recommended by the LUMI group. The 2.0% value (for some reason the text mistakenly reports 2.1%, but the actual number is 2.0%) is the contribution of such uncertainty to the cross section measurement. This number differs from the 1.6% a priori value, but it should be noticed that this nuisance parameter is only defined for the signal sample and for those backgrounds whose normalization is not measured in data. The luminosity uncertainty in the cross section measurement can ultimately depend on two effects: the correlation among different processes, that can slightly pull the nuisance parameter during the fit procedure, and the error propagation to the final result. Eventually, the combined action of these effects makes the 1.6% a priori uncertainty in the luminosity a greater contribution when computing the cross section measurement.

• L194: Additional trigger inefficiency corrections due to ECAL trigger timing drift [xx] are included ... [use our L1 paper as a reference for preferring and avid using jargon].

See comment about section 4. Then we will change line 194 as: "Additional trigger corrections ("prefiring corrections") -> "Uncertainties in the trigger timing shift".

• Figures 2-4: the figures do not comply with the CMS Style. There should be no "L =" in front of the integrated luminosity number. The legend "nonprompt" should start with a capital letter for consistency; all the inequalities in Fig. 3 should be strict. Fig. 2 x axis label should read "DNN output". Figs. 3-4 x axis labels should be capitalized: "Bins". "Multiboson" background has never been defined in the paper. How is it different from WW? Does WW mean QCD-induced WW? - Then it should be labeled as such. Legends defining categories in Figs. 3-4 should not be italicized, as this is inconsistent with the notations in the body of the paper. The signal seems to be shown twice: once overlaid with the background, and once stacked with it. This should be clearly explained in the captions and/or paper. There may be some problem with the figure graphics: while it looks fine on the screen, the two color printers I tried to print the paper on both gave me a mess of black background with color lines overlaid.

See Figure 2, Figure 3, Figure 4. The caption of Figure 2 now reads: "The contributions from background and signal processes are shown as stacked histograms; the signal template is also displayed as a superimposed line to highlight the difference in shape with respect to the background distribution. Systematic uncertainties are plotted as dashed gray bands. This description holds for Figures 3 and 4 as well."

• LL209-214: give a typical range of the scale uncertainties here, as you do for other sources.

We would rather drop all ranges and just put the reference to Table 3 for the impact of each source in the cross section measurement.

• LL209,213: drop "QCD" - these are not "QCD scales" but rather scales of the QCD RGE evolution, but this goes beyond the scope of the paper!

Done.

• L228: why do you only vary the factorization scale, and not both? Please, explain.

Because the VBS signal is a purely electroweak process at LO, it has no effect changing the renormalizaion scale, which only affects the coupling of the alpha strong constant.

• LL232-233: anti-kT jets are not conical; please rephrase as follows: "Additionally, if such lepton is found within a distance ΔR=0.4 from a jet axis, the event is discarded".

Will do.

• LL234-235: is this also LO cross section? Please specify so.

Yes it is, we added it to the text.

• Table 2: you need to explain that the fiducial phase space definition uses "dressed" leptons and define pbare ℓT [note "bare" in Roman!].

We dropped the "bare" superscript, added the "dressed" superscript on the left-hand side of the equation in table 2 and rephrased lines 230-232 as follows: "If a photon is found within a distance $\Delta R < 0.1$ from a lepton, its four-momentum is added to the one of the lepton, making a "dressed" lepton."

• L247: ... QCD-induced production of W boson pairs and ...

Done.

### Comments from Markus Klute (cds)

• General: The paper is rather short, we think one could add an event yield table to provide the reader a better overview of how the different regions are populated, as this is always hard to imagine from log-scale plots. Maybe the choice of journal should be reconsidered to allow a more detailed discussion.

We will provide the event yield table as additional material. The paper will be submitted to PLB.

• Abstract: Is such a detailed definition of the fiducial volume really needed in the Abstract? I think it is too technical for an Abstract.

An observation of the electroweak production of a W$^{+}$W$^{-}$ pair with two jets, with both W bosons decaying leptonically, is reported. The data sample corresponds to an integrated luminosity of 138 fb$^{-1}$ of proton-proton collisions at $\sqrt{s}=13$ TeV, collected by the CMS detector at the CERN LHC. Events are selected by requiring exactly two leptons (electrons or muons) and two jets with large pseudorapidity separation and high dijet invariant mass. Events are categorized based on the flavor of the final-state charged leptons. A signal is observed (expected) with a significance of 5.6 (5.2) standard deviations with respect to the background-only hypothesis and the measured fiducial cross section is $10.2 \pm 2.0$ fb, consistent with the Standard Model prediction of $9.1 \pm 0.6$ (scale) fb.

We will add the standard CMS description:

"The central feature of the CMS apparatus is a superconducting solenoid of 6\unit{m} internal diameter, providing a magnetic field of 3.8\unit{T}. A silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections, are installed within the solenoid. Forward calorimeters extend the pseudorapidity coverage provided by the barrel and endcap detectors. Muons are detected in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid. A more detailed description of the CMS detector, together with a definition of the coordinate system and the relevant kinematic variables, can be found in Ref.~\cite{CMS_detector}.

Events of interest are selected using a two-tiered trigger system. The first level (L1), composed of custom hardware processors, uses information from the calorimeters and muon detectors to select events at a rate of around 100\unit{kHz} within a fixed latency of about 4\mus~\cite{CMS:2020cmk}. The second level, known as the high-level trigger (HLT), consists of a farm of processors running a version of the full event reconstruction software optimized for fast processing, and reduces the event rate to around 1\unit{kHz} before data storage~\cite{CMS:2016ngn}.

During the 2016 and 2017 data-taking, a gradual shift in the timing of the inputs of the ECAL L1 trigger in the region at $\abs{\eta} > 2.0$ caused a specific trigger inefficiency. For events containing an electron (a jet) with \pt larger than $\approx$50\GeV ($\approx$100\GeV), in the region $2.5 < \abs{\eta} < 3.0$ the efficiency loss is $\approx$10--20\%, depending on \pt, $\eta$, and time. Correction factors were computed from data and applied to the acceptance evaluated by simulation.

The particle-flow (PF) algorithm~\cite{CMS:2017yfk} aims to reconstruct and identify each individual particle in an event, with an optimized combination of information from the various elements of the CMS detector. The energy of photons is obtained from the ECAL measurement. The energy of electrons is determined from a combination of the electron momentum at the primary interaction vertex as determined by the tracker, the energy of the corresponding ECAL cluster, and the energy sum of all bremsstrahlung photons spatially compatible with originating from the electron track. The energy of muons is obtained from the curvature of the corresponding track. The energy of charged hadrons is determined from a combination of their momentum measured in the tracker and the matching ECAL and HCAL energy deposits, corrected for the response function of the calorimeters to hadronic showers. Finally, the energy of neutral hadrons is obtained from the corresponding corrected ECAL and HCAL energies. The candidate vertex with the largest value of summed physics-object $\pt^2$ is taken to be the primary $\Pp\Pp$ interaction vertex. The physics objects used for this determination are the jets and the associated missing transverse momentum, taken as the negative vector sum of the \pt of those jets.

Hadronic jets are clustered from all the PF candidates in an event using the infrared and collinear safe anti-\kt algorithm~\cite{Cacciari:2008gp, Cacciari:2011ma} with a distance parameter of 0.4. Jet momentum is determined as the vectorial sum of all particle momenta in the jet, and is found from simulation to be, on average, within 5 to 10% of the true momentum over the whole \pt spectrum and detector acceptance. Additional proton-proton interactions within the same or nearby bunch crossings can contribute additional tracks and calorimetric energy depositions, increasing the apparent jet momentum. To mitigate this effect, tracks identified to be originating from pileup vertices are discarded and an offset correction is applied to correct for remaining contributions. Jet energy corrections are derived from simulation studies so that the average measured energy of jets becomes identical to that of particle level jets. In situ measurements of the momentum balance in dijet, $\text{photon} + \text{jet}$, $\PZ + \text{jet}$, and multijet events are used to determine any residual differences between the jet energy scale in data and in simulation, and appropriate corrections are made~\cite{CMS:2016lmd}. Additional selection criteria are applied to each jet to remove jets potentially dominated by instrumental effects or reconstruction failures.

The missing transverse momentum vector \ptvecmiss is computed as the negative vector sum of the transverse momenta of all the PF candidates in an event, and its magnitude is denoted as \ptmiss~\cite{CMS:2019ctu}. The \ptvecmiss is modified to account for corrections to the energy scale of the reconstructed jets in the event. The pileup per particle identification (PUPPI) algorithm~\cite{Bertolini:2014bba} is applied to reduce the pileup dependence of the \ptmissvec observable. The \ptmissvec is computed from the PF candidates weighted by their probability to originate from the primary interaction vertex~\cite{CMS:2019ctu}."

• line 24: or at another place early in the text: “The W decay channels into tau leptons are not used for this measurement and treated as sources of background to the signals with electrons and muons”.

Leptons coming from tau decays are actually included in our signal definiton, but they represent a minor contribution to the number of signal events. When definining the fiducial volume however, a veto is applied to reject such leptons. This is now explcitly stated when defining the fiducial volume, and table 3 has been accordingly updated.

• line 43: “…the improvement in precision relative to Refs. [5–7] reflecting the (uncorrelated) time evolution of some systematic effects.“ This doesn’t tell us anything. Anyone not from CMS would understand even less.

This was the standard sentence for the luminosity and it was recently updated to: "The integrated luminosities for the 2016, 2017, and 2018 data-taking years have 1.2--2.5\% individual uncertainties~\cite{CMS-LUM-17-003,CMS-PAS-LUM-17-004,CMS-PAS-LUM-18-002}, while the overall uncertainty for the 2016--2018 period is 1.6\%."

• line 54: missing Ref. before [8]

• line 59: It should be explicitly stated that Pythia 8 was used

PYTHIA 8.240 and 8.230 in 2016 and 2017-2018, respectively. Added in the text.

• line 60: How is the subtraction performed? Is the diagrams removed from the ME? Or is their contribution subtracted from the signal cross-section? In the latter case, how do you take care of the interference terms?

Diagrams involving a top quark line are removed from the ME since they are already accounted for in the ttbar + tW background sample.

• line 62: Is there a reference for this setting available or can it briefly be discussed what this setting does?

The sentence was rephrased as "The dipole approach ~\cite{sjostrand_2018} is used to model the inital-state radiation, rather than the standard \pt-ordered one used in the PYTHIA parton shower". We added this reference:

@article{sjostrand_2018,
title={Some dipole shower studies},
volume={78},
ISSN={1434-6052},
url={http://dx.doi.org/10.1140/epjc/s10052-018-5645-z},
DOI={10.1140/epjc/s10052-018-5645-z},
number={3},
journal={The European Physical Journal C},
publisher={Springer Science and Business Media LLC},
author={Cabouat, Baptiste and Sjöstrand, Torbjörn},
year={2018},
month={Mar}
}


• line 66: “On-shell Higgs boson production mechanisms…” we would remove “On-sell” here as you are not defining when you account on- and off-shell.

Done.

• line 70-77: Not entirely clear which background is modeled with what simulation and what is the accuracy in their predicted cross section? Suggest adding a table summarizing the signal and background simulations, their respective cross-section predictions with the level of accuracy.

It would not fit the PLB target, we would rather leave this section as it is.

• line 110: “quantified by the Zeppenfeld variable” please add so-called: “quantified by the so-called Zeppenfeld variable”

• line 122: The term "data driven" is a bit misleading here since you still rely on your MC simulation for the modeling of most backgrounds in your signal region, you just add a control region to constrain the normalization.

We rely on the MC simulation for describing the shape of our background templates, but normalizations of the most important background samples are entirely measured in data. It's true that normalizations are mainly constrained in CRs, but they are used to measure background contributions in SRs as well. "Normalizations of major backgrounds are measured by the fit to data with dedicated control regions."

• line 139: ".., with fever events in the mumu category.'' This construction sounds strange, suggesting to write something along the lines that the probability for nonprompt electrons is higher than for nonprompt muons in general.

We rephrase it as: "Nonprompt leptons, i.e.\,either leptons produced in decays of hadrons or jets misidentified as leptons, are mainly due to \wj events."

• line 152: “Other minor backgrounds are Higgs boson production, greatly suppressed by the mll cut, ..” Add “SM” to the Higgs. After all, our analysis allows for an extended or anomalous Higgs sector.

We think that this might be clear from the context, given that the paper assumes the SM hypothesis.

• line 154ff: From the text it reads that the separation by the Zeppenfeld variable is just introduced to gain performance in the DNN training, but later it gets clear that you also separate your event categories by this cut, even the regions where no DNN is applied at all. How is this motivated?

As stated in lines 108-111, the division of the SR into two subcategories wrt to Zeppenfeld variable is performed to improve the sensitivity of the analysis. Indeed, the Zll < 1 SR region is enriched with signal and has less background contamination. In line 154, we just want to express that two different DNN models are trained for the two SR subcategories, instead of using only one model for both Zll < 1 and Zll > 1 SR phase spaces.

• line 160-163: What is the activation function used in DNN? How many epochs are trained before Early stopping sets in? What is the area under ROC ?

Activation functions are the Rectified Linear Unit (ReLU) for the hidden layers and the sigmoid for the output layer. Early stopping stops the training when the difference between the loss values calculated on the validation data set for two consecutive epochs is less than 0.0002 after 30 epochs. We compared the shape of the ROC curve of different models to understand their performance, but we didn't take into account the area under the ROC as a figure of merit. We instead checked the behavior of metrics such as loss, efficiency (or recall), and purity (or precision) on the validation and training data sets to make sure the networks were reliable. However, we are going to remove lines 160-163 since they are too technical.

• line 164 -168: The signal extraction should be explained in a consistent way using equations for the likelihood construction. One needs to clearly indicate which is the POI and what are the different nuisance parameters in the likelihood. Also, how does the POI translate into signal cross section?

In line 166 we added: "The signal strength of the EW W+W- production is the parameter of interest and is translated to a cross section measurement in two different fiducial volumes."

• Section 6: Too brief. Lacks necessary details/references about why certain systematics such as prefiring conditions'' are required. This may not be apparent to a person outside CMS. Also, the impact of the systematic variation is redundant in the text as they can be easily read out from Table 2. I think more focus should be given on how the variation of a particular source (scale / resolution parameter) is performed along with the necessary references.

"Uncertainties in the integrated luminosity, lepton reconstruction and identification efficiency~\cite{CMS:2015xaf,CMS:2018rym}, trigger efficiency and additional trigger timing shift have been taken into account. The electron and muon momentum scale uncertainties are computed by varying the momenta of leptons within one standard deviation from their nominal value. Similarly, jet energy scale and resolution uncertainties~\cite{CMS:2016lmd} are evaluated by shifting the \pt of the jets by one standard deviation, and this directly affects the reconstructed jet multiplicity and \ptmiss measurement: several independent sources are considered and partially correlated among different data sets. Uncertainties in the residual \ptmiss~\cite{CMS:2019ctu} are also included and calculated by varying the momenta of unclustered particles that are not identified with either a jet or a lepton. The \PQb tagging~\cite{Bols_2020} introduces different uncertainty sources. The uncertainties from the \PQb tagging algorithm itself are correlated among all data sets, while uncertainties due to the finite size of the control samples are uncorrelated. Finally, the uncertainty in the pileup reweighting procedure is applied to all relevant simulated samples.

Among theoretical uncertainties, the effects due to the choice of the renormalization and factorization scales are evaluated. These are computed by varying the scales up and down independently by a factor of two with respect to their nominal values, ignoring the extreme cases where they are shifted in opposite directions~\cite{Catani:2003zt,Cacciari:2003fi}. Only shape effects are considered when varying such scales, since signal and background normalizations are directly measured in data. PDF uncertainties in the signal process do not introduce any shape effect in the \mjj and DNN output distributions, hence they have not been considered. For \ttbar and DY backgrounds, since normalization effects have no impact in the fit, PDF uncertainties can only affect the ratio of the expected yields between the SR and the CR. Such uncertainties have been included in the CR and estimated to be 1\% and 2\% for \ttbar and DY backgrounds, respectively. The modeling of both the parton shower and the underlying event has been taken into account."

The breakdown of all systematic uncertainties in the cross section measurement is shown in Table~\ref{tab:syst}. Systematic uncertainties have been added in quadrature, and the systematic component of the overall relative uncertainty in the signal cross section measurement is 13.1\%. The stastical uncertainty has been computed by freezing all systematic sources to their best-fit result and its value is 14.9\%. The combined relative uncertainty in the cross section measurement is 19.8\%.

• line 220: "Fit results are combined in all categories." What exactly does this mean? Do you fit the regions independently and then combine the results in the end or are they fitted simultaneously, thus taking correlations into account?

All regions are simultaneously fit to data and correlations are properly taken into account. We will rephrase this sentence as: "All categories are simultaneously fit to data."

• line 223ff: The definition of the fiducial volume (and the corresponding table) should go before the results chapter, i.e., where you outline your analysis strategy.

It is common to define fiducial volumes after the signal extraction procedure has been explained. We'd rather keep it as it is.

• line 227: How was that predicted value derived or do you have a reference for it?

The predicted cross section value in the inclusive fiducial volume is the nominal cross section of the MadGraph signal sample employed in our analysis, where only few parton-level requirements are applied [1]. Uncertainties are computed by varying the choice of the QCD factorization scale.

• line 235: The same question with this prediction, also are these the same uncertainties as above?

The predicted cross section value was derived by running the fiducial selection on the signal sample, using gen-level variables and counting how many signal events pass the fiducial selection. Uncertainties are computed by varying the choice of the QCD factorization scale.

• line 249ff: The paper just ends quoting the numbers, but no conclusion is given. How do the measured numbers agree with the prediction? The title also mentions first observation, so this should be more advertised.

We added this sentence in the last line: "Results are compatible with SM predictions within one standard deviation."

• Section 7: Suggest to show comparison plots for inclusive and fiducial cross sections between measured value and prediction. It would be nice to add a plot of the p-value as a function of the POI here or as additional material.

The tabulated results for this analysis would be provided in the HEPData record. Hence, we prefer not to add the plot suggested as additional material since similar information would be summarized in the HEPData record.

• All figures: The ordering of processes (and colors) in the stacked histogram and the legend is not identical. Would be nice to have the same ordering here.

Ordering in the stacked histograms is usually chosen wrt the abundance of different backgrounds to make all the various contributions be visible. Hence, it may be different in the SR of emu and ee/mumu categories. We can try to arrange the same ordering if it does not spoil visibility.

• Figure 1: "...production of W+W-." -> "...production of W+W- in association with two quarks."

Done.

• Figure 2: "data" is missing in the legend

• Figure 3: Bins are of equal width -> no horizontal uncertainty bars needed. "Bins" on the x axis takes only integer values -> remove the ticks on the x axis in between.

We removed the tick on the x axis. We prefer to maintain horizontal uncertainty bars to be consistent with other plots in the paper.

• It seems like the data is not well described in bins with high mjj and Delta eta, i.e., there is at least 10% more data than estimated for this region. I guess the normalization of background processes is mainly driven from the control regions where the agreement is fine, but the differences are not covered by uncertainties. Table 2 mentions an "ttbar normalization" uncertainty, what exactly is this uncertainty?

The "ttbar normalization" uncertainty is given by varying the nuisance parameter that scales the ttbar + tW background contribution in the likelihood function. Deviations are compatible with statistical fluctations.

• Figure 4: Bins are of equal width -> no horizontal uncertainty bars needed. The plot has only two bins, remove "0.5" and "1.5" from the x axis as these values don't make any sense. Please remove also the ticks on the x axis between "0" and "1" ; and "1" and "2"

We prefer to maintain horizontal uncertainty bars to be consistent with other plots in the paper. We removed ticks non-integer numbers from the x axis.

• Table 1: Why is only the transverse W boson mass of the leading lepton used and not also the one obtained from the subleading lepton?

We choose the input variables of the DNN among a larger set of observables. The mTW2 (calculated with the pT of the subleading lepton) was considered as a candidate input variable as well. However, including mTw2 as an input variable did not produce a significant improvement of the DNN performance, hence it was not chosen.

• Table 3: "MET" -> "missing transverse momentum"

Done

### Comments from Tommaso Dorigo (cds)

• The main comment concerns the "probability inversion" mistake you make on line 237-238, where you say that the p-value "corresponds to the probability of the background hypothesis under the asymptotic approximation". What the p-value is is a probability to observe data at least as discrepant to those observed, under the background-only hypothesis. By talking about the probability of hypotheses you are implicitly converting the frequentist calculation into Bayesian reasoning (which implies marrying a certain prior for the hypotheses). We don't do that in CMS papers. Please don't forget, and change the text accordingly.

"The statistical significance of the signal is quantified by means of a $p$-value, converted to an equivalent Gaussian significance, which corresponds to the probability of observing data with a larger discrepancy with respect to the background-only hypothesis, under the asymptotic approximation~\cite{Cowan_2011}."

• In the abstract, there is an "additionaly" to correct. But I would remove the full last sentence starting with "The fiducial..." which describes cut values and details that are better left for the body of the text. I understand that fiduciality is important as you quote the fiducial cross section above, but I do believe this detail is not for an abstract.

We rephrased the abstract as follows: "An observation of the electroweak production of a W$^{+}$W$^{-}$ pair with two jets, with both W bosons decaying leptonically, is reported. The data sample corresponds to an integrated luminosity of 138 fb$^{-1}$ of proton-proton collisions at $\sqrt{s}=13$ TeV, collected by the CMS detector at the CERN LHC. Events are selected by requiring exactly two leptons (electrons or muons) and two jets with large pseudorapidity separation and high dijet invariant mass. Events are categorized based on the flavor of the final-state charged leptons. A signal is observed (expected) with a significance of 5.6 (5.2) standard deviations with respect to the background-only hypothesis and the measured fiducial cross section is $10.2 \pm 2.0$ fb, consistent with the Standard Model prediction of $9.1 \pm 0.6$ (scale) fb."

• L9 "depending on which pair of bosons .... IS chosen" (not "are chosen").

Done.

• The figures 2,3, 4 shows ratios between poisson variables. How did you compute the uncertainties on the ratios?

In each bin of the bottom panel, errors are computed as the ratio of the data uncertainty to the sum of all background samples. The nominal data uncertainty is displayed as an asymemetric vertical bar as recommended by the Statistics Committee: https://twiki.cern.ch/twiki/bin/viewauth/CMS/PoissonErrorBars. For the expected number of events the poisson uncertainty is not shown in the upper plot, thus not propagated in the ratio plot. The uncertainty on the expected number of events is in first approximation a gaussian uncertainty, and this explains why in the ratio plot the dashed error band is symmetric around 1.

• Figure 2 is odd, as you use a thick line for the VBS signal overlaid to the full histograms, AND use the same for the total model. I think you need to change this, as the total model is not the VBS alone. Furthermore the chosen colour is hard to see on the right panel.

See Figure 2, Figure 3, Figure 4. The signal is also shown as a superimposed line to put in evidence differences between the signal distribution itself and other backgrounds, we will modify all captions accordingly.

• Figure 2 (left) has a under-fluctuating point in the upper panel, which disappears in the ratio data/expected lower panel. What is the purpose of that lower panel if you cut it to exclude overfluctuating bins? Please change the range of the ratio shown such that all points and their uncertainty are properly included in it.

We changed the range of Y axis of the ratio plot to make the last point and its uncertainty visible.See Figure 2

• Figure 3 is annoyingly complicated and too full of unreadable text. I suggest that you remove al the data concerning cuts in the three left bins, and leave this to the caption. Also, the vertical red line has no function as it separates bins defined by some cuts by bins defined by other cuts, that you still have to define in the caption. It is the single most striking visual element in the histogram but it does not belong there. Also, note that the legend "Mjj>500 GeV, deltaEta>3.5" in the left graph of Fig. 3 overlap and make the text hard to read.

See Figure 2, Figure 3, Figure 4.

• Table 2 has sources of systematics and a column called "impact". I would not call that "impact" as one could mistakenly take that to mean that it is the difference it makes on the total uncertainty to include or exclude each source separately. We use to call the latter "impact", while what you report are seemingly independent sources of uncertainty. Maybe you can relabel that as "value".

Table 2 has been corrected.

### Comments from Marco Cipriani (cds)

• Unlike the summary, the abstract only provides the observed significance. A priori, there is no rule stating that abstracts must give the significance to claim an observation, given the nice cross-section measurement that implies a 5 sigma observed significance and interference with the QCD-induced process is negligible, cf. l. 64. If significance is deemed to be crucial for the abstract, then the expected should also be mentioned so that the reader can gauge if there may be some form of anomaly in the result, like an excess or deficit in the data.

We rephrased the abstract as follows: "An observation of the electroweak production of a W$^{+}$W$^{-}$ pair with two jets, with both W bosons decaying leptonically, is reported. The data sample corresponds to an integrated luminosity of 138 fb$^{-1}$ of proton-proton collisions at $\sqrt{s}=13$ TeV, collected by the CMS detector at the CERN LHC. Events are selected by requiring exactly two leptons (electrons or muons) and two jets with large pseudorapidity separation and high dijet invariant mass. Events are categorized based on the flavor of the final-state charged leptons. A signal is observed (expected) with a significance of 5.6 (5.2) standard deviations with respect to the background-only hypothesis and the measured fiducial cross section is $10.2 \pm 2.0$ fb, consistent with the Standard Model prediction of $9.1 \pm 0.6$ (scale) fb.".

• The only theoretical reference and context is provided in Ref.[1]. Please cite the original and relevant sources. Among them, Higgs observation and motivation: arXiv:1207.7214, arXiv:1207.7235, arXiv:1303.4571, PRL13 (1964) 321, PRL 13 (1964) 508, Phys. Rev. 145 (1966) 1156, and the contest of VBS and recent theoretical developments Phys. Rev. 155 (1967) 1554, Phys. Rev. Lett. 13 (1964) 585, Phys. Rev. D 87 (2013) 055017, Phys. Rev. D 87 (2013) 093005, Phys. Rev. Lett. 38 (1977) 883, Phys. Rev. D 16 (1977) 1519, and maybe also a bit of context of the BSM status in EW-VBS, like aQGC and dim6.

Added theoretical references in line 3: PRD 87 (2013) 055017 , PRD 87 (2013) 093005.

Added references for observation of Higgs boson in line 4: PLB 716 (2012) 1, PLB 716 (2012) 30, JHEP 06 (2013) 081.

Added the following citations for recent ATLAS and CMS publications in VBS same-sign: PRL 114 (2015) 051801, PRL 113 (2014) 141803, PRL 120 (2018) 081801, PRL 123 (2019) 161801 , PLB 809 (2020) 135710. Rephrased lines 9-11 as: "The EW production for two W bosons with the same electric charge in the fully leptonic final state has extensively been studied by the ATLAS and CMS Collaborations~\cite{ATLAS8TeV, SMP-13-015, ATLAS2016, SMP-17-004, SMP-19-012, SMP-20-006}."

• Fig 1: Does gamma gamma -->WW also contribute to the signal process? Then it would be better to change “W” to “W/Z/y” to cover more possible diagrams? If not, since we define VBS scattering as VV’ → VV’ in the first sentence of the paper, we need to explain succinctly why only W exchanges are shown in the plots.

Yes it does, W bosons are shown just as an example. Anyway, not all boson combinations are allowed in the middle diagram of Figure 1.

• L47-49: It might be worth specifying the thresholds for the triggers. In the case of the single lepton triggers, these are probably >= 27 GeV for electrons (even > 32 GeV for 2017 and 2018) and >= 24 GeV for muons if using the tight working points. Later in the preselection we say that the leading lepton must have pt > 25 GeV, which seems quite low at least for electrons coming from the SingleElectron triggers. Even if we model the trigger turn on and use dedicated scale factors, the efficiency plateau is probably above 40 GeV (driven by L1), so we might get very large uncertainties (although in the systematic section you state it is less than 1%, so this might be wrong). Is there a significant gain in using low offline electron pt thresholds, compared to raising them to around 30 GeV for instance?

We added the following standard sentence for the trigger: Line 47: "This analysis requires events filtered by trigger algorithms that select either a single lepton passing a high-\pt threshold, or two leptons with a lower \pt threshold, satisfying both isolation and identification criteria. In the 2016 data set, the \pt threshold of the single electron trigger is 25\GeV for $\lvert\eta\rvert < 2.1$ and 27\GeV for $2.1 < \lvert\eta\rvert < 2.5$, while the \pt threshold of the single muon trigger is 24\GeV. Double lepton triggers have lower \pt thresholds, namely 23\GeV (12\GeV) for the leading (trailing) lepton in the double electron trigger, 17\GeV (8\GeV) for the leading (trailing) lepton in the double muon trigger and 23\Gev (8\GeV in the first part of the data set, corresponding to 17.7\fbinv, and 12\GeV in the second one) for the leading (trailing) lepton in the electron-muon trigger. In the 2017 data set, single electron and single muon \pt thresholds are raised to 35\GeV and 27\GeV, respectively. Likewise, in the 2018 data set the corresponding single lepton \pt thrseholds are 32\GeV and 24\GeV. Double lepton \pt thresholds in 2017 and 2018 data sets are the same as those described for the 2016 data set, except for the \pt threshold of the trailing lepton in the electron-muon trigger, which is 12\GeV."

In the offline analysis, events are selected by both single and double lepton triggers. Single electron trigger pt thresholds are 27 GeV, 35 GeV and 32 GeV and single muon trigger pt thresholds are 24 GeV, 27 GeV and 24 GeV for 2016, 2017 and 2018 data sets, respectively. The per-leg trigger efficiency is measured in data with the Tag-and-Probe technique and applied to MC. Single and double lepton trigger efficiencies are then combined together as disjoint probabilities, and the error in this estimate is evaluated by varying the pt of the probe lepton. Eventually, the impact in the final measurement is around 1%. This uncertainty comprises the contribution from double lepton triggers as well, whose pt thresholds are 23 GeV / 12 GeV (e/mu) 17 GeV / 8 GeV (mu/mu) and 23 GeV / 12 GeV (e/e) and for which the analysis selection is trigger safe. Thus, we do not expect any significant improvement in increasing the thresholds of the lepton transverse momenta. As a further cross-check, we have verified this by setting the offline lepton pt thresholds to 30 GeV (leading lepton) and 20 GeV (subleading lepton) and the expected significance was found to be 5 sigmas instead of 5.2 with respect to the background-only hypothesis.

• L60-61: Was the theory community consulted on the fact that ‘removing’ diagrams is the right thing to do? What is done with off-shell top diagrams?

These diagrams are actually not computed by the matrix element calculation, but the electroweak top contribution is already included in the ttbar + tW sample, therefore all top diagramss are correctly taken into account.

• L63-65: The paper states that the interference has an effect of a few percent on signal yield, but it also says it is negligible. This seems contradictory, especially when looking at table 2 which reports many systematic uncertainties of less than 2%. Is any item in table 2 already accounting for the interference?

The interference term was neglected beacause it's a small effect that was found to have no impact in the final measurement. In fact, it doesn't contribute to high signal-to-background ratio bins and the overall normalization effect is less than 2%.

• L70: Is the nonresonant WW production induced by gluons merged with the QCD-induced WW background in the plots (i.e. label “WW” in the legend), or in the multiboson component, or others?

The gluon-induced WW production (ggWW) is included in the QCD-induced background, labeled as "WW" in all plots.

• L94: Please add the standard description and bibliography refs. of AK4 jet reconstruction and jet energy corrections etc, currently the text doesn’t even mention anti-kT or R=0.4 anywhere.

We will add the standard CMS description:

"The central feature of the CMS apparatus is a superconducting solenoid of 6\unit{m} internal diameter, providing a magnetic field of 3.8\unit{T}. A silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections, are installed within the solenoid. Forward calorimeters extend the pseudorapidity coverage provided by the barrel and endcap detectors. Muons are detected in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid. A more detailed description of the CMS detector, together with a definition of the coordinate system and the relevant kinematic variables, can be found in Ref.~\cite{CMS_detector}.

Events of interest are selected using a two-tiered trigger system. The first level (L1), composed of custom hardware processors, uses information from the calorimeters and muon detectors to select events at a rate of around 100\unit{kHz} within a fixed latency of about 4\mus~\cite{CMS:2020cmk}. The second level, known as the high-level trigger (HLT), consists of a farm of processors running a version of the full event reconstruction software optimized for fast processing, and reduces the event rate to around 1\unit{kHz} before data storage~\cite{CMS:2016ngn}.

During the 2016 and 2017 data-taking, a gradual shift in the timing of the inputs of the ECAL L1 trigger in the region at $\abs{\eta} > 2.0$ caused a specific trigger inefficiency. For events containing an electron (a jet) with \pt larger than $\approx$50\GeV ($\approx$100\GeV), in the region $2.5 < \abs{\eta} < 3.0$ the efficiency loss is $\approx$10--20\%, depending on \pt, $\eta$, and time. Correction factors were computed from data and applied to the acceptance evaluated by simulation.

The particle-flow (PF) algorithm~\cite{CMS:2017yfk} aims to reconstruct and identify each individual particle in an event, with an optimized combination of information from the various elements of the CMS detector. The energy of photons is obtained from the ECAL measurement. The energy of electrons is determined from a combination of the electron momentum at the primary interaction vertex as determined by the tracker, the energy of the corresponding ECAL cluster, and the energy sum of all bremsstrahlung photons spatially compatible with originating from the electron track. The energy of muons is obtained from the curvature of the corresponding track. The energy of charged hadrons is determined from a combination of their momentum measured in the tracker and the matching ECAL and HCAL energy deposits, corrected for the response function of the calorimeters to hadronic showers. Finally, the energy of neutral hadrons is obtained from the corresponding corrected ECAL and HCAL energies. The candidate vertex with the largest value of summed physics-object $\pt^2$ is taken to be the primary $\Pp\Pp$ interaction vertex. The physics objects used for this determination are the jets and the associated missing transverse momentum, taken as the negative vector sum of the \pt of those jets.

Hadronic jets are clustered from all the PF candidates in an event using the infrared and collinear safe anti-\kt algorithm~\cite{Cacciari:2008gp, Cacciari:2011ma} with a distance parameter of 0.4. Jet momentum is determined as the vectorial sum of all particle momenta in the jet, and is found from simulation to be, on average, within 5 to 10% of the true momentum over the whole \pt spectrum and detector acceptance. Additional proton-proton interactions within the same or nearby bunch crossings can contribute additional tracks and calorimetric energy depositions, increasing the apparent jet momentum. To mitigate this effect, tracks identified to be originating from pileup vertices are discarded and an offset correction is applied to correct for remaining contributions. Jet energy corrections are derived from simulation studies so that the average measured energy of jets becomes identical to that of particle level jets. In situ measurements of the momentum balance in dijet, $\text{photon} + \text{jet}$, $\PZ + \text{jet}$, and multijet events are used to determine any residual differences between the jet energy scale in data and in simulation, and appropriate corrections are made~\cite{CMS:2016lmd}. Additional selection criteria are applied to each jet to remove jets potentially dominated by instrumental effects or reconstruction failures.

The missing transverse momentum vector \ptvecmiss is computed as the negative vector sum of the transverse momenta of all the PF candidates in an event, and its magnitude is denoted as \ptmiss~\cite{CMS:2019ctu}. The \ptvecmiss is modified to account for corrections to the energy scale of the reconstructed jets in the event. The pileup per particle identification (PUPPI) algorithm~\cite{Bertolini:2014bba} is applied to reduce the pileup dependence of the \ptmissvec observable. The \ptmissvec is computed from the PF candidates weighted by their probability to originate from the primary interaction vertex~\cite{CMS:2019ctu}."

• L112-120: It might make sense to mention also the W+jets control regions here, for completeness. Consider adding a table defining all signal and control regions.

There might be a misundertanding here. The W+jets control region is not specific to our analysis, but rather it is defined with a looser selection, which is also shared by other analyses. For further reference please check Figures 89-100 of the object AN (v8): https://cms.cern.ch/iCMS/jsp/db_notes/noteInfo.jsp?cmsnoteid=CMS%20AN-2019/125 . The W+jets control region selection is defined at page 93.

• L162: It would add clarity and improve readability if the loss function is described here (currently it only mentions that it is something with L2 regularization). Consider adding references for “L2” and “early stopping”.

We removed these lines since they are too technical.

• L160-163: An overly technical paragraph that can probably be omitted.

Done.

• L169-176: The reader wonder why the bin with 300 < m_jj < 500 GeV and Deta_jj > 3.5 is not included in the m_jj spectrum with the other 5 m_jj bins at the same Deta_jj. Even if that m_jj bin is less enriched in signal, using the full shape of m_jj should already account for it. In general, was the possibility of using control regions in bins of m_jj and/or Deta_jj? What the reader understands is that the background shape is fixed, in the sense that only the normalization is scaled, by using the control regions. Was there any issue in using differential distributions for the control regions too?

%{led-blue} The mjj > 500 GeV and Deta_jj > 3.5 phase space has been separated from the rest of the signal region because it is more enriched with VBS events. We have now included that bin in the full mjj spectrum as shown in Figure 3. The issue with employing differential distributions in control regions as well is the lack of statistical power, especially in the measurement of the two DY contributions in the same flavor categories.

• Fig 2: Do we understand why the last bin in the data for the left plot is so far from the prediction? It appears to be well compatible with the background-only hypothesis. It would be very useful to zoom in on the ratio plot range in order to include this point too (by eye, it seems that Data/Expected is around 0.4)

We have checked the DNN distribution in CRs and the agreement between data and MC of the input variables, and no significant deviation that might explain this behaviour have been observed. This is therefore a statistical fluctuation.

* L190: The uncertainty on the integrated luminosity is not an uncertainty on the signal cross section. Please rephrase the statement accordingly. Is 2.1% the actual number for the full Run2, including correlations? The recommended lumi uncertainty for full Run 2 is 1.6%. Similarly, L201-L208. There is here a logical step to make that the result is not only the cross section, but also the significance, and you should state that these uncertainties are included in the yields of both.

Indeed we apply a 1.6% uncertainty in the luminosity over the full Run 2 data set, keeping correlations into account as recommended by the LUMI group. The 2.0% value (for some reason the text mistakenly reports 2.1%, but the actual number is 2.0%) is the contribution of such uncertainty to the cross section measurement. This number differs from the 1.6% a priori value, but it should be noticed that this nuisance parameter is only defined for the signal sample and for those backgrounds whose normalization is not measured in data. The luminosity uncertainty in the cross section measurement can ultimately depend on two effects: the correlation among different processes, that can slightly pull the nuisance parameter during the fit procedure, and the error propagation to the final result. Eventually, the combined action of these effects makes the 1.6% a priori uncertainty in the luminosity a greater contribution when computing the cross section measurement.

• L190-218: Currently this chapter gives the impression that for some systematics we quote the prefit uncertainty on the estimated yields, and for some others we quote the impact on the signal strength (which are referred to as the “cross section”). We should be consistent with what we quote, and should also make it clear from the beginning.

This section has been rephrased as follows:

"Uncertainties in the integrated luminosity, lepton reconstruction and identification efficiency~\cite{CMS:2015xaf,CMS:2018rym}, trigger efficiency and additional trigger timing shift have been taken into account. The electron and muon momentum scale uncertainties are computed by varying the momenta of leptons within one standard deviation from their nominal value. Similarly, jet energy scale and resolution uncertainties~\cite{CMS:2016lmd} are evaluated by shifting the \pt of the jets by one standard deviation, and this directly affects the reconstructed jet multiplicity and \ptmiss measurement: several independent sources are considered and partially correlated among different data sets. Uncertainties in the residual \ptmiss~\cite{CMS:2019ctu} are also included and calculated by varying the momenta of unclustered particles that are not identified with either a jet or a lepton. The \PQb tagging~\cite{Bols_2020} introduces different uncertainty sources. The uncertainties from the \PQb tagging algorithm itself are correlated among all data sets, while uncertainties due to the finite size of the control samples are uncorrelated. Finally, the uncertainty in the pileup reweighting procedure is applied to all relevant simulated samples.

Among theoretical uncertainties, the effects due to the choice of the renormalization and factorization scales are evaluated. These are computed by varying the scales up and down independently by a factor of two with respect to their nominal values, ignoring the extreme cases where they are shifted in opposite directions~\cite{Catani:2003zt,Cacciari:2003fi}. Only shape effects are considered when varying such scales, since signal and background normalizations are directly measured in data. PDF uncertainties in the signal process do not introduce any shape effect in the \mjj and DNN output distributions, hence they have not been considered. For \ttbar and DY backgrounds, since normalization effects have no impact in the fit, PDF uncertainties can only affect the ratio of the expected yields between the SR and the CR. Such uncertainties have been included in the CR and estimated to be 1\% and 2\% for \ttbar and DY backgrounds, respectively. The modeling of both the parton shower and the underlying event has been taken into account.

The breakdown of all systematic uncertainties in the cross section measurement is shown in Table~\ref{tab:syst}. Systematic uncertainties have been added in quadrature, and the systematic component of the overall relative uncertainty in the signal cross section measurement is 13.1\%. The statistical uncertainty has been computed by freezing all systematic sources to their best-fit result and its value is 14.9\%. The combined relative uncertainty in the cross section measurement is 19.8\%

• L194: Please express in one line what these uncertainties are due to.

See comment about line 94. Then we changed line 194 as: "Additional trigger corrections ("prefiring corrections") -> "Uncertainties in the trigger timing shift".

• L194: Is this only the ECAL prefiring or does it also include muon prefiring. Please note that this is not the effect of ECAL prefiring on muons, there is actually a muon-specific prefiring affecting the muon chambers. It mainly affects 2016 but the effect here can be as large as 2-3% on the inclusive DY yields; please see here: https://cms.cern.ch/iCMS/jsp/openfile.jsp?tp=draft&files=AN2021_086_v2.pdf

This effect is the one due to the ECAL prefiring.

• L196-198: is the momentum scale varied by one sigma of the momentum resolution?

Yes it is.

• L213-214: The argument that only shape effects are considered “because normalizations… are directly measured in data” should be made more clear – why are all the other normalization uncertainties included then?

Normalization uncertainties are only included for those backgrounds that are not estimated in data. All other effects are described by nuisance parameters that modify the shape of the histograms, rather than their integral.

• L214-215: It is unclear whether what is meant is that PDFs induce a less than 1% variation on the signal yield before the fit or that this <1% is the impact they have on the final measurement. In the former case, PDFs can also affect the background yields; is this effect also not considered or neglected? How large is the impact on the measured cross section if they are included in both signal and backgrounds? It may not be negligible compared to other small uncertainties quoted. If this is exactly what the text was already saying, then what the secondary sentence “hence they have not been included” is not understandable.

The "less than 1% variation on the signal yield" is evaluated before the fit, that is why PDFs were not included as one of the major uncertainty sources. Moreover, they don't introduce any shape variation in the observables of interest and therefore they have no impact in the cross section measurement. Regarding PDF for the main backgrounds, similar considerations hold but we took into account the PDF uncertainty contribution that might affect the ratio of the expected yields between SR and CR. We rephrased that paragraph as: "PDF uncertainties in the signal process do not introduce any shape effect in the \mjj and DNN output distributions, hence they have not been considered. For \ttbar and DY backgrounds, since normalization effects have no impact in the fit, PDF uncertainties can only affect the ratio of the expected yields between the SR and the CR. Such uncertainties have been included in the CR and estimated to be 1\% and 2\% for \ttbar and DY backgrounds, respectively."

• L220-222: What are the “expected results” quoted here? At least the “expected” yields in Fig. 4 are based on a fit to real data and not Asimov data set (hopefully). Does “expected” here refer to the expected significance of 5.2 sigma? Please also specify what is meant by “the prediction” for the Asimov data set – a fit with signal regions blinded maybe? Finally, Asimov data set is usually not called a “toy” data set, since it is an alternative to “throwing toys” i.e. running multiple pseudoexperiments.

"Expected results" in Figure 4 are of course based on the fit to real data, the Asimov data set is only exploited to derive the a priori errors in the signal strenght and the expected significance. The Asimov data set is centered in the Monte Carlo prediction and is used instead of real data in all regions: this is a fast and blinded procedure to get an estimate of such numbers without performing the actual fit to real data. Rephrased L220-222: "Expected results are assessed by using the Asimov data set~\cite{Cowan_2011}, a pseudo experiment in which data are set in each bin to the value provided by the prediction".

• L222, what is the value of the nuisance parameters? (prefit or postfit)

Nuisance parameters are included with their postfit values even when the Asimov data set is employed. In this case though, most of the postfit values are identical to their prefit estimate.

• L228: Why is only the factorization scale considered?

Because the VBS signal is a purely electroweak process at LO, it has no effect changing the renormalizaion scale, which only affects the coupling of the alpha strong constant.

• Table1: the transverse mass m_T^W1 of the leading lepton and ptmiss in e\mu events is never mentioned in the text, while from Section 4 it appears that one cuts on the mT defined using the lepton pair and ptmiss. Is it a typo? If not, further details on this m_T^W1 variable may be needed. For example, whether it is more discriminating than the other mT.

No, it is not a typo, mT and mTW1 are two different variables. In particular, a cut on the mT variable (mT > 60 GeV) is applied in the SR to suppress the DYtt process. The mTW1 is used instead as input variable of the DNN as it adds more discrimination power to the DNN than mT. Indeed, the DNN was trained to distinguish the signal from the top and the QCD-WW background; the DYtt one, which constitutes a minor background for the emu final state, is not considered in the training of the network. Because of the limited space, all the procedure adopted to build the DNN was not mentioned in the paper. However, these details are reported in the chapter 7 of the analysis note v9.

### Comments from Laurent Thomas (cds)

General:

• The paper is very short and could benefit from a few additional sentences describing the physics object definition, triggers,...

Added standard description of CMS detector in Sec.2.

• The signal extraction strategy should be discussed and motivated more. It is not clear why a DNN is used in the opposite flavour channel but not in the same flavour channels. The choice of the input variables for the DNN seems also arbitrary (why not using the 4 momentum of all selected physics objects? Why not to use soft activity variables such as the number of soft jets?). There is very little to no discussion on the validation in data of the DNN output distribution/mjj spectrum predicted by simulation.

The different flavor channel is purer wrt to the same flavor one. Indeed, the same flavor final state signal purity is spoiled by the presence of the DY background, which is suppressed in the emu final state. Therefore, the main contribution to the final significance derives from the different flavor channel. Since the emu final state is the most sensitive, we choose to train a DNN in this phase space to boost the performance of the analysis. As to the DNN input variables, they were selected among a larger set of observables looking at their shapes and their correlations. Different models with different input variables were tested. Finally, the most performing one was chosen by comparing the ROC curves of the different models. Soft activity variables were not taken into account since they come with large theoretical uncertainties. Because of the limited space, all the procedure adopted to build the DNN was not mentioned in the paper. However, these details are reported in the chapter 7 of the analysis note v9. For the same reason, we did not discuss in the paper the validation in data of the DNN output/mjj spectrum predicted by simulation. We checked the agreement between data and Monte Carlo for mjj and DNN output distribution in top and DY CR, which are orthogonal to the signal region, before the unblinding. These studies are reported in detail in the analysis note v9 chapter 7.3 (for the DNN) and chapters 8.3-8.4 (for mjj).

• Abstract: We suggest to quote the expected results

We rephrased the abstract as follows: "An observation of the electroweak production of a W$^{+}$W$^{-}$ pair with two jets, with both W bosons decaying leptonically, is reported. The data sample corresponds to an integrated luminosity of 138 fb$^{-1}$ of proton-proton collisions at $\sqrt{s}=13$ TeV, collected by the CMS detector at the CERN LHC. Events are selected by requiring exactly two leptons (electrons or muons) and two jets with large pseudorapidity separation and high dijet invariant mass. Events are categorized based on the flavor of the final-state charged leptons. A signal is observed (expected) with a significance of 5.6 (5.2) standard deviations with respect to the background-only hypothesis and the measured fiducial cross section is $10.2 \pm 2.0$ fb, consistent with the Standard Model prediction of $9.1 \pm 0.6$ (scale) fb.".

• Abstract: Quoting the lepton pt thresholds seems to detailed for an abstract.

Removed description of the fiducial region.

• "with a cone size ∆R = 0.4" => We assume (but can't confirm from the text) that you meant to say that you are using AK4 jets (i.e. jets clustered with the anti kT algorithm with a distance parameter of 0.4) but that sentence, as written, is not correct.

See comment above.

• l11: "we exploit the full LHC 2016–2018 dataset" => the full 2016–2018 data set recorded by the CMS experiment

Done.

• l18: "with two high transverse momenta (pT) forward jets." => Actually the pt of these jets is not that high in general

-> "with two forward jets."

• l20: "suppressed hadronic activity between them" => This information is actually not used in the analysis unlike other similar analyses (e.g. Hmumu). Why?

Requiring a reduced hadron activity between the two tagging jets would mean applying some kinematic selection to variables related to the third jet. This would in turn mean explicitly relying on the parton shower modeling for the definition of our phase space, which is not very convenient. Instead, we decided to select our signal region according to centrality of the di-lepton system with respect to the two jets, since it has a good discrimination power and is scarcely correlated to the third jet kinematic.

• l29 "other background sources derive from DY production and W + jets events" => this sounds odd. We would suggest something like "other relevant sources of background are the Drell-Yan process and the production of a W boson, both in association with jets".

Done.

• Section 2: we suggest to elaborate on the event reconstruction and triggering.

We will add the standard CMS description:

"The central feature of the CMS apparatus is a superconducting solenoid of 6\unit{m} internal diameter, providing a magnetic field of 3.8\unit{T}. A silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections, are installed within the solenoid. Forward calorimeters extend the pseudorapidity coverage provided by the barrel and endcap detectors. Muons are detected in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid. A more detailed description of the CMS detector, together with a definition of the coordinate system and the relevant kinematic variables, can be found in Ref.~\cite{CMS_detector}.

Events of interest are selected using a two-tiered trigger system. The first level (L1), composed of custom hardware processors, uses information from the calorimeters and muon detectors to select events at a rate of around 100\unit{kHz} within a fixed latency of about 4\mus~\cite{CMS:2020cmk}. The second level, known as the high-level trigger (HLT), consists of a farm of processors running a version of the full event reconstruction software optimized for fast processing, and reduces the event rate to around 1\unit{kHz} before data storage~\cite{CMS:2016ngn}.

During the 2016 and 2017 data-taking, a gradual shift in the timing of the inputs of the ECAL L1 trigger in the region at $\abs{\eta} > 2.0$ caused a specific trigger inefficiency. For events containing an electron (a jet) with \pt larger than $\approx$50\GeV ($\approx$100\GeV), in the region $2.5 < \abs{\eta} < 3.0$ the efficiency loss is $\approx$10--20\%, depending on \pt, $\eta$, and time. Correction factors were computed from data and applied to the acceptance evaluated by simulation.

The particle-flow (PF) algorithm~\cite{CMS:2017yfk} aims to reconstruct and identify each individual particle in an event, with an optimized combination of information from the various elements of the CMS detector. The energy of photons is obtained from the ECAL measurement. The energy of electrons is determined from a combination of the electron momentum at the primary interaction vertex as determined by the tracker, the energy of the corresponding ECAL cluster, and the energy sum of all bremsstrahlung photons spatially compatible with originating from the electron track. The energy of muons is obtained from the curvature of the corresponding track. The energy of charged hadrons is determined from a combination of their momentum measured in the tracker and the matching ECAL and HCAL energy deposits, corrected for the response function of the calorimeters to hadronic showers. Finally, the energy of neutral hadrons is obtained from the corresponding corrected ECAL and HCAL energies. The candidate vertex with the largest value of summed physics-object $\pt^2$ is taken to be the primary $\Pp\Pp$ interaction vertex. The physics objects used for this determination are the jets and the associated missing transverse momentum, taken as the negative vector sum of the \pt of those jets.

Hadronic jets are clustered from all the PF candidates in an event using the infrared and collinear safe anti-\kt algorithm~\cite{Cacciari:2008gp, Cacciari:2011ma} with a distance parameter of 0.4. Jet momentum is determined as the vectorial sum of all particle momenta in the jet, and is found from simulation to be, on average, within 5 to 10% of the true momentum over the whole \pt spectrum and detector acceptance. Additional proton-proton interactions within the same or nearby bunch crossings can contribute additional tracks and calorimetric energy depositions, increasing the apparent jet momentum. To mitigate this effect, tracks identified to be originating from pileup vertices are discarded and an offset correction is applied to correct for remaining contributions. Jet energy corrections are derived from simulation studies so that the average measured energy of jets becomes identical to that of particle level jets. In situ measurements of the momentum balance in dijet, $\text{photon} + \text{jet}$, $\PZ + \text{jet}$, and multijet events are used to determine any residual differences between the jet energy scale in data and in simulation, and appropriate corrections are made~\cite{CMS:2016lmd}. Additional selection criteria are applied to each jet to remove jets potentially dominated by instrumental effects or reconstruction failures.

The missing transverse momentum vector \ptvecmiss is computed as the negative vector sum of the transverse momenta of all the PF candidates in an event, and its magnitude is denoted as \ptmiss~\cite{CMS:2019ctu}. The \ptvecmiss is modified to account for corrections to the energy scale of the reconstructed jets in the event. The pileup per particle identification (PUPPI) algorithm~\cite{Bertolini:2014bba} is applied to reduce the pileup dependence of the \ptmissvec observable. The \ptmissvec is computed from the PF candidates weighted by their probability to originate from the primary interaction vertex~\cite{CMS:2019ctu}."

• l50: "All samples " => "All physics processes considered"?

Done.

• l54: Since you use a mix of single and dilepton triggers which introduces a correlation between the two leptons, it is not clear how the Tag and Probe technique can be used to measure the trigger efficiency. Can you clarify?

Although we make use of both single and dilepton triggers, our events are mostly selected by the latter and therefore the correlation between them is negligible. The efficiency of double lepton triggers is measured in data: the T&P method is used to determine the efficency per leg, assuming that the two legs are independent of each other. Single and double lepton trigger efficiencies are then combined together as disjoint probabilities.

• l55: after "b-quark jets" please provide a reference for the b-tagging algorithm performance

It is given in line 102 but we could mention it here.

• l66-69: as the VBF Higgs production process interferes destructively with the VBS signal, elaborating how it is treated in the fit may be worthwhile. It cannot be treated as the other processes, since the interference will also change with the signal strength. Perhaps provide a reference to a more detailed discussion in a previous CMS paper ?

%{led-blue} Only the on-shell Higgs boson contribution (s-channel diagrams) has been removed from our signal definition, instead the off-shell Higgs boson production is included and does interfere with the EW WW signal. Eventually, the VBF Higgs boson mechanism is treated as a backgorund process, whereas the off-shell contribution is part of the signal sample - a unique parameter is used to scale the signal process in the fit to data.

• l72-73: "Most of the events are generated at NLO" : with the advent of generators with negative event weights, one cannot really say that every event is generated at NLO -> "Most of the event samples are generated at NLO"

Done.

• l75-77: a reference to the pt reweighting would be useful. Actually is this pt reweighting adapted here given that your most sensitive variable is likely mjj? Was it considered to derive a correction as a function of this variable instead?

The reweighting has been applied in order to get a better data/MC agreement. We added the following reference to this method.

@article{Khachatryan:2016mnb,
author         = "Khachatryan, Vardan and others",
title          = "Measurement of differential cross sections for top quark
pair production using the lepton+jets final state in
proton-proton collisions at {13\TeV}",
collaboration  = "CMS",
journal        = "Phys. Rev. D",
volume         = "95",
year           = "2017",
pages          = "092001",
doi            = "10.1103/PhysRevD.95.092001",
eprint         = "1610.04191",
archivePrefix  = "arXiv",
primaryClass   = "hep-ex",
reportNumber   = "CMS-TOP-16-008, CERN-EP-2016-227",
SLACcitation   = "%%CITATION = ARXIV:1610.04191;%%"
}

• l93: According to the approval talk, it looks like PUPPIMET is used. This should be mentioned in the text.

• l97 and l112, 114: "two control regions" vs. "The ttbar control regions..." and "In DY control regions...": in the first, "regions" refers to two types of control samples, and in the second, to different regions of the same type of control sample. Perhaps use "control samples" in the first instance.

In all instances, "control regions" refer to single-bin categories where the normalization of the corresponding background is constrained by the fit procedure to data.

• l110: Zeppenfeld variable: is there a motivation for this particular definition ? Consider adding a reference. Would a definition involving the rapidity of the Z, y_Z- 0.5*(eta_j1 + eta_j2), also work ?

We are interested in the centrality of the two leptons with respect to the tagging jets, and those leptons in general are not produced by a Z boson. Reference to the Zeppenfeld variable:

@article{Rainwater:1996ud,
archiveprefix = {arXiv},
author = {Rainwater, David L. and Szalapski, R. and Zeppenfeld, D.},
doi = {10.1103/PhysRevD.54.6680},
eprint = {hep-ph/9605444},
journal = {Phys. Rev. D},
pages = {6680},
title = {{Probing color singlet exchange in $\PZ$ + two jet events at the CERN LHC}},
volume = {54},
year = {1996}
}


• l113: is the ttbar purity similar in ee/mumu and emu categories?

Yes, it is. The ttbar purity is around 95% in all the ee,mumu and emu top CRs.

• l125-127 the description of the way the QCD-induced W+W- production is constrained would be better placed after the DY, which is constrained by a CR as the ttbar background just described.

We changed the ordering.

• l125-127 the sentence reads somewhat strange; the QCD WW background is constrained by all regions. Proposal: "..., is left to float freely and is the outcome of the global fit to all signal and control regions."

We will rephrase as suggested.

• l129 "A large fraction" -> for the reader, having an idea of the fraction value would be interesting (how many times do we have PU jets polluting our VBS sample). Consider adding a value.

In the ee/mumu SRs, the fraction of DY + at least 1 PU jet events is about 50%-60% of the total DY events. We can add the value.

• l149 "(from EW production of Z+jets)": why a parenthesis ? If it is the only process contributing, consider remove the parenthesis. If it is not, then consider elaborating which other contamination is also estimated from simulation and removed.

%ICON{led-green% We removed the parenthesis and rephrased: "The remaining contamination of prompt leptons from electroweak production of a Z boson in association with jets is estimated from simulation and removed."

• Table 1: is there a reason to call the last variable mT^W1 or is it internal to CMS ? Since the MET is global, computing the transverse mass of one of the W bosons is not possible. Consider using mT^1 instead

The variable name mT^W1 was chosen to point out that in the calculation we used the pt of the lepton from the first W boson decay. We can surely change the name of the variable to be not misleading.

• l162-163 maybe add a reference for L2 weights and the early stopping ?

We removed lines 160-163 since they were too technical.

• l167-168 "In the emu category the binned DNN outputs are chosen...": this sentence comes just after stating that the control regions are included as single-bin templates... so it’s not clear anymore if the binned DNN outputs are used as discriminating variables in the signal region, or as single-bin template in the CRs

The DNN outputs are used discriminating variables in the emu SR. Changed the phrase "In the emu category" in "in the emu SR" to make the sentence clearer in the text.

• l177 "The number of events in each category and in each bin of the discriminating distributions is modeled as a Poisson random variable...": this phrasing seems incorrect in statistics. As is well-known, the bin contents in a category and the total content of this category are not both Poissonian, one is multinomial and the other is Poissonian; however all the bin contents can be modeled as independent Poisson variables (if the total is unconstrained)

We will rephrase as:"The number of events in each bin of the templates included in the likelihood function is modeled as a Poisson random variable..."

• Fig 2: The ratio plot for Zll<1 doesn't show the last bin.

We are going to change the range of Y axis of the ratio plot to make the last point and its uncertainty visible. See Figure 2.

• Fig 2/3: The postfit data/MC agreement doesn't actually look very good, especially for Zll<1 in fig 2. Did you quantify the goodness of the fit?

We performed the goodness of fit and we did find an acceptable value. We have also run a compatibilty channel test by measuring the VBS signal with different POIs for each data set and flavor category and the p-value of such test was around 15%.

• Fig 4: is the background control region a single bin? How do you constraint the shape of the DNN or of the mjj distribution?

Yes, control regions are included as single bin to constrain the normalization of the major backgrounds (ttbar-tw and DY). As to the shape of all the backgrounds, we rely on MC simulations. The reliability of MC simulations was checked in the CRs (before unblinding) by comparing the MC with the data. The agreement between data and MC in the CRs was good. These studies are reported in detail in the analysis note v9 chapter 7.3 (for the DNN) and chapters 8.3-8.4 (for mjj).

• l190/191: it's counterintuitive that a 1.6% uncty on the integrated lumi leads to 2.1% on the signal cross section. Can you comment on that?

Indeed we apply a 1.6% uncertainty in the luminosity over the full Run 2 data set, keeping correlations into account as recommended by the LUMI group. The 2.0% value (for some reason the text mistakenly reports 2.1%, but the actual number is 2.0%) is the contribution of such uncertainty to the cross section measurement. This number differs from the 1.6% a priori value, but it should be noticed that this nuisance parameter is only defined for the signal sample and for those backgrounds whose normalization is not measured in data. The luminosity uncertainty in the cross section measurement can ultimately depend on two effects: the correlation among different processes, that can slightly pull the nuisance parameter during the fit procedure, and the error propagation to the final result. Eventually, the combined action of these effects makes the 1.6% a priori uncertainty in the luminosity a greater contribution when computing the cross section measurement.

• l192: it is unusual that the muon efficiency uncties are larger than electrons. Can you comment on this?

Indeed the numbers we quoted in the text about the uncertainty in the electron and muon efficiencies were mistakenly swapped.

### Comments from Sijin Qian (cds)

• L162-163
• It seems that a Reference should be given to the "Early Stopping" on L163, but I'm not sure whether it is [25] or not;
• if yes, it may be solved by combining two sentences with a semi-colon to replace the period dot of the 1st sentence;
• also, the commas in the 2nd sentence seems a little too many, it may be clearer if two of them can be replaced by a pair of brackets;
• it seems better if "L2" can be explained briefly, e.g. "Adam optimizer [25]. To prevent overtraining, regularization techniques, such as L2 weights decay and Early Stopping, are implemented." --> "Adam optimizer [25]; to prevent overtraining, regularization techniques (such as L2 weights decay (where L2 is ...) and Early Stopping) are implemented."

We are going to remove lines 160-163 since they are too technical.

• L191-218, those systematic and theoretical uncertainties percentages (e.g. 2.1%, 1.5%, 2.0%, 3.3% and 2.6%, etc.) may should cite some reference articles, so that readers would not wonder why these percentages are chosen, why not any other arbitrary percentage numbers. At least, the "The uncertainty on the integrated luminosity" should be given a Ref. to be consistent with all other CMS papers.

This section has been rephrased as follows:

"Uncertainties in the integrated luminosity, lepton reconstruction and identification efficiency~\cite{CMS:2015xaf,CMS:2018rym}, trigger efficiency and additional trigger timing shift have been taken into account. The electron and muon momentum scale uncertainties are computed by varying the momenta of leptons within one standard deviation from their nominal value. Similarly, jet energy scale and resolution uncertainties~\cite{CMS:2016lmd} are evaluated by shifting the \pt of the jets by one standard deviation, and this directly affects the reconstructed jet multiplicity and \ptmiss measurement: several independent sources are considered and partially correlated among different data sets. Uncertainties in the residual \ptmiss~\cite{CMS:2019ctu} are also included and calculated by varying the momenta of unclustered particles that are not identified with either a jet or a lepton. The \PQb tagging~\cite{Bols_2020} introduces different uncertainty sources. The uncertainties from the \PQb tagging algorithm itself are correlated among all data sets, while uncertainties due to the finite size of the control samples are uncorrelated. Finally, the uncertainty in the pileup reweighting procedure is applied to all relevant simulated samples.

Among theoretical uncertainties, the effects due to the choice of the renormalization and factorization scales are evaluated. These are computed by varying the scales up and down independently by a factor of two with respect to their nominal values, ignoring the extreme cases where they are shifted in opposite directions~\cite{Catani:2003zt,Cacciari:2003fi}. Only shape effects are considered when varying such scales, since signal and background normalizations are directly measured in data. PDF uncertainties in the signal process do not introduce any shape effect in the \mjj and DNN output distributions, hence they have not been considered. For \ttbar and DY backgrounds, since normalization effects have no impact in the fit, PDF uncertainties can only affect the ratio of the expected yields between the SR and the CR. Such uncertainties have been included in the CR and estimated to be 1\% and 2\% for \ttbar and DY backgrounds, respectively. The modeling of both the parton shower and the underlying event has been taken into account.

The breakdown of all systematic uncertainties in the cross section measurement is shown in Table~\ref{tab:syst}. Systematic uncertainties have been added in quadrature, and the systematic component of the overall relative uncertainty in the signal cross section measurement is 13.1\%. The statistical uncertainty has been computed by freezing all systematic sources to their best-fit result and its value is 14.9\%. The combined relative uncertainty in the cross section measurement is 19.8\%

### Plots updated to CWR comments

Figure 2 paper v13:

Figure 3 paper v13:

Figure 4 paper v13:

All the questions have been addressed, see the presentation: https://cernbox.cern.ch/index.php/s/sq0sXQ27fEcdL9L

### Questions

• Cross sections should replace signal strengths in the paper. Replaced signal strength with cross section in paper v10

• The so-called inclusive cross section is also a fiducial cross section. We should report the requirements to define that cross section, in addition to the (more exclusive) fiducial cross section. Added descriptions in paper v10 section Results

• Systematic uncertainty table should show the total systematic uncertainty, the data statistical uncertainty, and the total uncertainty to make a complete story with just that table Updated Table 2 in Sec.6 of paper v10

• Show the results with a single WW rate parameter instead of splitting in 2016 and 2017+2018. See the presentation above

• Show the individual results with the new samples per year and per flavor, including the postfit/prefit yields for the main backgrounds. See the presentation above

### v5 26 September 2021

• L1: missing 3 primes from 3 Vs

• L20: preferable to add dijet in front of "invariant mass" for clarification even if it is implicitly understood by experts. A single jet also has an inv. mass.

• L42, ref 5 and l195-197: Regarding luminosity uncertainty, here is the recommendation from PubComm page (https://twiki.cern.ch/twiki/bin/viewauth/CMS/Internal/PubDetector) for short letters: The total Run~2 (2016--2018) integrated luminosity has an uncertainty of 1.6\%, the improvement in precision relative to Refs.~\cite{CMS-LUM-17-003,CMS-PAS-LUM-17-004,CMS-PAS-LUM-18-002} reflecting the (uncorrelated) time evolution of some systematic effects. The improved uncertainty of 1.2% for 2016 was advertised widely, here is for example the announcement from early April: https://hypernews.cern.ch/HyperNews/CMS/get/physics-announcements/6191.html So I find it unacceptable that the new results are not used / referenced when they are available since half a year. A lot of effort went into providing these almost unprecedentedly precise calibrations to the collaboration. I can understand that you might not want to redo your fits for a sub-dominant uncertainty but I would still insist to present it in a way that gives justice to the work. An option is to use the above recommendation from PubComm. And if we ever need to redo the fit, use the correct lumi calibration, please.

• L47: The sentence on single lepton trigger strictness is not correct or unclear. The analysis cuts on the lepton pT are 25 and 13 GeV. The single lepton trigger cuts of 24-27 GeV (muons) and 27-32 GeV (electrons) is more restrictive. The issue is probably what your pronoun â€œtheyâ€ refer in the sentence. Removed

• L90: Events with an additional â€¦ are rejected.

• L100: Comma after category

• L117: Add that for the sf ll DY, you consider two subregions (judging from Fig 4) and why Added that the SF DY CR is further divided into \detajj bins. For the explanation of this choice, we refer to the lines below in background estimation.

• L138: â€¦ defined ones populating â€¦

• L142-144: Something is not clear in the description. You say that the fake-leptons have to fail the tight isolation requirements used in the signal region. This means that they are non-overlapping with the leptons in the signal region definition. If so the probability for a jet to satisfy both these loose criteria and the signal lepton criteria is zero. What you measure (I assume) is a transfer factor (i.e. the relative rate of these) and not a probability.

• L148: It is also a bit confusing that you call this a â€œdijet" CR as it seems you select lepton + jet events. Removed "dijet".

• L164: kinematic

• Tab 1, last line: extra space in the subscript between T and l_1

• Fig 2: stop the y axis at a smaller value (10^5 and 5*10^5 ?) to stretch a bit the histograms

• Fig 3: start the y axis at 0.1 for both plots to squeeze less the histograms It might be my printer but the labels on top of the bins are somewhat difficult to read, increase font size a bit?

• L195: see above the comment on luminosity uncertainty

• L213: start new para for theory errors

• L217: for my education, is it a new theory community recommendation to ignore these extreme cases? Tese are the "canonical 7-point variations" as it is also mentioned in the YR4 https://arxiv.org/pdf/1610.07922.pdf

• L189-90: each histograms related >>>> each histogram related

• L197, : strenght >>> strength

• L204-205 â€œâ€¦.shifting the PT of jets â€¦â€, Shifted by 1 sigma? Please mention.

• L229, 243 Signal strength: 1.32 (+0.29, -0.27) = 1.32 (+0.20, -0.18 Syst)) and (+0.20 and -0.19 Stat) If the Syst. and Stat. errors are combined in quadrature, it should give 1.32 (+0.28, -0.26) How are the errors combined ? Errors are combined in quadrature. The small difference of 0.01 arises from truncation error.

• line 30 and elsewhere: â€œW + jetsâ€ - it looks there are two spaces between â€œWâ€ and â€œ+â€, please check.

• line 45: personally, I would remove the comma from "cut, or an electronâ€ and add a comma to "threshold, satisfying both"

• Figure 3 and 4: the labels on the different bins are very small, the font size could/should be made at least a factor 2 larger

• line 88: â€œdescribed in [22, 23].â€ -> â€œdescribed in Refs. [22, 23]."

• line 223: "impact are not listed:â€ -> "impact are not listed."

• Table 2: "Most impactful systematic uncertainties -> "Most relevant systematic uncertainties ? [just a personal preference]

• Ref. 3 should be listed as â€˜CMS Collaborationâ€™, and include the arXiv reference

• Ref. 7 should be listed like Ref. 5 and 6, as CMS-PAS-LUM-18-002 (not â€™technical reportâ€™)

• Ref. 25 is missing the arXiv reference

• Ref. 26 looks incomplete, it seems unpublished but it should at least include the arXiv reference: arXiv:1412.6980

• Abstract: Please write "137 fb-1" instead of "137.1 fb-1". I do not think the ".1" is needed.

• line 242: "overwhelming background". The word "overwhelming" suggests that the background could not be dealt with. Please replace "overwhelming" with "very large" or "dominant".

### v4 30 August 2021

• L2: VVâ€™ -> VVâ€™ with V and Vâ€™ being a W, Z or gamma vector boson

• L4: with a mass of 125 GeV

• L8: Wgamma missing? But actually the list in () could be dropped

• L11: with the same electric charge

• Fig 1: one could include a diagram with TGC couplings as well

• Caption Fig1: Examples of

• L18: pT not introduced

• centrality not defined, only in L107

• L28: b tagging veto is a jargon, veto against jets originating from b quark production

• L29: via (or by) data driven techniques

• L36: electromagnetic

• *L44: ref 7 should be the 2018 lumi PAS

• *L44 I guess, ref 5 is outdated, we have a publication with much improved uncertainty:
Yes, it is true. But ref5 is what we have used in the analysis.

• L49: is it really less restrictive, eg. Single lepton trigger pT threshold?
Yes it is, since the pT thresholds are 24 GeV (2016, 2018 data set) and 27 GeV (2017 data set) for the single muon trigger and 27 GeV (2016 data set), 35 GeV (2017 data set) and 32 GeV (2018 data set) for the single electron trigger, while they are lower in dilepton triggers.

• L51: PU reweighing not mentioned

• L62: negligible contribution - can it be more qualitative?

• L71: line too long

• L85, 90, 91: It would be better to capitalise the start of the items, and finish them with full stop, as the first item has actually several sentences.

• L86: merge sentences by dropping "The leptons"€

• *L87: Ref 9 is not citable. Use the published gamma and muon papers.

• L99: Coma after category

• L101: comma after categories

• L102: events from Z boson production.

• *L109-114: Some indication of how clean these control regions are would be useful to add

• *L137-144: the W+jets CR is mentioned but not really described, neither the dijet CR. It would be useful to add a sentence what the looser definition of lepton ID means, and also how the dijet CR is defined.

• L142: DY events in simulation?

• L141: The prompt lepton contribution to the dijet control region %ICON{led-green%

• *L152: the input variables to the DNN should be given

• *Figs 2, 3, 4: CMS Preliminary inside the frame, /fb should be fb^-1, enlarge plots so that labels etc are better visible in the preprint, They are unreadable as is. VBS legend item should just be a line, not a box and a wider line would be more visible. All MC legend items would be better in the lower panel. Are these before or after fit plots? Should be specified in the caption. Captions: Add that numbers in [] give the expectation.

• Figs 2, 3: the y scale could start at a larger value so that the interesting part is not so squeezed

• Fig 3: legend item: EW Zjj?

• Figs 2, 3: any comment on the observed differences between data and expectation?

• *Figs 4-6: please merge into a single plot the three single-bin plots. Takes a lot of space without much info. Then you have space to show the control regions from where the background normalisations come from. That would be much more informative.

• *L180: 2016 lumi uncertainty is 1.2%

• L185: decide energy or momenta, and "leptonsâ€™ momenta" should be momenta of leptons

• L190: strength

• L199: case

• L201: what does negligible mean?

• *Sec 6: Not much on background uncertainties. I would find useful a table summarising all considered uncertainties in the paper.
The listed 1-3% uncertainty per source you quote in the text hardly explains the 20% total syst, though not all uncertainty has a quoted size (b tagging, â€¦).

• L207: central values of the templates

• L210: line too long

Title and abstract:

• should one change â€˜observationâ€™ to â€˜first observation, given L13: "which has not yet been observed"

• "with large pseudorapidity separation and high invariant mass" -> "and high dijet invariant mass" ?
The "high invariant mass" refers to the "two jets", using "dijet" would sound a bit redundant but we will change it if needed.

Section 1:

• L21: "analsyis" -> "analysis"

• L32: "Other background sources include W + jets and DY production.â€: should one mention here that the first is estimated from control regions and the second from MC ? Section 3 mentions only DY, and one has to read up to Section 5, L138 to read that W+jets is estimated from data.

Section 3:

• L44: "respectively [5-7]": please check, references 5 & 6 refer to 2016 and 2017 luminosity, reference 7 is not about 2018 luminosity

• L51: "are modeled via simulation that has been reweighted"-> "are modeled via Monte Carlo simulation, reweighted"

Section 4:

• L94: "defined such that the signal-to-background ratio is higher": "higher" than what ? maybe something like "optimal" sounds better ?

• L114: "|mZ-mll|<15 GeV" -> â€œ|mll-mZ|<15 GeVâ€ (admittedly, just a personal preference) <img src="/twiki/pub/TWiki/TWikiDocGraphics/led-green.gif" width="16" height="16" alt="Green led" title="Green led" border="0" />

Section 5:

• L116-117: "For the normalization of the major backgrounds data driven estimates using control regions are employed" -> "Data driven estimates using control regions are employed for the normalization of the major backgrounds"

• L133: "A minor source of background is DYtt events" -> "A minor source of background is due to DYtt events"

• Figure 2: the label "DNNoutput_lowZ_s2b5e3_2016" on the X axis looks quite cryptic maybe just "DNN output"?

• Figure 3: the label mjj on the X axis should be changed to mjj [GeV]â€

• Figure 2 & 3: the labels on the Y axis look unconventional (personal opinion, I've not checked the guidelines)

• Figure 4 & 5: "events" as label on the X axis ?

• suggestion: if you put together the Zll<1 and Zll>1 in one single two-bins histogram(â€œ-1â€ and â€œ+1â€), then you could use Zll as the label on the X axis

Section 6:

• L180-181: should one add the references to the luminosity uncertainties estimates?

• L195: "Finally, the uncertainty on the pileup is applied" -> (maybe) "Finally, the uncertainty on the pileup reweighting procedure is applied"

• L199: "nominal value, ignoring the extreme case" -> "nominal valueS,

• ignoring the extreme caSe (i.e. one S missing, one S extra)

References (cut and paste form InspireHEP does not always work):

• L234: wz -> WZ

• L235: "at s=13tevâ€ -> "at sqrt(s)=13 TeVâ€

• ref. [7] is not about luminosity

• reference [9] is an AN, i.e. not publicly available

• L258: Nnlops -> NNLOPS, w+w- -> W^+W^-

• L264: Update -> update (or use caps consistently for the whole title)

• L267: lhc -> LHC

• L269: mcfm -> MCFM"

### v1 25 February 2021

Comments implemented in the v2 of the paper.

Intorduction:

• Fig1: I’m not really sure if you need 5 Feynman diagrams to get the point across. ok, reduced to one example

• Ln 20: It’s a bit odd to cite the 2016 result only. added citation of latest result

• Ln 41: Would be better to define QCD-induced production more explicitly earlier done

Section 4:

• I think Table 1 is not relevant to have. Nevertheless, we should have a table (or tables) with the (post)fit data, signal, and background yields.

• This reads more like an AN, and you cite and AN. The likelihood ratio test statistic is used in ~every CMS measurement with a search or low stats measurement. The appropriate citations are the usual profile likelihood ratio ones, not a CMS paper (refer to any published VBS result). ok, added correct citation.

• It’s also important to stress that your measurement is a search for a process, defined by the signal strength of the process, and that the significance of the signal strength is then quantified by the significance of the likelihood ratio test statistic. ok, reprhased.

• Refer to published VBS results and restructure along these lines. ok

• Ln 142: You can restore the broken line numbers by wrapping the equation in \begin{linenomath*} and \end{linenomath*} fixed

• DeepFlav is this the right way to refer to this? I don’t see any working points defined in the paper It is referred also as DeepJet, changed.

• Ln 159: You extract the signal strength and then calculate its significance ok

Section 5:

• We should see full run2 distributions, the individual years are irrelevant for outsiders. We should have the mjj and DNN distributions for all SRs and CRs. I believe it would also be good in the AN. You can have the split distributions in years in the Appendix, but it's better to show the combined distributions in the main body. Exceptions in cases when studying specific 2017 and 2018 issues. ok, adding full run II distributions for SR and CRs.

• Fitting strategy. While it's not completely clear in the AN (as mentioned by Yacine and Kenneth), it's not clear in the paper draft either. I am not sure if I understand l189 "ttbar and DY normalizations....after being initially left free to float". What does it mean initially?

For the normalization of the major backgrounds (tt and DY) data driven estimates using control regions are employed. The normalization of top and DY is left to float freely in the fit and con strained by the corresponding control region.

Results section:

• You don’t really describe how the two analyses should be considered. Is the DNN one the nominal one? Yes, the DNN is the nominal one. In the paper (v2) we are going to quote the results obtained combining DF (DNN) with SF (mjj) categories.

• It's interesting to show the significances and signal strengths per channels and analyses, but not per year. Okay, updated table.

• Consider looking at the VVV discovery paper for inspiration on how to treat the two together side by side ok

Conclusions:

• You don’t have any and you should added

Here questions regarding the AN are collected and addressed for each available version. Link to the gitlab repository; https://gitlab.cern.ch/tdr/notes/AN-20-073/-/tree/master

### To be added in ANv7

• Converge and clarify the signal definition choice (for both analyses)
• Add one or two bins in the cutbased analysis with events with mjj:[300-500] GeV and detajj<3.5

We have added three bins in the cut-based analysis, defined as follows:

1) 300 GeV < mjj < 500 && 2.5 < detajj < 3.5

2) 300 GeV < mjj < 500 && detajj > 3.5

3) mjj > 500 && 2.5 < detajj < 3.5

Such bins have been included in each Zll region. In this way the two analyses share the same phase space definition, hence we can now derive an apple-to-apple comparison. Evaluating the expected significance in the different flavour channel (for each dataset) we get the following results:

dataset mjj shape-based analysis DNN analysis
2016 1.83 sigma 1.89 sigma
2017 1.95 sigma 1.92 sigma
2018 2.82 sigma 2.88 sigma
full Run2 3.79 sigma 3.75 sigma

The mjj shape-based analysis clearly benefits from loosening the VBS-like phase space definition, and so we will use this selection.

• Check the DNN sensitivity by cutting on mjj>500GeV and detajj>3.5

We raise the thresholds for mjj from 300 to 500 GeV and for detajj from 2.5 to 3.5. We use 2016 to estimate how this affects the DNN performance. We find that the significance decreases by about 2% with tighter cuts, passing from 2.12 to 2.07 (all leptonic channels included). The plots below represent DNN:mjj (on the left) and DNN:detajj (on the right) for the signal in the Zll < 1 (top row) and Zll > 1 (bottom row).

As you can see, it is not 100% true that an event with low mjj and low detajj ends in the low score region of the DNN output. This explains why the significance with a tighter cut on mjj and detajj decreases.

• Check the possible anticorrelation between QCD WW and EW WW in the fit

VBS signal and WW QCD normalsations are 30% anti-correlated, see the correlation matrix below where only scaling parameters are displayed:

We did investigate the realiabilty of the fit procedure through the use of toys, see Kenneth's question below in "Slides" section.

• Further validation of the datacards
• merging ee and mumu categories if it doesn't help

Merging ee/mumu categories slightly reduces the expected statistical significance (e.g. 2018 data set: 1.78 sigma ->1.52 sigma, mjj > 300 && detajj > 2.5). Although a 15% gain on the expected significance in the SF category does not mean that the combined fit improves by the same amount, since the analysis is mainly driven by the DF category, it could be worth keeping the ee/mumu splitting in the analysis.

• merging production processes using a scheme like the ones in the figures

Higgs contribution is not relevant at all in SF categories, due to the very tight mll cut (>120 GeV), so we decided to neglect such samples. Moreover, merging all different Higgs contributions is not vey feasible, since each production mode is affected by different theoretical systematics which are treated separately.

• WW+QCD MC samples: WWJJ vs. WW inclusive.
• Show mjj distribution starting at mjj>300 GeV, combine all lepton flavors, Zll regions, and all years
• Check the fraction of events of 0, 1, and 2 parton jets at GEN level from existing WW inclusive sample

We compared the MadGraph LO WWJJ sample we are currently employing in the analysis with an inclusive WW NNLO sample generated with powheg (WWJ) [1]. This is a fair comparison, since the QCD precision of the second jet is at LO in both samples. At reco level we observe an overall good agreement in mjj, although the two samples significantly differ in the very first bin. Events have been selected with mjj > 300 GeV and detajj > 2.5.

Indeed the fraction of 0/1 gen-jets entering the signal region is much higher for the WWJ sample at low mjj values, whereas the prediction for events with at least 2 gen-jets with pt > 30 GeV is basically the same.

We also drew a comparison applying the analysis preselection defined with gen-level variables: no relevant discrepancies are found in the shape of (gen-)mjj.

We could hence replace the WWjj MadGraph sample we are currently employing with the WWJ Powheg one.

• Make use of EW LLJJ MC samples instead of EW ZJJ (that overlap with dibosons)

We are processing the EW LLJJ sample, in the meantime we are cutting on mjj > 120 GeV at LHE level, removing the overlap with the semi-leptonic sample.

• Investigate further the surprising agreement for the third jet distribution, using different PS configurations. Share the configuration setup.

Configuration setup has been shared, see https://hypernews.cern.ch/HyperNews/CMS/get/SMP-21-001/17.html

### Slides

Q Guillelmo s16 is this selection or signal definition ? signal definition

Q then it is inconsistent between cutbased and DNN ? not settled yet on the signal definition. Indeed need to compare with same mjj cut

Q Guillelmo Wouldn't it make sense to have a more VBS-like definition? How much do you gain by relaxing the cuts?

Q Aram: When you say you get better performance, do you mean the ROC curve or going all the way to the expected result? We compare first the ROC curve, to understand qualitativelt which models have better performance, and to make a first selections of all the models tested. Then we extract the expected results to have a quantitative measure of the gain of the DNN wrt to mjj.

Q Guillelmo Just to make sure you have a gain, you could add a bin with all events with mjj in [300, 500] or detajj in [2.5, 3.5] to have a more fair comparison. Also, you should have a consistent cut between the two channels in order to define a consistent cross-section

Q Paolo: Personally I donâ€™t think going down to a low value is a problem as long as you stay away from the triboson production ok

Q Guillelmo s40, for the CRs are you using 1 bin per region? The CR you would make would still be dominated by top

Q Guillelmo If itâ€™s free floating, it must be very anti-correlated with the signal. Do you really have the ability to separate them?

Q Kenneth You could do some tests with toys, possibly drawing the data from a biased distribution built from a*QCD + b*EW. You should see how reliably you recover the values of a and b that you put in vs

We checked the fit reliability through the use of toys (500 for each configuration), generating data with different a,b values. The fit procedure shows that input parameters are recovered regardless of initial settings.

input parameters fitted parameters
a = 0.5 ; b = 0.5 a_fit = 0.497 +/- 0.015 ; b_fit = 0.500 +/- 0.012
a = 0.5 ; b = 1 a_fit = 0.510 +/- 0.015 ; b_fit = 0.963 +/- 0.014
a = 0.5 ; b = 2 a_fit = 0.483 +/- 0.016 ; b_fit = 1.994 +/- 0.016
a = 1 ; b = 0.5 a_fit = 0.994 +/- 0.015 ; b_fit = 0.489 +/- 0.012
a = 1 ; b = 1 a_fit = 1.013 +/- 0.016 ; b_fit = 0.970 +/- 0.014
a = 1 ; b = 2 a_fit = 0.992 +/- 0.016 ; b_fit = 1.983 +/- 0.017
a = 2 ; b = 0.5 a_fit = 1.974 +/- 0.015 ; b_fit = 0.475 +/- 0.012
a = 2 ; b = 1 a_fit = 1.985 +/- 0.015 ; b_fit = 0.977 +/- 0.014
a = 2 ; b = 2 a_fit = 1.923 +/- 0.015 ; b_fit = 1.982 +/- 0.017

Q Guillelmo Did you make a check of merging the ee and mm channels? We checked that we really don't lose much, we can probably do this, but we need to study it a bit more

Q Guillelmo In the data cards, you should really combine the small processes rather than having them all split. There are quite a few warnings that need to be addressed

Q Paolo: We should understand the off-shell effects. we did try to make a sample of p p > l v l v j j and some tests, in the backup

Q Paolo Is the sample really LO WW+2j only? Did you make some comparison? Yes also in backup

Q Kenneth Combine channels and years (at least 2017/8) to have more clear comparison of the gen-level differences

Q check the fraction of 0,1 partons contributing in the inclusive samples ok

Q Paolo s8 Z+2jets EW ==> switch to EW LLJJ samples

Q Manjit If you compare your sherpa and MadGraph samples, there is a bump in the ratio plot, can you really ignore this? It's only a couple of bin, so yes, not so relevant.

Q Manjit s19: How do you choose 80% and 20% for the split? roughly yes, maybe not the exact numbers, but the training should be the larger one

Q Manjit You've used mjj for one channel and DNN for another, if you're going to combine them, do you make some compatibility check of the two?

Q Paolo s13: Surprising you don't see much difference in the PS settings for the third jet. We are sure that it was configured correctly. We can share the settings in any case

• TableSec 7.4, Fig 42 and 43: are the pileup jets defined by an ID or by matching to GEN?

The DY_PUJets process is defined requesting at least one of the two leading reco jets with pt>30 GeV not being matched to a GEN jet having pt>25 GeV. Therefore, the DY_hardJets sample has both leading jets matched at GEN level. This sentence will be added in the AN as well.

• Are you sure that the issue is pileup, or could it just be mismeasurement outside the tracker? Do you have plots showing bins in jet eta tracker vs. outside? (perhaps both jets eta < 2.5, 1 jet eta < 2.5, and 2 jets < 2.5). Would also be kind of interesting to see inside HF or not (eta > 3).

We believe that the issue, mostly visible in the 2016 DY sample, is due to the simulation of the hard radiation and/or to the relative fraction of events w/ and w/o PU jets. Plots of detajj with both jets inside the tracker or at least one outside are shown below:

Two tracker jets (ee/mumu):

One tracker jet (ee/mumu):

As you may see, the region with both jets inside the tracker is entirely populated by the DY_hardJets process and shows a large disagreement. As a further cross-check, we did try to use this categorization to determine both DY_hardJets and DY_PUJets normalisations and results are in agreement with the strategy we are using in the AN (slide 7-8 https://hypernews.cern.ch/HyperNews/CMS/get/SMP-21-001/8/1.html).

• Fig 47 and 48: the nonprompt background statistics really seem insufficient. I don't think it's a good idea to fit with this background estimation. How many raw data events do you have here? Some possible approaches: - Loosen the ID somehow to have a better sample of events? - Combine all years rather than fitting separately - Derive the shape from a looser region and scale with the ratio of signal region/loose region

We cannot define a looser selection than the one we are currently using to estimate the fake rate, the definition of lepton's WPs is the loosest possible satisfying the trigger-safe requirement. Moreover, nonprompt leptons are really a marginal background for the SF analysis, we indeed expect 3 events in the full run2 for the mumu signal region category, which is basically less than an event per bin in mjj, and 12 in the ee region.

• In section 8, you regularly reference splitting the DY into PU and no PU jet events. How is this defined in the signal region? Purely by splitting events with etajj > 5 or < 5? I assume you also split the signal region in these bins? Why is this never shown? It would be good to see the signal distributions with the DY colored according to the two contributions.

Our strategy is to treat DY_PUJets and DY_hardJets as two different processes. Each of them has a dedicated control region: detajj < 5 DY CR is enriched with DY_hardJets events, while the other one is mainly populated by the DY_PUJets contribution. There is no detajj splitting in the signal region (see table 8) and the two processes contribute there with the yields determined in their respective CR. Both samples are shown in figures 47-48 (light green = "hard" DY process, dark green = DY with at least 1 PU jet).

• Fig. 49: Why is this the only place that Z EW is referenced? Is it included in other plots but not labelled? Also, DY EW isn't really meaningful since it's not a Drell-Yan process

We will keep the Zjj sample separated from the pure DY, as it is in the rest of the AN.

• I'm kind of concerned that the stats are so low in the DNN distribution, Fig. 49. This should definitely be rebinned. Ideally the stats of the backgrounds would also be increased.

We have rebinned the DNN output asking in each bin for at least one signal events, 2 signal + background events and a maximum of 30% of statistical error on background. For minor backgrounds we require a yield > 0 in all bins. The binning has been implemented on 2016 dataset and then applied to the other 2 years. In the figure you can see in top (bottom) row the Zll < 1 (Zll >1) signal region for 2016/2017/2918 respectively.

New results are extracted and reported in table below.

year significance err. on signal strenght
2016 2.16 -0.48/+0.53
2017 2.33 -0.44/+0.48
2018 3.28 -0.32/+0.34
fullRun2 4.39 -0.24/+0.26

These results are going to be updated in AN v7.

• Can you clarify what DY sample you are using? Quite a lot are listed in the introduction, and the stats don't seem great

Table 6 shows all DY samples we are employing for the SF analysis, for some of them we are also using available extensions for furhter increasing the statitics, we will include those as well in the list.

• Is WZ the major source of multiboson background? How many events do you have in the sample, and how many raw events pass the final selection

Here's the number of events of each process entering in the "multiboson" definition (2018 data set). Plots are drawn in inclusive e/mu, e/e and mu/mu signal regions respectively: while WZ is the major contribution in the different flavour category, it is equally important as Vg in the same flavour analysis.

• Did you check the WW+jj samples against the WW inclusive ones? We usually didn't use these VV+2j LO samples in the past, because the matching scale in Pythia gives very hard 3j radiation. It's worth at least checking the impact of using other samples if you have the statistics.

Here you may find the comparison between inclusive and WWjj sample: the inclusive WW sample is plotted as data while the LO WWjj sample is the solid azure histogram. The dashed grey bands include both MC stat and theory uncertainties and mjj shapes are in agreement within error bars in almost each signal region.

• Are the impact plots up to date (Fig. 50-52)? I don't see all the parameters for the DY as I would expect Could you share the complete impact plot files (a link in the twiki would be enough)? There are various high ranked uncertainties that are purely statistical? This landscape would change with binning changed

Plots shown in the AN are updated, the r_vbs estimation is mainly driven by the DF analysis and that's why SF-related nuisances don't impact much in the VBS measurement. You may find all pages here:

Link to full impact plots mjj analysis:

Link to full impact plots DNN analysis:

• Did you share your combine cards with Pietro (and us) yet?

This is the gitlab repository where all datacards have been uploaded: https://gitlab.cern.ch/cms-hcg/cadi/smp-21-001

### v5 26 January 2021

In v5 all comments from v4 have been implemented. Main concerns related to v5 are the followings:

• We would expect to gain more sensitivity when using a DNN approach to extract the signal, at this level both mjj and the DNN score show similar results. Is there room for any optimization?

The training procedure now includes both QCD WW and ttbar pair production as backgrounds. Doing so, the expected statistical significance increases by roughy ~15% in the different flavour analysis and when combining all categories together we almost reach 5 expected sigma.

• The same flavour DY control region shows some criticities in data/MC agreement, especially for the 2016 data set. Have you tried to implement a bin-by-bin corrections?

In order to tackle the observed data/MC disagreement we changed paradigm for the same flavour analysis. The new strategy we came up with is based on two main points: 1) Discrepancies strongly depend on detajj and this could be the hint of a PU dependancy; 2) CR and SR need to be as similar as possible. Eventually we split the DY sample into two contributions, one including events in which at least one jet comes from PU and the other one for the remaining "hard" events. Two independent parameters are used to scale their normalizations in the fit procedure. In order to gain sensitivity to these contributions, the DY control region has been divided into 2 detajj bins (> or < than 5). Besides we increased the MET cut up to 60 GeV, as it is for the SR as well. Although "hard"-like events are unlikely to be found in such a high-MET region, the categorazion in detajj is suitable for separating the two DY sub-samples and allows a better estimation of their yields.

### v4 13 January 2021

We never managed to produce a meaningful sample with POWHEG.

• Could you be more specific on the issues encountered when trying to produce those samples? Perhaps GEN group can be of help? Even if the issues are critical with POWHEG it's worth documenting the studies that you made for reference.

The issue we encountered with Powheg was related to the sample generation, as it appeared like all events had the same seed. We tried to get in contact with Powheg's authors but we never had a follow-up on that, hence we dropped the study.

• A study of MadGraph <https://twiki.cern.ch/twiki/bin/edit/Main/MadGraph?topicparent=Main.QandAforVBSOSWW;nowysiwyg=1>+Herwig at Gen level would also be useful. This could be done on NanoGen <https://twiki.cern.ch/twiki/bin/view/Main/NanoGen> pretty easily. We can help you with the configuration, then you just need to generate events and make a few comparison plots of your sensitive variables at Gen level. Since this is the first time this state has been studied, it would make the analysis stronger.

This has been documented in the AN (see figure 6).

We performed a preliminary study where we compare our signal sample at GEN level (starting from MiniAOD files) with LO VBS W+W- sample generated with Sherpa, along with its built-in parton shower. The Rivet analysis employed for this comparison contains the main cuts which define our signal selection. We considered jets with pt > 30 GeV, from which we have further removed leptons with pt > 10 GeV contained in their cone (R = 0.4). Both samples are affected by an issue affecting the colour reconnection scheme, which results in generating more jets within the pseudorapidity gap of the two tagging jets. In Sherpa, a fix for this problem is available, and the difference in the production rate of the third jet is well visible. Nevertheless, inclusive two-jets distributions agree within a fair 10%, and there are no relevant shape differences affecting mjj, which is our chosen fit variable.

• Could you please specify what PS did you use for the Madgraph samples? is it with the default Pythia8 or Herwig? it would be useful to have both. In the case of the Pythia8 it would be useful to check the dipoleRecoil option as well. Would it be possible to update these plots with more statistics?

The PS used with MadGraph samples is the default Pythia8. Plots with more statistics have been uploaded in the AN (see figures 4 and 5).

• Also the Sherpa PS fix vs Madgraph plots shows large differences mainly in the 3rd jet variables and that's indeed due to the colour reconnection scheme. Even though the checks done previously showed that the cut-based analysis is not affected by the issue, now that you have a DNN approach the conclusio might different. I would suggest also to check the impact on the DNN with Sherpa-PS-fix to start and with MG5+Herwig when ready.

We compare the ROC curves obtained applying the models to the analysis samples to estimate the discrimination power of a network wrt to another. As to the overfitting, we check that the loss function evaluated on the validation dataset does not increase with the number of epochs, but decreases or remains stable (as the ones we show in the fig. 8 of the AN). Moreover, we are also considering other two metrics: the recall (TP/(TP+FN)) and the precision (TP/(TP+FP)). And finally, we also check that the distribution of the DNN score obtained with the training and with the validation samples are overlapped.

• If I understand correctly the optimisation is done by "hand", so you check if the loss function is relatively flat and does not increase with the epoch. Is that correct? have you tried using a more quantitative approach such as Kolmogorov-Smirnov test? This is, I believe, what the SMP-20-013 is using.

We are implementing, as suggested, the Kolmogorov-Smirnov test in the optimisation procedure of the latest networks to further check the absence of overfitting.

• Figure 10-11-12: I see that the loss function is oscillating with the number of epochs (same pattern with the efficiency and purity). Do you have an explanation for this? On my knowledge, such behaviour is symptomatic of an optimisation oscillating around a saddle point. Maybe you can reduce the learning rate so that the gradient descent doesn't overshoot the minima. Also, I see that (line 380) the LR is automatically optimised as the learning progresses. Could you show a plot of the LR as a function of the epochs? Maybe the oscillation is an artefact of this automation.

The oscillation pattern you see in the metrics is due to the Cyclical Learning Rate algorithm [1] used in the training. With this method, three parameters are set for the learning rate: a lower and an upper bound and step size. Thus, the learning rate increases from the lower to the upper bound in steps; when reaching the upper bound, the learning rate decreases until the lower bound is touched; the process repeats during all the training. Figure [2] shows an example of the behavior of the learning rate during each iteration of the training. The wave-like behavior of the loss is a consequence of this learning rate oscillation. In particular, the bottom of the wave corresponds to the minimum value of the learning rate, while the top corresponds to the maximum learning rate. The Cyclical Learning Rate helps prevent overfitting and reduces the number of iterations needed to optimize the networks.

### v3 04 January 2021

• The numbers in Tables 9-11 between v2 and v3 have changed quite a lot, the signal is changing by almost 10% in 2016. We really need a more detailed explanation of what changed here. This is still the same selection, without the DNN involved, right? It would really speed up our review to give a breakdown of the impact of individual changes. Just NanoAODv5 --> NanoAODv7 is too vague, we need to know what corrections etc are changing that impact the physics results.

In addition to the change in NanoAOD version there are two additional modifications: the working point for the muons has been changed, following a similar change in the HWW analysis from which we inherit the object definition. In particular, for muons we have moved from a cut based WP to a WP cutting at 0.8 on the ttHmva, as described in the AN 2019/125. Also we have moved the bveto from the DeepCSV loose WP to the DeepFlavor loose WP. Both improve sensitivity in almost all categories.

• Table 9-11: How do you treat the negative nonprompt yields? (there is still one negative yield in the new version, there were several in the old).

At the moment they are go into combine as they are.

• General point: I agree with Yacine’s comment that studying the signal with another generator would be useful. I remember studying POWHEG some time ago. Did you conclude that there was an issue with POWHEG?

We never managed to produce a meaningful sample with POWHEG.

• A study of MadGraph+Herwig at Gen level would also be useful. This could be done on NanoGen pretty easily. We can help you with the configuration, then you just need to generate events and make a few comparison plots of your sensitive variables at Gen level. Since this is the first time this state has been studied, it would make the analysis stronger.

We performed a preliminary study where we compare our signal sample at GEN level (starting from MiniAOD files) with LO VBS W+W- sample generated with Sherpa, along with its built-in parton shower. The Rivet analysis employed for this comparison contains the main cuts which define our signal selection. We considered jets with pt > 30 GeV, from which we have further removed leptons with pt > 10 GeV contained in their cone (R = 0.4). Both samples are affected by an issue affecting the colour reconnection scheme, which results in generating more jets within the pseudorapidity gap of the two tagging jets. In Sherpa, a fix for this problem is available, and the difference in the production rate of the third jet is well visible. Nevertheless, inclusive two-jets distributions agree within a fair 10%, and there are no relevant shape differences affecting mjj, which is our chosen fit variable.

• We think it would be important to make a combined EW+QCD measurement in a fiducial region. Using the shape-based fit for this, with EW WW and QCD WW as signal, should be an easy addition that is appreciated by theorists.

We are currently working on that and we will soon implement the measurement in the documentation. We have not yet settled on a fiducial volume definition, but we propose to perform the fit in such a way that the fiducial and nonfiducial signal components entering the signal region are scaled together. If we follow this approach the fiducial volume definition does not matter when fitting, and plays a role only when translating the signal strength extracted from the fit into a fiducial cross section. We already were able to fit the EWK+QCD sample as signal, and for that we get an expected result for the signal strength of 1 +/- 0.26. We would like to work on the exact definition of the fiducial region between now and the preapproval.

• Ln 100: There are a lot of definitions of the Zeppenfeld variable. The one you use is sometimes called the centrality (zeta), with the Zeppenfeld variable reserved for zetall/etajj. Did you try the zeppenfeld with this definition as well? It would probably be clearer to adopt this language (as in SMP-18-001)

We have tried to use for the categorization of the signal region Zetall/detajj =½ abs((ηlep1+ηlep2)-(ηjet1+ηjet2))/|ηjet1-ηjet2| instead of the usual Zll (defined at line 100 of the AN). We had a quick test using only different flavor categories and only top control region in the final fit. We tried some different scenarios, splitting the signal region in two categories wrt to Zetall/detajj and changing the cutting value from 0.1 to 0.5 in steps of 0.05. Results are reported in the table below.

cut on Zell/detajj Significance Zll /detajj
0.1 2.34
0.15 2.43
0.2 2.39
0.25 2.38
0.3 2.36
0.35 2.27
0.4 2.25
0.45 2.21
0.5 2.36

The significance obtained with the usual categorization (i.e. Zeppll < 1 / Zeppll > 1) is 2.56. Therefore, the usual categorization has the best performance.

We will adopt, as suggested, the naming convention as in SMP-18-001.

• Sec. 6.1: It’s awfully hard to see the improvement in a lot of these plots. Can you show only the region of interest, and plot abs(eta) as well so there are more stats to see the performance?

We have plotted abs(eta) of the two leading jets for all the flavor categories (ee, mm, em) in the top [1] and DY [2] control regions. The data/MC agreement in the horns region (2.5 <|ηjet| < 3.2) is everywhere good. These plots will be included in section 6.1 of ANv4.

Questions on the impact plots, Fig. 42-44:

• QCDscale_top_2j wasn’t shown in the previous version. Is this the shape uncertainty of the top background? Was it just overlooked? Is it not included in the norm param because of the shape effect?

In the previous version (ANv3), QCDscale_top_2j wans't accounted for and it's the QCD scale uncertainty related to the top bacgkround. Both up and down variations are calculated as the difference between the nominal histogram and the envelope obtained by considering the highest up and down QCD scale variation in each bin. Such uncertainty is treated as a shape effect and the varied distribution is normalised to the nominal integral (that's indeed why it is not included in the rate parameter).

• In the previous version, you had an uncertainty labeled CMS_scale_met, and I was wondering what this is. Is this the JES propagated to the MET or is it the unclustered energy? Did you remove it or did it get pushed further down the ranking?

In the current versione CMS_scale_met is still presented but has been slightly pushed down in the ranking by other uncertainties. It's computed by varying the MET energy scale of PF algorithm candidates which are not clustered into jets and it is properly propagated to other variables which depend on the MET itself. Up and down histograms are then normalised to the nominal one, thus this contribution is treated as a shape effect.

• What is the primary source of the stat uncertainties that are dominant in the impact plots? Is it the stat uncertainty on the nonprompt?

The primary source of statistical uncertainties in the impact plot is mainly due to the top sample in almost all mjj bins within different flavour categories and to the DY contribution in same flavour categories.

Here is shown uncertainties breakdown, performed over a likelihood scan on the Asimov dataset. The total error is split into JES, systematic and statistical contributions; the latter is clearly what limits our analysis: The plots will be included in an appendix of AN version 4.

• Where is the nonprompt norm uncertainty? For the combined fit, can you put all the nuisances into the appendix?

The main uncertainty source on the "Fake" sample is a normalization uncertainty of 30% derived from a closure test in MC. This uncertainty is modeled as a lognormal distribution, separately for events with a subleading electron or muon. They rank 78 and 169 in the combined impacts plot with an effect of 0.5% and 0.2% respectively on the signal strength. We will create an appendix in version 4 of the AN for all nuisances considered in the combined fit.

• Some of your JES and JER uncertainties are pretty one-sided. Can you add a few illustrative examples of the input shapes you use to the AN?

Overall JES/JER uncertainties seem reasonable, although for some of them Up/Down variations are indeed one-sided in few mjj bins, as it may be observed here for the 2018 dataset:

Most impactful JES + JER uncertainties are drawn for main processes, i.e. VBS, top and WW samples, in each signal category. Similar plots are extracted for other years. These plots will be included as an appendix in the new version 4 of the AN.

Section 5.0:

• You have mentioned that the datasets should be balanced, so you increased the signal samples weights in training. Does that mean that you include the event weights in a way or another in the DNN training? if so can you be more explicit how this information is incorporated in the DNN?

Yes, the weights of the events are considered in the DNN training. In particular, the loss computed for each sample is multiplied by the weight associated with it. In this way, the back propagation will behave differently depending on the weight of the events, giving more importance to the events with a higher weight.

At first we consider as weight for each event XS*lumi*SF, and then a balancing is made. This means that the total number of weighted events of the signal dataset should be the same as the one of the backgrounds datasets combined. This is achieved increasing the weight of the signal samples in the training, using as weight: weight/mean(weights). While to balance the background we use as weight: weight*nS / sum(weights), where nS represents the number of simulated signal events.

• Since you have divided the samples into two datasets one for training and the other for validation, I think it would be good to show the DNN probability distributions for both training and testing to illustrate the absence of overfitting.

We will include the DNN probability distributions for both training and testing in the updated documentation v4.

• What loss function are you using?

We are using the binary cross entropy as loss function.

• It seems that you have used only ttbar samples as background. Out of curiosity, have you tried including other backgrounds to see if the discrimination power improves or deteriorates?

Until now we have considered only ttbar as background, because it is the dominant one in the signal region (its yield is ~10 times the WWqcd one, which is the second most relevant background). We are trying to add in the training also the WW qcd to see if it will improve the network performance.

• It would be nice to see some of the ROC curves you are mentioning in the text.

We will add the ROCs comparison for mjj and DNN in the version 4 of the AN. Here [1] ([2]) some examples for the low Zll (high Zll) categories for the three years. The DNN performs better than mjj.

Section 5.1:

• Could you substitute the N in the text to reflect the results obtained? As I understand, the DNN optimisation is still ongoing, but it would be good to mention the architecture used to make sense of the results.

In the v4 of the AN we will fix this. However, we are using neural networks with 2 or 3 hidden layers, and a number of neurons that goes from 50 to 150.

• Maybe this is not important in your case, but have you tried using dropout layers? this has proven to reduce overfitting.

During the optimisation of a network we try different architectures and tools; we try dropout layers as well. It's true that they help to reduce overtraining, but in some cases they inficiate the performance of the DNN, and therefore in those cases they are discarded.

• You mentioned that a down-weight of mjj/2000 is applied, I am curious to know how this information is used in the DNN.

We multiply the weights of the events for mjj / 2000 only if mjj >=2000 GeV. In this way we give more importance to all the high-mjj events (i.e. the events with mjj > 2000 GeV) during the training process. In the training of the DNN this information is used with a direct rescaling of the loss function. In fact, the loss computed for each sample is multiplied by the weight associated with it. In this way, the back propagation will behave differently depending on the weight of the events, giving more importance to the events with a higher weight.

Section 5.2:

• The strategy consists of choosing the best variables and have a tradeoff between overtraining (line 370) and discrimination power. For the discrimination power, I guess you used the area under ROC, right? Could you provide us with the methodology used to estimate the overfitting?

We compare the ROC curves obtained applying the models to the analysis samples to estimate the discrimation power of a network wrt to another. As to the overfitting, we check that the loss function evaluated on the validation dataset does not increase with the number of epochs, but decreases or remains stable (as the ones we show in the fig. 8 of the ANv3). Moreover, we are also considering other two metrics: the recall (TP/(TP+FN)) and the precision (TP/(TP+FP)). And finally, we also check that the distribution of the DNN score obtained with the training and with the validation samples are overlapped.

• In line 367, you mention that an optimal value has to be searched, it would be nice to show more details on that.

To find the optimal configuration, we started with a small DNN (2 layers with 20 neurons each) and then we trained it with as many variables as possible. If the DNN overtrained, we ranked the variables thanks to the SHAP (see AN-2019/239) , considering their “importance” in terms of impact on the DNN output, and we removed the 2 less important variables. Then we repeated the process (training->ranking->variables removing) until the DNN was not overtrained anymore. If the results in terms of performance were not satisfying, then, we incremented the DNN structure (number of layers and/or neurons) and repeated the process until we found the optimal set of training variables with this new structure. We have repeated all this process until we have found a DNN with a satisfying performance, that means with a ROC curve that shows a better performance with respect to mjj in the whole phase-space

### v2 23 November 2020

General comments: * Various references are missing (example: Line 258, Line 289, …)

References are updated.

* Out of curiosity, I see you have mentioned a DNN approach in line 112: are you also considering implanting a DNN analysis besides the cut-based one?

Yes, we are working in parallel on a DNN approach in the different flavor category to boost the analysis performance.

Section 3:

• Why are you using NanoAODv5 for 16 and 17 datasets? The current version is v7, are you planning to update soon?

Yes, we are planning to update the analysis, moving it to NanoAODv7 datasets.

* For the signal, you are using MG5 interfaced with Pythia 8, where you require 2 jets in the final state at LO. This could lead to large discrepancies in a case of third jet veto (such the Zll variable), due to a mis-modelling of colour-connection in Pythia 8. You could consider generating WW+3j at the LO with the dipoleRecoil=on in the Pythia 8 settings in order to mitigate this issue. You can find more details on the following links:

I would also recommend using a different parton shower (Herwig++ or 7) as cross-check

The Zll variable should not introduce additional mismodelling in our signal sample, since it's not strictly related to the third jet kinematics. Rather, it describes the polar distribution of the di-lepton system w.r.t. the two tagging jets and, for the signal, we expect to find more activity in the central region. Indeed this is what happens and that's why the Zll < 1 category is enriched with signal and has a favourable S/B ratio. As additional evidence to such behaviour, we provide the main jet distributions for the signal sample, evaluated both inclusively in Zll and applying the categorization (example provided for the 2016 dataset -> might be updated with 2017 and 2018):

No differences in the shape of the distributions are visible, meaning that the Zll cut does not affect the third jet kinematics.

* Have you checked if you are effected by the HEM issue in 2018 dataset?

We apply on 2018 datasets the recipe to cure the HEM issue [1]. The effect on our control regions is negligible, as you can see comparing plots where corrections are applied (top [2], DY [3]) to the ones in which they are not (top [4], DY [5]). The checks on HEM issue will be included in section 6.2 of ANv3.

Section 5:

* You applied the PUJID only the 2.5 < |ηjet| < 3.2 region, have you tried to apply the pileup id to other eta regions? maybe this would improve the agreement of the very forward jets

We are already applying a PUJID loose in all the eta range for all jets with pt < 50 GeV. In addition to that, in 2017 we require the two leading jets to pass the tight PUJID wp, if their eta is in the range 2.5 <|ηjet| < 3.2.

* In the note we understand that the jet horns are an issue only in 2017? Have you checked for 2016 and 2018? I do remember that in VBF Higgs we have seen the same issue in 2016 dataset as well.

We checked both 2016 and 2018 datasets to find if the jet horns issue was affecting them. As to 2018, in both DY[1] and top [2] CR the agreement in the 2.5 <|ηjet| < 3.2 looks quite good. In 2016 the agreement is a bit worse (DY[3], top[4]), in particular for DY cr in the same flavor categories, but still not comparable to what is observed for 2017 [see fig. 8-12 of the ANv2].

* Also on the same note, have you applied the latest JEC/JES recommendations? If not, you might consider updating to the latest recipe that showed better Data/MC agreement in the horns region.

We are planning to update soon the analysis from NanoAODv5 to NanoAODv7, which include the latest JEC/JES recommendations (here GT comparison of the two versions [0]).

Section 8:

* Can you be more explicit about the treatment of the theory uncertainty in the VBS signal? from the text it seems as if you varied only the factorisation scale by 1/2 and 2.

The theory uncertainty on the VBS signal is indeed evaluated by varying the factorisation scale by 1/2 and 2. However, since the normalization of the signal is measured during the fit procedure, we divided the varied histograms by the integral of the nominal one (i.e. the one with mu_F = 1), in order to account for possible modifications affecting only the shape of the distributions.

* Can you also comment on how the experimental uncertainties are correlated across years?

Experimental uncertainties are kept uncorrelated across the three years, as mentioned in lines 440-442 ANv2.

### v1

Empty skeleton, first draft.

## General questions and discussion

### Minutes from 15-09-2020 SMP-VV

Philip :
• The e/mu regions are still dominated by the top backgrounds, you might consider finding more variables to reduce this. In ATLAS in Run I, this was done with a cut on the mT2 variable. Take a look at the corresponding paper and see if this variable would be useful. This was meant to target top quark mass to discriminate against ttbar. If i remember correctly the variable was computed with some min, or max of [ MT2(lvlv+vbfjet1), MT2(lvlv+vbfjet2) ]. https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/HIGG-2013-13/fig_20.pdf/

We are planning to include mT2 in the analysis to see if it can help in the suppression of the top backgrounds. We are investigating to understand the definition of the variable.

Paolo :

• You’re using the LO MC for Drell-Yan, can you switch to the NLO one?

The NLO DY sample has not enough statistics to populate the signal region we have defined in the anlaysis, thus we use LO HT-binned samples to provide for lack of MC stat in the same flavour categories. We share this approach with the HWW high mass analysis.

We did try to employ the NLO DY sample instead of the HT binned samples and we observed a general improvment in the high Z_ll DY CR. However, this doesn't hold for the low Z_ll category, where a significant discrespancy between data and MC is still present.

• You also process the VBF Z sample, one would expect that this could be significant.

We included the Zjj sample in the analysis. Still, its contribution seems to be not so significant and it does not cover the data-MC gap.

• Can you request the signal sample with the Pythia dipole recoil shower (and Herwig)? Perhaps in the UL?
Working on it.

Yacine :

• On the categorization, you say that the Zeppenfeld variable improves the sensitivity. Did you try it wrt other variables? Have you tried using the Z_{l1} rather than just Z_{ll}?

We tried using Z_{l1} (instead of Z_ll) to split the signal region in two categories for 2018 (Z_{l1} < 1 and Z_{l1} >= 1). The signal purity in region Z_l1 <1 ( expected to have the most favorable S/sqrt(B)) is not as good as the one in the old Z_ll <1 category. Thus we obtain a statistical significance (2.49) worse than the one obtained with the old configuration (3.07).

• Also, how did you optimize the binning for the mjj?

We optimize the binning requiring no empty bins.

MattiaLizzo - 2021-03-01

Topic attachments
I Attachment History Action Size Date Who Comment
png c_VBS_2j_ee_events.png r1 manage 14.9 K 2021-03-05 - 15:07 MattiaLizzo
png c_VBS_2j_em_me_events.png r1 manage 15.0 K 2021-03-05 - 15:07 MattiaLizzo
png c_VBS_2j_mm_events.png r1 manage 15.1 K 2021-03-05 - 15:07 MattiaLizzo
png corrMatrix_rateParam_ANv6.png r1 manage 41.6 K 2021-03-22 - 16:13 MattiaLizzo
png cratio_DY_2j_ee_oneTrackerJet_detajj.png r1 manage 26.1 K 2021-03-05 - 12:45 MattiaLizzo
png cratio_DY_2j_ee_twoTrackerJets_detajj.png r1 manage 22.5 K 2021-03-05 - 10:37 MattiaLizzo
png cratio_DY_2j_mm_oneTrackerJet_detajj.png r1 manage 25.6 K 2021-03-05 - 12:45 MattiaLizzo
png cratio_DY_2j_mm_twoTrackerJets_detajj.png r1 manage 22.1 K 2021-03-05 - 10:37 MattiaLizzo
pdf cratio_SFCR.pdf r1 manage 16.2 K 2022-02-04 - 18:00 MattiaLizzo plots updated to CWR comments
png cratio_VBS_0j_mjj.png r1 manage 22.6 K 2021-03-24 - 18:05 MattiaLizzo
png cratio_VBS_1j_mjj.png r1 manage 24.0 K 2021-03-24 - 18:05 MattiaLizzo
png cratio_VBS_2j_GenMjj.png r1 manage 23.5 K 2021-03-24 - 18:05 MattiaLizzo
png cratio_VBS_2j_mjj.png r1 manage 24.8 K 2021-03-24 - 18:05 MattiaLizzo
png cratio_VBS_nj_mjj.png r1 manage 25.7 K 2021-03-24 - 18:05 MattiaLizzo
pdf log_cratio_DFCR.pdf r1 manage 16.2 K 2022-02-04 - 17:32 MattiaLizzo plots updated to CWR comments
pdf log_cratio_highZ.pdf r1 manage 18.8 K 2022-02-04 - 18:00 MattiaLizzo plots updated to CWR comments
pdf log_cratio_lowZ.pdf r1 manage 18.7 K 2022-02-04 - 18:00 MattiaLizzo plots updated to CWR comments
pdf log_cratio_sr_highZ_DNNoutput.pdf r1 manage 26.3 K 2022-02-04 - 18:00 MattiaLizzo plots updated to CWR comments
pdf log_cratio_sr_lowZ_DNNoutput.pdf r1 manage 30.5 K 2022-02-04 - 18:00 MattiaLizzo plots updated to CWR comments
png srHighZ_SIG_detajjDNN.png r1 manage 11.8 K 2021-03-17 - 10:11 FlaviaCetorelli
png srHighZ_SIG_mjjDNN.png r1 manage 10.8 K 2021-03-17 - 10:11 FlaviaCetorelli
png srLowZ_SIG_detajjDNN.png r1 manage 10.1 K 2021-03-17 - 10:11 FlaviaCetorelli
png srLowZ_SIG_mjjDNN.png r1 manage 9.9 K 2021-03-17 - 10:11 FlaviaCetorelli
png sr_highZ_2016_newB1.png r1 manage 20.7 K 2021-03-24 - 11:03 FlaviaCetorelli
png sr_highZ_2017_newB1.png r1 manage 20.9 K 2021-03-24 - 11:03 FlaviaCetorelli
png sr_highZ_2018_newB1.png r1 manage 20.8 K 2021-03-24 - 11:03 FlaviaCetorelli
png sr_lowZ_2016_newB1.png r1 manage 21.9 K 2021-03-24 - 11:03 FlaviaCetorelli
png sr_lowZ_2017_newB1.png r1 manage 22.8 K 2021-03-24 - 11:03 FlaviaCetorelli
png sr_lowZ_2018_newB1.png r1 manage 22.2 K 2021-03-24 - 11:03 FlaviaCetorelli
Topic revision: r113 - 2022-03-17 - MattiaLizzo

Webs

Welcome Guest

 Cern Search TWiki Search Google Search Main All webs
Copyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback