Questions and answers for HIG-18-032

Color code

For the answers, the following color code is used:
* green - comment answered
* blue - comment requires further work to be addressed
* orange - work-in-progress

Comments from Cecile, on FF AN v3:

• Figures with stacked histograms: please increase the size of the legend (impossible to read even in A4 format), and possibly increase the size of the figures
Done.

• The documentation does not align with the twiki and sometimes contradicts itself. In particular, is the prompt contribution subtracted (eq 1, l173/174)? are the FF binned in decay mode (l146, l283-286, l295-296)? are the fractions renormalised to the sum of fake fractions (l171)? Please correct and double check everything!
Indeed due to some changes we implemented following your comments in the HTT meeting there were quite a few inconsistencies. They should all be resolved now in the AN.

• Related to what I understand from the twiki but is not transparent in this note. You recommend to compute the fractions in bins of mvis and njets: the fractions for tt, qcd, and w in generally do not sum to 1.0 because there is also a fraction or real taus. Because the fractions do not sum to 1.0 you do not need to subtract the prompt contribution (what you are doing in practice is applying a fake factor of 0 to events with real taus). But the fractions can depend on other variables that you use in the analysis. Let’s take nbtag as an example. When you derive the fractions in the inclusive region, you can have something like f_qcd=0.2, f_w=0.5, f_tt=0.1, f_real=0.2 for mvis under the ZTT peak. But for events where there is nbtag>0 (selected preferentially by the NN to populate the ttbar CR) the fractions you should have used could have been instead f_qcd=0.05, f_w=0.3, f_tt=0.6, f_real=0.05. Assuming FFqcd=FFw=FFtt you end up with a 15% difference because you considered there was 20% real taus instead of 5%. If FFqcd!=FFw!=FFtt the difference is even bigger. In my opinion a much better way to proceed is to renormalize f_qcd, f_w and f_tt so they always sum up to 1. Then you need to apply the sum of fake factors also to real events in the AR and subtract them from the reweighed data. In that case f_real is always correct, for any choice of variable. Please comment.
We agree. We have switched to “renormalisation and subtraction”, as suggested.

• The subtraction of real taus/leptons on the fly would also allow us to treat correctly all systematics like the tau energy scale tau ID efficiency, which are otherwise neglected to determine the fractions even if not negligible. Please comment.
We agree. We have switched to “renormalisation and subtraction”, as suggested.

• l26: I think the integrated luminosity corresponding to this json is 41.4 fbinv
This is what we consistently get for our triggers. It is also the same as e.g. for HIG-18-002. What is the source for 41.4 fbinv?

• l45/46: Why do you use these two single tau triggers? They are not part of the baseline selection for the analysis and should not bring you much acceptance
This was a mistake in the note which has been fixed now.

• l64: in 2017 there is no non-triggering MVA ID anymore
Indeed, corrected (to non-iso MVA ID)

• Table 3: the cross sections should differ from 2016 because of different tune
We have now added the numbers for 2017 in the note (for the analysis and results, we were always using the correct values, so this just affected the note).

• l70: why don’t you also derive fake factors for 20 < pt < 30 GeV? That would allow more flexibility in the analysis, and application to more different analyses
Yes, we were in fact already doing this, but had simply not plotted this region. Now we do.

• l95: this is not really true; in the Z mass peak region you can have 20% real taus in the AR
We believe it is nonetheless true that the typical impurities are at this level; but we have an additional statement explaining that they can be higher in some specific cases.

• l111: Have you checked if the fake factors would be different for njets=1 and njets>1?
Yes, indirectly, by applying the njets==1 FFs to a selection requiring njets>1. Directly measuring the FFs for njets>1 is not really possible, at least for QCD, due to statistics - they are consistent with njets==1 but that does not tell much given the huge uncertainties.

• eq 1: according to the twiki and to part of the discussion later in the AN you do NOT subtract n_true
Fixed.

• l133: how do you choose the thresholds for the isolation?
We look at the distributions such that we exclude as much as possible of the true-tau contribution while still keeping a minimum level of events in the CR needed to do a proper measurement.

• l135: the mT cut is the same in the DR and SR, no need to write it
Done.

• l145: “is negligible” -> do you mean you neglect it? From the plots it is larger than 10%, which is not what I call negligible
No, we do not neglect it. The statement is changed from “negligible” to “small” in the AN.

• l146: I understand there is no decay mode dependence anymore
Fixed.

• Figure 2: for the tautau plot, do you require exactly one anti-isolated tau, whichever it is?
Yes.

• l153: I thought the SR cut was mt < 50 GeV
Indeed, fixed in the AN.

• l153: why don’t you go higher in mt in the DR to have a better purity?
The purity is already very high, and what we gain in statistics brings more than uncertainties associated to the impurity we would reduce. Even more importantly, as there is the kinematic correlation of the FF with mT, this makes the corrections more reliable.

• figure 4 and others: is it the tau pt or the jet pt? I believe it is tau pt, please correct everywhere
Indeed tau pt. Done.

• figure 4: how can you explain the pt dependence in the tautau case (even if it is a different WP)?
I assume you mean compared to etau/mutau? This is an effect of the triggers used (mostly single-lepton with a bit of lep+tau vs di-tau).

• figure 4 and others: how do you choose the binning? If the bins have different widths you need to indicate the horizontal error bars. Why do you go higher in pt in the case of etau >=1 jet than mutau >=1 jet?
Horizontal bars have been added. The binning is chosen small enough at low pt to describe the dependency in this region well, and at high pt entirely dictated by statistics, i.e. we merge bins until we get data points with uncertainties which allow for a reasonable fit. We do not have an upper cut on pt, so the range we cover only depends on events available in the DR. The data points are positioned at the gravitational centers of the bin.

• figure 4 and others: how did you choose the fit function (Landau + polynomial)? Is the polynomial a normal one (=line if order one) or a special one (Bernstein, Chebychev, …)? Do you extrapolate with a constant at high tau pT?
we chose what worked best heuristically - in this case we often observe first a falling part and then a rather constant part. There is no underlying analytical description of the dependence, so we have to make do without theoretical motivation. The polynomial is “normal”, i.e. kx+d in this case. Yes, we extrapolate with a constant at high pt. This info has been added to the AN as well now.

• figs 5 and 6: here you go as low as tau pT = 20 GeV
yes, as explained above, we do use lower-pt taus, they were just not part of the plotted range in some of the plots.

• l159: why don’t you bin the TT fake fractions in bins of njets? You should have enough statistics for njets=0 too.
there are so few TT events without any jets (as every ttbar event contains at least two hard jets) that indeed we do not have the statistics to do so - and also no application for it as the 0-jet events in the AR only have a very small ttbar fraction.

• l171: this is in complete contradiction with the instructions in the twiki!! According to the twiki, you don’t renormalize to the sum of fractions for fake backgrounds (no denominator), which is the only reason why you are allowed not to subtract the prompt contribution from the AR.
• l173/174: I thought there was no subtraction anymore?!
• l185-189: again this applies only if you make a subtraction, which is not the case according to the twiki
We re-introduced subtraction following your comment. The twiki has now been updated as well.

• l201: why don’t you bin the corrections in njets too? Can you demonstrate that all the corrections described in this chapter would be the same for njets=0 and njets>0?
We have produced corrections binned in njets. They are almost identical for QCD; not relevant for ttbar since we merge in njets there; but there is a slight difference for W. We will try applying these separate corrections in our analysis to see if it has any significant impact, this will take a few days, though.

• l208: the differences from the finite binning are supposed to be covered already by the total uncertainty (the one described in l293)
Corrected.

• 212: why do you expect the fake factors to depend on the light lepton isolation?
In particular for QCD, the two objects which are reco’ed as tau and electron are correlated (e.g. through their flavor, g/u/d/c/s/b; or through their pt due to momentum balance, which also enters the relative isolation) and this can be parametrized e.g. via the light lepton isolation. It took us a while to find out that this is needed to achieve any reasonable modeling.

• l215: observed -> calculated
corrected.

• l218: why do you expect a non closure with mvis?
we do not specifically expect a non-closure with mvis. We expect a non-closure in general, and then chose one variable in which to parametrize it. Since mvis is correlated to most of our variables-of-interest, it was a natural (but still somewhat arbitrary) choice.

• figure 9: indeed all these corrections look very compatible with a flat line centered at 1. Are you sure you are not overfitting? Please indicate the chi2/ndof for your correction and for a horizontal line to judge if you are overfitting. This applies to all the corrections throughout the AN.
We are not sure if a chi2/ndof is a meaningful test for a smoothing algorithm (as opposed to a fit). We agree that several of the corrections are compatible with 1. Now we still need to assign an uncertainty, and we are not sure if e.g. we get 1.05 and an uncertainty of 0.2 it is better to have the correction as 1.0+0.25/-0.15, or as 1.05+/-0.2 (as we have now). In any case, the difference in practice is very small: In the region where most of our data is, the uncertainty is small by definition and the smoothing will be close to this value.

• l238-242: please show the data and predicted MC in all these regions (OS anti-isolated, …) and the raw FF obtained here
Added to the appendix. Data/MC obviously only for corrections which use data and are not just sim-based.

• l248-253: given your explanation of the mT dependence, why is it not covered by the binning in njets (low mT means more jets)?
the binning already decreases the dependence, but is too coarse (2 bins) to fully account for the dependence.

• figure 12 left: the average correction seems well below 1. How is that possible if you apply the fake factors derived using the same events and the fit functions match the data points very well in Fig 7? I could expect a shape dependence with mvis but not a correction of the normalisation.
For mvis~100+/-20, where most of the data is (i.e. determines the average correction), most of the data points are between 0.95 and 1. So it is true that there is a small normalisation shift, however, this can be explained by the imperfect parametrisation (e.g. if you look at the used pt fits, Fig 7, 0 jet & low pt [where most of the data is], the smoothed curve is about 3% below the data points, so this alone could account for most of the difference. But of course there are also other potential sources, like e.g. neglected dependencies.

• l265: Z->mumu + jets
done

• l260-266: please add plots and numbers related to this check
we do not have the information in the ntuples (selection of two muons) to do this for the current setup, but this has been checked e.g. here: https://indico.cern.ch/event/489919/contributions/1168013/attachments/1242344/1827701/2016-03-11_h2tau_ff.pdf#page=10

• fig13: How can you relate these correction shapes to your explanation in l248-253?

• fig 13: don’t you think a flat line would make as good a job with less overfitting? please check as mentioned already for fig 9. This wavy correction needs to have a physical explanation.
Note that we only use the left part of the figure, i.e. mTGeV, since this is our SR. In this region, there is a clear systematic shift for mu-tau; for e-tau, it is less clear due to lower statistics, but the central curve is very similar (as it should be, since this is a kinematic effect, not connected to lepton flavor). About the physical explanation: following from the explanation in l248-53, this effect is significant for low mT, in particular mT<30, and still non-negligible up to mT<50 GeV. So the best description of the dependence would likely be some function of mT up to 50 GeV, and then a flat line. This is compatible with what we see and could be implemented - however, since we do not use mT>50 GeV it would not have any practical consequence.

• l271: is it a problem if it overlaps with the emu signal region, even if we don’t use FF in this final state?
there is no overlap: we require a b jet while the emu selection vetoes b jets.

• l283-286: so you split in DM??
Fixed.

• l295-296: are you sure it is 12 and 4 if there is no DM dependence anymore? If there is no DM dependence, what are these DM-dependent uncertainties still prescribed in the twiki??
we provided new merged FFs for 2016 data only since a couple of days ago, and have only updated the twiki now. Sorry about that. And thanks for spotting it, corrected to 6 and 2, we had forgotten to update this bullet.

• l293-296: How can you end up with only one uncertainty per final state per jet multiplicity if the fit function has 4(?) degrees of freedom? In principle you should have as many uncertainties per function as the number of degrees of freedom. Otherwise the low pT region for example can constrain the high pT region without any good physical reason. This question applies also to the uncertainties in the corrections; please answer this question for all uncertainties
It is correct that the a-priori correct way would be to have an uncertainty for each dof. However, as also e.g. with most POG-provided uncertainties, we found that we can sufficiently cover the uncertainties by something similar to grouping them. For the fits, there are actually two uncertainties we use for fits: one alternative shape, normalized to the nominal shape; and the normalization value as norm uncertainty, as described below. This way it is possible for the fit to adjust e.g. just the low pT region, by moving towards the corresponding shape and at the same time to adjust the norm, such that the high-pt region stays ~unchanged, if the fit prefers that. GoF tests and NP pulls and constraints have shown this to work well in past measurements. For corrections, the answer is the same. In addition, there it would be difficult to do it in a different way since the functions are smoothed, not fit.

• fig 16: don’t you think the uncertainty band at low pt is too small compared to the size of the error bars on the points?
It seems to be roughly ok, the uncertainty on the bands are slightly but not much smaller than the data points, which is as it should be since there is some constraint coming from neighbouring points with smaller uncertainties. For the very low pt end (~first two data points), I agree it is a bit on the small side. We can give the algorithm more freedom there (i.e. less dependence on neighbouring points) if you prefer. However, the practical consequence will be small, as we have barely any ttbar background at very low mvis.

• l316-318: What is that? where is it explains? How do you determine it? (uncertainty on fractions)
This uncertainty takes into account that for tau-tau we use QCD FFs also for the W/Z/ttbar fakes. We have added the explanation to the paper. The way it is derived is to vary the fractions of these background up/down within their uncertainties given by the cross section uncertainties of the processes.

• l319-320: again is there or not a subtraction???
sorry for the confusion since we were on the way of changing this. It should now all be consistent (with subtraction)

• l325-327: please give the details on how you arrive to these numbers
We normalize the alternative shapes to the nominal one, and the values used for the normalization are added in quadrature. This procedure is the result of quite a lot of discussion for the MSSM paper.

• fig 17 and other control plots: please reduce the range of the y axis in the ratio plots
Done

Comments from Cecile, on Main AN v3:

• General: What is the status of the data card level synchronization?
For etau, mutau, tautau, the agreement is on sub-per cent level. For e-mu, we have a very good agreement on sync ntuple level. We have also a very good agreement for various variable distributions. The datacard synchronization is currently on-going and will be available in approximately one day.

Please show the level of synchronization. How independent are the frameworks? (e.g. do they use the same trainings? etc)

For e-mu, the SyncNtuple framework at DESY is fully independent to other frameworks, whereas the neural net framework is exactly the same as used by Vienna (only differs in the configuration file, DESY uses one for e-mu, Vienna one for et,mt,tt). As said, the agreement on SyncNtuple level is very good for e-mu. The only difference in 2017 between KIT and DESY is the different electron ID. The inclusive datacard synchronization for 2017 is already in a very good stage and is still ongoing to solve the last issues. Synchronization summary

The synchronization should compare the absolute difference in expected and observed limits between the groups. It is not enough to compare a few distributions by eye. The data_obs agreement in tautau is poor (usually we reach 100% agreement for data) and this probably creates large differences in observed limits (except if the differences cancel by chance). The synchronization should be finalised before unblinding.

This is one of our main priorities now and we will complete this until the pre-approval talk at the very latest.

Closing here as it is discussed as comment on v8 below.

• General: Can you consider to simply veto b jets to remove the overlap with ttH? This is not urgent but in view of the combination
Thanks, we will consider this for the full run-2 analysis which will enter the combination.

• General: You should add a section to explain why you use each of the training variables (= why it is supposed to help to separate the processes).
Paragraph added to AN-18-255. In addition, the sensitivity of the NN response to the variables is documented in AN-18-256.

• General: Some variables, like the b jet pT, do not seem useful to me to separate the processes. Can you quantify how much each variable helps (sort of ranking)? Have you tried to remove the variables that look useless and compare the expected results?
In the AN-18-256, we have described a method to perform a variable ranking using derivatives of the NN function. We are providing in this document plots and tables, which let you identify which variables are most sensitive for each output node of the NN including the importance of variable pairs.

So can you give a number for how much including the b jet pT improves the expected results?

Referring to the NN AN: http://cms.cern.ch/iCMS/user/noteinfo?cmsnoteid=CMS%20AN-2018/256, Appendix D. An analysis of Taylor coefficients of the neural net output as function of inputs was performed.

The variables bpt_1 and bpt_2 themselves show some sensitivity for process separation (values mostly around 0.05, roughly between 0.02 to 0.11), see Figure 23-28.

More important are the correlations of bpt_1,2 with m_sv and/or m_vis (see Tables 5-10) with higher values around 0.15.

Based on this we've decided to keep bpt_1 and bpt_2 as variables supporting nbtag variable, as long as they pass the GOF test.

An extensive study of b-jet pt's on the full analysis from retraining of the neural net up to constraints on r wasn't done, because the study of Taylor coefficients presented in AN-2018/256 is considered as sufficient.

How can I translate Taylor coefficients to an impact on the sensitivity? How can you explain that there is a correlation between bpt_2 and m_vis, and then that it matters?

Details on the taylor coefficients study are described here: https://arxiv.org/abs/1803.08782 (1) and were also presented in the HTT meeting: https://indico.cern.ch/event/718595/contributions/3000074/ (2).

To summarize this briefly: The output of the neural net, so the various scores for classification, is considered as a vector function of the vector of inputs:

with (corresponding to selected variables)

A Taylor coefficient of this function's Taylor expansion would give information about how strongly would the classification score change, if we would vary the inputs. This allows to quantify, how important a variable (coefficients of 1st derivatives) or the correlation of two variables (coefficients of 2nd derivatives) is to classify an event to a certain class.

Several examples are shown on slides 4,5 of (2) to demonstrate possible use-cases. They show, that the measures introduced (means across the used dataset for absolute values of the 1st & 2nd taylor coefficients) can be used to quantify, how well a 1D or 2D variable could be used to assign an event to certain class.

So in other words, these coefficients can quantify the separation power of variables, which is learned by the network. In that sense, they can quantify (indirectly) the impact on the analysis sensitivity.

We did studies with leaving important variables (like m_vis) out of the training and saw decrease in the expected constraint on r. Since such a study is a major effort (1-2 days), this cannot be done on a short time-scale for several combinations of variables. If required, we would propose to do such studies after the pre-approval talk for a requested set of variables (both 1D and 2D).

Concerning the example of bpt vs. m_vis:

Please have a look at the inclusive 2D plots in et channel for some processes:

Please also note, that the network also learns the fallback default, if there is no leading bjet. In that way, the variable is supporting nbtag == 0 veto.

Some separation by comparing peak width and its orientation for the three processes in bpt_1 vs m_vis plane (2nd order feature) can be seen, however not strong.

We would like to perform the exact quantification, how much that matters in terms constraints on r, after the pre-approval talk, if this quantification is still required.

• Table 2: The EWK samples are missing

• Table 3: Shouldn’t the ZLL cross section be different between 2016 and 2017 because of different tunes?
Yes. Added correct k-factors for 2017

• Table 3: Please add the Wgamma sample names
No Wgamma samples are used for 2017. Removed from table.

Why would you use these samples in 2016 and not in 2017?

Thank you very much for asking. Indeed after checking the contribution of WGamma to total WJets events, we saw that it is a non-negligible contribution (which we originally believed it is). Therefore we included the sample (WGToLNuG - WGstarToEE /MuMu samples are not available in 2017) to our 2017 analysis. This is now included in the results in the AN. The table in the AN is updated accordingly.

• Table 3: What does * * * mean (not in the caption)?
Removed

• l98: can you clarify why this ele ID choice makes it “simpler”?
Reformulated the sentence to: “... makes definition of isolation-based sideband regions possible, as required e.g. by the fake factor method”

• l118: Why is the electron isolation different between 2016 and 2017? Why not align?
For 2017 the recommended electron isolation by egamma is the rho corrected isolation while for 2016 the dbeta corrected one was used as it was recommended. For the 2017 isolation the optimal cut was found to be 0.15 -> Study by Albert

• l187: There is also a very very loose WP in 2017
Thanks, info added to the AN.

• l191: ML->MVA to remove the confusion with the ML used to extract the signal
Done

• l242: You should move to the latest JEC (v32)
We noticed it when this recommendation came up recently. It requires a reprocessing of the ntuples and new recoil corrections, FFs, etc and hence ~2-3 weeks time. We will work on it once there is an official global tag (not yet there as of Dec. 18).

• l261: I have noticed in the sync ntuples that KIT does not cut the b jet pseudo rapidity at 2.4. Can you confirm everything is done correctly in the analysis?
Can confirm that the cut is applied correctly in the analysis. Cut is also applied in sync ntuples for 2016 KIT ntuples

• l263: Danny reports a bad data/MC agreement using deepCSV in 2017. Have you checked?
This has been checked and compared to the agreement given by CSV2, but it looked very similar, so we decided to use the recommended deepCSV.

• Table 7: I thought you also use the single electron 35 trigger in 2017
Added to the trigger table

• l353: Have you tried to reoptimize the Dzeta cut? Is it still the best cut?
The cut of dzeta>-35 GeV has been chosen such that on the one hand a large fraction of ttbar events is discarded, and on the other hand enough ttbar events are left for the NN training. This is helpful since the ttbar background is mostly made of di-leptonic ttbar events with real muon and electron candidates making it difficult to reduce the ttbar background to a small amount by simply cutting on variables. Having enough ttbar events in the training helps to efficiently separate the ttbar background from the signal processes. Additionally, due to the resulting pure ttbar background region, the ttbar background normalization can be controlled, which reduces the uncertainties on the ttbar background.

Does that mean you have not reoptimized the cut?

If you mean by optimization that we ran the NN for various dzeta cuts and checked afterwards the improvement/loss in the signal strength constraint, then we haven’t done a re-optimization. Still, we believe that this cut is chosen very carefully since, as said, it is a compromise between background suppression while keeping signal high. From a theoretical point of view, a perfect NN should perform best without any cut, because in this case the NN can still exploit the full signal. But this is of course not true from a practically point of view, since the NN can improve its performance when cutting already some background away. Still, there is no 1 to 1 correspondence of a cut to the s/sqrt(b) ratio like it is for a cut-based analysis. Especially, since dzeta itself is used in the NN training, the neural net has the freedom to use it as discriminating variable. Anyway, we checked that roughly 5% of signal is cutted away with a dzeta cut of >-35, which is in our view still reasonable in view of the large background suppression. Of course we might gain something when loosening the dzeta cut further (increasing it will definitely not help). But the sensitivity gain will be maximally 5% for e-mu (if all signal and background events would be 100% correctly categorized) which would only be a very small gain in sensitivity for the whole analysis given the contribution of e-mu to the full analysis sensitivity. The estimated time of work for loosening this cut is unfortunately quite long (4-5 days), since we would need to redo the SyncNtuples again. We therefore think, that it is not worth the effort right now to check how much sensitivity gain we would get when loosening this cut.

• l370: isn’t it tight instead of medium?
Yes, in the SR the tight WP is applied. However, no ID is applied during pair selection. Removed this bullet

• l370: do you really apply the tight tau ID before choosing the best pair? How do you do to select the anti-isolated events needed for the FF method?
No ID is applied during pair selection. Removed

You still apply the VLoose ID during the pair selection, right?

To be precise: for 2016 data/MC, no byIso Tau ID is used during pair selection; for 2017 data/MC, the VVLoose WP of the byIso ID 2017v2 is used during pair selection. For both years, the old decay mode finding is required during pair selection. This is done so for the following reasons:

• In 2016, we use the default 2015 trainings; in 2017, the recomputed on top of MiniAOD 2017v2 trainings. Since there are several developments ongoing for the Tau ID (new DM vs. old DM, several trainings, new methods like DeepPFTau etc., one ID for all years proposal, ...), an extensive Tau ID optimization would make sense in the scope of Run II legacy results. For this analysis, we decided to stick to the chosen ID's above ( best choice for the sample sets used).
• 2016 working-point range: VLoose to VTight; 2017 working-point range VVLoose to VVTight. To be aligned with the NanoAOD production of 2017 samples (search for nTau), we choose for 2017 data/MC VVLoose working point during pair selection.

Concerning the choice of signal region & anti-isolated regions for FF method: The byIso Tau ID working points are applied on the first pair after sorting by the pair selection algorithm.

• Signal region (both 2016 & 2017): Tight WP of the corresponding byIso Tau ID
• Anti-Isolated region for FF (both 2016 & 2017): VLoose WP of the corresponding byIso Tau ID

• l384: Do I understand correctly that you require ptH>50 in 2017 and ptH>50 && pt_1>50 in 2016? Please reformulate so it is crystal clear. Why do you apply this ptH cut? I think it was not applied in 2016. How was it optimised? And if I understood the cuts correctly, why don’t you ask pt_1>50 in 2017 too?
Both cuts were adopted from HIG-16-043. The pt_1 cut has only been found to be optimal for 2016 and therefore we stuck to the 40 GeV cuts as motivated by the triggers. The ptH cut has turned out to be beneficial for the final result and for the data/MC agreement in the GoF tests, also in 2017. Added this comment to AN.

This has never been presented nor discussed at any HTT meeting. The pTH cut in HIG-16-043 was applied only in the VBF category, and the events failing that cut then entered the boosted category. The cut was chosen based on an optimisation of the expected results. Can you show any study that proves that pt_1>50 is not optimal in 2017? Can you compare the plots with and without ptH cut (as I understand the cut was chosen just to solve data/MC agreement problems and not to optimise the analysis)?

When we switched to the pt_tt cut in 2017 we observed an improvement in the signal strength constraint of about 0.003 which is not very much indeed, and it also led to a general improvement in the 2D GoF tests which also was a reason to keep it. We'll check the impact of a pt_1 cut towards or after preapproval.

You cannot have different cuts in 2016 and 2017 without a physics motivation. You should align them, preferentially in the way that gives the best expected sensitivity.

We will study this until pre-approval

• l391: This is not completely correct as you separate the events so there is no overlap for the triggers. There should be an upper pT cut for the events associated to the cross trigger
Added range instead of lower pt cut for cross trigger

• l422: same as above
Added range instead of lower pt cut for cross trigger

• l495: Your T background contains any tau decay. How do you treat the lnN uncertainties in that case (e.g. only a fraction of these events should have a tau ID uncertainty, a µ->tau ID uncertainty, or an e->tau ID uncertainty)?
The T background contains all events with two genuine taus. This background has been checked to contain >99% genuine hadronic taus in all final state, and is assigned the genuine hadronic tau ID uncertainty. Note that this background is replaced completely by embedded events and not used in the final analysis.

• l502: Does this mean that there is no imposed tau decay (e.g. only hadronic) in the embedded samples?
Thanks for the comment! Indeed, the embedded events only cover the four respective final states, and especially for Z->tautau->ee and Z->tautau->mumu there are no embedded events simply because they are not produced. An event such as Z->tau tau -> e e which is reconstructed as e tau_h due to a ele->tau misidentification are therefore missed. The effect is negligible for the results presented (This would increase the Z->tau tau yield between 0.1% (mutau) and 0.8% (eltau)) however for the next iteration (in about 1-2 days) all these events will be taken from MC to be completely consistent.

• l516: the numerator and denominator are not probabilities but numbers of events. Please reformulate.
Done

• l528: I thought you don’t use tau pT < 30 GeV
In the current version of the fake factors taus with pt > 23 GeV are used

• l582: please add more details (or copy paste from [20])
More information on the QCD estimation method in emu was added.

• l584: Is the shape taken from SS with a relaxed isolation to increase the statistics?
The shape of the QCD background in the (OS) signal region is taken from the SS region that is created by inverting the charge requirement on the electron-muon pair. Thus, the isolation criteria in this SS region are the same as in the signal region. The correction factors that account for the extrapolation from the SS to the OS region, are determined in OS and SS regions, where the muon candidate fails the isolation requirement of the baseline selection but still passes Irel < 0.5. Since these extrapolation factors depend on he pT of the selected electron candidate, the pT of the selected muon candidate, the number of jets in the event and the distance between the electron and the muon candidate measured in the η-φ plane, their application also alters the shape of the QCD background in the signal region. We have not observed any problems with the statistics of the QCD background. The statistics we get by performing the background estimation as described is fully sufficient.

• Section 6.5. I need to read in more details. But have these been presented to EGamma and Muon POG? Please always plot data/MC in the ratio plot and not the opposite (e.g. fig 12)
This was not presented in these POGs but this was not done in the previous year either with lepton scale factors that the Htautau group had determined for themselves, and the procedure is unchange. The ratio will be plotted the other way round in future plots.

There are always exchanges with the POGs. This is even a condition for preapproval, so it is better for you to do it early rather than be stopped later waiting for their answer.

Corresponding e-mail is sent to EGamma and Muon POG conveners.

Closed here as it is discussed below as v8 comment

• Table 10: You should consider rounding the uncertainty so it is the same for all decay modes
We will bring it up with the TauPOG but follow their recommendation until they change it.

• l797: please show the equivalent for 2017
The data/MC plots and the weights have been added for 2017 data.

• l875: Can you show the shifted templates for all shape uncertainties?
First version here: http://www-ekp.physik.uni-karlsruhe.de/~swozniewski/unc_plots/index.html

I have troubles understanding what the plots exactly show. I think the title gives the category, but where can I see which process it is?

The plots always show total background. I think it would be a mess of plots to show it for every background separately but please feel free to make suggestions. Maybe to take just the backgrounds affected by this uncertainty?

• l875: Can you prove that your ML has a linear effect on the shape uncertainties? What I mean: does twice the variation of the 1 sigma shifted template correspond to the shifted template if you shift the input by 2 sigmas?
The NN just reorders the regions of the phase space. So if an uncertainty increases the population of a certain phase space bin, this works also twice. Technically there's no difference compared to propagating e.g. tauES through the svfit algorithm or the m_vis calculation. This is also endorsed by the Machine Learning Forum of CMS where we had this discussion once.

• l886: Have you checked that the shifted templates corresponding to your JES have smooth variations compared to your nominal template? If this is not the case you should consider smoothing them or replacing them by lnN depending on how they look like.
This has not been checked, we will have a closer look.

Closed here as it is discussed below as v8 comment

• l919: Have you made sure that the shifted templates still have the same normalisation as the nominal template?
We checked and the inclusive yield does not change, but of course there is an effect on different phase space regions by construction of a reweighting - this means that there can be a small norm effect e.g. in the different categories. We have checked this effect and it is at order -2% within the analysis selection.

• l924: You should not correlate 2016 and 2017 since the tunes are different (and I guess the corrections look very different too)
It is hard to say what the actual correlation is. The option we provide to the fit is to apply more or less of the correction weight. We decided to correlate this because the technique is the same and a possible underlying systematic error should have the same effect in both eras.

It is not a question of technique but of MC sample. In case you do not know how much to correlate the conservative way is to uncorrelate.

switched to uncorrelated

• l924: isn’t it conservative given the statistics you have to derive the corrections?
Yes, it is conservative but it was the way to go in previous analyses (SM/MSSM) and there were no deeper investigations since then. It is expected to be constrained by the fit.

This is not true. HIG-16-043 used an uncertainty equal to 10% of the correction.

You are right, we missed that. We now apply 10% of the correction as well.

• l925: Does it really have a shape effect? Cant you replace the shapes by lnN?
It has a shape effect, especially in the ttbar categories.

Can it explain why it has a shape uncertainty? Can you show the templates nominal/up/down?

As the b-tagging is used to build the ttbar category the efficiency has an impact on the migration into this category and its highest bins.

• l935: Why is it fully correlated between 2016 and 2017?
This is seen as an uncertainty in the embedding technique and should not depend on different detector conditions in 2016 and 2017.

• l936: How do you correlate 2016 and 2017?
50% correlated following the tau POG recommendation. added to AN

• l944: Is it really 6 if there is no DM splitting anymore?
Yes: 3 (QCD/W/tt) times 2 (njet0/njet>0).

• l975: Haven’t you forgotten the electron energy scale uncertainty?
Indeed, has now been added to AN

• l982: It is hard to understand what you do with the tau ID uncertainty. In 2016 there were two parts: one fully correlated between DM and one fully uncorrelated. Now the TAU POG recommends 50% correlation between 2016 and 2017. Do you split both 2016 uncertainties into 2?
yes

Can you write explicitly in the AN what all the uncertainties are and what their magnitudes are.

done

• l1007: is 10% recommended by the TAU POG?
As confirmed by Tyler, there is no official recommendation and this is what has been estimated and used in past analyses.

• l1005: why isn’t there a different uncertainty for the single lepton and the cross triggers? They affect the phase space differently, which could have an impact on the muon/electron pT spectrum for example.
Good point. We have checked the impacts to estimate how important this is. The lepton trigger scale factor uncertainties are ranked lower than 130 and as we would have to implement additional two shape uncertainties for this, we would prefer to stick to the current setup.

Even if the impact is low, what you are doing is not correct.

Ok, we now apply separate uncertainties for the cross trigger and the single lepton triggers.

So it is a shape uncertainty?

Yes

• l1011: why 4% for the double mu trigger while all the other triggers before had only 2%?
We take 2% for each muon leg

But for the cross triggers you take 2% even if there are 2 legs too.

Yes, because the cross trigger makes just a minor part of the triggered events. Since we now have separate uncertainties (comment above), we apply 7% to the cross trigger (2% for the lepton leg and 5% for the tau leg)

• l1014 (and actually everywhere): please add references for all the uncertainty values
done

• l1020: why isn’t there at least a partial correlation?
The reason for this was that the underlying HLT path changed which is also consistent with the treatment of the single lepton triggers

• l1027: are you sure of this number? Please add the reference
yes, done

• l1030: same as above
done

• l1036: any plot to support this statement?

• Table 14: You can round to have the same uncertainties in 2016 and 2017
Sure, could be done and will also be done in the public documentation. Internally there is no harm in keeping the higher precision.

• l1083: even after reading the whole note I do not understand your criteria for checking “by eye”
This has been reformulated

• l1084: why do you need to remove outliers?
Due to the regularization measures these few events should not be able to bias the training even if there is a misdescription. Therefore, we decided to exclude them from the GoFs.

• Fig 25 and others: Why don’t you remove the isolation from the GOF? You know it is going to fail and why, and it is therefore not used in the NN
done

• Fig 25 and others: Are all systematic uncertainties included when running the GOF?
Yes, it is stated in the text

• Fig 25 and others: How is the signal treated in the GOF?
Signal is included in the fit as its existence is confirmed by the previous analysis. Note, that for this saturated goodness of fit test, no fit is run. The test statistics is determined based on the prefit distribution. However, the signal is in any case almost negligible in inclusive distributions.

• Fig 25 and others: What kind of GOF is performed?
Saturated goodness of fit test. Added now in the caption

• Fig 32, Fig 34: You have a significant number of 2D tests failing when involving the ditau mass, but you do not remove this variable. Please comment.
Statistically seen, the number of failing tests over the whole variable range is less than expected. It is also expected that one finds variables with less failing tests and some variables with a bit more. So this is not really critical and the di-tau mass itself (1D) is known to be important for the signal extraction so we decided to keep it after looking at all relevant distributions.

You have 3/17 checks failing for the ditau mass (~20%), how is that less than expected (5%)?

Concerning the specific case: With recent analysis updates this is now not the case anymore. Concerning the general procedure to deal with such cases: This has in the meanwhile been discussed in a HTT meeting, see https://indico.cern.ch/event/762191/contributions/3250124/attachments/1771052/2877884/Standard_Model_H-_machine_learning_approach_-_Review.pdf .

• Fig 36: Same as above. 7/16 tests have pvalue <= 0.05!
This had to be rerun, as the pt_tt>50 cut was not yet included. In the newer version, the results are fine

• Table 16: the ditau mass in etau looks pretty bad in Fig 72. How can you explain it passed the GOF? Is it because it is a saturated GOF and not the one that cumulates the variations over neighbour bins)?
Yes, the saturated GOF does not check the correlation between neighbouring bins. But also the trend in the uncertainty shape suggests, that a correlated shift in the low mass region is allowed. Apart from that, being located on this side of the Z peak this should not contaminate the signal region.

But is is not at all covered by the uncertainty band. What can you pull (and by how much) to reach a decent agreement?

• l1105: please define “major background”
done

• l1122: I understand why you do not use the FF background for the training in etau/mutau. But why not use it in tautau, where you don’t separate the fake backgrounds in classes?
As we have no simulated events for QCD, we use same sign events in the et, mt channel and anti-isolated events in the tt channel for our training. This is motivated by the sideband regions used in the former QCD estimation methods. For tt this in fact already corresponds to the events used for the fake factor method. Additionally we train on Wjets and tt bar MC, as there are associated fake factor uncertainties which can be constrained. An explanation on the usage of these data events is now added to the AN

• l1122: The ttbar contamination in the embedded samples should be small enough for it not to be a problem. If the major problem is too many embedded samples, you could train with a fraction of them only.
The possibility to train on embedded samples would require some further investigation and we currently see it as a goal for the run-2 paper.

• l1127: Can you quantify “suboptimal”?
This is currently hard to tell. See comment before. However, MC is in general usable for analyses and also our current result suggest that it works fine.

• Table 17: What enters misc in emu? In general can you please define misc in each case?

• l1129: I do not understand at all this sentence.
It refers to the fact, that the exact border where one event migrates from one to the other category also depends on the assigned class weights in the training (prior probability). It is not enough to just apply the categorization. For a good signal extraction it is important to use the exclusive probability for further discrimination. The formulation in the AN has been changed.

• l1148: not for now but for the full run-2 paper. Would you like to ask for a bigger simulation to be able to perform a training with several classes based on STXS stage 1?
Even with larger statistics a training with one output node per stage 1 category would be hard to control and the gain would probably be minimal since the stage1 cuts are known perfectly well. However it would likely help to train better on stage 1 signals that are currently not populated well. Therefore we will gladly try it as soon as samples with larger statistics are available

• Section 11: How do you define the binning for each of the subcategories?
This has been done by eye looking where the slope of the s/b ratio is large and it is therefore worth to split and keeping at least roughly one signal and background event. For 2017, we use the same binning as in 2016.

You should find a systematic way to do it instead of by eye. Please find an algorithm that optimises the expected results while keeping enough statistics in ech bin.

Agreed, propose to do so after pre-approval.

You have to do it before unblinding, which means before preapproval and ARC review. Otherwise it is too late to change.

I (Artur) apologize for causing misunderstanding with my formulation. I agree with you, that this can be made as a condition for unblinding and thus for pre-approval. I intented to propose to make the procedure automatic after the pre-approval talk. We will have a look at the auto-rebin option of CH which was used e.g. in MSSM HTT, see here: https://github.com/cms-analysis/CombineHarvester/blob/master/MSSMFull2016/bin/MorphingMSSMFull2016.cpp#L454-L459. However, this needs technical adaptions to handle unrolled distributions properly to match our needs.

We will work on the implementation in the next days.

Closed here as it is discussed below as v8 comment

• Section 11: Can you please add an explanation on how you choose the lower bound of the histograms? I only understood that yesterday during the working meeting.
Explanation added to the NN section

• Section 11: Are all uncertainties taken into account in the uncertainty band (using Combine Harvester?)?
yes

• Fig 41: Please also show the VH signal, at least in the categories targeting VH.
VH samples are ready in principle but not yet included in the analysis. This will be done before pre-approval.

This is very important. Please do it before preapproval.

done

• Fig 42 and others: Can you show in linear scale and normalised to unity the signal and backgrounds in the ggH categories? The shapes look really the same to me now. If the shapes are so close, don’t you think you could find a better separating variable?
This can be inferred better from the ratio plot, which shows the ratio of s+b over the b-only expectation. E.g. in Fig 42, the S/B ratio is rising from close to 0 to slightly above 5% steadily with the NN value.

And there is no other variable that would do a better job?

Plot with linear scale + normalization to unity here:

ggH signal is peaking between 0.4 and 0.5 with large tales to 1, whereas the backgrounds are peaking between 0.3 and 0.4 with less prominant tails to 1. Hardest background to distinguish from ggH in this ggH inclusive category (no stage 1 splitting): genuine taus from embedding (EMB). However, a shape difference between ggH and EMB is already visible by eye.

• Fig 51 bottom left: The agreement looks bad. Please comment.
We have checked this thoroughly, the explanation seems to be a small systematic effect plus a fluctuation. Our uncertainty model is able to handle this case: The b-only fit is able to model this well, without any strong NP pulls (none above 1 sigma, and very few above 0.5).

What is pulled? Is it compatible with the prefit agreement in the other categories?

The prefit distribution on the left is tranformed by the fully combined bg-only fit (all channels, 2016+2017) into the postfit distribution on the right. Most prominent change is visible in the FF background, so the FF related systematics for this year, channel and category (both normalization & shape-changing) are pulled. This is compatible with other background categories as can be seen in the prefit and postfit distributions (the same fully combined bg-only fit): http://ekpwww.etp.kit.edu/~wunsch/2018-12-12/blinded_backgrounds/index.php

• Fig 53: In this first bin I am not sure I understand what is the y axis. I would guess it is one event divided by the bin width (0.8), which should be a point above 1 instead of below 1. Can you please explain how the number of events per bin is computed?
It is indeed normalized by bin width but note that these are asimov data shown in the signal categories and also the fake factor method (subtraction method and fake rate applied to data events) can have events with weight smaller than 1. So here the model expectation is less than 1 event.

• Section 11: You certainly know it but 2016 emu plots and 2017 all plots are missing
Indeed. They have been added now.

• Section 12: Can you please provide me with the full impact plot?
To be added to the AN for the final combination, today or tomorrow.

• l1173: How is VH treated? Is it part of what you call qqH in Table 18? Or is it floated like a background? It would make sense to have a VH line in this table and have another coupling strength modifier
Done.

• Fig 58: what is uncertainty 2? Is it the ttbar cross section? Why is is ranked so high?
Yes, it is the ttbar normalization. It is a bit artificial that it is second of the impacts. Please note, that the JEC and the TauES uncertainties which used to be on the first ranks in former analyses are now split into parts due to MC/EMB and 2017/2017 correlations. Each of these splits is ranked lower but together they should have the largest impact as before.

How can you explain that the ttbar cross section has a ~5% impact on your result while you expect almost no ttbar in the signal region (and half is taken from the FF, and the uncertainty is only 6% on the ttbar yield)? It looks fishy to me. This uncertainty should rank very low.

Indeed it looks fishy at the first glance. But please have a look at the direction of the impact. An upshift of the ttbar cross section is connected to an upshift of the signal, so just the opposite from what you would expect if they were overlapping. So it seems to be an indirect effect. Please note that there usually is ttbar background in the signal categories (e.g. fig.47 VBF categories) but not in the significant bins. The overall yield in the categroy is similar to the signal yield, but moving to higher NN outputs the ttbar fraction decreases while the signal fraction increases. Therefore, an upshift of ttbar probably decreases the yield of ff and embedding (profiling the other nuisance parameters) which then pulls the signal up. Also in the ttbar category an up shift of ttbar would decrease ff and embedding. So this is probably the reason for this impact. We will try to have a deeper look into the profiling of this impact fit.

To quantify this hypothesis with a study:

At first, an asimov fit was performed with the ttbar xsec uncertainty nuisance as POI and with fixed r=1 to identify nuisances correlated with the ttbar xsec uncertainty. A list of nuisances with abs(correlation coefficient > 0.01), ordered by correlation is given here: http://ekpwww.etp.kit.edu/~akhmet/plots_archive/2018_12_12/tjXsecAsPOI_diff_nuisances.html

As a next step a fit was performed for the signal POI r on asimov with signal for fixed values of ttbar xsec uncertainty nuisance: tjXsec=-0.99, tjXsec=0.01 (ca. best fit), tjXsec=1.01. Please note, this nuisance is already normalized to prefit sigma, i.e. the prefit sigma equals to 1. Results for the subset of nuisances shown above, ordered by correlation to r, can be seen here:

http://ekpwww.etp.kit.edu/~akhmet/plots_archive/2018_12_12/tjXsec-0.99_diff_nuisances_selected.html

http://ekpwww.etp.kit.edu/~akhmet/plots_archive/2018_12_12/tjXsec0.01_diff_nuisances_selected.html

http://ekpwww.etp.kit.edu/~akhmet/plots_archive/2018_12_12/tjXsec1.01_diff_nuisances_selected.html

In general it is difficult to quantify and deduce the interplay between all nuisances playing a role. However, what can be observed is for instance the fact, that with higher values of the ttbar nuisance parameter, the luminosity uncertainty nuisance as well as some other lnN uncertainties nuisances are pulled down. The signal is affected by the lumi nuisance directly. In turn, there is more "space" for r to go up, after lumi had a downward pull. Again: this is only an example how it might happen, that the ttbar xsec has a strong impact on the signal strength going into the same direction.

• Fig 58 lines 17 and 22: why is there qqh in the names of these FF uncertainties?
This refers to a category uncorrelated uncertainty applied to the jet fakes, in this case in the qqh category

Why do you uncorrelate this uncertainty per category? This has never been explained nor justified.

Closed here as it is discussed as comment on v8 below

• Fig 58 line 5: the jet energy scale is still constrained. It may be due to artificial constraints due to spiky templates as I wrote before. Please check where the constraints come from. The JES are not applied to FF and embedded so there should not be a constraint.
We will have a closer look. But the constraint it similar to HIG-16-043 so this is qualitatively understood.

The comparison with HIG-16-043 is not valid because the JES were affecting the ZTT and W samples at that moment, which is not your case.

You are probably right that it is due to the fact that the JES uncerainties are not very smooth http://www-ekp.physik.uni-karlsruhe.de/~swozniewski/unc_plots/plots_CMS_scale_j_RelativeBal/index.html Switching to lnN uncertainties however does not seem to be an option as there are rather clear shape effects in some cases e.g. in the et ZLL category.

You need to find a way to smoothen the shapes to remove the artificial constraints.

Closed here as it is discussed as comment on v8 below

• Fig 58: Can you please justify all the constraints in the text?
Done. To be updated together with the impact plot for the final combination

• Table 19: The 0 jets measurement is less good than that from HIG-16-043 (which can be further improved by having the tau pT as a second dimension instead of the decay mode). Can you explain why the ML performs worse? I know 0 jet is not the most important category, but you could consider to have a cut-based 0 jet category? This would demonstrate also that you control all your backgrounds well (ZTT peak, ZL peak, …).
Thank you for the suggestion, this will be investigated further.

• Fig 60: I guess the fake background is dominating at high electron pT. If so there is a 20% overestimation of the background by the FF method. This is also probably why you reject most of the pT distributions from the GOF in 2017. Can you investigate that?
the average over-estimation at high pt is about 10%. This is well within the uncertainties, as shown also in post-fit plots.

• Fig 72: As said before I am surprised the ditau mass does not fail the GOF test. Can you explain why?
Yes, the saturated GOF does not check the correlation between neighbouring bins. But also the trend in the uncertainty shape suggests, that a correlated shift in the low mass region is allowed. Apart from that, being located on this side of the Z peak this should not contaminate the signal region.

• Comparing Fig 68 and 80: You have twice as many events in 2016. Why?
There are different reasons, i.e. higher lumi, different triggers, higher fake rate due to pileup, and the control plots for 2017 include 23<tau pt<30 which was first tried to use, but is now limited to 30 GeV (better modeling) as done in 2016.

Can you remove taus with pT < 30 GeV from the control plots?

done

• Comparing Fig 78 and 66: You have twice as many events in 2017. Why? Given the muon pT spectrum it cannot only be due to the trigger. Same question for etau.
see comment above

• l871: What about VH and VBF?
this is a special treatment for ggH provided by the HiggsWG due to higher uncertainties associated with this production process containing the quark loop

But you still need PDF and QCD scale uncertainties (related to the acceptance) for VBF and VH.

Right, this is what is written in the AN for VBF (l.866). For VH this will be updated as soon as it is in the analysis.

• l451: I think the pair ordering biases the application of the FF method. For the FF to work you need to treat the anti-isolated events in exactly the same way as the isolated events. However you favour isolated events through the pair ordering, which creates a “deficit” in the anti-isolated region. I think that if the FF method is used the pairs should be ordered based on pT only (or also light lepton isolation but not tau isolation). Please comment.
This is correct. This was investigated in the past and an effect was found to affect ttbar. Since then, the FF is measured and applied consistently (this effect is taken into account in both cases, i.e. the pairs on which the FF is measured are selected the same way - instead of using all pairs for the FF measurement as was done in the beginning, about 2 years ago) and the remaining small effect is below stat. Unc. and is mitigated via non-closure correction.

Comments from Martin, on Main AN v4:

• Table 2: k factor for * * * is missing
Removed

• Table 3: sample name for Wgamma is missing
No Wgamma samples are used for 2017. Removed from table.

• l 98: makes simpler -> makes definition of isolation-based sideband regions possible, as required e.g. by the fake factor method
Reformulated sentence

• Table 4: can you clarify which of the values are for the WP of which year in the AN?
Added 2016 in caption to make it more clear for which year the WPs are. As stated in the text there are no WPs for 2017

• 192: hadronic tau decays is used

• 252: please point out that it is a POG recommendation to use the tight WP for 2017 (otherwise it seems arbitrary that you do not use the same as for 2016)
Done

• 284: (emu) So one of the triggers for data is not used in simulated samples? How is that corrected for? Via trigger SFs? Please explain in the AN.
The formulation in the AN was not fully correct. Only the dz filter of the trigger employed in 2016 Run G and H is not modelled in simulation. In order to account for the missing $dz$ filter in simulation, the simulation is corrected by the efficiency of the filter in data, which amounts to 97.9%. The text in the AN has been corrected.

• 316: (emu) For 2016 electron electron ID, it says you use ID that includes isolation variables. Is that correct? I thought this ID exists only for 2017.
Yes, you are right. The mentioned ID only exists for 2017. The formulation in the AN was wrong. It has been corrected.

• 340: same comment as above

• 349: of the event

• 358: has to be

• 370: it says the medium WP is used for tautau, but l. 192 states that tight is used everywhere.
Yes, in the SR the tight WP is applied. However, no ID is applied during pair selection. Removed this bullet

• 385: why the additional requirement in 2016, and not in 2017?
Both cuts were adopted from HIG-16-043. The pt_1 cut has not been checked in 2017 and therefore we stuck to the 40 GeV cuts as motivated by the triggers. The ptH cut has turned out to be beneficial for the final result and for the data/MC agreement in the GoF tests, also in 2017. Added this comment to AN

• 481: object*s*
Done

• 5.4: Please give the range in which the OS/SS correction numbers are, and state how uncertainties are determined.
More information about the QCD background estimation (including the range of the OS/SS correction factors and the determination of the uncertainties) has been added to the AN.

• 595: please give the typical efficiency loss of this jets removal (i.e. something like "about X% for VBF and below Y% for all other samples"
Added numbers for VBF to AN

• 629: event*s*
Done

• 724 (or elsewhere): give typical values (or a range) for the embedding SF for both 2016, and 2017
“Typical values for 2017 are around 0.7 in the turn-on region and around 0.85 in the plateau region, whereas the 2016 scalefactors are typically around 0.94 in both regions.” was added to the draft

• 798: is the same weight applied for 2016 and 2017, or do you derive them separately?
No, the weights for 2016 and 2017 have been determined and are applied separately. Information about the weights for 2017 has been added to the AN.

• Fig 25: pt(e) is badly modeled, pt(tau) is fine; yet Table 15: you use pt(e), not pt(tau) ??
Indeed, that’s a typo. Changed table 15 to pt(tau)

• S 7.1: The subsection is only about theory signal uncertainties, right? Then please rename it to "Signal theory uncertainty"
done

• 994: Does that mean BBB uncertainties are not used for non-MC backgrounds (embedding, FF)?
No, indeed this is applied to all backgrounds. Reformulated in AN

• S 7: Concerning correlations between 2016 and 2017: whereever your decision is based on a POG recommendation, please say so in the text.
done

• 1110: Please describe already here the strategy to deal with variables which fail some of the 2D checks
done

• 1157: different treatment between -> discrepancy between
done

• S 9: This needs a bit more detail, even if it is in a different AN - just very briefly describe the architecture (preprocessing, activation function, number of hidden layers, number of nodes in these layers, type=fully connected feed-forward NN)
done

• 1175: node (without s)
done

Comments from Cecile, on FF AN v4:

• General: I see some changes in the FF wrt v3 of the AN (I guess due to adding the bins with 20 < pt < 30 GeV) but the corrections have not changed at all. Why?
We were already before using taus also with pt<30 GeV (just not using them in the pt fits and plotting them) hence they were already considered in the corrections before.

But the fits are changed by adding the low pT taus even if the point with pT > 30 GeV are unchanged.

They are very similar, and indeed it would be very strange if they changed much by just adding bins to the fit. Note that they are not identical, though, e.g.

• General: Have the changes been propagated to the analysis (to the FF repository) yet?
Yes, all the changes described are in the FF repository and in the most recent results in the main AN.

• General: I understand you have only one shape uncertainty per “uncorrected FF” (see also followup on your answer). If so can you add to all plots like Figure 5 the Up and Down FF corresponding to ± 1 sigma?
The error band shown in the plots already shows the up/down variation due to the stat. unc. If you mean the systematic uncertainties, they also depend on other quantities (e.g. typically mvis for the closure, mT for corrections in W, etc) hence the 1D up/down shapes cannot be shown in 1D plots (internally they are products of 2D and 3D histos) in an unambiguous way - or in a way that is more meaningful than the way we present them now in the note, each correction by itself.

Do you mean that the +1 sigma function is the one that surrounds your band by above, and the -1 sigma the one that surrounds your band by below? In this case you have mostly a norm effect (which you say you cover in a different way) and not a shape effect.

Yes, the bands as shown in the plots, and the resulting distributions (after applying the varied FF) are then normalized to the nominal one, so it is a pure shape effect (this is already described in the AN where the uncertainties are discussed).

• General: If the uncorrected FF are fit with a flat line at high pT (e.g. Figure 5), why does the uncertainty band have this shape in this pT range and not a linear shape too? (BTW how do you define “high pT”?)
The higher we get in pT, the more the value is an extrapolation (in the sense that it is not supported by data points nearby) and hence the intrinsic uncertainty increases. We take this into account by increasing the uncertainy beyond the point where the function is made constant, by the amount we would get if we actually prolongued the fit beyond that point (in other words: while we make the central value const at some value, we keep the original size of the uncertainties). This treatment was actually quite relevant for the MSSM analysis since in the sensitive region there were a lot of high-pt taus. It does not have any practical effect on the SM analysis, though, but we kept it for consistency and because it is nonetheless the most correct thing to be done for these very rare events.

• General: Why do you expect the FF and their corrections to differ between etau and mutau?
Most of the FF and corrections are actually very similar and we will investigate in the future if these can be merged. However, as there are correlations between the two tau decay products in the event, this is not actually trivial and needs a dedicated investigation.

• l31: write explicitly that the choice of Powheg/Madgraph depends on the data period
Done.

• l69: I thought you used pT > 20 GeV everywhere now
Corrected (to 23 GeV, the minimum we can use embedding for).

• l75: choosing the most isolated object biases your FF; the probability to be isolated will be higher if you favour isolated taus
This is true, and in the past we observed an effect on ttbar (it only affects events with a large number of jets = large number of tau candidates). However, after we changed the FF measurement itself to also consider only the candidate of the most isolated pair, choosen exactly the same way, the effect was gone (as far as we can judge, in any case below stat. unc.). A small remaining effect, if any, is one of the things that would be corrected by the non-closure correction.

• l121 and 172: I do not understand how you make the prompt subtraction. Can you clarify? I guess you take anti-isolated MC events, reweigh them with the FF, then subtract the result from the reweighed data. Is that correct? The text seems to indicate that you just take MC events from the SR without reweighting.
This is correct. Clarified in the AN as well.

• l189 and about the fractions in general: I think you derive the fractions in an inclusive category. But the fractions depend on the categories (e.g. much more ttbar in a given mvis bin for events that enter the ttbar CR). How do you take that into account? Do you have an uncertainty on the fractions? Or could you compute the fractions per mvis bin of the CR/SR defined by the NN?
For the final results, we now compute the fractions for each category separately. This is documented in the (upcoming) version of the main AN but not here in the FF AN as we describe the method here in general and not wrt the specific NN implementation.

And do you have an uncertainty on the fractions.

Yes, see section 5.

• Figure 5: The mt fits look bad.
It is not pretty, but all data points agree with the fit within uncertainties.

I don't think the chi2 would be reasonable.

Thanks for the comment - indeed we have now improved the fit, this is updated in the note and also propagated to the analysis. To explain: For these two fits, we had actually used Landau+pol0 (i.e. just a constant) because of very bad behavior at large pt values otherwise. This has become better in this iteration, plus we cut off the fit to become constant at the region without data points, so this now gives a satisfactory result.

• Figure 10 and others: please add the horizontal error bars for the corrections too
Done.

• Figure 12: How do you explain the difference between etau and mutau?
In HTT studies, we have always been observing difference between etau and mutau when it comes to OS/SS corrections, the SFs have typically not been compatible. The underlying reason is mostly branching ratios of heavy meson decays involving electrons and muons due to the phase space, but they also depend on the selection of the lepton (e vs mu ID and kinematic selection)

• Figure 19 top right: The agreement is bad in the tail
It is not pretty, but it is roughly within uncertainties (there is a correlated uncertainty which affects mostly the tail). Also, the variable is not used directly in the NN.

• Figure 19 center right: The agreement is bad too. Can the uncertainties really cover it (I mean pulling only one uncertainty instead of pulling each bin by 1 sigma)?
Yes, this has been checked.

Which uncertainty? Do you have plots? If this uncertainty is pulled, are the other plots still fine?

Here are the plots. CMS_ff_qcd_njet1_et_stat_13TeV is pulled by -1.09 sigma, no other FF uncertainty is pulled more than 0.4. Of all other uncertainties, the only significantly pulled uncertainty is on the ZL shape which only affects the low-pt-like regions (low pt or low mvis) of the plots and not the FF-dominated one.

• Figure 23: The FF are about 50% higher than in 2017. This is more an analysis question, but was it considered to use an isolation WP that would have the same FF for both 2016 and 2017? Since the 2016 WP was optimised, the 2017 WP might be suboptimal.
We will look into that, but the main criterion is to have a loose denominator, that is a) as loose as possible while b) still ensures that differences between FFs from different processes significantly decrease. In that respect the WP is fine for both years.

• Figure 31: How can the correction be much below 1 for all mvis values if you apply the FF measured in exactly the same region? If this is really the same region there is obviously a problem.
Thanks for spotting this... while investigating this we found that while we use embedding for everything, somehow for these corrections we used MC. This had very little effect almost everywhere, but for some reason for this W correction. This is now fixed.

I cannot understand how using MC instead of embedded samples could cause a 10% change (given that ZTT is at most 10%), especially in the mvis tail where you don't expect any ZTT. Please comment.

It is not because of using MC instead of embedding but because MC was used like embedding. This lead to the problem that several weights that need to be applied on MC were left out because the weightstring for embedding was used. So bottom line is that this effect is of purely technical nature and should be considered as a bug introduced in the MC/embedding transition that is now fixed.

Followup on your answers to my comments on AN v3

• The documentation does not align with the twiki and sometimes contradicts itself. In particular, is the prompt contribution subtracted (eq 1, l173/174)? are the FF binned in decay mode (l146, l283-286, l295-296)? are the fractions renormalised to the sum of fake fractions (l171)? Please correct and double check everything!
• Indeed due to some changes we implemented following your comments in the HTT meeting there were quite a few inconsistencies. They should all be resolved now in the AN.
• Can you clarify in the AN exactly how the subtraction is done?
Done.

• Related to what I understand from the twiki but is not transparent in this note. You recommend to compute the fractions in bins of mvis and njets: the fractions for tt, qcd, and w in generally do not sum to 1.0 because there is also a fraction or real taus. Because the fractions do not sum to 1.0 you do not need to subtract the prompt contribution (what you are doing in practice is applying a fake factor of 0 to events with real taus). But the fractions can depend on other variables that you use in the analysis. Let’s take nbtag as an example. When you derive the fractions in the inclusive region, you can have something like f_qcd=0.2, f_w=0.5, f_tt=0.1, f_real=0.2 for mvis under the ZTT peak. But for events where there is nbtag>0 (selected preferentially by the NN to populate the ttbar CR) the fractions you should have used could have been instead f_qcd=0.05, f_w=0.3, f_tt=0.6, f_real=0.05. Assuming FFqcd=FFw=FFtt you end up with a 15% difference because you considered there was 20% real taus instead of 5%. If FFqcd!=FFw!=FFtt the difference is even bigger. In my opinion a much better way to proceed is to renormalize f_qcd, f_w and f_tt so they always sum up to 1. Then you need to apply the sum of fake factors also to real events in the AR and subtract them from the reweighed data. In that case f_real is always correct, for any choice of variable. Please comment.
• We agree. We have switched to “renormalisation and subtraction”, as suggested.
• This does not solve the problem if FFqcd!=FFw!=FFtt, see my general comment above
This is already discussed in a comment above (in short, we bin the fractions individually per category).

And do you have an uncertainty for the fractions?

Yes, see section 5.

• l133: how do you choose the thresholds for the isolation?
• We look at the distributions such that we exclude as much as possible of the true-tau contribution while still keeping a minimum level of events in the CR needed to do a proper measurement.
• Do you have a more quantitative answer?
I am afraid not. As there is a negative trade-off between two things (purity and high stats) this needs a bit of manual fine-tuning in the end. But typically as soon as we exclude the very low-iso bins (i.e. from 0 to 0.02) we get rid of most of the true taus, and if we can afford it due to high stats, like in mu-tau, we choose to go even higher and increase the purity.

• figure 4: how can you explain the pt dependence in the tautau case (even if it is a different WP)?
• I assume you mean compared to etau/mutau? This is an effect of the triggers used (mostly single-lepton with a bit of lep+tau vs di-tau).
• Can you please explain how the triggers influence the pT dependence? The isolation at trigger level should be looser than the VL WP of the tau iso.
It is looser in average, but the resolution is worse so there will always be some events removed by the trigger which would be accepted (as very loose) offline. Just looking at the actual raw fake factors there is a significant difference. And this resolution effect affects taus the more the lower the pt, which might explain the shape difference.

• figure 4 and others: how do you choose the binning? If the bins have different widths you need to indicate the horizontal error bars. Why do you go higher in pt in the case of etau >=1 jet than mutau >=1 jet?
• Horizontal bars have been added. The binning is chosen small enough at low pt to describe the dependency in this region well, and at high pt entirely dictated by statistics, i.e. we merge bins until we get data points with uncertainties which allow for a reasonable fit. We do not have an upper cut on pt, so the range we cover only depends on events available in the DR. The data points are positioned at the gravitational centers of the bin.
• Can it be done in a more quantitative way instead of by eye? What do you mean by “reasonable fit”?
We have not found a way to fully automatize it. Reasonable fit means that error bars are not so large that essentially any fit would be possible and the minimisation does not converge.

• 212: why do you expect the fake factors to depend on the light lepton isolation?
• In particular for QCD, the two objects which are reco’ed as tau and electron are correlated (e.g. through their flavor, g/u/d/c/s/b; or through their pt due to momentum balance, which also enters the relative isolation) and this can be parametrized e.g. via the light lepton isolation. It took us a while to find out that this is needed to achieve any reasonable modeling.
• Can you add the explanation to the AN?
Sure - done.

• figure 9: indeed all these corrections look very compatible with a flat line centered at 1. Are you sure you are not overfitting? Please indicate the chi2/ndof for your correction and for a horizontal line to judge if you are overfitting. This applies to all the corrections throughout the AN.
• We are not sure if a chi2/ndof is a meaningful test for a smoothing algorithm (as opposed to a fit). We agree that several of the corrections are compatible with 1. Now we still need to assign an uncertainty, and we are not sure if e.g. we get 1.05 and an uncertainty of 0.2 it is better to have the correction as 1.0+0.25/-0.15, or as 1.05+/-0.2 (as we have now). In any case, the difference in practice is very small: In the region where most of our data is, the uncertainty is small by definition and the smoothing will be close to this value.
• I agree that chi2 does not work for the smoothing. But you should first check if a constant or linear function makes a good job (here you use a chi-square test) before overfitting your corrections with unphysical shapes. If the chi-square of the constant or the linear function is not good enough, then only you use the smoothing. I agree from your example that 1.0+0.25/-0.15, or as 1.05+/-0.2 look as good as each other, but the problem is that you introduce shape effects that are not covered by systematic uncertainties if you correct by wavy unphysical corrections (e.g. the correction in one mT region is artificially high because of a statistical fluctuation, but the uncertainty on the correction can only shift the whole correction and not that particular mT bin).
We will look into it, but this will take a little. What you say is of course true, but at the same time the number of corrections is high and it is technically a lot easier to use the same procedure everywhere without changing the result noticably. Also, there is hardly any "wavyness" above 1-2% and the correction is not done in our final observable, so even this averages out. So in the end there is almost no practical effect on the results. Nonetheless we will investigate this more closely since indeed it is not very elegant as it is now.

• figure 12 left: the average correction seems well below 1. How is that possible if you apply the fake factors derived using the same events and the fit functions match the data points very well in Fig 7? I could expect a shape dependence with mvis but not a correction of the normalisation.
• For mvis~100+/-20, where most of the data is (i.e. determines the average correction), most of the data points are between 0.95 and 1. So it is true that there is a small normalisation shift, however, this can be explained by the imperfect parametrisation (e.g. if you look at the used pt fits, Fig 7, 0 jet & low pt [where most of the data is], the smoothed curve is about 3% below the data points, so this alone could account for most of the difference. But of course there are also other potential sources, like e.g. neglected dependencies.
• Almost all the points of Fig 12 left are 0.90 and 0.95. This means the estimated background in this region is too large compared to the observation. That would be the case if the FF is overestimated. However you say the smoothed curve is too low. Can you check? Also I don’t think neglecting dependencies can lead to a general yield disagreement (but it could make a shape difference) because you are using exactly the same events to derive the FF and to apply it.
Same cause as for your comment for Fig 31 above, solved now.

• fig 13: don’t you think a flat line would make as good a job with less overfitting? please check as mentioned already for fig 9. This wavy correction needs to have a physical explanation.
• Note that we only use the left part of the figure, i.e. mTGeV, since this is our SR. In this region, there is a clear systematic shift for mu-tau; for e-tau, it is less clear due to lower statistics, but the central curve is very similar (as it should be, since this is a kinematic effect, not connected to lepton flavor). About the physical explanation: following from the explanation in l248-53, this effect is significant for low mT, in particular mT<30, and still non-negligible up to mT<50 GeV. So the best description of the dependence would likely be some function of mT up to 50 GeV, and then a flat line. This is compatible with what we see and could be implemented - however, since we do not use mT>50 GeV it would not have any practical consequence.
• If you do not care about mT > 50 GeV, you should remove this region from the fit. You should not use any correction that does not look physical even if this is what you obtain from the smoothing. See the answer to the followup on figure 9.
Done - but since for the smoothing only neighbouring bins are relevant, the result is almost unchanged.

• l293-296: How can you end up with only one uncertainty per final state per jet multiplicity if the fit function has 4(?) degrees of freedom? In principle you should have as many uncertainties per function as the number of degrees of freedom. Otherwise the low pT region for example can constrain the high pT region without any good physical reason. This question applies also to the uncertainties in the corrections; please answer this question for all uncertainties
• It is correct that the a-priori correct way would be to have an uncertainty for each dof. However, as also e.g. with most POG-provided uncertainties, we found that we can sufficiently cover the uncertainties by something similar to grouping them. For the fits, there are actually two uncertainties we use for fits: one alternative shape, normalized to the nominal shape; and the normalization value as norm uncertainty, as described below. This way it is possible for the fit to adjust e.g. just the low pT region, by moving towards the corresponding shape and at the same time to adjust the norm, such that the high-pt region stays ~unchanged, if the fit prefers that. GoF tests and NP pulls and constraints have shown this to work well in past measurements. For corrections, the answer is the same. In addition, there it would be difficult to do it in a different way since the functions are smoothed, not fit.
• What do you mean with "we found that we can sufficiently cover the uncertainties by something similar to grouping them”? Can you show all the 1 sigma alternative shapes for each of these fits to check if these indeed seem to cover all stat uncertainties? How is the alternative shape determined from all these dof?
We tested by using toys (for the MSSM analysis): We randomly modify data points used in the fit within their uncertainties and redo the fit (many times) and checked if the stat. unc. can correct such fluctuations - and they can, at least up to a very small remaining shape effect which is well below our (other) uncertainties. The real test are of course the control regions which showed (for Z->tautau and MSSM) and still show for this analysis that the modeling works. The alternative shape is derived the same way as for the check explained above: we modify data points of the fit within their uncertainties, obtaining many toy experiments, and then use the 68% boundaries of these toys as 1 sigma variation.

In any case, can you show the +/- 1 sigma functions if they are different from the borders of the orange bands? (and if they are equal to the borders of the orange bands I do not think it is correct, see my previous comment.

They are equal to the borders, except for the normalization. Since they are split in norm and shape uncertainty, the fit can effectivly cover the whole orange band, in any orientation (and not just linear combinations of the upper and lower end, as it would be without the split).

Comments from Tyler, on Main AN v7:

Certainly, done!

• I see many great mass plots in the control plots at the end. Can you please try to think of how one could show a mass plot using the ML approach? I know we have talked about biases and sculpting of the shapes in the past but this would still be very interesting to see. What if you did 1 plot each per channel for ggH and qqH categories for reconstructed mass for NN_score > 0.8 or something like that?

Done, as shown in HTT meetings. A version of this will also be added to the PAS.

• Regarding results (section 12), you only have 4 lines discussing results that are not about impact plots. I don’t see any comparison to the final 2016 HTT cut-based results. You should quote the number in https://arxiv.org/abs/1809.03590 for the ggH + qqH analysis (HIG-16-043) after it was updated for the 2016 combination (WG1 uncertainties and NNLOPS ggH reweighting which you also use). The number is expected mu=1.00 +0.25 -0.23. This number is actually identical to your 2016 combined inclusive expected mu value I see in the table. This is certainly worth mentioning even though the analyses have different uncertainty models as well as ZTT treatment and fake modeling.

• Thanks, we will add this. However, are the expected mu numbers also available for the original paper somewhere, plus the mu_VBF and mu_ggH expected values? We could not find them in any of the two papers.

• (Tyler:) I do not see the expected values listed in the papers either. I can work with you to help produce these. I think the most relevant comparison is with the 2016 cut-based analysis updated with WG1 and NNLOPS. I have expected mu for this scenario + VH analysis which gives expected mu_ggH = 1.00 +0.48 -0.43, mu_qqH = 1.00 +0.37 -0.39. There is very little ggH and qqH in the VH portion, so this is probably a good proxy for the moment.

Done.

• l388, I do not understand this statement: “The transverse momentum of the di-tau plus MET system and in 2016 the transverse momentum of the leading tau as well are required to be larger than 50 GeV respectively in order to suppress QCD multijet background. Both cuts are adopted from the previous analysis”. There was no di-tau plus MET system > 50 GeV cut in the 2016 analysis. Can you clarify? No optimization was checked when using the ML approach to loosen leading tau pT back to what is allowed by the triggers? What is the motivation for the “transverse momentum of the di-tau plus MET system” cut? Can you show a distribution of what this looks like pre-cut? This is essentially Higgs_pT?

Removed the statement that this was adopted from the 2016 analysis. The motivation is a very slight increase in sensitivity, and a much better data modeling. Plots are already in the AN appendix (labelled pT(tautau)).

• l379, what is your loss in signal acceptance by choosing the isolated pair before trigger matching and extra lepton vetos? Reading the other channels, I do not see trigger matching listed a bit. In MuTau, if the muon only fires the cross trigger, is the tau required to match the tau HLT object? At what point is this enforced?

I've made a small study by checking how many events are with more than one valid pair in a VBF signal sample (tt channel signal selection) and after this the difference in only check the HLT for being fired vs. making a trigger object match to the most isolated pair. The difference in events between the two approaches in this study gives the upper threshold for the number of events lost by applying trigger match after selecting the most isolated pair.

If you are interested in details how these numbers are calculated, I've put a python script with the input ntuple here: /afs/desy.de/user/a/aakhmets/public/ForTyler/. The output of the script is the following:

Investigating impact of making trigger match after pair selection
At first, have a look at events with only one pair (result should be exactly the same for both methods):
Events with fired HLT for one pair: 14795
Events with the the one isolated pair matched to TOs from HLT: 14301
Resulting efficiency: 0.966610341332
Now, consider events with more than one pair
Events with fired HLT and with several pairs: 235
Events with several pairs and TOs from HLT matched to the most isolated pair: 180
Resulting efficiency: 0.765957446809
Dividing out the overall trigger matching efficiency determined for exactly one pair. Resulting (unfolded) efficiency: 0.792415944726
Potential number of events lost by trigger match on most isolated pair: 53.1635687732
As last step, calculating the overall fraction of events being lost by applying the trigger matching on the isolated pair
This fraction is equal or less than 0.0028997256793

Agreed - the information on trigger matching has been added to the AN. The matching happens both for the muon and the tau again for the most isolated pair.

• Figure 15, why are the embedded efficiencies so different? The Elec WP90 high eta plot shows the opposite trend of what is shown in data and other MC. Is the cause understood? Figure 17 also shows differences in measured efficiencies for electrons at the high eta region.

This is a known issue of embedded samples, which is investigated by us. Since we have a hybrid event based both on simulated objects as well as on objects from data, some low-level reconstruction may go wrong, in case the difference in alignment and calibration between data and simulation plays a major role. We tracked this issue down to a refit procedure of the electron trajectory, which reads in this alignment and calibration info to interpret the sub-detector information given in local coordinates. In this specific case, this leads to a mismodeling of the delta-eta between extrapolated electron track to the supercluster, which enters as input variable for the new MVA discriminator. We are currently working on a solution for this, which will enter a new version of embedded samples.

• Figures 21,22, single Mu and single E triggers have much difference efficiency in embedded vs. simulation. Do you know why?

This is a known issue of embedded samples. Currently, the HLT simulation is done on the simulated part of the event only. This involves only the foot-prints from the two simulated tau-leptons. It was found, that the vertexing with two tracks only can lead to a bad vertex quality, especially in the z-direction and the calculation of impact parameters gets difficult. This could be one of the reasons for a shifted turn-on in pt for the trigger efficiencies. In general, the environment for HLT reconstruction is completely different from the usual use-case, so such differences are not unexpected.

Again, the following work-around is applied and proposed to account for such cases in currently provided samples: use efficiency SFs measured for embedding if they look fine and provide a proper data/MC agreement. If not, don't apply neither the Trigger nor the corresponding SF in the problematic region and apply instead the Trigger efficiency measured in data.

• Section 6.14, I thought ttbar reweighting was no longer recommended in Run-II. Am I wrong? Figure 28, this is after the Z-->mumu mass and transverse momentum reweighting mentioned in section 6.13? Does 2016 show the same improvement?

On Z(pt, mass) reweighting. Both Figure 27 (for 2016) and Figure 28 (for 2017) refer to the reweighting procedure mentioned in Section 6.13. Corrections for both years are validated with Z->ee events & in case of signal regions, an improvement is visible for DY events. Please see for details into 2016 presentation, slides 87-89 and 2017 presentation related to this correction.

Referring to TopPt reweighting, use-case 3, we've decided to use the Run I reweighting strategy as our analysis-specific reweighting, since we see an improvement by applying this correction (both in 2016 and 2017). Additionally, we have the ttbar category to constrain both the shape and normalization of ttbar further.

• l991 HTT group should abandon the electron 1% in barrel 2.5% in endcap and adopt EGamma POG recommendations (I am guilty of using this in the past). If not for this analysis, this should certaintly happen for the full Run-II analysis.

Agreed.

• l996 Was there confirmation from JetMET on the JES eta region groupings proposed by Danny?
Yes (as discussed on the hypernews).

• l1028 MET Recoil correction uncertainties: Is it correct that these replace the MET unclustered energy uncertainties for the W/Z/Higgs events? Can you quantify the improvement in sensitivity from from this switch? This will help better compare the 2016 cut-based analysis vs. the ML version.

Correct, met recoil corrections and corresponding uncertainties are applied to boson-like events, so W/Z/H events, replacing the MET unclustered energy uncertainty. The amount of shape variation is expected to be smaller than for the unclustered energy uncertainty, since these uncertainties are derived from the recoil corrections themselves, taking the more precise knowledge of the boson 4-momentum into account.

However, a 1 to 1 study of the uncertainty of the MET recoil correction vs. MET unclustered energy correction wasn't performed and the impact on the analysis sensitivity up to the r constraint isn't studied. We propose to do so after the pre-approval, if such a study is considered as required.

• l1033, top pt reweighting, again I don’t recall this being recommended.

See above

• l1050-1054, I think the Tau POG tracking efficiency is built into the Tau ID SFs. You are measuring a tracking efficiency for "Higher track reconstruction efficiency due to reconstruction in an empty detector environment" and "Migration effects for 1ProngPi0 due to the footprint of replaced muons". Do you apply the normal Tau POG Tau ID SF on top of this? If so, do you know how much double counting may be happening?

This is an additional uncertainty for embedded samples only. It corrects for differences with respect to data/MC in track reconstruction performed in a clean environment, which is done in case of embedded samples during the simulation step.

We do not use the Tau POG Tau ID SF for embedded. Danny measured (with the tracking efficiency correction applied) independent Tau ID SF for embedded events which we use, these are 1.02 for 2016 and 0.97 for 2017.

• l1096, are you using Barlow-Beeston, or Barlow-Beeston-lite (autoMCStats in Combine 81X)? The Lite version may save you some time on fits, if you are currently using Barlow-Beeston.

We have switched to 81X and the "lite" uncertainties.

• Table 19, there is a degradation of about 10% in expected mu uncertainty comparing the inclusive 2017 mu vs. 2016 mu. Where does this degradation come from considering we had more integrated luminosity in 2017? It looks like the degradation is strongest in tau_h tau_h channel.

We will investigate this more closely - but in particular for tau-tau, the increased trigger requirements more than overcompensate the additional lumi in 2017, see e.g. the event counts in the sensitive bins of Fig 68 vs 89.

I agree the change in yields in the significant bins probably leads to a lower significance, but why is this the resulting 2017 shape?

One major effect is the prefiring, which has a stronger effect in 2017 compared to 2016. In general also the data understanding plays a role, reflected by the different sets of variables taken for training. To be investigated further for additional effects.

This is the type of stuff which should be checked before unblinding. I see multiple changes in the entire profile of these shapes. You train on different input variables, that alone could lead to different performance.

If I look at raw yields, you have WAY more 2017 data in your inclusive region. I am comparing Figure 112 Number of b-tagged jets (easy to estimate total) = 12k events in 2016 vs Figure 127 Number of b-tagged jets = 40k events in inclusive. If these are correct, you have something VERY different in the initial selections. (Tau pT cuts?) Can you align the tau pT cuts in 2017 with the 2016 approach and give total inclusive data yields?

These control plots were done without a cut on pt_tt < 50 GeV (see Figure 126, middle left plot), therefore the large difference in event yields. This will be corrected. The analysis on the NN output itself includes the pt_tt < 50 GeV cut already. done

• Fig 104, jet eta only go to 4. Do you keep jets up to 4.7? (same for other jet eta plots). Is there cause for worry for the last +/-0.7 in eta?

Thank you for spotting this! It is just due to a deprecated binning config. Plots in AN are updated with proper range.

This was just a consequence of arbitrary choice in the plotting script. The new draft contains plots going all the way up in eta.

Comments from Cecile, on FF AN v4/v5:

• Now I understand you take the extremities of the orange bands as shape uncertainties on the fake factors as a function of pT (after normalising because the normalization is another uncertainty). I do not think it is correct. Imagine you fit your functions with a line. In that case the correct way to take the uncertainty into account is to have an uncertainty on the intercept (= normalisation) and one on the slope (=shape). But with your technique you would get the normalisation uncertainty correctly, but the shape uncertainty would have a zero effect (the error band extremities are parallel to the nominal line, which means they are equal to the nominal line after you renormalise them). You completely lose the slope effect. Can you please comment?

It is of course true that if you decrease the number of independent contributions in the uncertainty model, by definition you can cannot exactly model each case. This is the same with every uncertainty and can only be justified for a specific use case, not in general - similar as to the JES uncertainties: to be fully correct, one would always have to use the full model with dozens of specific contributions. For actual analyses it is typically sufficient to consider a subset of them; in fact, better to do so us otherwise one would only get sensitive in fluctuations in the large number of templates. For our specific case, we convinced ourselves that the current model is sufficient by looking at the pre- and post-fit distributions in various control/validation regions, e.g. the inclusive region and background-enriched categories, as well as what we have done for past analyses and what this time Danny has done: look at other background-enriched regions, e.g. W-, tt- or QCD-enhanced regions by modifying cuts, to some degree complementary to our background categories. So there is overwhelming evidence that the model works. For the specific case you give: That the uncertainties are parallel is almost never the case, since with larger pt the uncertainties increase and thus the region in any variable given where higher-pt taus are enhanced, will get a larger uncertainty and give the fit more freedom to adjust - as it should be. If indeed the uncertainties are parallel, then it would actually mean that across the range of this variable, in each bin taus with similar average and spread in pt contribute, i.e. that each bin samples the FF pt fits in a similar way - in which case there actual should not be any freedom of the fit to adjust the shape and a norm uncertainty is justified. To fully illustrate this point, we will prepare an alternative uncertainty model which actually addresses this "slope effect" and compare the results. If the differences (beyond the effect that the alternative model will, in some way, have overly conservative uncertainties which will have some impact where this is not constrained from measurement) are negligible we propose to stick with our model. If not, then we will adjust it accordingly. EDIT: After discussion in the HTT meeting, we have adopted the alternative model.

• l319 of v4: why is the subtraction of true taus not a shape uncertainty?
Because it is a) small and b) only shows a mild shape effect. Note that statistical uncertainties of the true tau subtraction are directly dealt with the same way as every bin-by-bin uncertainty.

• l312: Can you clarify exactly how you derive the uncertainties in the fractions? If you modify the W cross section, all the fractions will change at the same time. How many uncertainties does this correspond to? Also what is the uncertainty on the W (should be more than the cross section because you also have the jet->tauh fake rate uncertainty)?
The uncertainty on the fractions include both cross section and experimental uncertainties. This adds two NPs. We evaluate it by changing one fraction at a time (and of course adjusting the rest to get to 100%). All this is now better explained in the note (included starting from v7).

Comments from Cecile, on NN AN v4:

• l76: seems to be in contradiction with l84 where you say you discard all failing variables and do not do further scrutiny

The sentence in l76 puts emphasis on the fact that due to the nature of the p-value variables may fail the test and then may be added to the list of variables used for training if the agreement is validated manually. Since the crucial variables do not fail, mostly important the SVFIT mass, we did not spend further work with studying the variables failing the 1D GoF tests and rejected the failing ones.

• l92: what does 6.8 refer to? Is it the total number of expected failing tests and not the number of expected failing tests for m_vis only? It is a bit confusing.

The 6.8 refers to the expectation of all tests of variable pairs with all variables. As explained in l86, with 17 variables in et (2017), we expect (N**2 - N)*0.5*0.05 = 6.8 variables to be below the 5% threshold and four out of seven are in m_vis. This could be a statistical effect, however we decided to be conservative and remove m_vis from the variable selection in et (2017).

• Fig 4: In practice how do you do the MC subtraction when you train on QCD data events? With negative weights in the training?

For the QCD training, we do not perform any MC subtraction. Indeed, the QCD-enriched sample used for training on QCD may contain events of other background processes. The confusion matrices in appendix B and the plots of the QCD categories in AN-18-255 show that in general the separation works well.

• Tab 4: Please add emu. Where is the single top?

emu has been added. Single top is a separate event class in emu. We have included for emu the mapping of the physical processes to the event classes in the table.

• l151: is is standard to validate with training events?

Yes, this approach is standard. The syntax used here is taken from the machine learning community. Validation means that we monitor during the training the performance of the model, more precisely the convergence of the loss function. These validation events are taken from the training dataset, which is used during training partially to optimize the free parameters and partially to monitor the training. This procedure is indeed described in an understandable way for non-ML people in the PAS/paper draft.

• Appendix B: Why do you do confusion matrices? I understand what they are, but how do you use them? What do you want to check/validate?

We validate with these matrices that the NN has learned its task. These matrices are created on the test dataset, which is drawn independently from the same population than the training dataset and therefore we use it to validate the multi-class classification performance in detail. E.g., we can see which classes interfere most in the classification task such as Higgs from gluon-fusion and Z to tau tau. E.g. for a random choice of 8 categories the positive predictive value (PPV) (w/o prevalence) would be 0.125. You can see from the corresponding confusion matrices that (e.g. etau PPV 0.72(!)) how successful the NN has been in separating out each given process. This information is also given in the PAS/paper draft to allow for an assessment for the reader.

• Fig 11 bottom: Is it really useful to have a misc category if the purity is only < 10%?

The purity in Fig 11 is indeed only 11%, however, the efficiency is around 30%. This means that we are able to put 30% of this background correctly in this class. The purity is that low, mainly because ZTT events are misclassified as such. This may reduce the constraint on the events, which belong truly to this event class but still we keep the background events successfully out of the signal regions and the “misc” events out of other background categories to improve the constraints there. By construction a "misc" category is not expected to be very pure. Therefore the name...

• Fig 11 center: You have almost as many ggH events in the ggH category as in the zll category.

This is correct. ggH is inherently ambiguous and many events migrate to other classes. this is not unexpected.

• Appendix C: I have no idea about how to read these plots and what kind of information they bring. Can you please add some explanation of why they are useful to have in the documentation?

The plots are referenced in the paragraph in l150-l157. It shows the loss function of the training for these two datasets, one used for the gradient steps and one to monitor the performance. We have now added the relevant text from the body also to the beginning of the appendix section. See also your question regarding the overtraining.

• Appendix D: I cannot understand the difference between the top and bottom plots. And what do you mean by “two on the top” in the caption? I also do not understand how to interpret these numbers (except that the higher the better…)

I’m sorry for the confusion with the caption, the text has not been modified correctly. We have corrected this issue. The interpretation of the numbers is discussed in detail in https://arxiv.org/abs/1803.08782. It will also referred to in the paper/PAS draft and briefly explained to the detail that a non expert reader can understand the ranking whereever we refer to it. In short, you can see for each output node of the NN, which inputs are the most sensitive ones (to first order). For the figures in appendix D, the absolute values of the first derivative of the respective output node of the NN function is calculated with respect to the inputs. This is performed for each event in the testing dataset and then aggregated by calculating the mean, which is shown in the figures. This means, that zero can be read as “no sensitivity” since the NN response does not change with respect to this variable. However, the absolute value does not have a simple interpretation and it needs to be read relative to the others.

• Fig 23: Why is nbtag not useful to classify in the tt category (at least not more useful than for the other processes)?

The variable shows its largest sensitivity among classes in the tt class. As you point out the overall sensitivity in this training seems to be very large. In the overall picture you can see that the efficiency (purity) of the tt category is about 80% (40%). Following the study in Fig. 23, the NN seems to achieve this performance mainly by using several inputs, among those njets, m_vis, the (sub-) leading b-jet pt's (which already are an indication of the existence of two b-jets). This case actually shows the kinematic redundancy of variables at work.

• Fig 23: mt_1 seems pretty useless even for w. Is it because of the mt_1<50 requirement applied before? If so it looks to me like this variable could be simply removed (as could be jpt_2, bpt_1, and bpt_2 from this plot)

You are right -- mt_1 is most likely useless because we apply the mt_1<50 cut already in the baseline selection. Indeed, variables not contributing significantly to the classification of any event class could be removed. It should be mentioned though that well described variables (even if not effective are also not harmful). According to the discussion above they even add to the redundancy that can be useful in case that other varibles are not optimally described in the training samples. Such a pruning process could easily be performed during an ARC review.

• Fig 23: jdeta does not seem to be useful to classify in the qqh category

It is useful even though there are variables that contribute even more to the classification. The explanation is that the information carried by this variable is already contained in another variable, or in a combination of variables (e.g. mjj, jpt1, jpt2), as you can actually nicely see from Fig. 23.

• Most of the above comments apply to Fig 24 too

This is actually assuring, since it shows that the NN has figured out the similarities between the mutau und etau channels and converges to a similar solution for the event classification. Note that such a conclusion can only be done by studies such as carried out in appendix D.

• Fig 25: Why does m_vis perform better than m_sv for all categories?

Note that m_vis and m_sv are not uncorrelated quantities. Actually, m_vis should carry already a large bulk of the information and m_sv should add only little. This may be reflected by your finding. Which of the two correlated variables might be picked up as the most contributing one might be a matter of statistical fluctuations. Also the rankings of all single varbables and correlations have to be viewed in this sense.

• Fig 28: This trend is completely different here where m_sv is much much better. How can you explain this difference between 2016 and 2017?

We do not control the source of information picked up by the NN during training. Since m_vis and m_sv carry to a large part very similar information, small differences e.g. in resolution between the years could change the relative importance of the two variables drastically, in particular if one to a large degree could replace the other variable. Also note that the meaning of the absolute values is not easily interpretable, so the main conclusion from these matrices regadring m_vis and m_sv is that both contribute very strongly.

• Tab 5: As said before, mt_1 does not seem to be anywhere in the useful variables for etau

Agreed. We are happy to re-consider during ARC review. A variable pruning is simple but since it involves many calculation and training steps, when doen in a well monitored this should not be rushed.

• Tab 5: in the ggH column, how can you explain that bpt_2:m_vis is ranked so high?

See the answer to the bpt/m_vis correlation in your questions to the main AN. It should be noted that the ranking implies separability of the given event class from all(!) other event classes, which enter the training with equal weight. In many cases the ranking can be very well understood at least on a qualitative basis, which we think is very helpful to understand what the NN is doing. A deeper understanding, of individual cases would require more diagnostics, e.g. by checking the separation not from all but individual other classes, or just inspecting the correlation plot itself in both versions, for the given class against all and individual other classes. This could easily be done during ARC review.

• Tab 6: m_vis is clearly the best variable (before m_sv) in 2016 but was discarded in 2017 because of bad GOF. Wouldn’t it make sense to work more on getting a good data/MC agreement for m_vis to be used in 2017 too?

Studying the classification performance in the confusion matrices in Fig. 12 and Fig. 15, the resulting classification power of the NN seems very similar. This backs up the line of argument above, that m_sv is an improved version of m_vis and both carry very similar information. In this sense the "loss" of m_vis is backed up by m_sv, which shows another strength of the NN. Of course we will invest some work to better understand the m_vis distribution in this case. This would be a typical tasks towards the full run-2 publication.

• In general, can there be overtraining with this method? Have you checked if this happens?

You could observe overtraining in the plots shown in appendix C -- this is exactly the purpose of showing them. If the performance of the model on an independent dataset decreases (which is equal to an increasing loss), the model would have overtrained on features only present in the dataset used for the gradient steps. We check this with the training and validation datasets, which are used for the gradient steps and the monitoring, respectively.

Comments from Martin, on NN AN v5:

• General: Most plots look a bit blurry. Have you tried adding them as pdf instead of png? Not really important if it takes a lot of time, but if it can be done easily...

• General: There are a number of terms which can be left unexplained in a ML journal, but would need a reference (paper) or an explanation (AN) in a physics paper. Not overly urgent, but if you can briefly define/explain in the text some of these terms it would help the average reader a lot. Also, if you have references/explanations (as e.g. for Adam) put them (also, multiple references are fine) where they are first mentioned, or at least forward-reference the explanations.
- softmax - dropout - L2 - epoch - Adam - Glorot

• 4: Please add an actual reference to the AN.

• 7: for the fits used to measure observables such as signal strength modifiers or cross sections

• 36: Please clarify in the AN whether you first standardize (using events having sensible values) and then assign the default value of -10, or the other way round (i.e. those events are considered in the standardization). It is implictly contained in the text but not very clear, in particular since you described standardization first, and then the default values.

• 47: "manual" optimization is weird, in particular without further explanation. You need not explain your procedure necessarily, but remove the "manual" and add wrt what you optimised (mu? loss? confusion?). Since you use the same value for all channels, I assume you are not very sensitive to the choice, anyway - if that is true, please also state it.

• 74/76: could you make this a bit clearer? First you say that these "bad" variables are not used; and then two lines later that "maybe" they are used after all.

• 87: they -> there

• 104, 109: than -> as

• 106: a suitable simulation sample of this process is not available [since it is not true that there are no QCD samples at all]

• 108ff: this is fine, and just as a long-term comment: You could try to, in addition to inverting the charge requirement, also require e.g. tight isolation, to get a sample purer in QCD. Of course it is not clear if this will improve anything since the "impurities" could also improve the training performance.

• Table 2, 3: emu column is empty (of course you know that...)

• 127: "ground truth" - is that a real word? Either call it "true classes" (plural, as it is a vector) or explain the term.

• 129: "probability...to be found in data" -> "proportional to the rate with which this event is expected to be selected" [it is definitely not a probability]

• 162-164: "dropout layer", "dropout probability" and "L2 penalty" need to be explained.

• 187: I assume "both", i.e. the two matrices per channel, refer to the two folds? This should be made clear here and in the figure captions

• 222: should be -> are

• Fig 11ff: Not urgent, but for the next update, could you make sure the labels are the same as in 18-255 (e.g. QCD, not SS)?

Comments from Isobel, on Main AN v8:

• P.41 Is it expected to have more accurate tau reconstruction in the embedded vs. the MC samples (meaning SF closer to 1)?

They are expected so be somewhat different due to the different tau reconstruction in embedded events. Also it should be mentioned that an additional correction is also applied which covers the effects of the tracking efficiency due to the tracking in an empty detector environment, as well as effects of ECAL remnants after muon cleaning on the pi0-reconstruction. These are between 0.93 and 1.02 depending on the tau DM. If this is multiplied by the embedded Tau ID SF of 0.97, e.g. for 2017, this results in an effective scale-factor applied to embedded events between 0.90±0.04 and 0.99±0.04 as the ID SFs were measured after the tracking corrections. The MC SF are 0.89±0.03, so this is reasonable.

• Fig 27. This plot shows good agreement by construction as the ZPT weights are derived based on the ZPT data/MC distribution (correct?) It would be good to validate the agreement in another region.

You are right -- and this is exactly what the plots should show, since they are meant as closure tests. They demonstrate that the procedure is correct. Control plots in different regions and variables (e.g. MET) have been shown by DESY in the past and showed a clear improvement after reweighting (see slide 21 to 25 of this presentation https://indico.cern.ch/event/762837/contributions/3172618/attachments/1731302/2798220/Recoils_20181010.pdf given in the HTT meeting for a closure test of the corrections in Z->ee events). You can also take the control plots that have been made for this analysis (see e.g. in recent presentations of this analysis in the HTT meeting) and the NN output (see e.g. AN Fgis. 61, 67, 82, 88) in the zll categroy as checks.

• P.56 How can the GoF perform well in a high signal region with out biasing the result? One could expect poor data/MC agreement there unless Signal is injected.

Not clear what you are referring to. p=page points to the systematics section in the AN and GoFs are not discussed there; p=Fig points to a prefit plot w/o GoF. In general GoF 's are applied to single input distributions on the inclusive dataset or on NN outputs in the background categories and/or with blinded signal. Blinded means either (i) that the signal categories are not part of the fit, (ii) that they are filled with asimov data, (iii) or (in my personal preferred version) the sensitive bins in the signal are set to 0 in data and in the model and the rest of the bins contributes to the fit. That signal cannot interfere with the GoF at this stage is a logical consequence of CMS blinding policiies.

• Concerning the impacts, many of them seem different from the 2016 cut based impacts/pulls. It would be good to understand why.

This can be easily understood with a bit of time on a case-by-case basis. E.g. Cecile brought up the question, why Nbtag does not dominantly contribute to the selection of ttbar events. On closer inspection you quickly see that a very good separation can be achieved exploiting a set of at least 5-6 event varibales with nearly equal share. In this way the NN does not rely just on Nbtag and the modelling of this input parameter. This usually then translates into the impacts. If the NN e.g. uses mvs but also mvis for signal extraction MET will play a minor role in the impacts. In short: this is a completely different analysis strategy, which comes with its own sensitivities to its inputs. In general the inputs are more and more diverse, therefore I'd expect the impacts of single uncertainties in general to smaller than for HIG-16-043. To boil it down (in a quite complex analysis) concrete questions should be followed up on a case by case basis. I'm happy to do this with anybody interested in a 10min chat, but impossible can guess any (concrete) question you might come up with.

Comments from Cecile, on Main AN v8:

More info on the next 5 bullets here: https://indico.cern.ch/event/762852/#14-sm-htt-20162017-review-comm

• Synchronization on expected and observed limits by category and by data taking period
Decent agreement on typically 10%-level has been achieved.

• Find an algorithm to determine the most optimal binning in each category
Done.

• Solve the constraints of the JES by finding a way to smoothen the shifted templates
Templates have been smoothed, constraints are a bit weaker.

• Get the scale factors approved by the EGamma and Muon POGs
No objections yet in discussions, still following up a few checks requested for pre-approval.

• Align the cuts between 2016 and 2017 (e.g. there is an unjustified difference in the tautau selection with the tau pT)

Cut options have been compared for 2016 and 2017 including a retraining and no benefit from the pt_1>50 cut was observed. Therefore we now cut on 40GeV in both eras.

The documentation is missing some important discussions:

• A longer and better justified discussion of all the constrained impacts, and of the reason why the leading impacts are ranked so high

Discussion has been extended.

• A discussion comparing the 2016 and 2017 results (strong differences in emu and tautau, which cannot be understood without explanation)

• A discussion comparing this analysis to HIG-16-043

• A more motivated explanation of the different variables (why is it useful?, role of correlations, role of redundancy, …)

Added a motivation and reference to NN AN to variable validation section

• For each single shape uncertainty, illustrate the effect with a chosen process in a chosen category (histograms nominal/up/down)

This has been added at least for each kind of uncertainty. It would be a bit much for every split like decay mode for example.

In terms of cosmetics:

• All emu plots should be aligned with the others

Done.

• All plots/subplots should have a label with the name of the subcategory

We had discussed this and concluded to label the subclasses in the caption. The reason is that it is difficult to have the labels set automatically and looking nice without interfering with the legend or parts of the histograms in all plots. We will again think about this with respect to the PAS and potentially update the AN later on.

New general comment:

• You should show the visible or full mass distributions for all your final plots (instead of plotting the NN output in each category you plot the mass). This is helpful to understand how much including the mass as a training variable biases the shape, and it also overlaps with Tyler’s comment to see a ditau mass distribution for the signal in a category with high signal NN output. I think it could also bring an additional sensitivity, for example in the zll category, where mvis will most likely provide a better discrimination than the NN output, which looks pretty flat. Depending on how the plots look like you can remake the fit and see if using the mass in some/all the categories can improve your results.

Done (as shown in the HTT meeting, see link above) - for now a validation plot, showing that the NN behaves as expected. For the final PAS we will also prepare a usual mass plot weighted by some function of the expected S/B.

And a few detailed comments on the second half of the AN:

• l1066: I don’t think “inexplicably” is the good term…

reformulated

• l1068 and elsewhere: don’t forget the single top

• l1078: how does the extrapolation of 1.5% give 2%?

This is meant to be conservative. reformulated

• l1080: your explanation of the uncorrelation is in contradiction with what you do for the tau ID (where the discriminators have also changed)

The tau ID treatment follows a Tau POG recommendation and dealing with real taus the role of the anti-lepton discriminators is less important than here. Here we have own measurements and the conservative way is to uncorrelate.

• l1138-1140: I do not understand at all this sentence, and how many nuisances you add at the end
Clarified in the text. There are two nuisances added.
• l1112: are all of them shape uncertainties?
Yes, except for the last, which was in the wrong subsection. Moved to the normalization subscection.
• l1142: why 10%
This is a rough conservative estimate combining effects from cross section uncertainties (about 5% for most process), tauID uncertainty and other uncertainties affecting the yield.
• l1151: what is the fourth?
The fourth shape uncertainty describes uncertainties in the extrapolation of the correction factors from the antisolated to the isolated region. The text in the AN has been changed.
• l1157: does that mean you have switched to autoMCstat?
No, we have not switched to the Barlow-Beeston-lite approach ("autoMCstat"), but are still using the bin-by-bin factor and the regular Barlow-Beeston approach. autoMCstat is not available in the release we are using.
• l1169: why not partially uncorrelated as for the tauh?
• l1193: they should be added in quadrature because they are different objects

Right, updated this

• l1211: from the ref it looks like 3% instead. please check

the diboson processes have uncertainties between 3-5%. To be conservative, we use a value of 5% as was done for previous SM and MSSM HTT papers

• l1217: I cannot see this number in the reference twiki

Indeed, this is not on the twiki. As in previous analyses, we conservatively use the same number as for the QCD Z process.

• l1232: there should be a correlation by process (I understand the composition is different, but you should just scale the uncertainty per process by the fraction of this process and correlate between categories)
The correlation is weak and splitting the uncertainties in correlated and uncorrelated part has a negligible impact while complicating the uncertainty model and fit substantially.
• l1239: the problem does not come from the tools but from the incorrect way with which you define the shape variations (see comments on FF AN too)
Reformulated. Every uncertainty model used in CMS analyses groups degrees of freedom, so this is common and necessary practice and we do not consider it incorrect.
• Fig 46: the ditau mass fails the GOF but is included in the input variables

The corresponding features are ranked below 25 in the feature ranking, so this is acceptable. Added comment to AN

• l1397: this statement is discussable given the number of constraints you still have …

Rephrased %DONE%

• Sections 12 and 13 need much more explanations and discussions

We have added a bit more discussion, but will add even more once we have digested the final results. %DONE%

• l1399: the constraint is most likely artificial and should be resolved

Yes, comment in that spirit added to the AN. %DONE%

• l1400: the fact that there is a ttbar control region does not explain why the xs uncertainty is constrained and why it is acceptable

Agreed, rephrased. %DONE%

• Impacts: why is impact81 split by category? why is the met unclustered energy scale so constrained?

impact81: This describes the systematic uncertainty on the subtraction of true taus from the fakes background and is only very weakly correlated with other categories. Since the impact is small, and it is not constrained, it would make the fit a lot more complex to model the weak correlations without affecting the end result. met_unclustered: we will follow this up, but since the impact is so small a potential different treatment will not change the results. %DONE%

Comments from Cecile, on PAS v0:

Title: This is not the first measurement. ATLAS already has STXS results for HTT.

That's a pity. The title has been changed.

Abstract line 2: I don’t see any measurement of the branching fraction in the result section

The measurement is on cross section times branching fraction. This is what is stated in the abstract.

l62-66: you can remove the details about photons

While in principle yes, photons play a role in the definition of tau had decay modes. A very similar description has been used for HIG-17-020, which went smoothly through the whole publication process and journal review.

l120 and around: CVS -> CSV

Fixed.

l139-147: the WP are different in 2016 and 2017

Added statement that this is for 2016, while for 2017 the WPs have been reoptimized, with a similar performance (no exact numbers, as there is no reference for it).

l163-164: This sentence is strange

Modified to potentially read less strange.

Section 4.2: I do not see where a ttbar-> mu tau_h event would enter in your 3 groups (no fake, but not two taus). Also, l 270, I think embedded samples cover only events with two tau decays, and not with at least one

The events you are referring to would naturally fall into the first event class (l253-l259 electron or muon misinterpreted as originating from a tau decay). L269 has been corrected to refer to event with two genuine tau decays.

l298: There is no mention of the subtraction of prompt contributions

Not sure what you are referring to as "prompt". If you are referring to the subtaction of backgrounds other that the QCD, Wjet and ttbar. This is mentioned in l290f PAS v0.

Table 2: It is not clear which thresholds are for single lepton triggers and which for cross triggers (the tauh cut seems to apply in all cases)

Your observation is correct. The "cross triggers" are used to lower the electron or muon pt thresholds as explained in the text (l313-316 in v0 of the PAS). A clear explanation of what values refer to what triggers is given in the caption of the table.

l360: I thought the tauh pT cuts differed between 2016 and 2017

It was a triviality to harmonize this in the meantime (I think even on a well justified earlier request during review).

Table 3: I do not understand this table and what PPV means

Event classes are introduced in the text (l375ff). On what a positive predictive value in statistics is have a look e.g. here.

l464: You should explain what is meant by mvis vs mvis correlation

Not clear what needs to be explained here. We can add something on the physics interpretation for the reader's convenience.

l 517-538: I had not understood before that you make 2 different fits, one for the inclusive cross sections, and one for the STXS. If this is the case it should be explained clearly in the AN too, and you should justify it. It looks very strange to me to have two different fits since you could obtain both results from a single fit to the STXS categories, and that would most likely also improve your inclusive results. Please check the difference in sensitivity by extracting the inclusive results from the STXS fit.

Indeed the same event category is used for each signal model. The text has been corrected accordingly.

l597: I think 0.8% is not for all decay modes in 2017

Indeed, it is 0.8-1.0%, depending on decay mode -- this is what is already described in the AN and what we are using. Corrected now also in the PAS.

l604: I doubt the tau energy scale still covers the electron energy scale since its prefit uncertainty is much smaller now

Electron energy scale uncertainty has been added.

l637: ZH, WH, and ttH should be part of the inclusive measurement

The formulation has been fixed now to refer only to the stxs stage 0 and 1 measurements.

l633-635: What are these additional uncertainties? I don’t remember seeing that in the AN

This statement has been removed from the text.

Section 8: You should include a table for the definition of the STXS categories instead of the graphs, which do not give precise information and show categories that you do not include.

All available information is part of the figures. More information is given in the text (l549ff in PAS v0). A table cannot add more information and will be more difficult to layout/read.

Section 8: It would be useful to present the STXS results in figures (sigma/sigma_th for all categories)

Agreed the representation of the results is not yet settled. We are happy to work on a proposal towards pre-approval and during ARC review. We will include your suggestions in this proposal.

Section 8 or as additional material: It would be useful to have a matrix that shows the purity of each STXS category in the different STXS signals (see Figure 6 of HIG-18-029 for example)

Agreed. We will add this information during ARC review.

References: Please extend them. At the very least you should include ATLAS 2016 paper, which has STXS measurements, and HIG-16-043, which is the origin of many of the cuts and corrections presented in this PAS.

We have added ATLAS and HIG-16-043 to the references.

Datacards should be checked and approved by the combination group

Done

Fill in the HIG JetMET: https://twiki.cern.ch/twiki/bin/viewauth/CMS/HIGJetMET

Done, reply see below

Include the signal samples : bbH, ggZH, ttH signals, and H \to WW decay. For bbH and ggZH that are not centrally produced estimate the from the ggH and qqZH samples and check with small scale private production

Done (for details, see mail to hypernews)

Slide48 \mu(qqH) -> \mu_V vs. \mu_F

To be done after unblinding with observed data.

The analysis is not sensitive to VH, so the only thing that can be done is tying them to the SM or tying them to another process. Include VH hadronic together with VBF in the plots and explain it in the text and captions/labels.

VH hadronic is now combined with VH. VH leptonic is tyied to the SM expectation

On the way on approval prepare a PR plot to be added in the PAS

We are working on it, will be added before approval.

One of the W+gamma MC samples used is bugged, replace it

Done

Finish your TODO list : synchronization, scale uncertainty

Done

how do you choose the final binning of the DNN distribution?

The DNN score is in the range from 1 / (number of classes) to 1, e.g., from 0.2 to 1.0 in the fully hadronic channel. We make the input shapes for Combine Harvester with a bin width of 0.05 in the sensitive regions with high neural net score. Finally, we merge bins with no background contribution from right to left. The full procedure is implemented in the CombineHarvester dev branch here [0].

why there is some deviation from 1 in the Asimov dataset (e.g. Fig. 3 of PAS)

The definition of our Asimov dataset is signal plus background expectation. Since we stack only the backgrounds in the upper part of the plots, the deviation from the Asimov data is equal to the signal expectation. Same applies for the ratio plot and the deviations from 1.

In principle, it would be better that the optimization samples are also stat. orthogonal to the ones used for signal extraction. (this is unclear from reply in sec 3.11)

I'll clarify our approach in the following. We are using a two-fold approach for the optimization and signal extraction. We divide the full dataset in half and train the NN on one half and apply it to the other for the signal extraction. On the second half, we perform this v.v. with an independent training. Therefore, the events used for the optimization of the NN are stat. independent to the ones used for the fitting of the signal.

About the JECs correlation between 2016 and 2017. Can you elaborate a bit more about the value that you are using? Are you aware of this recommendations: https://twiki.cern.ch/twiki/bin/view/CMS/JECUncertaintySources#2016_2017_JEC_uncertainty_correl

we can try to quickly tick off the question about the treatment of JES. The procedure that we are following had been presented in the JetMET meeting and blessed for usage for a preliminary result, during the followup of the JetMET conveners (added in cc to this post, so that they can confirm). The JetMET conveners' main concern was about the anticipated treatment in a legacy combination with other Higgs analyses. This concern is well appreciated and shared by us, but since we are talking about a PAS and not the legacy paper it does not apply to the current case. This has also been the explicit understanding of the JetMET conveners. We will carefully reevaluated the JetMET uncertainty model for any legacy results and follow the recommendations given by JetMET, and any eventual modification that might be agreed upon in a coordinated way within the Higgs PAG.

About the JER scale factors. Can you elaborate why are they not applied? Are you aware of this recommendations: https://twiki.cern.ch/twiki/bin/view/CMS/JetResolution#JER_Scaling_factors_and_Uncertai

on the JER uncertainties. As Alexei stated in an earlier mail we inherited and further developed parts of the uncertainty model from the published HIG-16-043, where these uncertainties were not considered. In this course in fact it slipped through to check that these uncertainties can be neglected in face of the impact of the JES uncertainties. This coincides with our intuition, which is why we took it for given that such a check had been made in the past. Further investigations so far seem to indicate that nobody has explicitly checked this up to now. So we are preparing a quick study to confirm our intuition. We will come back to answer this question once the results of this study are available, For a legacy measurement everything will be put back on the table and we'll take care of better documenting what has been done and what not.

First, one clarification: the stage0 and inclusive fits are done from the same categorized signal regions of stage1, or from fitting the inclusive ggH and qqH signal regions?

They are done from the same categorized signal regions of stage1

For the stage0 and inclusive fit, there is a sizable difference on the 2016 dataset compared to HIG-16-043, especially in the non-VBF categories (mu changes from being ~0.3 sigma above unity to being ~1.5 sigma below unity), possibly from the lepton-tau categories (mu-tau and e-tau both change of ~1 sigma, while e-mu and tau-tau are more stable). While changes are not unexpected, it may be useful to get a bit more insight on this. * In this analysis, the 2016 result is from the fit of 2016 alone or from the combined 2016+2017 fit with separate POIs?

The 2016 result is from the 2016 datacards only

* Can you fit separately ggH and qqH in the old cards (HIG-16-043), split by decay mode and combined, in a way that maps better to what shown you have in the current unblinding slides?

For the stage1 fit, it appears that the fit is behaving poorly probably due to the correlation between the parameters (e.g. the qqH vs ggH in the VBF topology bins). Probably you have to either tie some bins together in the fit or fix them to SM predictions within uncertainties in order to get a more meaningful result. It would be good to see the fit result and the correlation matrix of the POIs in some more restricted scenarios. Some suggestions could be: * tie together the pT 120-200 and >200 since you have merged the bins in most final states (and possibly also 1j + 2j?), or tie together pTH > 200 for 1j and 2j (it will happen anyway in stage 1.1) * fix the ggH VBF-like events to SM, or make a single POI for ggH 2j (except possibly pT > 200, or pT > 120)

to be checked

Comments from Andrei, on AN v10 and PAS v2:

The snapshots of PAS v3 and AN v11 are added in this HN post link

The STXS includes ggH, VBF, VH (including gg->ZH), ttH, bbH, tqH (see yellow report), with various sub-bins. While this analysis is clearly targeted to ggH, VBF, VH(hadronic), all other production modes need to be included or shown to have zero contribution. When contribution is not zero, overlap with other dedicated analyses should also be addressed. When you run the fit, it should be clear what is done with each component, if it is floated as signal or background, or fixed. More on this below.

You can find the information on what production modes are considered for the fit and how already in PAS v2 L530-535. We will check whether the information given there is complete and clear enough for everybody to understand what is being done and update the PAS if considered necessary in further versions during the ARC review .

What I see in Tables 2 and 3 (in AN) with MC samples, there are still missing gg->ZH, bbH, tqH MC samples.

As you know there are no official samples available for gg->ZH, bbH nor tqH. The (non-)importance of these production modes for the current analysis has already been studied in detail on privately produced samples or using MSSM bbH Monte Carlo, as a prerequisite to the official pre-approval by the Higgs PAG conveners. As a reminder you can find the discussion posted to the HN here: link . For your convenience we summarize the results of the studies that have been made: i) the effect of gg->ZH is 1,6% of the combination of VBF with all ZH production modes (excluding gg->ZH), with no significant dependency on the stxs stage-1 bins, when compared to the expected sensitivity of the analysis in each corresponding bin; the cross sections of the corresponding production modes have been scaled up inclusively by 1.6%. ii) the effect of bbH is at most 1% of the ggH production cross section (with an expected uncertainty on mu_ggH of 0.2) w/o dependency on the stxs stage-1 bins. The ggH cross section has been scaled up inclusively by 1%.

The production cross section of tqH is roughly one order of magnitude below the ttH production cross (see link). The expectation off ttH events after the baseline selection of the analysis is ≤1 event in basically all event categories. Explicit numbers for the signal categories in the mt channel are ggH: 0.000 (0.003) and qqH: 1.05 (0.99) in 2016 (2017). Note that these are absolute event numbers. The tqH production mode has been neglected for the current analysis. A corresponding clarification has been added to the captions of Tables 2 and 3 and in the text in AN v11. For the legacy paper the importance of these samples for the analysis in the tautau final state will be re-discussed within the HTT subgroup.

I expected to see a table of the relevant contribution of various STXS-1 input modes contributing to different categories adopted in analysis. Just to give you an example from the HZZ pre-approval today, look at slide 23 in [1]. You promised to show explicit numbers for ttH and other such contributions, when we talked on Wednesday.

We will add such a table to the AN and, if considered interesting during the ARC review, also to the PAS.. Maybe with less granularity than presented in the example that you are pointing to, since we see most of the entries empty or plain zero. (What is the difference between empty and plain zero, by the way?). For the expected ttH event yield, see our answer to one of your previous questions.

In the PAS, I see tables 5 and 6, but they are completely empty (why?), and even when those numbers are filled, this would be only partial information. I understand that there is limited space in the PAS, but AN has no such restriction.

These tables have been filled with numbers (after unblinding) in PAS (v3). For completeness the same tables have also been added to AN (v11). Note that tables 5 and 6 in the PAS refer to the 2016 dataset, while the numbers for the 2017 dataset are given in tables 9 and 10. Let us know what you still need as a more complete information for the public. For CMS internal review or more specific questions you are invited to check the complete statistical model that has been uploaded to the corresponding HCG repository (link).

The main result is the fit of mu(qqH) vs mu(ggH), as well as the combined mu. However, from the documentation (Results section in the PAS for example) it is absolutely not clear what is signal, what is background, and what is fixed for H(125). At the HTT meeting you said (please correct me if I am wrong) that ttH is fixed to SM, V(leptonic)H is treated as background, and V(hadronic)H is merged with VBF into the qqH signal. Even if there is not much of a practical difference due to the dominant contribution of ggH and VBF, this needs to be done in a clean way:

(1) spell everything out in the documentation

We appreciate your comment and agree that this should be done in a cleaner way in the documentation. We have added the corresponding clarifications in PAS (v3 L707-716) and AN (v11) (Sections 2 and 12).

(2) when we quote 4.7 sigma H->TT signal, all H(125) contributions should be treated as signal, such as ttH, bbH, tqH, V(leptonic+hadronic)H, VBF, ggH.

The result is the same. A corresponding sentence has been added to PAS (v3 L707-716).

(3) when a 2D scan of mu(qqH) vs mu(ggH) is done, it makes sense to merge all HVV-initiated processes (VBF+VH) into mu_HVV (or mu_V) and all fermion-initiated processes (ggH, bbH, ttH, tqH) into mu_Hff (or mu_f). Can you move to this presentation of results instead of the current Fig.7 content?

This is already what we do, as is stated from the caption of Fig. 7. The axis labels have been changed to make it clear to everybody also from the plot itself (w/o caption) what is being done.

In the HTT report on Wednesday, you had a nice split of fit results between 2016 and 2017, and we discussed that you would approach compatibility comparison with the previous published results on the 2016 dataset.

I do not find any discussion of comparison on the 2016 dataset. To make the proper comparison, one would have to estimate overlap between the two analyses (both in terms of events and in terms of the likelihood fit). This is close to impossible in a practical way (otherwise you would have to run a lot of toy MC with both analyses at the same time). However, it is possible to compare under the two extreme models with 0% and 100% overlap just to see the bounds, and you can also estimate the event overlap in the signal region with the same selection, to get another bound on overlap. However, it is a must to compare to the previous published result in some practical way.

A study of the compatibility between the results of HIG-16-043 and HIG-18-032, including a feasible check of the event overlap is in preparation and will be provided for discussion, soon.

The reported signal is 4.7 sigma. At the HTT meeting you promised to provide plots which should show this unambiguous signal in several ways. There are many plots in AN and PAS, but I do not think you have provided the plots which you promised on Wednesday. I also do not think any of the existing plots illustrate the re-discovered signal with ~5 sigma.

As discussed repeatedly and in many places, such plots will be prepared on the way to approval. We understand that you are eager to see such plots. But we stick to a given order with our work, starting with important items (e.g. related to data understanding) up to presentation items. We think that organizing our work this way makes more sense. As soon as we converge with the ARC on the plots, we will provide them for further discussion to the whole collaboration.

Comments from the ARC v1, Part I: Physics

Answers can be found here.

Comments from the ARC v1, Part II: Text

• Line 75-76 (G.G.-C.) Regarding the HLT output, don't write below 1kHz, that's the old run-I value, we were writing above 2-3kHz in the run-II, and even more in 2018 with the B-parking. (G.G.-C.)
Thanks - corrected to "a fw kHz"

• l198: the share -> they share (A.-C.L.)
Thanks - corrected.

• l227: it sounds a bit like no genuine pTmiss can be defined for ttbar (A.-C.L.)
Removed the reference to genuine pTmiss (the idea behind the half-sentence was that here the genuine ptmiss can also be defined in terms of a boson against a hadronic recoil).

• l229: Section3 -> Section 3 (A.-C.L.)
Thanks - corrected.

• l236: is the energy scale for electrons misidentified as taus applied for all samples ? how important is that correction ? (A.-C.L.)
It is only applied to Z->ee events - clarified in the text now. This correction is actually very important for the e-tau channel in which MC could not model the data without.

• l295-299: maybe good to drop "used for the NN training", it makes it a bit difficult to read, two ideas in one sentence, i.e. the cross check with simulation, and the validity of NN training samples (A.-C.L.)
Done.

• l204: the end of the sentence is not clear, "from a DR with 0.2< I_rel^mu< 0.5" (A.-C.L.)
Rephrased - "in a region with inverted muon isolation requirements".

• l372: not clear to me at this stage what are the 3 to 6 background categories (A.-C.L.)
Changed to "several". The details are not important at this point and are explained in the following subsection and its table.

• l377: four NN in total in total for each year --> four NN per year of data-taking (A.-C.L.)
Done.

• l426: the superscripts "fit" need to be explained (A.-C.L.)
It referred to the method, "SVFit". However, it does not actually serve any purpose and has been removed.

• l434-447: one could maybe also stress that these tests ensure a robust classification in data (expected to be the same as in simulation) (A.-C.L.)

• l448- 478: interesting but not clear to me if it allows to spot problems or how it is used to cross check the stability of the analysis (A.-C.L.)
The purpose of these specific studies described between those lines is just to identify the most important variables, and not to spot problems or for stability reasons.

• Table 3 and l484 define PPV, remove it from the table. Nevertheless, can you explain more clearly what it is? (G.G.-C.)
Removed the definition from the table (but not using the abbreviation, since the table appears before its reference and this could be confusing. The PPV explanation has been expanded: "The PPV for a given class is defined as the number of events of that class associated to the correct category, divided by the total number of events assigned to the this category. Uniform prevalence gives the same statistical weight to all event classes by normalizing them to each contribute the same total number of events prior to the classification."

• l551-553: POIs probably need to be introduced (A.-C.L.)
Thanks - done.

• l538-540: One could maybe say here that the information from simulation is used to build templates (A.-C.L.)
Since here only the event-by-event classification is discussed it would lead to confusion to introduce templates here, so we prefer to leave it as it is.

• l550: to what refer "event classes" here ? (A.-C.L.)
Since here only the event-by-event classification is discussed it would lead to confusion to introduce templates here, so we prefer to leave it as it is.

• l558-559 : "Each event class in this way leads to a dedicated category", same for caption of Figure 6. (A.-C.L.) --> does this mean that the NN output clasification works well and that then one can further categorize ? (Just trying to check if I'm following correctly ;-))
This just means that typically, a class (defined on gen-level) corresponds to a category (defined on reco-level), with the exception explained in the following.

• Figure 5, caption : "with the exception of the >=2 jet classes with H->qq topology". lines 560-561 also mention that it's different for emu etauh and mutauh pTH>200 GeV categories. (A.-C.L.)
Added also to the figure.

• line 565-563: "This is the case... category." (A.-C.L.) -> The NN output naturally leads to an association of such events to the qqH event class. (A.-C.L.) or --> The NN output naturally leads to an association of such events to the qqH categories (A.-C.L.) depicted in Figure 6. ( Else one maybe tries to identify the qqH "inclusive" category on figures 5 and 6.)
Done.

• l564 : Don't we have 16 signal categories for tauh tauh ? And then 14+14+14+16+21 = 79 in total per dataset ? (A.-C.L.)
Thanks - in fact 14 for tauh tauh, and 12 for the rest, so 50 signal + 21 bkg = 71 in total. Corrected.

• l715-716 Concerning the quoted cross-sections (G.G.-C.) I don't understand those values. Following the latest HXSWG recommendations, sigmaVBFxBR(Htautau)->3.766*0.06272=0.236pb. Nevertheless, you quote 0.32pb for signal strength 1.03 (?). For ggH, it's almost in agreement, the value is 48.61*0.06272=3.05, while you quote 1.11/0.36=3.08. For the inclusive cross section, does it mean fully inclusive or is it just ggH and VBF. If it's the second one should get (48.61+3.766)*0.06272*0.75=2.46, you quote 2.54. Can you check? And needless to say, can you check you are using the right XS values in the analysis?
The qqH cross section includes also V(->qq)H, and ggH also bbH/ttH. If you add these numbers then you exactly reproduce the HXSWG numbers (we tried it...). Similarly, the inclusive cross section includes ggH, VBF, VH, and bbH/ttH. While this is stated in the paper in several places we agree it is confusing. We welcome any suggestions to make this clearer, one (ugly) way would be to always quote all processes, even in subscripts of variables, instead of using qqH both for VBF and V(->qq)H.

Comments from the ARC v2

• General: where appropriate (e.g. line 741 and 742)
change qq -> H into VBF + V(qq)H
change gg -> H into gg -> H, bbH, ttH

Done

• Abstract: Change the first sentence. Make it simple and punchy. Not clear what the “They” in the second sentence refers too. You also have “A measurement” which becomes “The measurements” Centre should write “centre” (see discovery paper) Also we are missing a result in the abstract. Suggestion: re-write the abstract. Suggestion:
A measurement of the inclusive cross section $\sigma_{incl}$ for the production of a Higgs (H) boson decaying in a pair of tau leptons, $\beta {\rm H} \rightarrow \tau \tau$, is presented. The measurement is based on the data collected with the CMS experiment in 2016 and 2017 which corresponds to integrated luminosity of 78 fb$^{-1}$ with pp collisions at a centre-of-mass of 13 TeV. A value of $\sigma_{incl} \beta {\rm H} \rightarrow \tau \tau = 2.54 \pm 0.47 {\rm (stat.)} \pm 0.34 {\rm (syst.)}$ pb is obtained. More results in the form of simplified template cross section with a splitting by production mode and kinematic regimes are also provided.

Thanks for the suggestion. The abstract has been updated

• Line 2-5: No measurement is ever “exact” change to “precise” Also simplify sentence and avoid too many “kinematic properties” in the same paragraph. Suggestion line 3 to 5:
After the discovery of the Higgs (H) boson at [.], one of the main targets of the experiments at the LHC is the precise measurement of the H boson production kinematics, as input to a detailed analysis of its coupling structure.

Thanks - Has been changed

• Line 12: define tau ($\tau) ounce forever: like the tau ($\tau$) lepton. Done • Line 13-14: Not should set in, we don’t know that. Also better merge in one sentence: The size of these deviations depends on the scale at which new physics could set in as well as on the kinematic properties of the measured process. Thanks for the suggestion - Done • Line 16-36 we cannot use I think the word “former” as it sounds like “obsolete”The results are not obsolete as to what concerns the “discovery” threshold, which we do not want to over-emphasize here in the paper. Also the equivalence Higgs = H as been made in the first sentence of the introduction such that “H” should be used elsewhere. “Higgs” is controversial. “H” is universally accepted. Finally a kind of executive summary of the reason why we present a new (different) analysis method is needed early in the paragraph Suggestion for lines 16 to 35 ([] = your text ): ATLAS [8] and CMS [9] experiments have previously each reported on the first observation of the H boson in the$\tau\tau$final state and have provided first cross section measurements. In this paper. In this paper, an inclusive measurement of the product of the cross section for the production of the Higgs boson and the branching fraction for its subsequent decay into tau leptons is presented. Cross section measurements split by production modes and in different kinematic regimes are also determined and presented as simplified template cross sections, as defined by the LHC Higgs Cross Section Working Group [10]. The measurement exploits 78 fb$^{-1}$of data collected with pp collisions at a centre-of-mass of 13 TeV. The analysis relies on a new method where a differentiation between the individual signal and background sources is performed using a neural net (NN) multi-classification algorithm that allows to maximize the purity of each signal and background process. The NN is used to separate nearly any individual process considered in the analysis of the data into a dedicated event category, based on a fully supervised training. The central element of the analysis [] pair. The$\tau\tau$final state provides [] (VBF). Tau leptons have a distinct [] Four different finalstates [] decay.Simulated events samples [] method [12]. Thanks for the rephrasing- Has been changed in the PAS • Line 181 and 182: Grammar problem, “,” missing plus need re-shuffling. Suggestion: For this purpose, simulated events are used for each individual process except for QCD multijet production. Thanks - Done • Line 183 to 187: Very difficult and long sentence as it is. From what we understood, we try our best to reformulate here: The training of the NN with respect to QCD multijet events relies on data events selected as described in Section 5 but with the following differences for each final states. In the$e\mu$, [] final states, the selected leptons are requested to carry equal instead of opposite charge. Moreover, the trailing [] is requested to pass the Loose working point and to fail the Tight working point [] otherwise used in the [] final state. Thanks for the suggestion - Has been changed • Line 290: missing “,” -> For this purpose, the number of Done • Line 293: avoid yet another “For this purpose”. -> To achieve this, the Done • Line 303: avoid yet another “For this purpose”. Suggestion: For this objective, all corrections Done • Line 466-496 : one should consider shrinking this part. Done • Line 669: not clear what is meant with “A dedicated uncertainty scheme” Has been reformulated. • Figure 7 bottom: please proceed to the merging of some of the multijet bins as discussed at the ARC-Author meeting. The results are now shown as discussed at the ARC-Author meeting • Provide the details of how the NN output "probabilities" are constructed, in particular write explicitly that the "probabilities" are the NN targets assuming equal yields from each of the signal and background species. Details have been added in Section 5.4 • Maybe add a sentence on the fact that linear correlations between NN outputs have been investigated and very good agreement is observed between the predictions and the data. Done, at the beginning of Section 5.4.2. • Try to add some additional information/caveats on the systematic uncertainty bands shown on the plots along the lines of what Roger explained during the meeting (neighbouring bin correlations, eventually some info on the bkg normalisations uncertainties, etc) Additional information on uncertainty bands has been added in caption in Figure 2 * line 3887 389: The sentence “For the For the following discussion” is very difficult. A *suggestion: * After the selection step, the events are further split into two signal-like and several background-like categories. These categories are to be distinguished from the event classes introduced to indicate the hypothesized truth. An event class usually coincides with a single The sentence has been rephrased according to the suggestion Comments from the Anne-Catherine on PAS v5 • l23: "centre-of-mass of 13 TeV" -> "centre-of-mass energy of 13 TeV" Done • l26: "purity" -> "selection purity" (?) Has been changed to “The analysis relies on a new method where a differentiation between the individual signal and background sources is performed using a neural net (NN) multi-classification algorithm that allows establishing several categories very pure in the respective signal or background process.” • l28: "supervised training" is mentioned in several places in the document, to be skipped it some places. Has been removed at L407: The training of the NN is performed based on… • l26-27: "The NN... training" -> "The NN is used to separate nearly any individual process considered in the analysis and to assign the events to dedicated categories." Changed to “The NN is used to separate nearly any individual process considered in the analysis by assigning the events to dedicated event categories, based on a fully supervised training.” • l87: What is meant by "for further processing" ? Can be removed • l91: Not so clear if tracks are inputs to jets or to the primary vertex. Rephrased to “The physics objects for this purpose are the jets, clustered using the jet finding algorithm [21, 22] and the associated missing transverse momentum. Hereby, jets are build using all charged tracks associated with the vertex, including tracks from lepton candidates and missing transverse momentum is taken as the negative vector sum of the p T of those jets.” • l95: "purity" -> "selection purity" ? Has been changed to “To reduce the number of particles wrongly identified as electrons...” • l98: "to identify electrons" can probably be removed, can be guessed from the previous sentences. Has been changed to “For this analysis working points with an identification efficiency between 80 and 90% are used.” • l119-120: re-optimised version -> optimised version or simply "(b jets) the combined secondary vertex b tagging algorithm (CSV v2) and a NN..." Some readers may start to wonder if the dependency to the data taking period is related to the choice of the discriminant or to the optimisation of CSV. Changed to “an optimized version” • l128-130: One should probably specify that all tau decay modes are considered, including 3pi and pi+2pi0s. We think that this is already specified. The text says : ”For the analysis the decay into three charged hadrons and the decay into a single charged hadron with up to two neutral pions with pT > 2.5GeV are used.” • Footnote from Table 1:"Decays into genuine tau leptons taken from EMB." ->"Decays into genuine tau leptons taken from embedded samples." Done • l171: "Drell-Yan production in the subsequent decay into tau leptons" -> "Drell-Yan production in tautau final states" Or something easier to read. Changed to “The most prominent background process in the eμ, eτ h and μτ h final states is Drell–Yan production of Z \rightarrow \tau\tau” • l177-178: Maybe a bit vague. "Similar arguments...(diboson)." ->"Similarly, single top-quarks and vector boson pair production (diboson) contribute to the etauh and mutauh final states." Done • l243-245: "For the statistical analysis (...) Section 5.4." -> "The overall normalization of this background is constrained for the relevant final states by ttbar categories, that reaches purities of >= 90%, as decribed in Section 5.4." To make the sentence shorter. "expected purity" -> it doesn't probaly matter here how this purity is computed (on simulation). Changed to “The overall normalization of this background is constrained for the relevant final states by dedicated$\ttbar$categories, which reach purities of${\gtrsim90\%}$, as described in Section~\ref{sec:Event_categorization}.” • l248: "whenever simulation is used" can probably be skipped. Yes, we agree. We removed it from the sentence. • l250: One can probably skipp "for the signal extraction". -> "The backgrounds, described in Section 6, can be misinterpreted as signal in three ways:" Yes, we agree. We changed the sentence to “All backgrounds can be misinterpreted as signal in three ways:” • l261-62: What kind of selection is used to select Z->mumu ? a quite loose preselection ? looser than the ones from the analysis ? The selection is tight enough to ensure a high purity of genuine µµ events and at the same time loose enough to minimize biases of the embedded event samples. The selection of the muons defines the minimal selection requirements to be used in the analysis plus a requirement on the invariant mass of the muon pair to ensure that Z → μμ events are selected predominantly. • l273: "via the reuse" -> "by considering" (?) Changed to “This is achieved by reusing the full data set of selected \mu\mu events for each final state.” • l274-276: "It has been checked that the overlap of the resulting embedded event samples is small enough such that the distributions that are related to the part of the event that originates from the observed data are uncorrelated." -> "It has been checked that the overlap of the resulting embedded samples is small enough such that the distributions related to the part of the event originating from the (actual ?) data are uncorrelated." Changed to "It has been checked that the overlap of the resulting embedded samples is small enough such that the distributions related to the part of the event originating from collision data are uncorrelated." • l323: "for the analysis" can probably be removed. Done • l389-390: The first sentence says the same as the second one, to be removed (?) Yes, we agree. The first sentence has been removed. • l392: "to indicate the hypothesized truth" -> "to describe the origin of events" (?) Maybe I'm wrong but "hypothesized truth" sounds complicated Replaced with “These categories are to be distinguished from the event classes that correspond to the true origin of the processes.” • l402:"to order events by expected purity" -> "to assign events to categories with high purity" Can one define an "event purity" ? Has been reformulated to “This probability allows to order events in a given category by expected purity from low to high NN output.“ • l407: "based" -> "using" Done • l411: "exception of a miscellaneous (misc) event class for each final state" -> "exception of miscellaneous (misc) event classes" "for each final sate" can probably be dropped as everything is well explained in the next sentence. Done. • l418: "The training is split in two folds." -> not clear why, because of two data periods ? No, we divide the samples into two independent sets for training. In a first step, one step is used for training and the other set for the evaluation. Afterwards, the set used for training before is used for evaluation and the former evaluation set is used for training. We change the sentence in the PAS to: “For each final state and data taking period two fold cross validation is used for training” We hope that this is clearer. • l454: "20(18)" -> according to the table it's "20(17)" (?) Yes, that is true. Thanks for spotting this. • l489: What is meant with "predominantly", all events at the end should lie in a given category. Changed wording to “This procedure ensures that each event class is predominantly collected in the dedicated event category.” • l562: "and a total of 9 PIOs" can probably be skipped. Yes, we removed it. Comments from the ARC on PAS v6 Introduction • Line 2: There are 6 « Higgs » and one floating « H » in the intro. Introduce at first occurence « Higgs (H) » Done. • Line 5, 9, 12 change « Higgs boson » to « H boson » Done. • NDLR: Please propagate everywhere Higgs bosons -> H boson Done. • Line 8: Higgs sector is slang. Change to « in the SM scalar sector » Done. • Line 10: I guess you mean the standard wording « . to Brout-Englert-Higgs mechanism » Done. • Line 32: The following sentence does not look right grammatically: « . separated well from the overwhelmingly large background comprised of jets produced . » Maybe you meant «. composed of jets » I’ve picked this sentence from the CMS publication guidelines. But I will cross-check with John’s comments. • Line 36 to 40: It is unclear why the section 4 comes after section 5 in the list. I fail to find what is wrong with the natural ordering in the text with sections 2,3,4,5, . This was a bug and has been changed for v7 already. Sorry that we have bothered you with this. • Lines 105 and 106: Suggestion to make it simpler _« The contributions . backgrounds to the electron or muon selection are further reduced . I don’t understand the suggestion you make to simplify, but I’ve moved “contributionS” from singular to plural as you suggest. • Line 113: Just a comma missing « To mitigate any distorsions from PU, only those . » Done. • Line 137-139: Just a (needed) comma missing _« To distinguish ... quarks or gluons, a multivariate . is used Done. • Line 214-217: This sentence is very difficult, maybe saved by (important here) missing « , » Suggestion: « For the comparison . data, corrections are derived . trigger paths, for . efficiency, and in the efficiency . » Done. I’ve added commas, where suggested. • Line 232: Here appears for the first and only time « Zjets » Change to « Z + jets events » Done. This was also a bug that has been fixed for v7 already. Thanx for spotting. • Line 288: You write « All processes given in Table 1 typically contribute to more than one of these groups. » Are you here referring to the « three ways » introduced with the previous bullets ??? How does Z -> tau tau contribute « to more than one » group ??? Isn't it only the second group ?? Suggestion for line 250: « All backgrounds in Table 1 can be misinterpreted in different ways. There can be grouped as in the following. » Done. Thanks for this suggestions. • Line 317: Suggestion: « Same-charge and opposite-charge transfer factors . » This sentenc has sligthly changed since v7. • Line 333: A (needed here) « , » If the events passed . of the triggers, the lepton identified Done. • Table 3: Change in caption (since you will re-use the labels later in Fig. 1): « All event classes (labelled below ggH, qqH, ztt, QCD, tt, misc, zll, wj, db and st) enter the training . » Done. • Line 462: Just a (needed) comma missing: _« . between the input parameters, those are Done. • Line 474: Suggestion: Here a « Thus » fits better: « Thus, it is expected that . » or « It is thus expected that » Done. • Line 495-496: Suggestion « Given the uniform prevalence, the PPV . » This section has changed in v7. • Line 635 and 636: You refer to section 4.1 but What you call « reweighting method » here was called « weighing » in line 241 of section 4.1 . weighing or reweighing ??? Chose. Changed to weighting • Line 681: A (needed here) « , » _« In turn, a significant DONE Conclusion • Line 749: the present tense « are » looks strange . Suggestion: « . have been investigated in terms . » Changed Comments from the LE on PAS v7 general • I don't think we should say "\tau\tau pair" anywhere. It's either (for example) a "\tau\tau final state" or a "tau pair". It might seem to some people that a "\tau\tau pair" is four taus! Changed to "tau pair" title • I think the title needs work. It sounds like we are measuring something very general - the Higgs boson production cross section (which has several components) - but in the tau tau final state, which is one of the decay modes. So it's a bit schizophrenic. I suggest: "Measurement of Higgs boson production and decay to the \tau\tau final state" More discussion might be in order. Changed. Thanks for the suggestion." abstract • Second sentence: "The measurement is based on pp collision data collected by the CMS experiment in 2016 and 2017 corresponding to an integrated luminosity of$77.4~\text{fb}^{-1}$at a center-of-mass energy of$13~\text{TeV}$." Changed" • Last sentence: "Results are also presented in terms of cross sections for individual production modes and kinematic regimes." Changed" text • line 3+ "targets" -> "goals"; "as input so a detailed analysis of its coupling structure" -> "in order to elucidate its coupling structure." Changed" • line 7 "On the other hand, " (comma) Done" • line 12 Remove "especially"; change "like" -> "such as". Physics-wise this statement applies in Type II 2HDM; I am not sure we should call the tau a "down-type fermion". Added changes" • line 13 What scale? Energy? Mass? Written “energy scale” and added “This scale could be set by the mass of one or more additional heavy Higgs bosons.” in L14 • line 16 "experimental sensitivity" Added experimental" • line 27 "multiclassification" or simply "multiclass"; no hyphen in either case. Perhaps "classification" without "multi" is even better: after all, if you are classifying there are multiple classes. Changed to classification • line 85 "LHC run periods"; move "on average" to the end. DONE • line 90 Specify R for anti kt jets? Added: “FASTJET [21,22] and discussed below, and the associated...” • line 91 Suggest dropping "Hereby" Dropped • line 96 "wrongly" -> "incorrectly" Done • line 104 "For this analysis, ..." (comma) Done • line 174 "...followed by that from Z -> \tau \tau." Done • line 179 This paragraph is not making sense to me. The opening sentence promises "two approaches" but it is far from clear what the different approaches are; only one overall approach is described in the rest of the paragraph. Removed first sentence • line 184 "...relies on observed events..." I would begin the next two sentences with "Firstly, ..." and "Secondly, ..." respectively. Done • line 195 "The production samples of...." Changed to “For simulation of the diboson production processes...” • line 230 "...are determined from observed$\ttbar$events (for true$\cPqb$jets) and observed$\Zjets$events (for misidentified light quark and gluon jets)." Changed • line 239 "Correction factors accounting for deficiencies in the modeling of Drell-Yan$\Pe\Pe$,$\Pgm\Pgm$and$\Pgt\Pgt$final states are determined from the ratio of observed to simulated$\ZMM$events in bins of$\pt(\Pgm\Pgm)$and$m(\Pgm\Pgm)$." Replaced • line 242 "...to better match the observed top quark p_t distribution [43]." Done • line 249 Not making sense...if it is a background it must look like the signal! I think what you meant is "The various processes leading to background in our signal selection arise in a number of ways:" Thanks. Changed • line 251 I think you mean "...misinterpreted as originating from a$\Pgt$decay to hadrons." This is crucial for understanding the rest of the paragraph. Changed • line 259 I suggest the following rewrite: "The number and kinematic distribution of these events are estimated using the$\Pgt$-embedding technique described in Ref. [11], which uses observed$\ZMM$events in which the energy deposits and charged tracks associated with the muons are removed and replaced with those from simulated$\Pgt$decays. In that way, hybrid events..." I think a good deal of the rest of the paragraph could be dropped, as all this is discussed in Ref. 11. Changed sentence and removed L266-L271 and the last sentence • line 279 "Especially" -> "In particular, events..." Changed • line 290 In eq. 5, the left side has an index i and the right side does not Added • line 295 "differentially"....hmmmm. Bin bin-by-bin or is there a smoothed shape/function? Clarify. Changed to: “The$\FF^{i}$are determined differentially as a smooth function of the$\pt$of the$\Pgth$candidate” • line 317 "Events are selected online during pp collision running using different trigger requirements." Changed • line 329 "If an event passes only one of the triggers, the lepton..." Changed • line 344 "Finally, to distinguish..." (comma) Done • line 345 "...final state, all events..." (comma) Done • line 406 "For the training, these..." Done • line 409 "run period" Done • line 410ff I am wondering if the entire paragraph can be reduced to a single sentence describing a NN setup and training which is standard in some way, described elsewhere, and can be a reference. This is almost a minitutorial… This is intentional since it is the first time this approach is used in HTT. It is targeted at readers not familiar with the concepts of machine learning • line 449 Suggest "All NN input parameter distributions are tested for the level of accuracy with which they are described by the statistical model used for signal extraction, in each final state." Thanks. Changed • line 457 "Because the NN analysis gains statistical power compared with a cut-based analysis by utilizing the correlations among the input parameter distributions, these correlations are examined in addition to the marginal distributions of each parameter." Changed • line 465 comma after "approach" Changed • line 514 "The statistical uncertainty in the observed number of events..." DONE • line 518 "The ratio of the observed number of events to the expectation ..." DONE • line 520 comma after "signal" DONE • line 529 comma after "model" DONE • line 569 "limited statistics" DONE • line 574 I would not mind a reference to my 2011 paper somewhere around here: Added reference @Article{Conway-PhyStat, author = "Conway, J. S.", title = "Nuisance Parameters in Likelihoods for Multisource Spectra", journal = "Proceedings of PHYSTAT 2011 Workshop on Statistical Issues Related to Discovery Claims in Search Experiments and Unfolding, CERN, Geneva, Switzerland, 17-20 January 2011, edited by H.B. Propser and L. Lyons", volume = {CERN-2011-006}, pages = 115-120, url = "http://cdsweb.cern.ch/record/1306523/files/CERN-2011-006.pdf", year = "2011"}  • line 576 "as uncorrelated" DONE • line 584 "triggers used" DONE • line 610 "as correlated" DONE • line 626 "as correlated" (check this everywhere) Checked and changed • line 669 "In turn, ..." DONE • line 703 Somewhere in this paragraph you need to mention the priors assumed on the nuisance parameters. Added in L700: “... fit to the data. For all normalization uncertainties$\mathcal{C}_{j}(\hat{\theta}_{j}|\theta_{j})\$ is chosen to be a lognormal distribution. Shape altering uncertainties are implemented using a vertical morphing algorithm. The uncertainties may...”

figures

• Generally, everywhere we have a label such as Z -> ll, please use the script \ell, done in Root by using \\ell\\ell.
\ell is broken in TLatex and can not be used as label unfortunately

• Figs. 1-2. It is too late to properly capitalize the row/column axis labels for each category in Fig. 1, and the titles in Fig. 2?
The labels correspond to the class/category names and are therefore not capitalized

• Figs. 2-4, 13-23. The colored bars for the signal vanish when viewing the document at even 125% on my screen. What can we do to avoid this?
Increased width of signal

tables

• Table 7 The capitalization of row and column titles needs to be fixed here.
DONE

-- SebastianWozniewski - 2019-02-01

Topic revision: r82 - 2020-01-18 - RogerWolf

Webs

Welcome Guest

 Cern Search TWiki Search Google Search Main All webs
Copyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback