-- DennisMarioRoy - 2020-06-30

Color code for answers
Green led - comment is acknowledged and answered
Blue led - comment requires further work to be addressed or need attention from the reviewer regarding a specific issue
Red led - we don't agree with your comment and the arguments are given
Gray led - comment still to be answered

AN-20-046-v2 comments

Comments by Lorenzo

Green led- Sec 2.2.2: don't you use HT binned DY samples for the same flavor channels? Even though they are LO samples you would gain a lot in MC statistics.

No, it looked like the inclusive sample actually has more events. But after checking, there are indeed more events from the HT binned samples passing the cuts, so I'll try using these now for all years. If the DY CR plots look better (along with using new DY pTll reweighting), I'll change it in the documentation.

Addendum: I'm now using the HT-binned samples. I added them to the AN too.

Green led- Fig 3, 6, 7: Are these the shapes that are used in the fit? How was the binning chosen? Please also cross check that there are no bins with 0 expected background events for all years.

Yes, these are the SR plots used in the fit. While the binning isn't optimized yet: I started with a 20 GeV bin width in the high-stat region, and then made the bins larger as statistics become lower. None of the bins are empty in background.

Green led- Fig. 3, 6, 7 cont: does the bottom right figure really correspond to the VBF region? I would expect the top background to be dominant there. Also what is the definition of the VBF category? I couldn't find it in the text.

Yes, those are the VBF category plots. The definition is explained in the DNN chapter, for which I added a reference/explanation in section 4. I never explicitly studied the change in the background contribution in the VBF category, but it seems that more Top events are rejected from the VBF category in favor of WW background events.

Addendum: Actually I did see an issue with the previous DNN, so I retrained it with some modifications. The background distribution is now closer to what you would expect.

Green led- Appendix B and C: why these signal region plots have a different binning wrt Fig. 3?

I forgot to change the binning in these plots. Still, the binning used for the fit in all years is the one from Fig. 3.

Green led- L286: what is the goal of the mTl1+MET>60 GeV cut and why it is applied only in SF final states? You should explain in the text.

It helps to further reduce the DY background. I've elaborated this in the text.

Green led- L309: Does it mean that you apply the same cuts except for the lepton flavor? Given that the SF signal region cuts are different wrt OF, I think you should align the SF top CR cuts to the ones in the signal region to have a CR "closer" to the SR.

I meant that the SF Top CR is defined by the same b-tagging requirement change w.r.t. the SF signal region. I rewrote the sentence.

Green led- Sec 4.4: I understand that you use this additional cuts only to extract limits for mH>1 TeV? Is that right? If yes, you should write it explicitly in the text and show the signal region plots in these categories as well.

No, I only optimized the cuts based on signal of mH > 1 TeV. I do separate limits across the full mass range using only the regular cuts and using also the high mass cuts, then I combine the limits whereever those from the high mass cuts start being more sensitive.

Green led- L495-504: can you clarify if you use the whole datasets for the actual analysis? Or only half of each sample?

With the 2-fold approach, I can use the full samples for the DNN training and the analysis. I've rewritten it.

Green led- L524: have you tried including also additional jets if present, e.g. 3rd leading jet? According to what is written in L514-517 the 3rd jet kinematics might be relevant.

No. I forgot now if there was a reason I didn't. Maybe I can try adding that information, but even so I don't know if it would bring much improvement, as the DNN already performs quite well.

Addendum: While redoing the DNN (see above), I added information of the 4 leading jets with mjj and detajj for all possible combinations.

Green led- L538-547: have you considered using a different DNN cut for different mass hypotheses?

This is technically unfeasible because a mass-dependent cut would mean the categorization of background events and data changes as a function of the mass. The most I can do is consider a different cut value for the high mass categorization.

Addendum: I am using a different value for the high mass categories now.

Green led- Figs. 25, 26: Do these plots include only the interference among gluon induced processes? Do you have similar plots for the interference among quark induced diagrams?

They are the sum between the gluon -and quark induced processes. Should they be separate instead?

Gray led- Sec 10: it would be useful to write explicitly what shapes are used in the fit in each region, maybe a pictorial diagram would help.


Green led- Sec. 10.1.1: I would like to see in the AN the impacts for each year (not only 2018), as well as the combined impacts including all years. From the impacts I understand that you have a different rateParam for each year, and also a different one for em, ee and mm. In principle different flavor combinations should scale with the same rateParam, and also I'm not sure whether decorrelating the rateParams across the years is the right choice. Maybe you could check the impact of different rateParam correlation schemes on the final results (e.g. same top rateParam for ee, mm and em).

In HIG-17-033 the rateParams were decorrelated between final states. I can try to correlate them now, also between years.

Addendum: I'm now correlating rateParams for Top and WW between all final state, and DY for ee and mm (separate from em because Embedding vs. MC). I also correlated 2017 and 2018, but not with 2016 because of different MC tunes (CUET for 2016 vs. CP5 for 2017/18).

Green led- Fig. 29: why for 2000 GeV the MC stat nuisances are much more important than 200 and 800 GeV? I would expect a similar ranking of the nuisances for different masses. Or is it because you use the high mass categorization in this case?

I didn't use the high mass categorization for the workspace checks. About the nuisance stat: This is an effect of the not-yet fully optimized binning. Bins 25 and 26 refer to the two last bins, and these are the only bins that contain expected events for a 2000 GeV signal. Ideally I want the bins in the high mass range to be finer, but this would cause problems again if there's too little statistics in each bin (or no background at all).

Blue led- Fig. 30: any clue of what is causing these huge correlations? Do you have any constraint on the overall top/WW/DY normalization from other nuisances? Also, we should understand why the correlation is less significant at high mass.

Not yet. The only contraint I add is that the normalization parameter should be between 0 and 5. There used to be problems, I think for WW in VBF category, where the statstics is too low and the rateparam would end up negative.

Green led- For model independent limits the combination of the 3 years is missing, but I guess you are still working on it?

Yes, but since I have no ggF 5000 GeV sample yet for 2016, I don't know if I should do the combination just until 4000 GeV, or to use only 2017+18 in the range between 4000-5000 GeV.

Comments by Arun

Green led1. Since, we plan to go ahead with fully leptonic analysis at this time, so you need to change a few things in Abstract, title, and few other places. Just comment out the part related to semi-leptonic for the moment, like Section 5.


Green led2. In the Latinos group, I think it's kind of an unwritten policy that we should have mention of all Latino authors in the main AN of any analysis. So, please add the names of all authors along with the institute on the first page.
Let me and Lorenzo know if you have any reservations about it. You can find the list here : https://docs.google.com/spreadsheets/d/1Kajq5TnIPWb6CIyzVoxVkt2Ai0hQkmJEfSsYq0afiJE/edit#gid=0


Green led3. Line 45-46 : Add reference for Higgs discovery paper and may be latest results on Higgs.


Green led4. Line 89-90 : remove the semi-leptonic part


Green led5. Line 94 : Instead of mentioning all the table numbers, better to just write Table [19-30].

This doesn't work automatically in LateX, so I changed it manually for now.

Green led6. In Tables 7-9, why are the DY and ZG X-sec different between 2016 and 2017/18 ? Also, please provide the source of all the cross-sections used in AN.

For Zg, the phase spaces are different between 2016 and 2017/18. The 2016 sample is "ZGTo2LG" while the 2017/18 sample is "ZGToLLG_01J". See also the discussion on our Mattermost channel: https://mattermost.web.cern.ch/latinos/pl/8kxwc1skhpnwtq5dtthfsj9pwr .
The cross section is also different for WpWmJJ. For those samples I double-checked the cross section with GenXSecAnalyzer myself for different years, and found it to be correct the way it is. I assume there's a small change in the phase space here as well.
For DY the different cross section comes from here: https://hypernews.cern.ch/HyperNews/CMS/get/generators/4072.html ; from what I understand the difference w.r.t. 2016 comes from the change of the MC tune. Although I also found this post here https://hypernews.cern.ch/HyperNews/CMS/get/generators/4446.html , suggesting that the cross section should be changed again. But I don't worry about it since it's only a small fix, and we leave the DY normalzation floating in the fit.

Green led7. Line 106-107 : provide the ref of Tau-Embedding.


Green led8. Line 109-111 : Can you provide a quantitative measure of this fraction of leptons ?

It's about 8%. This seems larger to me than I remembered, then I realized that this also contains DY events that pass any trigger other than the E-Mu trigger. I've added this information.

Green led9. Section 2.2.2 : please add the effect of this reweighting in the form of some distributions for all years. Maybe ptll distribution?

I'm currently applying a new reweighting scheme, I'll update this section with plots once it's done. But if I show DY CR plots at this point, I'll have to refer to a later section about the definition of the DY CR.

Addendum: Done

Green led10. Line 128 : Antiquark -> Antitop


Green led11. Section 2.2.3 : please add the effect of this reweighting in the form of some distributions for all years.


Green led12. Section 2.2.3 : there is a mention of availability of CP5 tuned sample in 2016. Why don't you use it and get rid of this additional reweighting?

The short answer is that we didn't postprocess that sample. I took it privately from DAS just to determine this additional CUET->CP5 weight. On the other hand, looking at the list of MC samples we use, it looks like CUET is generally used for 2016 samples, while CP5 is used for 2017 and 2018. So "consistency" with other 2016 background samples might be another reason.

Green led13. I think it will be nice to add the Feynman Digs of all the major background processes in the AN. Especially for example for WW and Vg/Vg* background since there are multiple processes contributing to the background.


Green led14. Line 175-176 : mention in words that this factor of 1.11 comes from the 3-lepton CR used in other analyses in H->WW.


Green led15. Section 2.2.7 title should be : SM Higgs background. Also, there is no mention of VBF in there.


Green led16. Section 3 : It's ok to refer to the details of objects AN, but it will be good to have a table here containing the working point definition of each and every object used in this analysis. Object AN is more for the corrections and details of procedures to estimate those corrections. This analysis should be self consistent in explaining all the selection.

I added a full list of all objects I use

Green led17. Line 199 : "opposite flavour" --> "opposite flavour leptons"

Fixed, I did the same for "same flavour leptons".

Green led18. Line 202 : remove "motives"

I meant to write "motivates".

Green led19. Equation 9 : what is 'i' in the symbol of mTi? Please define it.

The "i" stands for "improved", as in "improved transverse mass" (defined in https://github.com/latinos/LatinoAnalysis/blob/master/Gardener/python/variables/WWVar.C#L964-L973 ). I replaced the name "mTi" with "m_reco", as it's named also in HIG-17-033.

Green led20. Line 219-220 : being considered -> considered


Green led 21. Section 4.2 and 4.3 : I would recommend adding a detailed table to add the selection and its details for various phase spaces and scenarios in this analysis. Scenarios like : different flavour cat, same flavour cat, high mass category, selection for DNN training
Phase spaces like : signal, top CR, DY CR. Once you have this table then you can refer to it in various sections. If you have some kind of preselection and final selection, then do categorize it in the table also.

Done, I hope it's fine the way I did it

Green led22. Line 238-239 : 3rd lepton veto is applied to leptons with pt < 10 GeV but they are required to pass the loose selection which you have not defined in the AN. Please add the loose selection and then refer to it when you mention 3rd lepton veto.


Green led23. I don't see a mention of data-driven estimation of fake ground. Although it is not the most dominant background, please mention it. I think you can add a dedicated section where you describe your signal and all the backgrounds. It partially exists in Section 2.2 when you are describing the MCs but its better to have a dedicated section and in that section you can mention all the details of background treatment per process. This will definitely help in readability of the AN.

I added this to the Objects section

Green led24. Please make it clear that you are using top and DY CRs in the fit. It was mentioned later in AN but it's better to clarify it in the beginning itself.


Green led 25. Figure 3,6 and others :
i) what is the second peak in the distribution? Is it ttbar at 350 GeV?
ii) Please define what selection was used to make these signal region plots?
iii) Define "multiboson" in legends.
iv) It is also good to add the ratio plots here in order to see the MC uncertainty per bin.
v) Maybe you can add also the log plots for SR only in order to see the shape of signals.

i) It's just that the binning is larger, so there are more events. I'll have to change the plotting to show Events / bin-width.
ii) The signal region cuts are defined at the beginning of section 4.2, 4.3 and 4.4.
iii) I added a full section explaining what's shown in the figures.
iv)+v) I'll do this when I update the plots.

Green led 26. You added 2018 plots in the main AN, I would say, for the important variables like mT, it won't harm to add all year plots in main AN. And then you can add only 2018 for all variables.

I'm not sure I understand. Do you mean show additional plots over other variables for 2018 (SR + CR), and bring the 2016 and 2017 from the Appendix into this section? I originally put those separately in the Appendix because I thought, with so many plots it might become confusing.

Addendum: Done

Green led27. Also, I would recommend adding CR plots for all years in the main AN. Another way is to pick a variable and add the plots of different years in a single Figure for each category.


Green led28. Fix the x-axis label in Figure 5, 8. It should be mT not mtH. I think it's also there in all other DY CR plots.

I'd rather rename it to m_{T,ll+MET} as I call it in l.229.

Green led29. Sec 4.3 : Can you show the distribution of DY MVA ? Blinded for SR and may be also unblinded for DY CR ?
I know that you have not worked directly on the DY MVA part but I would recommend adding a better description of that in this AN.
It could simply be copy and paste from other AN but as I said before every AN should be self-sufficient.

Since it did show some strange differences between data and MC, and also because the BDT method also cut some small fraction of the signal, I decided to not use the DY MVA anymore. I instead use slightly tighter cuts in the SF final states. This does increase the overall DY background contribution by a bit, but the overall data/MC agreement is better now.

Green led30. Figure 6 and 7 has a huge difference in the yields. Can you explain the reason? Also, mention it in the text.


Green led31. I would recommend adding the yield table for all these CRs because they are very important for your analysis and its better to understand the differences and scaling of their yields with luminosity.


Green led 32. Figure 8,9 : top left plot, what is the source of discrepancy here? lack of MC ? I think it's better to show some simple kinematic plots like the lepton pT or MET. Also, in general whenever you see something weird in plots, please address that in the text and explain it in the best way you can.

The discrepancy could have been either low DY statistic (didn't use HT-binned samples at the time) and/or because of the DY MVA, which I don't use anymore. The agreement is now better, but not yet perfect in 2018. An additional needed fix might be to apply recoil corrections on DY.

Green led33. Figure 10,11 : please address the discrepancies.

See above; likely recoil corrections are needed. I added plots on that for 2018.

Green led34. Line 348 - 353 : I don't understand how the neutrino corresponding to the leading lepton will have lower pT, when you correlate to the high mass resonances. Can you please explain it in other words?


Green led 35. Sec 4.4.1 and others :
i) It's not clear from the text what the total selection went into the high mass categorization. Is it on the top of what was applied earlier?
Please make it clear that you perform the full analysis with high mass categorization and then choose the best limit for each mass point at the end.
ii) You defined SR and various CRs 3 times in the AN, one for df, sf and high-mass.
iii) Why don't you just define them once and then describe them in distributions ? Keep the categorization and selection in one section. It will make things easier.

I think this is difficult without making it seem more complicated than it is, but I tried my best.

Green led36. Section 6 : Please provide the reference for DNN. You have used a lot of technical terms so a reference will be handy.

This is difficult. I added the information that I used Keras and Scikit-learn interfaces for the DNN, and added references for those.

Green led 37. Do the two DNNs have separate training? Can you clarify a bit in Section 6.1 the strategy of having two different DNNs with different purposes. One is used as a classifier and another is more like a regression.

Yes, they have separate trainings. I added more information that was missing before. As a result, this should be more clear now.

Green led38. Figure 19, 20, 21 : please clarify the structure of the blue curve in the last bins.


Blue led39. Line 576-577 : Can you add a reference for this statement about the convolution of BW shape with gluon PDF? Or may be just get a figure from somewhere and add a reference to that.

I don't really have a reference. Giulio made the suggestion that this is why the generator level masses don't peak at the resonance, considering the effect is much more significant for ggF than it is for VBF (ggF 5 TeV sample peaks around 800 GeV, while VBF 5 TeV peaks at 5 TeV and merely has a tail towards lower masses). Should I add these plots on the generator level Higgs mass, or simply refer to a public plot on the difference of the Gluon PDF as a function of the energy Q?

Green led40. Line 598-599 : what has been done to take into account this effect of changing the shape of mTi distribution for low mass samples with different width scenarios?

Nothing, it just shows that larger width scenarios for lower mass signals aren't recommended to be used.

Green led41. Equation 11 : What is the motivation of this kind of signal model where the Interference term is scaled by sqrt(r) ?

The factor sqrt(r) results from the square of the matrix element. I elaborated this in the text.

Green led42. Line 650 : remove "two"

It should have been "to"

Green led43. I am wondering if the scenarios used here are aligned with the other analyses in CMS. e.g. H->tautau, H->ZZ ?

H->TauTau uses the same MSSM scenarios, H->ZZ makes no THDM/MSSM predictions. (I didn't add this to the AN; should I?)

Green led44. Line 720 - 724 : Another source of uncertainty in trigger efficiencies is the narrowing down of Z mass window between tag and probe pairs by 10 GeV on both sides. (70,110) in place of (60,120).


Green led45. Line 726 : How did you get this number of 4% for 2016 and 1.5% for other years?

I think I compared the yields (Up/Nominal and Down/Nominal) from the sum of all background MC from the signal region.

Gray led46. Please add the yield table and final shapes that entered in the statistical analysis.


Gray led47. Add the impacts of other years also in the AN for some mass points.


Green led48. Line 930 : Figure references are wrong here.


Gray led49. Summary section is missing. It will be nice to have a summary containing the exclusion of mass range for various scenarios considered here.


Green led50. Why is mTi used as the final variable for 2016, whereas it is DNN for other years?

This was only an example to show the differences; the DNN is intended to be used for all years.

Green led 51. Do you plan to switch to ttHMVA id ? If not then please remove the limit plot for that part. It's confusing to have it there only for one year. If you want to add it then do so in an appendix for extra checks.

From these results, I won't plan to use ttHMVA.

Gray led 52. Can you quantify the gain in performance for using high mass categorization ? Also, can you quantify the gain because of DNN generated Higgs mass variable usage w.r.t. mTi ?


Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2020-09-23 - DennisMarioRoy
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback