---+ Chapter 3.6: Analysis of the collision data %TOC% ---++ Conceptual overview Running over large amounts of data can be rather tricky, but good practices can make your life a lot easier. A judicious usage of the computing resources and software tools at your disposal can make your analysis run a lot more smoothly and efficiently, while at the same time causing less of a strain on the limited computing resources. Before we get started on this WorkBook, it is useful to overview how data are acquired and distributed, in the context of accessing it for your analysis. The data are collected by the detector and processed through the HLT. From there, the HLT paths are designated to live inside a specific "Primary dataset" (PD). This is the "quantum" of the computing infrastructure. PD's are distributed in entirety to T1's and T2's, so accessing them is the primary mode that you will be using to access the data. The Primary Dataset Working Group ( [[%SCRIPTURL{"view"}%auth/CMS/PdwgMain][PDWG]]) is a good resource for you to keep up to speed with the PD's and their deployment. There are quite a lot of random triggers that occur when the detector is not taking data, and so to account for this, the best practice is to only run on luminosity sections where the detector was "on". [[https://cms-service-dqm.web.cern.ch/cms-service-dqm/CAF/certification/][This webpage]] is constantly updated with "good run lists" in a JSON format that correspond to the "DCS bit" being on, and the detector taking data. By using this "good run list" in your CRAB jobs, you will alleviate strain on the resources and run only on the data that is interesting for you for physics analysis. There are also many triggers within a given primary dataset. It is fast and efficient to request a single trigger (or group of triggers) at the beginning of your path, such that the rest of the sequence will only process if that trigger fired. This also leads to an alleviated strain on the computing resources, and lets your jobs finish much more quickly. Much of the machine background that comes from, for instance, beam+gas interactions, can be removed by prescriptions from various DPG's. This wisdom is best followed, and hence is provided as a standard cleaning recipe below. The following recipes will help you to get started with performing effective analysis of the collision data while minimizing impact on the computing environment. ---++ Recipes to get started In terms of software the user should always follow the instructions from WorkBookWhichRelease. Physics Data And Monte Carlo Validation (PdmV) group maintains some analysis recipe pages [[%SCRIPTURL{"view"}%auth/CMS/PdmV][ here]]. Especiale for run-II, next pages [[CMS.PdmV2015Analysis][2015]], [[CMS.PdmV2016Analysis][2016]], [[CMS.PdmV2017Analysis][2017]] and [[CMS.PdmV2018Analysis][2018]] collects useful information to be aware of when performing analysis or producing ntuples for analysis. Guidelines are collected on which release, detector conditions, and datasets. They also keep track of special filters or tools useful to remove atypical / problematic events There are several recipes for you to use to clean up the event sample as recommended by the PVT group. For 2010 and 2011 this is a collation of the information presented on the [[%SCRIPTURL{"view"}%auth/CMS/Collisions2010Recipes][Collisions2010Recipes]] and [[CMS.Collisions2011Analysis][Collisions2011Analysis]] TWiki pages. The following sample cleanups should be done for most 2010 and 2011 analyses, unless you know what you are doing to change them. * Beam background removal <verbatim>process.noscraping = cms.EDFilter("FilterOutScraping", applyfilter = cms.untracked.bool(True), debugOn = cms.untracked.bool(True), numtrack = cms.untracked.uint32(10), thresh = cms.untracked.double(0.25) )</verbatim> * Primary vertex requirement <verbatim> process.primaryVertexFilter = cms.EDFilter("GoodVertexFilter", vertexCollection = cms.InputTag('offlinePrimaryVertices'), minimumNDOF = cms.uint32(4) , maxAbsZ = cms.double(24), maxd0 = cms.double(2) )</verbatim> * HBHE event-level noise filtering <verbatim>process.load('CommonTools/RecoAlgos/HBHENoiseFilter_cfi')</verbatim> More specific recipes on data analysis can be found in the [[CMS.Collisions2010Analysis][Collisions2010Analysis]] and [[CMS.Collisions2011Analysis][Collisions2011Analysis]] TWiki pages. ---++ Selection of good runs The PVT group maintains centralized good run lists in two different formats [[https://cms-service-dqm.web.cern.ch/cms-service-dqm/CAF/certification/][here]]. The files are stored as either JSON format or directly as a CMSSW configuration snippet. The JSON format files should be used in conjunction with CRAB. Either the JSON or CMSSW formats should be used interactively in FWLite or the full CMSSW framework. * Using the JSON format with CRAB3 is described [[WorkBookCRAB3Tutorial#2_CRAB_configuration_file_to_run][here]]. * Using the JSON format with CMSSW or FWLite is described in these links: * [[CMS.SWGuidePythonTips#Use_a_JSON_file_of_good_lumi_sec][Using JSON for selecting good lumi sections]]. * [[CMS.SWGuideGoodLumiSectionsJSONFile][Detailed description of usage in various contexts]]. JSON file updates are announced on this HyperNews forum [[https://hypernews.cern.ch/HyperNews/CMS/get/physics-validation.html][link]].There you can found discussions regarding physics validation of MC and Data production for physics analyses, including good/bad run list based on DQM certification tools. <br />If you want to contact the experts the email gateway for this forum is: hn-cms-physics-validation@cern.ch ---++ Usage of computing resources The best policy for users to use CRAB3 in order to process collision data is : * Use the good run lists (in the JSON format) in CRAB3 as described [[WorkBookCRAB3Tutorial#2_CRAB_configuration_file_to_run][here]]. * The good run lists are available [[https://cms-service-dqm.web.cern.ch/cms-service-dqm/CAF/certification/][here]]. * From there, publish the dataset if you intend to use grid resources to access it later. Instructions for CRAB3 publication can be found [[WorkBookCRAB3Tutorial#Output_dataset_publication_AN1][here]]. * The final good run lists should be applied at the analysis level. The latest good run lists are available [[https://cms-service-dqm.web.cern.ch/cms-service-dqm/CAF/certification/Collisions17/13TeV/][here]]. ---++ Trigger Selection Oftentimes, it is extremely useful to apply your trigger selection in your skim or PAT-tuple creation step directly. This reduces the load on the computing and gives you a smaller output. To do so, as an illustration, here is how to select =HLT_Dimuon25_Jpsi= : <verbatim> process.triggerSelection = cms.EDFilter("TriggerResultsFilter", triggerConditions = cms.vstring('HLT_Dimuon25_Jpsi_v*'), hltResults = cms.InputTag( "TriggerResults", "", "HLT" ), l1tResults = cms.InputTag( "" ), throw = cms.bool(False) ) ... process.mySequence = cms.Sequence( process.triggerSelection* process.myOtherStuff )</verbatim> where =myOtherStuff= is whatever other modules you want to run. More information on trigger access in analysis can be found at WorkBookHLTTutorial and WorkBookMiniAOD2017#Trigger. ---++ Analysis of the processed data There are several choices for the user to analyze collision data. There are several examples to help get you started: [[%SCRIPTURL{"view"}%auth/CMS/WorkBookFWLiteExamples#Example_4_Realistic_Example_With][example in FWLite]] , [[WorkBookTrackAnalysis][TrackAnalysis]] and [[WorkBookMuonAnalysis][MuonAnalysis]]. ---++ Luminosity information Luminosity should be calculated with the official <b><span>brilcalc </span></b>tool, recommended tool for both Run 1 and Run 2 data. For more information, please see the [[CMS.TWikiLUM][Lumi POG page]] and the [[https://cms-service-lumi.web.cern.ch/cms-service-lumi/brilwsdoc.html][official brilcalc documentation]]. ---++ About Datasets The data are collected into "primary datasets", "secondary datasets", and "central skims" for distribution to the computing resources. The definitions are such to keep roughly equal rates for each PD. The PDWG is a group that defines and monitors these datasets. [[CMS.PdwgMain][The PDWG TWiki]] describes details in how this process is done. Particulary in runn-II, for producing ntuples for analysis the data are collected into different formats, such as, "AOD", "MiniAOD" and "NanoAOD" . The definitions are such that an attempt is made to maintain the necessary information according to the needs of each analysis. ---++ Overlapping run ranges. In general, there are many different primary datasets and/or run ranges that you will have to process for your analysis. As things evolve, older data is re-reconstructed, and primary datasets are split into smaller pieces. Because of the fact that oftentimes runs can appear in different datasets in re-reco's, it is often necessary to *first* define a run range for your dataset while running your grid job. This will ease your own accounting later on in the analysis chain. To do this, it is often advantageous to split up the good run lists into exclusive run ranges, and then pass the split sections to the various grid jobs you are running for the various primary or secondary datasets. See the [[CMS.SWGuideGoodLumiSectionsJSONFile#filterJSON_py][guide on good lumi sections]] with details of how to do this. For instance, given two run ranges "1-20" and "21-50", you could split up your good run list in the JSON format as: <verbatim> filterJSON.py --min 1 --max 20 old.json --output first.json filterJSON.py --min 21 --max 50 old.json --output second.json</verbatim> This will create disjoint run ranges for all of your datasets, simplifying your accounting. ---++ Review status | *Reviewer/Editor and Date (copy from screen)* | *Comments* | | Main.JhovannyMejia - 28 Aug 2018 | Update to RunII | | Main.SalvatoreRoccoRappoccio - 28 Sep 2010 | Author | -- Main.SalvatoreRoccoRappoccio - 28-Sep-2010
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
css
tutorial.css
r1
manage
0.2 K
2012-05-25 - 11:36
RogerWolf
This topic: CMSPublic
>
DefaultWeb
>
WebHome
>
SWGuide
>
WorkBook
>
WorkBookCollisionsDataAnalysis
Topic revision: r14 - 2018-08-29 - JhovannyMejia
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback