Framework Tests for the Xrootd Demonstrator

Applications used.

The CmsXrootdTestsApplications page documents the different applications used.

Results

Experiences analyzing a 100GB file

The following message was sent to the hn-cms-edmFramework list regarding performance analyzing a 100GB file:

I took a handful of 10GB files from a recent real data RECO and merged them together into a large, 100GB file (about 330k events, 640 lumis, 2 runs). This was in the new 390_pre3 format. I then took this file and ran a few transatlantic analysis tests. The effect of this setup was twofold: (1) I learned which of our I/O patterns scale with file size and (2) the extra latency emphasized any slowdowns from uncached I/O. Here are my findings:

  1. Problem: We read through the EventAux branch out-of-order. This defeats the caching and added about 3 minutes to the file open time. Scaling down to typical file sizes and latencies, this could add 1-5 seconds to opening the file. David Dagenhart is approaching a solution that would allow us to read this branch in-order. This scan through the EventAux branch can be avoided by turning off the duplicate checker and keeping the event sort off (defaults to off in 380 and later).
  2. Problem: From the way we load up the file, the first basket of the EventAux branch is already in memory before we start the first event. This causes the TTreeCache to miss the fact we read it every event and not cache the branch (the second basket doesn't get loaded until well after the training stops). Same thing happens to the EventBranchEntryInfo. Solution: Since we are required to read both of these, always manually register the *Aux branch (for run, lumi, and event) and EventBranchEntryInfo with the cache.
  3. Problem: The ParameterSet TTree is written contiguously but read in small chunks as it is not cached. Solution: Add a cache. I think we also leak this TTree object per file, which Bill is investigating.
  4. Observation: DQM information consumes a huge amount of memory and can be read in inadvertently. Unless this is turned off in the inputCommands (and I doubt most folks know to do this), this increases the memory footprint of the job by 900MB when reading real data. My understanding is that the DQM information has limited appeal, but it seems really easy to accidentally bring it forward. Suggestion: Just like we can mark products as transient, be able to mark information as not copied forward by default. This way, the DQM data can self-describe as probably not interesting to most people. I would be interested in poking around at a few folks output files and see how common this mistake was (or at least hear the opinions of the PAT folks).
  5. Problem: All Run and Lumi branches must be read (branches can't be read on demand because of what they are), but we still train the TTreeCache. Solution: manually train these trees. This should mostly be noticed on files with a large number of lumis/runs and files where the DQM data is present.
  6. Observation: There's a large increase in memory size which I can't quite figure out. If I run with a "drop *" in the input commands, the difference between a 10GB file and a 100GB file is about 100MB RAM. This is expected - there are just that many more baskets in the large file. However, if I keep everything, there is a 440MB increase in RAM between the two files. I don't understand where the remaining 340MB increase is coming from. igprof results are inconclusive so far. I'm working with Philippe and Chris to understand what this is, but I've at least narrowed it down to the Events tree.

If we can understand where the memory usage is going - and/or reduce the extra almost 500MB in RAM usage - it might be plausible to actually produce files of this size in terms of the framework (this ignores the fact that there are other places in the computing infrastructures that would fall apart with 100GB files).

This work also shows that we are getting down to "odds and ends" in making WAN streaming of such files a reasonable activity for some analyses. Currently, I see no need for further file format changes - just improvements in how we read the file format.

Evolution of CMSSW access patterns

The information below was done using the "read_few.py" application outlined above.

I wanted to share how our I/O patterns have evolved over the last 4 software versions (from 3.6 to 3.9). I took a recent RECO file and created a simple job, below, that read out a handful of branches from the first 3000 events. For each of the 4 software releases, I used the "shipping settings" as much as possible. Below, I report two numbers - the number of reads issued by ROOT and then the number of reads issued to the OS (the difference being the effects of the cache).

Version, ROOT reads, actual reads, commentary
3_6_1, 13807, 11038, TTreeCache off (default for release)
3_7_0, 13807, 6264, TTreeCache on (default for release)
3_8_2, 14254, 6711, Increase probably due to construction of index into file
3_9_0, 14014, 3371, Decrease likely due to more aggressive caching (Run and Lumi products are now cached).

There is a 3.27x decrease from 3_6_1 to 3_9_0. I still expect another 2x decrease for ROOT 5.28, whenever that lands.

Latency test using JPE

This test ran over approximately 60,000 events using the following files:

  • root://brian@cmsdca0.fnal.gov//store/relval/CMSSW_3_9_0_pre3/EG/RECO/GR_R_38X_V9_RelVal_col_10_special-v2/0000/0200947B-0AB7-DF11-998B-001A92971B20.root
  • root://brian@cmsdca0.fnal.gov//store/relval/CMSSW_3_9_0_pre3/EG/RECO/GR_R_38X_V9_RelVal_col_10_special-v2/0000/686115F6-E7B6-DF11-BC87-00248C55CC7F.root
  • root://brian@cmsdca0.fnal.gov//store/relval/CMSSW_3_9_0_pre3/EG/RECO/GR_R_38X_V9_RelVal_col_10_special-v2/0000/18475378-0AB7-DF11-88F3-00261894382D.root
  • root://brian@cmsdca0.fnal.gov//store/relval/CMSSW_3_9_0_pre3/EG/RECO/GR_R_38X_V9_RelVal_col_10_special-v2/0000/0CC3CAE9-0BB7-DF11-9847-0030486792B6.root

Duplicate checks were turned on (pre3's duplicate check performance is low due to the out-of-order reading). CMSSW_3_9_0_pre3 was used. The source site was FNAL Xrootd; the data server was cmsstor137.fnal.gov.

  • t3-sl5 (ping time, 17ms); wall 80s, CPU 55s. CPU eff 69%
  • cmslpc03 (ping time, .1ms), wall 80s, CPU 75s. CPU eff 94%
  • lxplus256 (ping time, 128ms), wall 161s, CPU 61s. CPU eff 38%
  • cmslpc03 via dCap: wall 135s, CPU 67s. CPU eff 50%
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2010-09-07 - BrianBockelman
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback